Data Handling: Preparing Data for Analysis

Q: You have firm accounting data and want to attach matching market data with one row per firm-year on each side. Which command fits?

merge 1:1 id year using market.dta. A one-to-one match on the entity-time key is exactly merge 1:1 id year using market.dta. The 1:1 asserts a unique id year on both files. append stacks rows from a file with the same variables rather than joining new columns, reshape changes the shape of one dataset, and xtset only declares the panel dimension. After the merge you would check _merge to see which firm-years matched.

Q: Your data has one row per firm with columns roa2019 and roa2020. To get one row per firm-year you should:

Run reshape long roa, i(id) j(year). reshape long roa, i(id) j(year) stacks the year-suffixed columns into a single roa variable and creates a year variable, giving the long firm-year layout panel estimators need. xtset only declares the panel and does not reshape anything, duplicates drop removes repeated rows rather than columns, and xtreg requires the long shape, so wide data with year-named columns will not work.

Before any regression runs, raw files must become one clean panel. This is the step the video calls two firms, one question, before any regression runs: you merge sources that hold different pieces of the story, such as firm accounts with market data, on a shared id-time key; you reshape between wide and long so the data sits one row per entity-period; you declare the structure with `xtset id t` so Stata knows the panel dimension; and you confirm with `duplicates report id t` that each entity-period appears once. The payoff is a tidy panel $\{(i, t, y_{it}, x_{it})\}$ ready for an unobserved-effects model.

Watch the lesson

Modelling· 5:40· ECON3006

Data Handling, getting your data ready before any regression runs

Open full lesson page →

Why it matters

A panel is a stack of entities watched over time, so the data only makes sense once every row is pinned to exactly one entity and one period. Think of it as labelling boxes before you shelve them. If a firm-year shows up twice, or a merge silently drops the firms with no market match, every later coefficient is built on a broken frame. Getting the keys, the shape, and the `xtset` declaration right is unglamorous, but it is what lets the same observation line up across files and across years.

Formulas

Long-format panel layout

\{(i, t, y_{it}, x_{it})\} \quad i = 1,\dots,N \quad t = 1,\dots,T

Each row is one entity

i

observed in one period

t

. This is the long shape Stata expects after `xtset id t`, with

N

entities and

T

periods per entity in a balanced panel.

Merge key and reshape

(i, t) \;\longmapsto\; \text{one row} \qquad \text{wide} \;\xrightarrow{\ \text{reshape long}\ }\; \text{long}

A clean merge matches on the pair

(i, t)

and yields one row per entity-period. Reshaping turns year-named columns into a single time variable so the panel is long, not wide.

Worked examples

Scenario

Combine an accounting file (firm fundamentals) with a market file (returns, prices) for the same firms and years.

Solution

Open the accounting file, then run `merge 1:1 id year using market.dta`. The `1:1` says one row per `id year` on each side. Inspect `_merge` afterward: `tab _merge` shows matched rows (`_merge==3`) versus firms present in only one file (`_merge==1` or `2`). Investigate non-matches rather than dropping them blindly, since a systematic mismatch can bias the sample.

NoteSort or confirm a unique `id year` key on both files first, or the `1:1` merge will error.

Scenario

A spreadsheet stores `roa2019`, `roa2020`, `lev2019`, `lev2020` in one row per firm (wide), but the model needs one row per firm-year (long).

Solution

Run `reshape long roa lev, i(id) j(year)`. Stata stacks the year-suffixed columns into `roa` and `lev` with a new `year` variable, giving one row per firm-year. Then declare the panel with `xtset id year` and verify uniqueness with `duplicates report id year`, which should report zero duplicate `id year` combinations.

Common mistakes

✗Thinking a merge always keeps every row. By default unmatched observations stay flagged in `_merge`, but if you filter on `_merge==3` you silently drop firms with no market match, which can skew the sample.
✗Confusing wide and long. One row per firm with year-suffixed columns is wide. Panel estimators need long, one row per firm-year, so `xtset` and `xtreg` have a time dimension to work with.
✗Skipping `duplicates report id t`. A repeated entity-period slips past a casual eyeball but breaks `xtset` and quietly distorts within-entity estimates.
✗Treating this node as the data-type taxonomy or the missing-data problem. The labels cross-section, time series, and panel live in `efm-data-types`, and how to handle gaps and outliers lives in `efm-missing-data-outliers`. This node is only the mechanical prep that turns files into a declared panel.

Revision bullets

•`merge 1:1 id year using file.dta` joins two sources on the entity-time key.
•Always check `_merge` after a join, and decide on non-matches deliberately.
•`reshape long var, i(id) j(year)` turns wide year-columns into long firm-year rows.
•`xtset id year` declares the panel so `xt` commands know the entity and time dimensions.
•`duplicates report id year` confirms each entity-period is unique before estimating.
•The target shape is a clean long panel $\{(i, t, y_{it}, x_{it})\}$ for $y_{it} = x_{it}'\beta + a_i + u_{it}$ .

Quick check

You have firm accounting data and want to attach matching market data with one row per firm-year on each side. Which command fits?

Your data has one row per firm with columns `roa2019` and `roa2020`. To get one row per firm-year you should:

Connected topics

Data Types Stata Workflow Missing & Out.Panel Structure Fixed Effects

Sources

Wooldridge (2019), Ch. 1
Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
Section 1.3 describes cross-section, time series, pooled cross-section, and panel data structures, motivating why each observation must be keyed to an entity and a period.
Wooldridge (2019), Ch. 13
Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
Chapter 13 sets up two-period and multi-period panel data and the unobserved-effects model, the target structure that clean merging and reshaping prepare the data for.

How to cite this page

Dr. Phil's Quant Lab. (2026). Data Handling: Preparing Data for Analysis. Derivatives Atlas. https://phucnguyenvan.com/concept/efm-data-handling