Data Handling: Preparing Data for Analysis
Before any regression runs, raw files must become one clean panel. This is the step the video calls two firms, one question, before any regression runs: you merge sources that hold different pieces of the story, such as firm accounts with market data, on a shared id-time key; you reshape between wide and long so the data sits one row per entity-period; you declare the structure with `xtset id t` so Stata knows the panel dimension; and you confirm with `duplicates report id t` that each entity-period appears once. The payoff is a tidy panel ready for an unobserved-effects model.
Watch the lesson
Open full lesson page →Why it matters
A panel is a stack of entities watched over time, so the data only makes sense once every row is pinned to exactly one entity and one period. Think of it as labelling boxes before you shelve them. If a firm-year shows up twice, or a merge silently drops the firms with no market match, every later coefficient is built on a broken frame. Getting the keys, the shape, and the `xtset` declaration right is unglamorous, but it is what lets the same observation line up across files and across years.
Formulas
Worked examples
Combine an accounting file (firm fundamentals) with a market file (returns, prices) for the same firms and years.
Open the accounting file, then run `merge 1:1 id year using market.dta`. The `1:1` says one row per `id year` on each side. Inspect `_merge` afterward: `tab _merge` shows matched rows (`_merge==3`) versus firms present in only one file (`_merge==1` or `2`). Investigate non-matches rather than dropping them blindly, since a systematic mismatch can bias the sample.
A spreadsheet stores `roa2019`, `roa2020`, `lev2019`, `lev2020` in one row per firm (wide), but the model needs one row per firm-year (long).
Run `reshape long roa lev, i(id) j(year)`. Stata stacks the year-suffixed columns into `roa` and `lev` with a new `year` variable, giving one row per firm-year. Then declare the panel with `xtset id year` and verify uniqueness with `duplicates report id year`, which should report zero duplicate `id year` combinations.
Common mistakes
- ✗Thinking a merge always keeps every row. By default unmatched observations stay flagged in `_merge`, but if you filter on `_merge==3` you silently drop firms with no market match, which can skew the sample.
- ✗Confusing wide and long. One row per firm with year-suffixed columns is wide. Panel estimators need long, one row per firm-year, so `xtset` and `xtreg` have a time dimension to work with.
- ✗Skipping `duplicates report id t`. A repeated entity-period slips past a casual eyeball but breaks `xtset` and quietly distorts within-entity estimates.
- ✗Treating this node as the data-type taxonomy or the missing-data problem. The labels cross-section, time series, and panel live in `efm-data-types`, and how to handle gaps and outliers lives in `efm-missing-data-outliers`. This node is only the mechanical prep that turns files into a declared panel.
Revision bullets
- •`merge 1:1 id year using file.dta` joins two sources on the entity-time key.
- •Always check `_merge` after a join, and decide on non-matches deliberately.
- •`reshape long var, i(id) j(year)` turns wide year-columns into long firm-year rows.
- •`xtset id year` declares the panel so `xt` commands know the entity and time dimensions.
- •`duplicates report id year` confirms each entity-period is unique before estimating.
- •The target shape is a clean long panel for .
Quick check
You have firm accounting data and want to attach matching market data with one row per firm-year on each side. Which command fits?
Your data has one row per firm with columns `roa2019` and `roa2020`. To get one row per firm-year you should:
Connected topics
Sources
- Wooldridge (2019), Ch. 1Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.Section 1.3 describes cross-section, time series, pooled cross-section, and panel data structures, motivating why each observation must be keyed to an entity and a period.
- Wooldridge (2019), Ch. 13Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.Chapter 13 sets up two-period and multi-period panel data and the unobserved-effects model, the target structure that clean merging and reshaping prepare the data for.