Skip to content

Dummy Variables

A dummy (binary) variable takes the value 1 for one category and 0 otherwise, letting qualitative information enter a regression. In y=β0+δ0d+β1x+uy=\beta_0+\delta_0 d+\beta_1 x+u, the coefficient δ0\delta_0 shifts the intercept, measuring the average gap in yy between the group with d=1d=1 and the omitted base group, holding xx fixed. With gg categories you include g1g-1 dummies and leave one out as the reference. Including all gg dummies plus an intercept causes perfect collinearity, the dummy-variable trap.

Try it yourself

Model specification sandbox

One model, many specifications. See how functional form, a squared term, and a dummy variable each change what a coefficient means and how the fitted line bends. The same seeded data sit under all three views.

Dummy shiftsδ₀ = 7.0
716263545481216xybase group (D = 0)group D (D = 1)
base slope 1.05group-D slope 1.05
Show group
β₀ (base intercept)8.0
β₁ (base slope)1.05
δ₀ (intercept shift)7.0
Interaction δ₁ (tilt the slope)
δ₁ (slope tilt)0.45
base: ŷ = 8.0 + 1.05·x
group D: ŷ = 15.0 + 1.05·x
δ₀ shifts the intercept by 7.0. With the interaction off the slopes match, so the lines stay parallel.
Try this

Discussion. With the interaction off, δ₀ only lifts the line; turn it on and δ₁ tilts the slope. Which question does each parameter answer, and why must the base group and the main effect stay in the model for δ₀ and δ₁ to be readable?

y = β₀ + δ₀D + β₁x + δ₁(D·x) + u. The base group (D = 0) has intercept β₀ and slope β₁; group D has intercept β₀ + δ₀ and slope β₁ + δ₁. With the interaction off, δ₁ = 0 and the lines are parallel.

Why it matters

Regression needs numbers, but many things we care about are labels, such as married or single, union or non-union, or one of four regions. A dummy converts a label into a 0/1 switch. The coefficient then reads as "how much higher or lower is yy for this group compared with the left-out group, on average." You always need one group to compare against, which is why one category is dropped rather than coded.

Formulas

Intercept shift
y=β0+δ0d+β1x+uy=\beta_0+\delta_0 d+\beta_1 x+u
d=0d=0 gives intercept β0\beta_0; d=1d=1 gives intercept β0+δ0\beta_0+\delta_0. The slope on xx is the same for both groups.
Multiple categories
y=β0+δ1d1++δg1dg1+β1x+uy=\beta_0+\delta_1 d_1+\dots+\delta_{g-1} d_{g-1}+\beta_1 x+u
For gg groups use g1g-1 dummies. Each δj\delta_j is that group’s mean difference from the omitted base group.

Worked examples

Scenario

Estimate the wage gap associated with being female, controlling for education and experience.

Solution

Run `regress lwage female educ exper`. The coefficient on `female` is the average percent wage gap (since yy is logged) relative to men, the base group, holding education and experience fixed. A value of about -0.18 implies women earn roughly 18 percent less on average for the same measured characteristics.

NoteIf `female` is already 0/1 you can also write `i.female` to let Stata manage the base level.
Scenario

Region has four categories (north, south, east, west) and you want regional wage differences.

Solution

Run `regress lwage i.region educ`. Stata automatically drops one region as the base and reports three coefficients, each the mean log-wage difference from that omitted region. Trying to force all four region dummies with a constant would trigger the dummy-variable trap and Stata would drop one for you.

Common mistakes

  • Including a dummy for every category plus an intercept. That is the dummy-variable trap. The dummies sum to one and are perfectly collinear with the constant. Drop one category.
  • Reading a dummy coefficient in absolute terms. With log(y)\log(y) on the left, the coefficient is approximately a percent difference, and for larger values (eδ^1)×100(e^{\hat{\delta}}-1)\times 100 is the exact percent gap.
  • Thinking the choice of base group changes the substance. It only changes which comparisons the coefficients show; predictions and fit are identical.
  • Treating a dummy coefficient as causal. It is a conditional mean difference and can still reflect omitted variables correlated with group membership.

Revision bullets

  • A dummy is 0/1 and shifts the intercept by its coefficient.
  • The coefficient is the mean gap from the omitted base group.
  • Use g1g-1 dummies for gg categories.
  • All gg dummies plus a constant cause the dummy-variable trap.
  • In Stata, `i.var` factor notation handles the base level automatically.

Quick check

With four regions you should include how many region dummies alongside the intercept?

In `regress lwage female educ`, a coefficient of -0.20 on female means:

Connected topics

Sources

  1. Wooldridge (2019), §7.1-7.3
    Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
    Introduces binary regressors, intercept shifts, multiple categories, and the dummy-variable trap.
  2. Wooldridge (2019), Ch. 7
    Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
    Worked wage examples interpreting dummy coefficients as conditional mean differences from the base group.
How to cite this page
Dr. Phil's Quant Lab. (2026). Dummy Variables. Derivatives Atlas. https://phucnguyenvan.com/concept/efm-dummy-variables