Specification & Data Problemsintermediate

Dummy Variables

A dummy (binary) variable takes the value 1 for one category and 0 otherwise, letting qualitative information enter a regression. In $y=\beta_0+\delta_0 d+\beta_1 x+u$ , the coefficient $\delta_0$ shifts the intercept, measuring the average gap in $y$ between the group with $d=1$ and the omitted base group, holding $x$ fixed. With $g$ categories you include $g-1$ dummies and leave one out as the reference. Including all $g$ dummies plus an intercept causes perfect collinearity, the dummy-variable trap.

Try it yourself

Model specification sandbox

One model, many specifications. See how functional form, a squared term, and a dummy variable each change what a coefficient means and how the fitted line bends. The same seeded data sit under all three views.

Dummy shiftsδ₀ = 7.0

base slope 1.05group-D slope 1.05

Show group

β₀ (base intercept)8.0

β₁ (base slope)1.05

δ₀ (intercept shift)7.0

Interaction δ₁ (tilt the slope)

δ₁ (slope tilt)0.45

base: ŷ = 8.0 + 1.05·x
group D: ŷ = 15.0 + 1.05·x
δ₀ shifts the intercept by 7.0. With the interaction off the slopes match, so the lines stay parallel.

Try this

Discussion. With the interaction off, δ₀ only lifts the line; turn it on and δ₁ tilts the slope. Which question does each parameter answer, and why must the base group and the main effect stay in the model for δ₀ and δ₁ to be readable?

y = β₀ + δ₀D + β₁x + δ₁(D·x) + u. The base group (D = 0) has intercept β₀ and slope β₁; group D has intercept β₀ + δ₀ and slope β₁ + δ₁. With the interaction off, δ₁ = 0 and the lines are parallel.

Why it matters

Regression needs numbers, but many things we care about are labels, such as married or single, union or non-union, or one of four regions. A dummy converts a label into a 0/1 switch. The coefficient then reads as "how much higher or lower is $y$ for this group compared with the left-out group, on average." You always need one group to compare against, which is why one category is dropped rather than coded.

Formulas

Intercept shift

y=\beta_0+\delta_0 d+\beta_1 x+u

d=0

gives intercept

\beta_0

;

d=1

gives intercept

\beta_0+\delta_0

. The slope on

x

is the same for both groups.

Multiple categories

y=\beta_0+\delta_1 d_1+\dots+\delta_{g-1} d_{g-1}+\beta_1 x+u

For

g

groups use

g-1

dummies. Each

\delta_j

is that group’s mean difference from the omitted base group.

Worked examples

Scenario

Estimate the wage gap associated with being female, controlling for education and experience.

Solution

Run `regress lwage female educ exper`. The coefficient on `female` is the average percent wage gap (since $y$ is logged) relative to men, the base group, holding education and experience fixed. A value of about -0.18 implies women earn roughly 18 percent less on average for the same measured characteristics.

NoteIf `female` is already 0/1 you can also write `i.female` to let Stata manage the base level.

Scenario

Region has four categories (north, south, east, west) and you want regional wage differences.

Solution

Run `regress lwage i.region educ`. Stata automatically drops one region as the base and reports three coefficients, each the mean log-wage difference from that omitted region. Trying to force all four region dummies with a constant would trigger the dummy-variable trap and Stata would drop one for you.

Common mistakes

✗Including a dummy for every category plus an intercept. That is the dummy-variable trap. The dummies sum to one and are perfectly collinear with the constant. Drop one category.
✗Reading a dummy coefficient in absolute terms. With $\log(y)$ on the left, the coefficient is approximately a percent difference, and for larger values $(e^{\hat{\delta}}-1)\times 100$ is the exact percent gap.
✗Thinking the choice of base group changes the substance. It only changes which comparisons the coefficients show; predictions and fit are identical.
✗Treating a dummy coefficient as causal. It is a conditional mean difference and can still reflect omitted variables correlated with group membership.

Revision bullets

•A dummy is 0/1 and shifts the intercept by its coefficient.
•The coefficient is the mean gap from the omitted base group.
•Use $g-1$ dummies for $g$ categories.
•All $g$ dummies plus a constant cause the dummy-variable trap.
•In Stata, `i.var` factor notation handles the base level automatically.

Quick check

With four regions you should include how many region dummies alongside the intercept?

In `regress lwage female educ`, a coefficient of -0.20 on female means:

Connected topics

MLR model Poly & Inter.Dummy Inter.LPM Multicollin.

Sources

Wooldridge (2019), §7.1-7.3
Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
Introduces binary regressors, intercept shifts, multiple categories, and the dummy-variable trap.
Wooldridge (2019), Ch. 7
Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
Worked wage examples interpreting dummy coefficients as conditional mean differences from the base group.

How to cite this page

Dr. Phil's Quant Lab. (2026). Dummy Variables. Derivatives Atlas. https://phucnguyenvan.com/concept/efm-dummy-variables

← Back to the atlas See in the network →