Skip to content

Multicollinearity

Multicollinearity is high correlation among the explanatory variables. It inflates the variance of the affected slope estimators through the variance inflation factor, Var(β^j)=σ2SSTj(1Rj2)\operatorname{Var}(\hat{\beta}_j)=\frac{\sigma^2}{\mathrm{SST}_j(1-R_j^2)}, where Rj2R_j^2 is from regressing xjx_j on the other regressors. Crucially, multicollinearity does not bias OLS, which stays unbiased and consistent. It is a precision problem of small samples, widening standard errors and confidence intervals, so estimates become imprecise rather than wrong.

Try it yourself

Multicollinearity and the VIF

When two regressors are correlated, OLS still hits the true value on average, but each coefficient is harder to pin down. With two regressors the auxiliary fit gives R²ⱼ = ρ², so the variance inflation factor is VIF = 1 / (1 − ρ²) and the standard error widens by √VIF versus no collinearity, holding σ² and SSTⱼ fixed.

Variance inflation factor1.33
-4-2β₁+2+4Slope estimate β̂₁ (baseline standard errors from true β₁)No collinearity (ρ = 0)Current (ρ = 0.50), √VIF = 1.15× wider
VIF = 1 / (1 − ρ²) 1.33SE multiplier √VIF 1.15×Auxiliary R²ⱼ = ρ² 0.25
Correlation ρ = corr(x₁, x₂)0.50
The two curves share the same centre, so β̂₁ stays unbiased — collinearity does not move the estimate, it only spreads it out. The gold curve is 1.15× wider than the baseline, so the standard error of β̂₁ is 1.15× larger and its t-statistic is 0.87× smaller, holding σ² and SSTⱼ fixed. Note this inflates the variance of the individual coefficients; the overall fit and any joint prediction can still be precise.
Closed-form variance algebra (no simulation): VIF = 1 / (1 − ρ²), the (1 − R²ⱼ) factor in Var(β̂ⱼ) = σ² / [SSTⱼ(1 − R²ⱼ)]. A rule of thumb flags VIF > 10 (|ρ| > 0.95). In Stata, read it with estat vif after regress.

Why it matters

If two regressors move together almost in lockstep, the data struggle to tell their separate effects apart, like asking which of two people pushing a cart did the work when they always push together. The coefficients can swing around and the standard errors balloon. But nothing is being distorted on average. With more data, or with the two variables varying more independently, the haze clears. That is why this is about precision, not bias.

Formulas

Variance of a slope estimator
Var(β^j)=σ2SSTj(1Rj2)\operatorname{Var}(\hat{\beta}_j)=\frac{\sigma^2}{\mathrm{SST}_j\,(1-R_j^2)}
Rj2R_j^2 is the R2R^2 from regressing xjx_j on all other regressors. As Rj21R_j^2\to 1, the variance explodes.
Variance inflation factor
VIFj=11Rj2\mathrm{VIF}_j=\frac{1}{1-R_j^2}
A common rule of thumb flags a VIF above 10 (equivalently Rj2>0.9R_j^2>0.9), though no threshold is sacred.

Worked examples

Scenario

A wage regression includes both total experience and tenure at the current firm, which are strongly correlated.

Solution

After `regress lwage educ exper tenure`, run `estat vif`. If `tenure` has a VIF of about 12, its standard error is inflated and its tt statistic may be small even though experience-and-tenure together explain wages well. The point estimates remain unbiased; only their precision suffers.

NoteA high joint FF with insignificant individual tt statistics is a classic multicollinearity signature.
Scenario

A demand model includes price and price-squared, which are mechanically correlated over the sample.

Solution

Centering the variable before squaring, with `gen pc = price - r(mean)` after `summarize price` then using `c.pc##c.pc`, reduces the correlation between the level and the square and lowers the VIFs. The fitted curve and predictions are unchanged, which confirms collinearity was a precision issue, not a specification error.

Common mistakes

  • Believing multicollinearity biases the coefficients. OLS stays unbiased and consistent; only the variances rise. It is a precision problem, not a bias.
  • Dropping a correlated regressor to "fix" it. Removing a relevant variable can introduce omitted variable bias, trading a precision problem for a far worse consistency problem.
  • Treating a VIF above 10 as a hard failure. The cutoff is a convention. What matters is whether your standard errors are small enough to answer the question.
  • Confusing perfect collinearity with high collinearity. Perfect collinearity (an exact linear relationship) breaks OLS entirely; high but imperfect collinearity merely inflates variances.

Revision bullets

  • High correlation among regressors inflates Var(β^j)\operatorname{Var}(\hat{\beta}_j) via the VIF.
  • It does not bias OLS; estimates stay unbiased and consistent.
  • It is a precision problem, worst in small samples.
  • Symptom: high joint FF but insignificant individual tt statistics.
  • Do not drop a relevant variable just to cut collinearity, that risks OVB.

Quick check

Multicollinearity primarily affects OLS by:

A regressor has a VIF of 15. The most defensible response is:

Connected topics

Sources

  1. Wooldridge (2019), §3.4
    Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
    Derives Var(β^j)\operatorname{Var}(\hat{\beta}_j), the role of Rj2R_j^2, the VIF, and stresses that collinearity affects variance, not bias.
  2. Wooldridge (2019), §3.4a
    Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
    Discusses why dropping correlated but relevant variables can cause omitted variable bias.
How to cite this page
Dr. Phil's Quant Lab. (2026). Multicollinearity. Derivatives Atlas. https://phucnguyenvan.com/concept/efm-multicollinearity