Missing Data and Influential Outliers
Two practical data problems can distort OLS. Missing data is benign when observations are missing at random, costing only sample size, but systematic missingness (missing because of the value itself) can bias estimates much like nonrandom sampling. Outliers are extreme observations, and a high-leverage point with an unusual regressor value can move the fitted line on its own. Because OLS minimizes squared residuals, a few influential observations can dominate the estimates, so robustness checks and careful reporting matter.
Why it matters
If a handful of rows are blank for reasons unrelated to the answer, you just have a smaller dataset and OLS is fine. The danger is when data go missing for a reason tied to the outcome, like only successful firms reporting profits, which quietly skews the sample. Outliers are the other hazard: because least squares squares the residuals, one bizarre point far out on the -axis can yank the whole line toward itself, so you should check whether your results survive dropping it.
Formulas
Worked examples
A wage dataset has missing values on education for some workers.
`regress lwage educ exper` uses listwise deletion, so rows with any missing variable are dropped. Run `misstable summarize` first to see the pattern. If education is missing roughly at random, the smaller sample is the only cost; if it is missing for a reason tied to wages, the estimates can be biased.
One firm in a regression of R&D on sales has enormous sales, far above the rest.
After `regress rdintens sales`, run `predict lev, leverage` and `predict cook, cooksd` to flag the point, or use `lvr2plot`. Re-estimate without it; if the slope changes sharply, that single high-leverage observation was driving the result and you should report both estimates and explain the choice.
Common mistakes
- ✗Assuming all missing data biases results. Data missing completely at random only shrink the sample; bias arises mainly when missingness depends on the outcome.
- ✗Believing every outlier should be deleted. Extreme but valid observations may carry real information; the question is influence, not mere size.
- ✗Thinking a big residual alone makes a point influential. Influence needs both a large residual and high leverage, which is why Cook’s distance combines the two.
- ✗Treating ad hoc imputation as harmless. Filling gaps with the mean or arbitrary codes can itself introduce bias and understate standard errors.
Revision bullets
- •Random missingness costs sample size; systematic missingness can bias.
- •OLS uses listwise deletion for rows with any missing variable.
- •Leverage measures how unusual a regressor value is.
- •Influence needs a large residual and high leverage (Cook’s distance).
- •Run robustness checks and document any dropped observations.
Quick check
Data that are missing completely at random mainly cause OLS to:
An observation is most likely to be influential on the OLS line when it has:
Connected topics
Sources
- Wooldridge (2019), §9.5Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.Covers missing data, nonrandom samples, outliers, leverage, and influential observations in OLS.
- Wooldridge (2019), §9.5Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.Discusses least absolute deviations and robustness of estimates to influential observations.