Skip to content

Missing Data and Influential Outliers

Two practical data problems can distort OLS. Missing data is benign when observations are missing at random, costing only sample size, but systematic missingness (missing because of the value itself) can bias estimates much like nonrandom sampling. Outliers are extreme observations, and a high-leverage point with an unusual regressor value can move the fitted line on its own. Because OLS minimizes squared residuals, a few influential observations can dominate the estimates, so robustness checks and careful reporting matter.

Why it matters

If a handful of rows are blank for reasons unrelated to the answer, you just have a smaller dataset and OLS is fine. The danger is when data go missing for a reason tied to the outcome, like only successful firms reporting profits, which quietly skews the sample. Outliers are the other hazard: because least squares squares the residuals, one bizarre point far out on the xx-axis can yank the whole line toward itself, so you should check whether your results survive dropping it.

Formulas

Leverage of an observation
hi=1n+(xixˉ)2j(xjxˉ)2h_i=\frac{1}{n}+\frac{(x_i-\bar{x})^2}{\sum_{j}(x_j-\bar{x})^2}
Leverage rises with distance from xˉ\bar{x}. High-leverage points have the most potential to move the OLS fit.
Influence (Cook’s distance, idea)
Di=ri2k+1hi1hiD_i=\frac{r_i^2}{k+1}\cdot\frac{h_i}{1-h_i}
Combines a large residual rir_i with high leverage hih_i. Large DiD_i flags observations that, if removed, would substantially change the estimates.

Worked examples

Scenario

A wage dataset has missing values on education for some workers.

Solution

`regress lwage educ exper` uses listwise deletion, so rows with any missing variable are dropped. Run `misstable summarize` first to see the pattern. If education is missing roughly at random, the smaller sample is the only cost; if it is missing for a reason tied to wages, the estimates can be biased.

NoteCompare results on the full sample versus complete cases to gauge whether missingness matters.
Scenario

One firm in a regression of R&D on sales has enormous sales, far above the rest.

Solution

After `regress rdintens sales`, run `predict lev, leverage` and `predict cook, cooksd` to flag the point, or use `lvr2plot`. Re-estimate without it; if the slope changes sharply, that single high-leverage observation was driving the result and you should report both estimates and explain the choice.

NoteDropping points is a judgment call: document it rather than deleting silently.

Common mistakes

  • Assuming all missing data biases results. Data missing completely at random only shrink the sample; bias arises mainly when missingness depends on the outcome.
  • Believing every outlier should be deleted. Extreme but valid observations may carry real information; the question is influence, not mere size.
  • Thinking a big residual alone makes a point influential. Influence needs both a large residual and high leverage, which is why Cook’s distance combines the two.
  • Treating ad hoc imputation as harmless. Filling gaps with the mean or arbitrary codes can itself introduce bias and understate standard errors.

Revision bullets

  • Random missingness costs sample size; systematic missingness can bias.
  • OLS uses listwise deletion for rows with any missing variable.
  • Leverage measures how unusual a regressor value is.
  • Influence needs a large residual and high leverage (Cook’s distance).
  • Run robustness checks and document any dropped observations.

Quick check

Data that are missing completely at random mainly cause OLS to:

An observation is most likely to be influential on the OLS line when it has:

Connected topics

Sources

  1. Wooldridge (2019), §9.5
    Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
    Covers missing data, nonrandom samples, outliers, leverage, and influential observations in OLS.
  2. Wooldridge (2019), §9.5
    Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage, 2019.
    Discusses least absolute deviations and robustness of estimates to influential observations.
How to cite this page
Dr. Phil's Quant Lab. (2026). Missing Data and Influential Outliers. Derivatives Atlas. https://phucnguyenvan.com/concept/efm-missing-data-outliers