Skip to content

Goodness of Fit and R-squared

The variation in yy decomposes as SST=SSE+SSRSST = SSE + SSR, total sum of squares equals explained plus residual sum of squares. The R-squared is R2=SSE/SST=1SSR/SSTR^2 = SSE / SST = 1 - SSR / SST, the fraction of the sample variation in yy that the regression explains, and it always lies in [0,1][0, 1]. A higher R2R^2 means the points cluster more tightly around the line, but it says nothing about whether the model is correctly specified, unbiased, or causal. In applied microeconometrics a low R2R^2 is common and entirely compatible with a well-estimated, meaningful slope.

Try it yourself

Goodness of fit — R²

The variation in y splits as SST = SSE + SSR. The OLS line makes the residual part SSR as small as possible, so R² = 1 − SSR/SST is the explained share. Move your blue line and watch its loss stay above the OLS minimum.

R² (OLS fit)93%
161116221357911xyOLS best-fit lineYour line
OLS fit: SST = SSE (explained) + SSR (residual)
SSE 93%
OLS line ŷ = 2.1 + 1.55xSSR (OLS, min) 26R² (OLS) 93%
Your intercept b₀2.1
Your slope b₁1.55
Your line sits exactly on the OLS line, so the two SSRs are equal at the minimum 26. Nudge a slider and the loss can only go up.

Why it matters

SSTSST measures how much yy bounces around its own mean. The regression splits that into a part the line accounts for (SSESSE) and a part left over (SSRSSR). R-squared is just the explained share. It is a description of fit, not a verdict on whether your estimate is trustworthy. You can have an R2R^2 of 0.02 and still recover a credible causal effect, or an R2R^2 of 0.95 from a regression that is badly biased. Fit and validity are separate questions.

Formulas

Sum of squares decomposition
SST=SSE+SSRSST = SSE + SSR
SST=(yiyˉ)2SST = \sum (y_i - \bar{y})^2, SSE=(y^iyˉ)2SSE = \sum (\hat{y}_i - \bar{y})^2, SSR=u^i2SSR = \sum \hat{u}_i^2. The decomposition holds because OLS residuals are uncorrelated with the fitted values.
R-squared
R2=SSESST=1SSRSSTR^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}
Fraction of the sample variation in yy explained by xx. Always between 0 and 1 in a regression with an intercept.

Worked examples

Scenario

A student runs `regress wage educ` and Stata reports `R-squared = 0.165`. The student worries the model is broken.

Solution

An R2R^2 of 0.165 means education explains about 16.5% of the sample variation in wages. The remaining 83.5% reflects experience, ability, occupation, and other factors in uu. This is typical for cross-sectional wage data and does not invalidate the estimated return to schooling. The slope can still be precisely estimated and economically interpretable despite the modest R2R^2.

NoteJudging a model by R2R^2 alone is a beginner habit. The credibility of the slope depends on the error assumptions, not on R2R^2.

Common mistakes

  • A high R2R^2 means the model is correct or the estimate is causal. R-squared measures how tightly the data fit the line, not whether xx causes yy. A regression can have a high R2R^2 and still suffer severe omitted variable bias.
  • A low R2R^2 means the regression is useless. Many valid microeconometric studies report R2R^2 values below 0.2. The key estimate can be statistically and economically significant even when most of the variation in yy is unexplained.
  • Adding more explanatory power always requires a higher R2R^2 to be worthwhile. Goodness of fit and the quality of a causal estimate are different goals. Including a control to reduce bias matters even if it barely moves R2R^2, and chasing R2R^2 can introduce bias.
  • R-squared can be negative or exceed one. In a regression with an intercept estimated by OLS, R2R^2 is bounded in [0,1][0, 1] because SSESSE and SSRSSR are non-negative and sum to SSTSST.

Revision bullets

  • Decomposition SST=SSE+SSRSST = SSE + SSR (total = explained + residual)
  • R2=SSE/SST=1SSR/SSTR^2 = SSE/SST = 1 - SSR/SST, the explained share of variation in yy
  • R2R^2 always lies in [0,1][0, 1] with an intercept
  • High R2R^2 does not imply correct specification or causality
  • Low R2R^2 is normal in cross-sectional micro data and not a defect

Quick check

An R-squared of 0.30 in a wage regression tells you that:

Which statement about R-squared is correct?

Connected topics

Sources

  1. Wooldridge (2019), Ch. 2.3
    Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage Learning, 2019. ISBN 978-1-337-55886-0.
    Section 2.3 derives the SST = SSE + SSR decomposition and defines R-squared as the fraction of explained variation.
  2. Wooldridge (2019), §2.6 (interpreting fit)
    Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage Learning, 2019.
    Warns against over-reliance on R-squared and notes that low values are common and acceptable in applied work.
How to cite this page
Dr. Phil's Quant Lab. (2026). Goodness of Fit and R-squared. Derivatives Atlas. https://phucnguyenvan.com/concept/efm-goodness-of-fit