Goodness of Fit and R-squared
The variation in decomposes as , total sum of squares equals explained plus residual sum of squares. The R-squared is , the fraction of the sample variation in that the regression explains, and it always lies in . A higher means the points cluster more tightly around the line, but it says nothing about whether the model is correctly specified, unbiased, or causal. In applied microeconometrics a low is common and entirely compatible with a well-estimated, meaningful slope.
Try it yourself
The variation in y splits as SST = SSE + SSR. The OLS line makes the residual part SSR as small as possible, so R² = 1 − SSR/SST is the explained share. Move your blue line and watch its loss stay above the OLS minimum.
Why it matters
measures how much bounces around its own mean. The regression splits that into a part the line accounts for () and a part left over (). R-squared is just the explained share. It is a description of fit, not a verdict on whether your estimate is trustworthy. You can have an of 0.02 and still recover a credible causal effect, or an of 0.95 from a regression that is badly biased. Fit and validity are separate questions.
Formulas
Worked examples
A student runs `regress wage educ` and Stata reports `R-squared = 0.165`. The student worries the model is broken.
An of 0.165 means education explains about 16.5% of the sample variation in wages. The remaining 83.5% reflects experience, ability, occupation, and other factors in . This is typical for cross-sectional wage data and does not invalidate the estimated return to schooling. The slope can still be precisely estimated and economically interpretable despite the modest .
Common mistakes
- ✗A high means the model is correct or the estimate is causal. R-squared measures how tightly the data fit the line, not whether causes . A regression can have a high and still suffer severe omitted variable bias.
- ✗A low means the regression is useless. Many valid microeconometric studies report values below 0.2. The key estimate can be statistically and economically significant even when most of the variation in is unexplained.
- ✗Adding more explanatory power always requires a higher to be worthwhile. Goodness of fit and the quality of a causal estimate are different goals. Including a control to reduce bias matters even if it barely moves , and chasing can introduce bias.
- ✗R-squared can be negative or exceed one. In a regression with an intercept estimated by OLS, is bounded in because and are non-negative and sum to .
Revision bullets
- •Decomposition (total = explained + residual)
- •, the explained share of variation in
- • always lies in with an intercept
- •High does not imply correct specification or causality
- •Low is normal in cross-sectional micro data and not a defect
Quick check
An R-squared of 0.30 in a wage regression tells you that:
Which statement about R-squared is correct?
Connected topics
Sources
- Wooldridge (2019), Ch. 2.3Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage Learning, 2019. ISBN 978-1-337-55886-0.Section 2.3 derives the SST = SSE + SSR decomposition and defines R-squared as the fraction of explained variation.
- Wooldridge (2019), §2.6 (interpreting fit)Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage Learning, 2019.Warns against over-reliance on R-squared and notes that low values are common and acceptable in applied work.