Simple Regressionbeginner

Goodness of Fit and R-squared

The variation in $y$ decomposes as $SST = SSE + SSR$ , total sum of squares equals explained plus residual sum of squares. The R-squared is $R^2 = SSE / SST = 1 - SSR / SST$ , the fraction of the sample variation in $y$ that the regression explains, and it always lies in $[0, 1]$ . A higher $R^2$ means the points cluster more tightly around the line, but it says nothing about whether the model is correctly specified, unbiased, or causal. In applied microeconometrics a low $R^2$ is common and entirely compatible with a well-estimated, meaningful slope.

Try it yourself

Goodness of fit — R²

The variation in y splits as SST = SSE + SSR. The OLS line makes the residual part SSR as small as possible, so R² = 1 − SSR/SST is the explained share. Move your blue line and watch its loss stay above the OLS minimum.

R² (OLS fit)93%

OLS fit: SST = SSE (explained) + SSR (residual)

OLS line ŷ = 2.1 + 1.55xSSR (OLS, min) 26R² (OLS) 93%

Your intercept b₀2.1

Your slope b₁1.55

Your line sits exactly on the OLS line, so the two SSRs are equal at the minimum 26. Nudge a slider and the loss can only go up.

Why it matters

$SST$ measures how much $y$ bounces around its own mean. The regression splits that into a part the line accounts for ( $SSE$ ) and a part left over ( $SSR$ ). R-squared is just the explained share. It is a description of fit, not a verdict on whether your estimate is trustworthy. You can have an $R^2$ of 0.02 and still recover a credible causal effect, or an $R^2$ of 0.95 from a regression that is badly biased. Fit and validity are separate questions.

Formulas

Sum of squares decomposition

SST = SSE + SSR

SST = \sum (y_i - \bar{y})^2

SSE = \sum (\hat{y}_i - \bar{y})^2

SSR = \sum \hat{u}_i^2

. The decomposition holds because OLS residuals are uncorrelated with the fitted values.

R-squared

R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}

Fraction of the sample variation in

y

explained by

x

. Always between 0 and 1 in a regression with an intercept.

Worked examples

Scenario

A student runs `regress wage educ` and Stata reports `R-squared = 0.165`. The student worries the model is broken.

Solution

An $R^2$ of 0.165 means education explains about 16.5% of the sample variation in wages. The remaining 83.5% reflects experience, ability, occupation, and other factors in $u$ . This is typical for cross-sectional wage data and does not invalidate the estimated return to schooling. The slope can still be precisely estimated and economically interpretable despite the modest $R^2$ .

NoteJudging a model by

R^2

alone is a beginner habit. The credibility of the slope depends on the error assumptions, not on

R^2

Common mistakes

✗A high $R^2$ means the model is correct or the estimate is causal. R-squared measures how tightly the data fit the line, not whether $x$ causes $y$ . A regression can have a high $R^2$ and still suffer severe omitted variable bias.
✗A low $R^2$ means the regression is useless. Many valid microeconometric studies report $R^2$ values below 0.2. The key estimate can be statistically and economically significant even when most of the variation in $y$ is unexplained.
✗Adding more explanatory power always requires a higher $R^2$ to be worthwhile. Goodness of fit and the quality of a causal estimate are different goals. Including a control to reduce bias matters even if it barely moves $R^2$ , and chasing $R^2$ can introduce bias.
✗R-squared can be negative or exceed one. In a regression with an intercept estimated by OLS, $R^2$ is bounded in $[0, 1]$ because $SSE$ and $SSR$ are non-negative and sum to $SST$ .

Revision bullets

•Decomposition $SST = SSE + SSR$ (total = explained + residual)
• $R^2 = SSE/SST = 1 - SSR/SST$ , the explained share of variation in $y$
• $R^2$ always lies in $[0, 1]$ with an intercept
•High $R^2$ does not imply correct specification or causality
•Low $R^2$ is normal in cross-sectional micro data and not a defect

Quick check

An R-squared of 0.30 in a wage regression tells you that:

Which statement about R-squared is correct?

Connected topics

Cause vs Corr SLR Model OLS Derivation Fitted / Resid Omitted var bias

Sources

Wooldridge (2019), Ch. 2.3
Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage Learning, 2019. ISBN 978-1-337-55886-0.
Section 2.3 derives the SST = SSE + SSR decomposition and defines R-squared as the fraction of explained variation.
Wooldridge (2019), §2.6 (interpreting fit)
Wooldridge, Jeffrey M. Introductory Econometrics: A Modern Approach. 7th ed. Cengage Learning, 2019.
Warns against over-reliance on R-squared and notes that low values are common and acceptable in applied work.

How to cite this page

Dr. Phil's Quant Lab. (2026). Goodness of Fit and R-squared. Derivatives Atlas. https://phucnguyenvan.com/concept/efm-goodness-of-fit

← Back to the atlas See in the network →