3. Three Interpretations of Linear Regression
本章主题:线性回归 \(Y=\mathbf X'\boldsymbol\beta+u\) 的三种解读。 同一个方程 (3.1),依对误差项 \(u\) 的不同假设,有三层含义。§3.1 线性条件期望:假设 \(\mathbb E[Y\mid\mathbf X]=\mathbf X'\boldsymbol\beta\),令 \(u=Y-\mathbb E[Y\mid\mathbf X]\),则构造上 \(\mathbb E[u\mid\mathbf X]=0\)(最强的外生性,\(u\) 与 \(\mathbf X\) 均值独立、正交 \(\mathbb E[\mathbf Xu]=0\));\(\boldsymbol\beta\) 仅是联合分布的汇总、无因果含义。§3.2 最优线性近似 / 最优线性预测:不假设 \(\mathbb E[Y\mid\mathbf X]\) 线性,而取在所有线性函数中最小化均方误差者——\(\min_b\mathbb E[(\mathbb E[Y\mid\mathbf X]-\mathbf X'b)^2]\) (3.2) 与 \(\min_b\mathbb E[(Y-\mathbf X'b)^2]\) (3.3);命题 3.1 二者同解。一阶条件给出 \(\mathbb E[\mathbf Xu]=0\) (3.5)(仅正交、弱于均值独立);\(\boldsymbol\beta\) 仍无因果含义。§3.3 因果解释:设结构方程 \(Y=g(\mathbf X,u)\),线性时 \(\partial g/\partial X_i=\beta_i\) 即「保持其他不变、\(X_i\) 单位变化对 \(Y\) 的因果效应」;此时 \(\boldsymbol\beta\) 有因果含义,但只能把 \(\mathbb E[u]\) 归一化为 \(0\),\(\mathbb E[\mathbf Xu]\)、\(\mathbb E[u\mid\mathbf X]\) 一般不为零。
Chapter theme: three interpretations of the linear regression \(Y=\mathbf X'\boldsymbol\beta+u\). The same equation (3.1) carries three layers of meaning depending on the assumption on the error \(u\). §3.1 Linear conditional expectation: assume \(\mathbb E[Y\mid\mathbf X]=\mathbf X'\boldsymbol\beta\), let \(u=Y-\mathbb E[Y\mid\mathbf X]\); then by construction \(\mathbb E[u\mid\mathbf X]=0\) (the strongest exogeneity — \(u\) mean-independent of \(\mathbf X\), orthogonal \(\mathbb E[\mathbf Xu]=0\)); \(\boldsymbol\beta\) merely summarizes the joint distribution, with no causal meaning. §3.2 Best linear approximation / best linear predictor: do not assume \(\mathbb E[Y\mid\mathbf X]\) is linear; instead take the linear function minimizing mean-squared error — \(\min_b\mathbb E[(\mathbb E[Y\mid\mathbf X]-\mathbf X'b)^2]\) (3.2) and \(\min_b\mathbb E[(Y-\mathbf X'b)^2]\) (3.3); Proposition 3.1 says they share the same solution. The first-order condition gives \(\mathbb E[\mathbf Xu]=0\) (3.5) (orthogonality only, weaker than mean independence); \(\boldsymbol\beta\) still has no causal meaning. §3.3 Causal interpretation: posit a structural equation \(Y=g(\mathbf X,u)\); when linear, \(\partial g/\partial X_i=\beta_i\) is "the causal effect on \(Y\) of a unit change in \(X_i\) holding others fixed"; here \(\boldsymbol\beta\) is causal, but one can only normalize \(\mathbb E[u]=0\), while \(\mathbb E[\mathbf Xu]\) and \(\mathbb E[u\mid\mathbf X]\) are generally nonzero.
3. The Linear Model
设 \((Y,\mathbf X,u)\) 为随机向量,\(Y,u\in\mathbb R\)、\(\mathbf X\in\mathbb R^{k+1}\)。线性模型
$$Y=\mathbf X'\boldsymbol\beta+u \tag{3.1}$$
其中 \(\mathbf X=(\underbrace{X_0}_{=1},X_1,\dots,X_k)'\)、\(\boldsymbol\beta=(\beta_0,\beta_1,\dots,\beta_k)'\)。\(\beta_0\) 称截距,\(\beta_1,\dots,\beta_k\) 称斜率,\(X_0,X_1,\dots,X_k\) 为自变量(\(X_0\equiv1\)),\(Y\) 为因变量,\(u\) 为误差项。同一个 (3.1) 有三种解读。
Let \((Y,\mathbf X,u)\) be a random vector, \(Y,u\in\mathbb R\), \(\mathbf X\in\mathbb R^{k+1}\). The linear model
$$Y=\mathbf X'\boldsymbol\beta+u \tag{3.1}$$
where \(\mathbf X=(\underbrace{X_0}_{=1},X_1,\dots,X_k)'\), \(\boldsymbol\beta=(\beta_0,\beta_1,\dots,\beta_k)'\). Here \(\beta_0\) is the intercept, \(\beta_1,\dots,\beta_k\) the slopes, \(X_0,X_1,\dots,X_k\) the independent variables (\(X_0\equiv1\)), \(Y\) the dependent variable, and \(u\) the error term. The same equation (3.1) admits three interpretations.
3.1 Linear Conditional Expectation
此解读假设 \(\mathbb E[Y\mid\mathbf X]=\mathbf X'\boldsymbol\beta\)(即真实的条件期望恰为线性)。定义 \(u=Y-\mathbb E[Y\mid\mathbf X]\),则 (3.1) 化为
$$Y=\mathbb E[Y\mid\mathbf X]+u=\mathbf X'\boldsymbol\beta+u$$
(3.1) 被解读为 \(Y\) 给定 \(\mathbf X\) 的线性条件期望。要点:
- 不赋予 \(\boldsymbol\beta\) 任何因果含义——它只是汇总 \(Y\) 与 \(\mathbf X\) 联合分布的一个特征。
- 构造上 \(\mathbb E[u\mid\mathbf X]=\mathbb E[(Y-\mathbb E[Y\mid\mathbf X])\mid\mathbf X]=\mathbb E[Y\mid\mathbf X]-\mathbb E[Y\mid\mathbf X]=0\)。
- 由重期望定律 \(\mathbb E[u]=\mathbb E[\mathbb E[u\mid\mathbf X]]=\mathbb E[0]=0\)。
- 故 \(u\) 与 \(\mathbf X\) 均值独立:\(\mathbb E[u\mid\mathbf X]=\mathbb E[u]=0\)。
- \(\mathbb E[u\mid\mathbf X]=0\) 进一步蕴含 \(\mathbf X\) 与 \(u\) 正交:\(\mathbb E[\mathbf Xu]=\mathbb E[\mathbb E[\mathbf Xu\mid\mathbf X]]=\mathbb E[\mathbf X\mathbb E[u\mid\mathbf X]]=0\)。
这是三种解读中对 \(u\) 假设最强的一种(均值独立 ⇒ 正交)。
This interpretation assumes \(\mathbb E[Y\mid\mathbf X]=\mathbf X'\boldsymbol\beta\) (the true conditional expectation is exactly linear). Define \(u=Y-\mathbb E[Y\mid\mathbf X]\); then (3.1) becomes
$$Y=\mathbb E[Y\mid\mathbf X]+u=\mathbf X'\boldsymbol\beta+u$$
so (3.1) is read as the linear conditional expectation of \(Y\) given \(\mathbf X\). Key points:
- \(\boldsymbol\beta\) is given no causal interpretation — it merely summarizes a feature of the joint distribution of \(Y\) and \(\mathbf X\).
- By construction \(\mathbb E[u\mid\mathbf X]=\mathbb E[(Y-\mathbb E[Y\mid\mathbf X])\mid\mathbf X]=\mathbb E[Y\mid\mathbf X]-\mathbb E[Y\mid\mathbf X]=0\).
- By the law of iterated expectations \(\mathbb E[u]=\mathbb E[\mathbb E[u\mid\mathbf X]]=\mathbb E[0]=0\).
- So \(u\) and \(\mathbf X\) are mean independent: \(\mathbb E[u\mid\mathbf X]=\mathbb E[u]=0\).
- \(\mathbb E[u\mid\mathbf X]=0\) further implies \(\mathbf X\) and \(u\) are orthogonal: \(\mathbb E[\mathbf Xu]=\mathbb E[\mathbb E[\mathbf Xu\mid\mathbf X]]=\mathbb E[\mathbf X\mathbb E[u\mid\mathbf X]]=0\).
This is the strongest assumption on \(u\) among the three interpretations (mean independence ⇒ orthogonality).
3.2 Best Linear Approximation / Best Linear Predictor
此解读不假设 \(\mathbb E[Y\mid\mathbf X]\) 线性,而是在所有线性函数 \(\mathbf X'b\) 中取「最优」的那个。两个最优化问题:
最优线性近似 (3.2) / 最优线性预测 (3.3) 最优线性近似 \(\mathbb E[Y\mid\mathbf X]\approx\mathbf X'b\),\(b\) 求解 $$\min_{b\in\mathbb R^{k+1}}\mathbb E[(\mathbb E[Y\mid\mathbf X]-\mathbf X'b)^2] \tag{3.2}$$ 最优线性预测 \(Y\approx\mathbf X'b\),\(b\) 求解 $$\min_{b\in\mathbb R^{k+1}}\mathbb E[(Y-\mathbf X'b)^2] \tag{3.3}$$
命题 3.1 (3.2) 的任一解也是 (3.3) 的解,反之亦然。
证明(命题 3.1) 在 (3.2) 中加减 \(Y\)(令 \(\nu\equiv\mathbb E[Y\mid\mathbf X]-Y\)): $$\mathbb E[(\mathbb E[Y\mid\mathbf X]-\mathbf X'b)^2]=\mathbb E[(\underbrace{\mathbb E[Y\mid\mathbf X]-Y}_{\nu}+Y-\mathbf X'b)^2]=\mathbb E[\nu^2]+2\mathbb E[\nu(Y-\mathbf X'b)]+\mathbb E[(Y-\mathbf X'b)^2]$$ 展开中间项(用正交条件 \(\mathbb E[\nu\mathbf X']=0\),因 \(\nu=\mathbb E[Y\mid\mathbf X]-Y\) 是条件期望的残差): $$=\mathbb E[\nu^2]+2\mathbb E[\nu Y]-2\mathbb E[\nu\mathbf X'b]+\mathbb E[(Y-\mathbf X'b)^2]=\mathbb E[\nu^2]+2\mathbb E[\nu Y]+\mathbb E[(Y-\mathbf X'b)^2] \tag{3.4}$$ 其中 \(\mathbb E[\nu^2]+2\mathbb E[\nu Y]\) 不随 \(b\) 变。故最小化 \(\mathbb E[(\mathbb E[Y\mid\mathbf X]-\mathbf X'b)^2]\) 与最小化 \(\mathbb E[(Y-\mathbf X'b)^2]\) 给出相同的 \(b\)。\(\blacksquare\)
设 \(\boldsymbol\beta\) 为 (3.2)/(3.3) 的解,定义 \(u=Y-\mathbf X'\boldsymbol\beta\),则 (3.1) 被解读为最优线性近似 / 最优线性预测 \(\mathrm{BLP}(Y\mid\mathbf X)\)。要点:
- \(\boldsymbol\beta\) 仍无因果含义。
- (3.3) 的一阶条件: $$\mathbf 0=2\mathbb E[\mathbf X(Y-\mathbf X'\boldsymbol\beta)]\Rightarrow\mathbb E[\mathbf Xu]=0 \tag{3.5}$$
- 由 \(\mathbf X\) 首元为 \(1\),总可归一化使 \(\mathbb E[u]=0\)。
此解读对 \(u\) 的假设弱于 §3.1:只要求正交 \(\mathbb E[\mathbf Xu]=0\)(不要求均值独立 \(\mathbb E[u\mid\mathbf X]=0\))。
This interpretation does not assume \(\mathbb E[Y\mid\mathbf X]\) is linear; instead it picks the "best" linear function \(\mathbf X'b\). Two optimization problems:
Best linear approximation (3.2) / best linear predictor (3.3) Best linear approximation \(\mathbb E[Y\mid\mathbf X]\approx\mathbf X'b\), where \(b\) solves $$\min_{b\in\mathbb R^{k+1}}\mathbb E[(\mathbb E[Y\mid\mathbf X]-\mathbf X'b)^2] \tag{3.2}$$ Best linear predictor \(Y\approx\mathbf X'b\), where \(b\) solves $$\min_{b\in\mathbb R^{k+1}}\mathbb E[(Y-\mathbf X'b)^2] \tag{3.3}$$
Proposition 3.1 Any solution to (3.2) is a solution to (3.3), and vice versa.
Proof (Proposition 3.1) Add and subtract \(Y\) in (3.2) (let \(\nu\equiv\mathbb E[Y\mid\mathbf X]-Y\)): $$\mathbb E[(\mathbb E[Y\mid\mathbf X]-\mathbf X'b)^2]=\mathbb E[(\underbrace{\mathbb E[Y\mid\mathbf X]-Y}_{\nu}+Y-\mathbf X'b)^2]=\mathbb E[\nu^2]+2\mathbb E[\nu(Y-\mathbf X'b)]+\mathbb E[(Y-\mathbf X'b)^2]$$ Expanding the cross term (using the orthogonality condition \(\mathbb E[\nu\mathbf X']=0\), since \(\nu=\mathbb E[Y\mid\mathbf X]-Y\) is the residual of a conditional expectation): $$=\mathbb E[\nu^2]+2\mathbb E[\nu Y]-2\mathbb E[\nu\mathbf X'b]+\mathbb E[(Y-\mathbf X'b)^2]=\mathbb E[\nu^2]+2\mathbb E[\nu Y]+\mathbb E[(Y-\mathbf X'b)^2] \tag{3.4}$$ where \(\mathbb E[\nu^2]+2\mathbb E[\nu Y]\) does not change with \(b\). So minimizing \(\mathbb E[(\mathbb E[Y\mid\mathbf X]-\mathbf X'b)^2]\) and minimizing \(\mathbb E[(Y-\mathbf X'b)^2]\) yield the same \(b\). \(\blacksquare\)
Let \(\boldsymbol\beta\) solve (3.2)/(3.3) and define \(u=Y-\mathbf X'\boldsymbol\beta\); then (3.1) is read as the best linear approximation / best linear predictor \(\mathrm{BLP}(Y\mid\mathbf X)\). Key points:
- \(\boldsymbol\beta\) still has no causal meaning.
- The first-order condition of (3.3): $$\mathbf 0=2\mathbb E[\mathbf X(Y-\mathbf X'\boldsymbol\beta)]\Rightarrow\mathbb E[\mathbf Xu]=0 \tag{3.5}$$
- Since the first element of \(\mathbf X\) is \(1\), one can always normalize so that \(\mathbb E[u]=0\).
This interpretation assumes less about \(u\) than §3.1: it requires only orthogonality \(\mathbb E[\mathbf Xu]=0\) (not mean independence \(\mathbb E[u\mid\mathbf X]=0\)).
3.3 Causal Interpretation
此解读设 \(Y=g(\mathbf X,u)\),其中 \(\mathbf X\) 是 \(Y\) 的可观测决定因素、\(u\) 是不可观测决定因素,函数 \(g(\cdot,\cdot)\) 刻画 \(Y\) 如何被决定的结构。「保持其他因素不变、\(X_i\) 单位变化对 \(Y\) 的效应」即 \(\frac{\partial g(\mathbf X,u)}{\partial X_i}\)。若假设线性结构 \(g(\mathbf X,u)=\mathbf X'\boldsymbol\beta+u\),则
$$\frac{\partial g(\mathbf X,u)}{\partial X_i}=\beta_i$$
此时 \(\boldsymbol\beta\) 有因果含义。注意 \(\mathbb E[u]\) 未必为零,但可归一化使之为零:
$$\beta_0^{\text{new}}=\beta_0+\mathbb E[u],\qquad u^{\text{new}}=u-\mathbb E[u]$$
三种解读对 \(u\) 的假设强弱 比 \(\mathbb E[u]=0\) 更多的,因果解读一无所知——\(\mathbb E[\mathbf Xu]\) 与 \(\mathbb E[u\mid\mathbf X]\) 一般都不为零。这正是因果识别的难点:结构参数 \(\boldsymbol\beta\) 有因果含义,但 \(\mathbf X\) 与不可观测 \(u\) 可能相关(内生性),故不能简单用 OLS 估计 \(\boldsymbol\beta\)。 - §3.1 线性条件期望:\(\mathbb E[u\mid\mathbf X]=0\)(最强,均值独立);\(\boldsymbol\beta\) 无因果。 - §3.2 最优线性预测:\(\mathbb E[\mathbf Xu]=0\)(仅正交);\(\boldsymbol\beta\) 无因果。 - §3.3 因果:仅 \(\mathbb E[u]=0\)(可归一化);\(\boldsymbol\beta\) 有因果,但 \(\mathbf X\) 与 \(u\) 可能内生。
This interpretation posits \(Y=g(\mathbf X,u)\), where \(\mathbf X\) are the observed determinants of \(Y\) and \(u\) the unobserved determinants, and the function \(g(\cdot,\cdot)\) captures the structure of how \(Y\) is determined. "The effect on \(Y\) of a unit change in \(X_i\) holding other factors constant" is \(\frac{\partial g(\mathbf X,u)}{\partial X_i}\). If we assume the linear structure \(g(\mathbf X,u)=\mathbf X'\boldsymbol\beta+u\), then
$$\frac{\partial g(\mathbf X,u)}{\partial X_i}=\beta_i$$
so \(\boldsymbol\beta\) is causal. Note \(\mathbb E[u]\) need not be zero, but can be normalized to zero:
$$\beta_0^{\text{new}}=\beta_0+\mathbb E[u],\qquad u^{\text{new}}=u-\mathbb E[u]$$
Strength of the assumption on \(u\) across the three interpretations Beyond \(\mathbb E[u]=0\), the causal interpretation knows nothing — \(\mathbb E[\mathbf Xu]\) and \(\mathbb E[u\mid\mathbf X]\) are generally both nonzero. This is precisely the difficulty of causal identification: the structural parameter \(\boldsymbol\beta\) is causal, but \(\mathbf X\) may be correlated with the unobserved \(u\) (endogeneity), so \(\boldsymbol\beta\) cannot simply be estimated by OLS. - §3.1 Linear conditional expectation: \(\mathbb E[u\mid\mathbf X]=0\) (strongest, mean independence); \(\boldsymbol\beta\) not causal. - §3.2 Best linear predictor: \(\mathbb E[\mathbf Xu]=0\) (orthogonality only); \(\boldsymbol\beta\) not causal. - §3.3 Causal: only \(\mathbb E[u]=0\) (a normalization); \(\boldsymbol\beta\) is causal, but \(\mathbf X\) and \(u\) may be endogenous.