4. Linear Regression with Exogenous Variables
本章主题:外生(\(\mathbb E[\mathbf Xu]=0\))线性回归的总体性质、OLS 估计与推断。 §4.1 总体性质:由 \(\mathbb E[\mathbf Xu]=0\) 解出 \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) (4.2)(无完全共线性 ⟺ \(\mathbb E[\mathbf X\mathbf X']\) 可逆,引理 4.1);子向量的 Frisch-Waugh-Lovell「先净化再回归」(4.5, 命题 4.1);遗漏变量偏误 OVB 与测量误差 / 衰减偏误。§4.2 OLS 估计:\(\hat{\boldsymbol\beta}_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\frac1n\sum\mathbf X_iY_i)\) (4.10);矩阵形式 \(\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y\) (4.15) = 投影(投影阵 \(\mathbb P\)、残差生成阵 \(\mathbb M\),皆幂等对称);子向量 FWL (4.17);拟合优度 \(R^2\)、调整 \(\bar R^2\)(仅描述拟合、无因果含义)。§4.3 OLS 性质:均值独立 \(\mathbb E[u\mid\mathbf X]=0\) ⇒ 无偏(命题 4.2);再加同方差 ⇒ 高斯-马尔可夫 BLUE(命题 4.3);一致性(命题 4.4);极限分布 \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) (命题 4.5),同方差下 \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\)(命题 4.6);\(\Omega\) 的稳健估计 (4.25)。§4.4 假设检验:单一线性约束(\(t\) 检验)、多重线性约束 Wald 检验、拉格朗日乘子 LM 检验(与 Wald 等价)、非线性约束 Delta 方法。§4.5 GLS:异方差下 OLS 不再最优,GLS \(\hat{\boldsymbol\beta}=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y\) 为异方差下的 BLUE;不可观测方差时用 FGLS。
Chapter theme: population properties, OLS estimation, and inference for the exogenous (\(\mathbb E[\mathbf Xu]=0\)) linear regression. §4.1 Population properties: from \(\mathbb E[\mathbf Xu]=0\) solve \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) (4.2) (no perfect collinearity ⟺ \(\mathbb E[\mathbf X\mathbf X']\) invertible, Lemma 4.1); the Frisch-Waugh-Lovell "partial out then regress" for a subvector (4.5, Proposition 4.1); omitted-variable bias (OVB) and measurement error / attenuation bias. §4.2 OLS estimation: \(\hat{\boldsymbol\beta}_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\frac1n\sum\mathbf X_iY_i)\) (4.10); matrix form \(\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y\) (4.15) = projection (projection matrix \(\mathbb P\), residual-maker \(\mathbb M\), both idempotent and symmetric); subvector FWL (4.17); goodness of fit \(R^2\), adjusted \(\bar R^2\) (description of fit only, no causal meaning). §4.3 OLS properties: mean independence \(\mathbb E[u\mid\mathbf X]=0\) ⇒ unbiased (Proposition 4.2); plus homoskedasticity ⇒ Gauss-Markov BLUE (Proposition 4.3); consistency (Proposition 4.4); limiting distribution \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) (Proposition 4.5), with \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\) under homoskedasticity (Proposition 4.6); a robust estimator of \(\Omega\) (4.25). §4.4 Hypothesis testing: a single linear restriction (\(t\) test), multiple linear restrictions (Wald test), the Lagrange multiplier (LM) test (equivalent to Wald), and nonlinear restrictions (delta method). §4.5 GLS: under heteroskedasticity OLS is no longer efficient, and GLS \(\hat{\boldsymbol\beta}=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y\) is the BLUE under heteroskedasticity; use FGLS when the variance is unobservable.
4.1 Population Properties of β
设 \((Y,\mathbf X,u)\) 如第 3 章定义,假设 \(\mathbf X\) 无完全共线性、\(\mathbb E[\mathbf X\mathbf X']\) 存在、\(\mathbb E[Y^2]<\infty\)、\(\mathbb E[\mathbf Xu]=\mathbf 0\)(外生性)。
4.1.1 解出 \(\boldsymbol\beta\).
定义 4.1 / 引理 4.1 完全共线性:\(\mathbf X\) 完全共线 若 \(\exists\mathbf c\ne\mathbf 0,\mathbf c\in\mathbb R^{k+1}\) 使 \(\mathbb P(\mathbf c'\mathbf X=0)=1\)。引理 4.1:\(\mathbb E[\mathbf X\mathbf X']\) 可逆 当且仅当 \(\mathbf X\) 无完全共线性。
证明(引理 4.1) 等价命题:\(\mathbb E[\mathbf X\mathbf X']\) 不可逆 当且仅当存在完全共线性。 不可逆 ⇒ 完全共线:\(\mathbb E[\mathbf X\mathbf X']\) 不可逆 ⇒ \(\exists\mathbf c\ne\mathbf 0\) 使 \(\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbf 0\),于是 \(0=\mathbf c'\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbb E[(\mathbf c'\mathbf X)^2]\),故 \(\mathbb P(\mathbf c'\mathbf X=0)=1\)。 完全共线 ⇒ 不可逆:设 \(\mathbb P(\mathbf c'\mathbf X=0)=1\),则 \(\mathbf c'\mathbb E[\mathbf X\mathbf X']=\mathbb E[\underbrace{\mathbf c'\mathbf X}_{=0}\mathbf X']=\mathbf 0\),故 \(\mathbb E[\mathbf X\mathbf X']\) 不可逆。\(\blacksquare\)
由 \(\mathbb E[\mathbf Xu]=\mathbf 0\) 与 \(u=Y-\mathbf X'\boldsymbol\beta\):
$$\mathbf 0=\mathbb E[\mathbf Xu]=\mathbb E[\mathbf X(Y-\mathbf X'\boldsymbol\beta)]=\mathbb E[\mathbf XY]-\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta\Rightarrow\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta=\mathbb E[\mathbf XY] \tag{4.1}$$
$$\Rightarrow\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY] \tag{4.2}$$
Remark 4.1 若 \(\mathbb E[\mathbf X\mathbf X']\) 不可逆,(4.1) 有多解;任意两解 \(\hat{\boldsymbol\beta},\tilde{\boldsymbol\beta}\) 满足 \(\mathbb P(\mathbf X'\hat{\boldsymbol\beta}=\mathbf X'\tilde{\boldsymbol\beta})=1\)(同样的预测值)。这对预测/最优线性近似无碍,但使因果解读难以进行。
Let \((Y,\mathbf X,u)\) be as defined in Chapter 3, and assume \(\mathbf X\) has no perfect collinearity, \(\mathbb E[\mathbf X\mathbf X']\) exists, \(\mathbb E[Y^2]<\infty\), and \(\mathbb E[\mathbf Xu]=\mathbf 0\) (exogeneity).
4.1.1 Solving for \(\boldsymbol\beta\).
Definition 4.1 / Lemma 4.1 Perfect collinearity: \(\mathbf X\) is perfectly collinear if \(\exists\mathbf c\ne\mathbf 0,\mathbf c\in\mathbb R^{k+1}\) with \(\mathbb P(\mathbf c'\mathbf X=0)=1\). Lemma 4.1: \(\mathbb E[\mathbf X\mathbf X']\) is invertible if and only if \(\mathbf X\) has no perfect collinearity.
Proof (Lemma 4.1) Equivalent statement: \(\mathbb E[\mathbf X\mathbf X']\) is not invertible iff there is perfect collinearity. Not invertible ⇒ collinear: not invertible ⇒ \(\exists\mathbf c\ne\mathbf 0\) with \(\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbf 0\), so \(0=\mathbf c'\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbb E[(\mathbf c'\mathbf X)^2]\), hence \(\mathbb P(\mathbf c'\mathbf X=0)=1\). Collinear ⇒ not invertible: if \(\mathbb P(\mathbf c'\mathbf X=0)=1\), then \(\mathbf c'\mathbb E[\mathbf X\mathbf X']=\mathbb E[\underbrace{\mathbf c'\mathbf X}_{=0}\mathbf X']=\mathbf 0\), so \(\mathbb E[\mathbf X\mathbf X']\) is not invertible. \(\blacksquare\)
From \(\mathbb E[\mathbf Xu]=\mathbf 0\) with \(u=Y-\mathbf X'\boldsymbol\beta\):
$$\mathbf 0=\mathbb E[\mathbf Xu]=\mathbb E[\mathbf X(Y-\mathbf X'\boldsymbol\beta)]=\mathbb E[\mathbf XY]-\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta\Rightarrow\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta=\mathbb E[\mathbf XY] \tag{4.1}$$
$$\Rightarrow\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY] \tag{4.2}$$
Remark 4.1 If \(\mathbb E[\mathbf X\mathbf X']\) is not invertible, (4.1) has multiple solutions; any two solutions \(\hat{\boldsymbol\beta},\tilde{\boldsymbol\beta}\) satisfy \(\mathbb P(\mathbf X'\hat{\boldsymbol\beta}=\mathbf X'\tilde{\boldsymbol\beta})=1\) (the same fitted values). This is harmless for prediction / best linear approximation, but makes the causal interpretation difficult.
4.1.2 解出 \(\boldsymbol\beta\) 的子向量(Frisch-Waugh-Lovell). 把 \(\mathbf X\) 分为 \((\mathbf X_1)_{k_1\times1}\)、\((\mathbf X_2)_{k_2\times1}\),\(\boldsymbol\beta\) 分为 \(\boldsymbol\beta_1,\boldsymbol\beta_2\),
$$Y=\mathbf X'\boldsymbol\beta+u=\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u \tag{4.3}$$
由 \(\mathbb E[\mathbf Xu]=\mathbf 0\) 得 \(\mathbb E[\mathbf X_1u]=\mathbf 0_{k_1\times1}\)、\(\mathbb E[\mathbf X_2u]=\mathbf 0_{k_2\times1}\) (4.4)。若只关心 \(\boldsymbol\beta_1\),可分三步:
- 取 \(Y\) 的残差 \(\tilde Y=Y-\mathrm{BLP}(Y\mid\mathbf X_2)\);
- 取 \(\mathbf X_1\) 的残差 \(\tilde{\mathbf X}_1=\mathbf X_1-\mathrm{BLP}(\mathbf X_1\mid\mathbf X_2)\);
- 回归 \(\tilde Y=\tilde{\mathbf X}_1'\boldsymbol\beta_1+\tilde u\),得 \(\tilde{\boldsymbol\beta}=\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']^{-1}\mathbb E[\tilde{\mathbf X}_1\tilde Y]\) (4.5)。
命题 4.1 \(\tilde{\boldsymbol\beta}=\boldsymbol\beta_1\),其中 \(\tilde{\boldsymbol\beta}\) 由 (4.5) 定义、\(\boldsymbol\beta_1\) 由 (4.3) 定义。即「先把 \(\mathbf X_2\) 从 \(Y\) 与 \(\mathbf X_1\) 中净化掉、再回归」恰好得到 \(\boldsymbol\beta_1\)(Frisch-Waugh-Lovell 定理的总体版本)。
Remark 4.2 若 \(\mathbf X_2\) 含常数项,则 \(\tilde{\mathbf X}_1\) 必为零均值(由一阶条件),故 \(\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']=\mathrm{Var}[\tilde{\mathbf X}_1]\);若 \(\mathbf X_2\) 不含常数则此变换不成立。
4.1.2 Solving for a subvector of \(\boldsymbol\beta\) (Frisch-Waugh-Lovell). Partition \(\mathbf X\) into \((\mathbf X_1)_{k_1\times1}\), \((\mathbf X_2)_{k_2\times1}\) and \(\boldsymbol\beta\) into \(\boldsymbol\beta_1,\boldsymbol\beta_2\),
$$Y=\mathbf X'\boldsymbol\beta+u=\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u \tag{4.3}$$
From \(\mathbb E[\mathbf Xu]=\mathbf 0\), \(\mathbb E[\mathbf X_1u]=\mathbf 0_{k_1\times1}\), \(\mathbb E[\mathbf X_2u]=\mathbf 0_{k_2\times1}\) (4.4). To recover only \(\boldsymbol\beta_1\), in three steps:
- Take \(Y\)'s residual \(\tilde Y=Y-\mathrm{BLP}(Y\mid\mathbf X_2)\);
- Take \(\mathbf X_1\)'s residual \(\tilde{\mathbf X}_1=\mathbf X_1-\mathrm{BLP}(\mathbf X_1\mid\mathbf X_2)\);
- Regress \(\tilde Y=\tilde{\mathbf X}_1'\boldsymbol\beta_1+\tilde u\), giving \(\tilde{\boldsymbol\beta}=\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']^{-1}\mathbb E[\tilde{\mathbf X}_1\tilde Y]\) (4.5).
Proposition 4.1 \(\tilde{\boldsymbol\beta}=\boldsymbol\beta_1\), where \(\tilde{\boldsymbol\beta}\) is defined in (4.5) and \(\boldsymbol\beta_1\) in (4.3). That is, "partial \(\mathbf X_2\) out of \(Y\) and \(\mathbf X_1\) first, then regress" recovers exactly \(\boldsymbol\beta_1\) (the population version of the Frisch-Waugh-Lovell theorem).
Remark 4.2 If \(\mathbf X_2\) contains a constant, then \(\tilde{\mathbf X}_1\) must have zero mean (by the first-order condition), so \(\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']=\mathrm{Var}[\tilde{\mathbf X}_1]\); if \(\mathbf X_2\) contains no constant, this transformation does not hold.
推论 4.1(\(\mathbf X_2\) 为常数的特例). 若 \(\mathbf X_2=1\),则 \(\mathrm{BLP}(Y\mid1)=\mathbb E[Y]\)、\(\mathrm{BLP}(\mathbf X_1\mid1)=\mathbb E[\mathbf X_1]\),于是 \(\tilde Y=Y-\mathbb E[Y]\)、\(\tilde{\mathbf X}_1=\mathbf X_1-\mathbb E[\mathbf X_1]\) (4.6),代入 (4.5):
$$\boldsymbol\beta_1=\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,Y) \tag{4.7}$$
即去除常数后,斜率系数等于「协方差除以方差」。
4.1.3 遗漏变量偏误(OVB). 真实模型 \(Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u\),若遗漏 \(\mathbf X_2\)、错误地估计 \(Y=\beta_0^*+\mathbf X_1'\boldsymbol\beta_1^*+u^*\),则由 (4.7):
$$\boldsymbol\beta_1^*=\boldsymbol\beta_1+\underbrace{\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,\mathbf X_2)\boldsymbol\beta_2}_{\text{omitted variable bias}}$$
偏误 = 「\(\mathbf X_1\) 对 \(\mathbf X_2\) 的回归系数」乘以「\(\mathbf X_2\) 的真实效应 \(\boldsymbol\beta_2\)」。当 \(\mathbf X_1\perp\mathbf X_2\)(\(\mathrm{Cov}=0\))或 \(\boldsymbol\beta_2=0\) 时无偏误。
Corollary 4.1 (\(\mathbf X_2\) a constant). If \(\mathbf X_2=1\), then \(\mathrm{BLP}(Y\mid1)=\mathbb E[Y]\), \(\mathrm{BLP}(\mathbf X_1\mid1)=\mathbb E[\mathbf X_1]\), so \(\tilde Y=Y-\mathbb E[Y]\), \(\tilde{\mathbf X}_1=\mathbf X_1-\mathbb E[\mathbf X_1]\) (4.6); substituting into (4.5):
$$\boldsymbol\beta_1=\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,Y) \tag{4.7}$$
i.e., after removing the constant, the slope coefficient equals "covariance over variance."
4.1.3 Omitted-variable bias (OVB). True model \(Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u\); if we omit \(\mathbf X_2\) and wrongly estimate \(Y=\beta_0^*+\mathbf X_1'\boldsymbol\beta_1^*+u^*\), then by (4.7):
$$\boldsymbol\beta_1^*=\boldsymbol\beta_1+\underbrace{\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,\mathbf X_2)\boldsymbol\beta_2}_{\text{omitted variable bias}}$$
The bias = "the regression coefficient of \(\mathbf X_1\) on \(\mathbf X_2\)" times "the true effect \(\boldsymbol\beta_2\) of \(\mathbf X_2\)." No bias when \(\mathbf X_1\perp\mathbf X_2\) (\(\mathrm{Cov}=0\)) or \(\boldsymbol\beta_2=0\).
4.1.4 测量误差(衰减偏误). 模型 \(Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+u\) (4.8),但 \(\mathbf X_1\) 不可观测,只能观测 \(\hat{\mathbf X}_1=\mathbf X_1+\mathbf v\)。经典测量误差假设:\(\mathbb E[\mathbf v]=\mathbf 0\)、\(\mathrm{Cov}(u,\mathbf v)=\mathbf 0\)、\(\mathrm{Cov}(\mathbf X_1,\mathbf v)=\mathbf 0\)。用 \(\hat{\mathbf X}_1\) 替代 \(\mathbf X_1\) 回归 \(Y=\beta_0^*+\hat{\mathbf X}_1'\boldsymbol\beta_1^*+u^*\),由推论 4.1:
$$\boldsymbol\beta_1^*=\mathrm{Var}(\hat{\mathbf X}_1)^{-1}\mathrm{Cov}(\hat{\mathbf X}_1,Y)=(\mathrm{Var}(\mathbf X_1)+\mathrm{Var}(\mathbf v))^{-1}\mathrm{Var}(\mathbf X_1)\boldsymbol\beta_1$$
一维时 \(\beta_1^*=\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\beta_1\),系数 \(\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\in(0,1)\),故 \(\beta_1^*\) 被向零方向压缩——称衰减偏误(attenuation bias)。
例 4.2 更一般地把 \(\mathbf X\) 分为 \(X_0=1\)、\(X_1\in\mathbb R\)(误测)、\(\mathbf X_2\in\mathbb R^{k_2}\)(精确观测),\(Y=\beta_0+\beta_1X_1+\mathbf X_2'\boldsymbol\beta_2+u\);用 FWL 先净化 \(\mathbf X_2\)(\(\tilde{\hat X}_1=\hat X_1-\mathrm{BLP}(\hat X_1\mid1,\mathbf X_2)\)、\(\tilde Y=Y-\mathrm{BLP}(Y\mid1,\mathbf X_2)\))再处理,可证 \(\tilde{\hat X}_1=\tilde X_1+v\),衰减结论仍成立:\(\beta_1^*=\frac{\mathrm{Var}(\tilde X_1)}{\mathrm{Var}(\tilde X_1)+\mathrm{Var}(v)}\beta_1\)。
4.1.4 Measurement error (attenuation bias). Model \(Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+u\) (4.8), but \(\mathbf X_1\) is unobservable and we observe only \(\hat{\mathbf X}_1=\mathbf X_1+\mathbf v\). Classical measurement-error assumptions: \(\mathbb E[\mathbf v]=\mathbf 0\), \(\mathrm{Cov}(u,\mathbf v)=\mathbf 0\), \(\mathrm{Cov}(\mathbf X_1,\mathbf v)=\mathbf 0\). Regressing \(Y=\beta_0^*+\hat{\mathbf X}_1'\boldsymbol\beta_1^*+u^*\) with \(\hat{\mathbf X}_1\) in place of \(\mathbf X_1\), by Corollary 4.1:
$$\boldsymbol\beta_1^*=\mathrm{Var}(\hat{\mathbf X}_1)^{-1}\mathrm{Cov}(\hat{\mathbf X}_1,Y)=(\mathrm{Var}(\mathbf X_1)+\mathrm{Var}(\mathbf v))^{-1}\mathrm{Var}(\mathbf X_1)\boldsymbol\beta_1$$
In one dimension \(\beta_1^*=\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\beta_1\), where the factor \(\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\in(0,1)\), so \(\beta_1^*\) is shrunk toward zero — called attenuation bias.
Example 4.2 More generally, partition \(\mathbf X\) into \(X_0=1\), \(X_1\in\mathbb R\) (mismeasured), \(\mathbf X_2\in\mathbb R^{k_2}\) (measured exactly), \(Y=\beta_0+\beta_1X_1+\mathbf X_2'\boldsymbol\beta_2+u\); using FWL to partial out \(\mathbf X_2\) first (\(\tilde{\hat X}_1=\hat X_1-\mathrm{BLP}(\hat X_1\mid1,\mathbf X_2)\), \(\tilde Y=Y-\mathrm{BLP}(Y\mid1,\mathbf X_2)\)), one shows \(\tilde{\hat X}_1=\tilde X_1+v\), and the attenuation conclusion still holds: \(\beta_1^*=\frac{\mathrm{Var}(\tilde X_1)}{\mathrm{Var}(\tilde X_1)+\mathrm{Var}(v)}\beta_1\).
4.2 Ordinary Least Squares (OLS) Estimator
4.2.1 估计 \(\boldsymbol\beta\). 在总体性质(\(\mathbb E[\mathbf Xu]=0\)、\(\mathbb E[\mathbf X\mathbf X']<\infty\)、无完全共线性、\((Y,\mathbf X)\sim P\))与样本 i.i.d. 假设下,总体 \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) (4.9) 的自然样本类比为
$$\hat{\boldsymbol\beta}_n=\Big(\frac1n\sum_{i=1}^n\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\frac1n\sum_{i=1}^n\mathbf X_iY_i\Big) \tag{4.10}$$
称普通最小二乘(OLS)估计量——它最小化真实 \(Y_i\) 与拟合 \(\hat Y_i=\mathbf X_i'\hat{\boldsymbol\beta}_n\) 的平方距离和:\(\hat{\boldsymbol\beta}_n=\arg\min_{\mathbf b}\frac1n\sum(Y_i-\mathbf X_i'\mathbf b)^2\)。一阶条件 \(\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\hat{\boldsymbol\beta}_n)=\mathbf 0\) (4.11) 给出 (4.12)(与 (4.10) 同);由 CMT,\(\det(\frac1n\sum\mathbf X_i\mathbf X_i')\xrightarrow{p}\det(\mathbb E[\mathbf X\mathbf X'])\ne0\),故大 \(n\) 时几乎必可逆。残差 \(\hat u_i=Y_i-\hat Y_i\) 满足 \(\sum\mathbf X_i\hat u_i=\mathbf 0\) (4.13);因 \(X_{i,0}=1\),对 \(\beta_0\) 的一阶条件给出 \(\sum\hat u_i=0\) (4.14)。
4.2.2 OLS 作为投影. 用矩阵记号:\(\mathbb X_{n\times(k+1)}=(\mathbf X_1,\dots,\mathbf X_n)'\)(每行一个观测的转置)、\(\mathbb Y=(Y_1,\dots,Y_n)'\)、\(\mathbb U=(u_1,\dots,u_n)'\)、\(\hat{\mathbb Y}=\mathbb X\hat{\boldsymbol\beta}_n\)、\(\hat{\mathbb U}=\mathbb Y-\hat{\mathbb Y}\)。则
$$\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y \tag{4.15}$$
为 \(\min_{\mathbf b}|\mathbb Y-\mathbb X\mathbf b|^2\) 之解。\(\mathbb X\) 的列空间 \(\mathrm{Col}[\mathbb X]=\{\mathbb X\mathbf b:\mathbf b\in\mathbb R^{k+1}\}\)。定义投影阵
$$\mathbb P=\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X',\qquad\mathbb P\mathbb Y=\mathbb X\hat{\boldsymbol\beta}_n=\hat{\mathbb Y}$$
\(\mathbb P\mathbb Y\in\mathrm{Col}[\mathbb X]\),故 \(\mathbb P\) 把 \(\mathbb Y\) 投影到 \(\mathrm{Col}[\mathbb X]\)。残差生成阵 \(\mathbb M=\mathbb I-\mathbb P=\mathbb I-\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X'\),把 \(\mathbb Y\) 投影到与 \(\mathrm{Col}[\mathbb X]\) 正交的空间:\(\mathbb M\mathbb X=\mathbf 0\)、\(\mathbb M\mathbb Y=\hat{\mathbb U}\)、\(\mathbb M\mathbb U=\hat{\mathbb U}\)。\(\mathbb P\) 与 \(\mathbb M\) 皆幂等(\(\mathbb P^2=\mathbb P\)、\(\mathbb M^2=\mathbb M\))且对称(\(\mathbb P'=\mathbb P\)、\(\mathbb M'=\mathbb M\))。
4.2.1 Estimating \(\boldsymbol\beta\). Under the population properties (\(\mathbb E[\mathbf Xu]=0\), \(\mathbb E[\mathbf X\mathbf X']<\infty\), no perfect collinearity, \((Y,\mathbf X)\sim P\)) and an i.i.d. sample, the natural sample analog of \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) (4.9) is
$$\hat{\boldsymbol\beta}_n=\Big(\frac1n\sum_{i=1}^n\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\frac1n\sum_{i=1}^n\mathbf X_iY_i\Big) \tag{4.10}$$
the ordinary least squares (OLS) estimator — it minimizes the sum of squared distances between the true \(Y_i\) and the fitted \(\hat Y_i=\mathbf X_i'\hat{\boldsymbol\beta}_n\): \(\hat{\boldsymbol\beta}_n=\arg\min_{\mathbf b}\frac1n\sum(Y_i-\mathbf X_i'\mathbf b)^2\). The first-order condition \(\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\hat{\boldsymbol\beta}_n)=\mathbf 0\) (4.11) gives (4.12) (same as (4.10)); by CMT, \(\det(\frac1n\sum\mathbf X_i\mathbf X_i')\xrightarrow{p}\det(\mathbb E[\mathbf X\mathbf X'])\ne0\), so it is almost surely invertible for large \(n\). The residual \(\hat u_i=Y_i-\hat Y_i\) satisfies \(\sum\mathbf X_i\hat u_i=\mathbf 0\) (4.13); since \(X_{i,0}=1\), the first-order condition for \(\beta_0\) gives \(\sum\hat u_i=0\) (4.14).
4.2.2 OLS as a projection. In matrix notation: \(\mathbb X_{n\times(k+1)}=(\mathbf X_1,\dots,\mathbf X_n)'\) (each row is one observation transposed), \(\mathbb Y=(Y_1,\dots,Y_n)'\), \(\mathbb U=(u_1,\dots,u_n)'\), \(\hat{\mathbb Y}=\mathbb X\hat{\boldsymbol\beta}_n\), \(\hat{\mathbb U}=\mathbb Y-\hat{\mathbb Y}\). Then
$$\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y \tag{4.15}$$
solves \(\min_{\mathbf b}|\mathbb Y-\mathbb X\mathbf b|^2\). The column space of \(\mathbb X\) is \(\mathrm{Col}[\mathbb X]=\{\mathbb X\mathbf b:\mathbf b\in\mathbb R^{k+1}\}\). Define the projection matrix
$$\mathbb P=\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X',\qquad\mathbb P\mathbb Y=\mathbb X\hat{\boldsymbol\beta}_n=\hat{\mathbb Y}$$
\(\mathbb P\mathbb Y\in\mathrm{Col}[\mathbb X]\), so \(\mathbb P\) projects \(\mathbb Y\) onto \(\mathrm{Col}[\mathbb X]\). The residual-maker matrix \(\mathbb M=\mathbb I-\mathbb P=\mathbb I-\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X'\) projects \(\mathbb Y\) onto the space orthogonal to \(\mathrm{Col}[\mathbb X]\): \(\mathbb M\mathbb X=\mathbf 0\), \(\mathbb M\mathbb Y=\hat{\mathbb U}\), \(\mathbb M\mathbb U=\hat{\mathbb U}\). Both \(\mathbb P\) and \(\mathbb M\) are idempotent (\(\mathbb P^2=\mathbb P\), \(\mathbb M^2=\mathbb M\)) and symmetric (\(\mathbb P'=\mathbb P\), \(\mathbb M'=\mathbb M\)).
4.2.3 估计子向量 \(\hat{\boldsymbol\beta}_{1,n}\)(FWL 的样本版本). 把 \(\mathbb X\) 分为 \(\mathbb X_1,\mathbb X_2\)、\(\mathbb P\) 分为 \(\mathbb P_1,\mathbb P_2\)、\(\mathbb M\) 分为 \(\mathbb M_1,\mathbb M_2\)。样本类比 \(\mathbb Y=\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\mathbb X_2\hat{\boldsymbol\beta}_{2,n}+\hat{\mathbb U}\)。用 \(\mathbb M_2\) 左乘消去 \(\mathbb X_2\)(\(\mathbb M_2\mathbb X_2=0\))、\(\mathbb M_2\hat{\mathbb U}=\hat{\mathbb U}\):\(\mathbb M_2\mathbb Y=\mathbb M_2\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\hat{\mathbb U}\)。再左乘 \((\mathbb M_2\mathbb X_1)'\) 并用 \(\mathbb X_1'\hat{\mathbb U}=0\):
$$\hat{\boldsymbol\beta}_{1,n}=[(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb X_1)]^{-1}(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb Y) \tag{4.17}$$
称 Frisch-Waugh-Lovell 分解。由 \(\mathbb M_2\) 幂等对称,可简化为 \(\hat{\boldsymbol\beta}_{1,n}=[\mathbb X_1'\mathbb M_2\mathbb X_1]^{-1}\mathbb X_1'\mathbb M_2\mathbb Y\)。
4.2.4 拟合优度. \(R\) 方 \(R^2\equiv\frac{\mathrm{ESS}}{\mathrm{TSS}}=1-\frac{\mathrm{SSR}}{\mathrm{TSS}}\) (4.18),其中估计平方和 \(\mathrm{ESS}=\sum(\hat Y_i-\bar Y_n)^2\)、残差平方和 \(\mathrm{SSR}=\sum\hat u_i^2\)、总平方和 \(\mathrm{TSS}=\sum(Y_i-\bar Y_n)^2\)。由 \(\sum\hat u_i(\hat Y_i-\bar Y_n)=0\) 得 \(\mathrm{TSS}=\mathrm{ESS}+\mathrm{SSR}\) (4.19),故 \(0\le R^2\le1\)。\(R^2=1\) 为完美拟合(\(\hat u_i=0\))、\(R^2=0\) 为无法预测。加入更多变量 \(R^2\) 不减(自由度增加),故引入调整 \(R^2\) 作惩罚:
$$\bar R^2=1-\frac{\frac1{n-(k+1)}\sum\hat u_i^2}{\frac1{n-1}\sum(Y_i-\bar Y_n)^2}=1-\frac{n-1}{n-k-1}\frac{\mathrm{SSR}}{\mathrm{TSS}}$$
\(\bar R^2\) 可为负。\(R^2\) 与 \(\bar R^2\) 都无因果基础——只是拟合的描述,不是因果解读。
4.2.3 Estimating a subvector \(\hat{\boldsymbol\beta}_{1,n}\) (sample version of FWL). Partition \(\mathbb X\) into \(\mathbb X_1,\mathbb X_2\), \(\mathbb P\) into \(\mathbb P_1,\mathbb P_2\), \(\mathbb M\) into \(\mathbb M_1,\mathbb M_2\). The sample analog \(\mathbb Y=\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\mathbb X_2\hat{\boldsymbol\beta}_{2,n}+\hat{\mathbb U}\). Left-multiply by \(\mathbb M_2\) to remove \(\mathbb X_2\) (\(\mathbb M_2\mathbb X_2=0\)), with \(\mathbb M_2\hat{\mathbb U}=\hat{\mathbb U}\): \(\mathbb M_2\mathbb Y=\mathbb M_2\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\hat{\mathbb U}\). Then left-multiply by \((\mathbb M_2\mathbb X_1)'\) and use \(\mathbb X_1'\hat{\mathbb U}=0\):
$$\hat{\boldsymbol\beta}_{1,n}=[(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb X_1)]^{-1}(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb Y) \tag{4.17}$$
the Frisch-Waugh-Lovell decomposition. By idempotency and symmetry of \(\mathbb M_2\), it simplifies to \(\hat{\boldsymbol\beta}_{1,n}=[\mathbb X_1'\mathbb M_2\mathbb X_1]^{-1}\mathbb X_1'\mathbb M_2\mathbb Y\).
4.2.4 Measures of fit. The \(R\)-square \(R^2\equiv\frac{\mathrm{ESS}}{\mathrm{TSS}}=1-\frac{\mathrm{SSR}}{\mathrm{TSS}}\) (4.18), where the estimated sum of squares \(\mathrm{ESS}=\sum(\hat Y_i-\bar Y_n)^2\), sum of squared residuals \(\mathrm{SSR}=\sum\hat u_i^2\), total sum of squares \(\mathrm{TSS}=\sum(Y_i-\bar Y_n)^2\). Since \(\sum\hat u_i(\hat Y_i-\bar Y_n)=0\), \(\mathrm{TSS}=\mathrm{ESS}+\mathrm{SSR}\) (4.19), hence \(0\le R^2\le1\). \(R^2=1\) is a perfect fit (\(\hat u_i=0\)), \(R^2=0\) means no predictive power. Adding more variables never decreases \(R^2\) (more degrees of freedom), so introduce the adjusted \(R^2\) as a penalty:
$$\bar R^2=1-\frac{\frac1{n-(k+1)}\sum\hat u_i^2}{\frac1{n-1}\sum(Y_i-\bar Y_n)^2}=1-\frac{n-1}{n-k-1}\frac{\mathrm{SSR}}{\mathrm{TSS}}$$
\(\bar R^2\) can be negative. Neither \(R^2\) nor \(\bar R^2\) has any causal basis — they are descriptions of fit, not causal interpretations.
4.3 Properties of the OLS Estimator
以下都在 §4.2.1 的基本假设下,子节会另加假设。
4.3.1 无偏性.
命题 4.2(无偏) 额外假设 \(\mathbb E[u\mid\mathbf X]=0\)(均值独立),则 OLS 无偏:\(\mathbb E[\hat{\boldsymbol\beta}_n]=\boldsymbol\beta\)。
证明(命题 4.2) \(\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y=(\mathbb X'\mathbb X)^{-1}\mathbb X'(\mathbb X\boldsymbol\beta+\mathbb U)=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb U\)。取条件期望: $$\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X_1,\dots,\mathbf X_n]=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\underbrace{\mathbb E[\mathbb U\mid\mathbf X_1,\dots,\mathbf X_n]}_{=0\,\because\,\mathbb E[u_i\mid\mathbf X_i]=0}=\boldsymbol\beta$$ 由重期望定律 \(\mathbb E[\hat{\boldsymbol\beta}_n]=\mathbb E[\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X]]=\boldsymbol\beta\)。\(\blacksquare\)
4.3.2 高斯-马尔可夫定理.
定义 4.2 / 命题 4.3(高斯-马尔可夫) 同方差:\(\mathrm{Var}(u\mid\mathbf X)\) 为常数;否则异方差。命题 4.3:额外假设 \(\mathbb E[u\mid\mathbf X]=0\)(均值独立)与 \(\mathrm{Var}(u\mid\mathbf X)=\sigma^2\)(同方差),则 \(\hat{\boldsymbol\beta}_n\) 是 \(\boldsymbol\beta\) 的「最优」估计量(BLUE)——在所有形如 \(\mathbb A'\mathbb Y\)(\(\mathbb A=\mathbb A(\mathbf X_1,\dots,\mathbf X_n)\))且无偏(\(\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\boldsymbol\beta\))的估计量中,\(\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)\) 最小。
证明(命题 4.3) 无偏要求 \(\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\mathbb A'\mathbb X\boldsymbol\beta=\boldsymbol\beta\),即 \(\mathbb A'\mathbb X=\mathbb I\)。条件方差 \(\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)=\mathbb A'\mathrm{Var}(\mathbb U\mid\mathbf X)\mathbb A=\mathbb A'\mathbb A\sigma^2\) (4.21)。OLS 对应 \(\mathbb A'=(\mathbb X'\mathbb X)^{-1}\mathbb X'\),其方差 \((\mathbb X'\mathbb X)^{-1}\sigma^2\)。需证 \(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}\) 半正定。令 \(\mathbb C=\mathbb A-\mathbb X(\mathbb X'\mathbb X)^{-1}\),可证 \(\mathbb X'\mathbb C=\mathbf 0\),于是 $$\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}=\mathbb C'\mathbb C$$ 对任意 \(\mathbf c\ne\mathbf 0\),\(\mathbf c'(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1})\mathbf c=\mathbf c'\mathbb C'\mathbb C\mathbf c=(\mathbb C\mathbf c)'(\mathbb C\mathbf c)=\sum m_i^2\ge0\),半正定,故 OLS 方差最小。\(\blacksquare\)
All below are under the §4.2.1 baseline assumptions; subsections add extra assumptions.
4.3.1 Unbiasedness.
Proposition 4.2 (Unbiased) Additionally assume \(\mathbb E[u\mid\mathbf X]=0\) (mean independence). Then OLS is unbiased: \(\mathbb E[\hat{\boldsymbol\beta}_n]=\boldsymbol\beta\).
Proof (Proposition 4.2) \(\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y=(\mathbb X'\mathbb X)^{-1}\mathbb X'(\mathbb X\boldsymbol\beta+\mathbb U)=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb U\). Take conditional expectation: $$\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X_1,\dots,\mathbf X_n]=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\underbrace{\mathbb E[\mathbb U\mid\mathbf X_1,\dots,\mathbf X_n]}_{=0\,\because\,\mathbb E[u_i\mid\mathbf X_i]=0}=\boldsymbol\beta$$ By LIE, \(\mathbb E[\hat{\boldsymbol\beta}_n]=\mathbb E[\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X]]=\boldsymbol\beta\). \(\blacksquare\)
4.3.2 Gauss-Markov Theorem.
Definition 4.2 / Proposition 4.3 (Gauss-Markov) Homoskedastic: \(\mathrm{Var}(u\mid\mathbf X)\) is constant; otherwise heteroskedastic. Proposition 4.3: additionally assume \(\mathbb E[u\mid\mathbf X]=0\) (mean independence) and \(\mathrm{Var}(u\mid\mathbf X)=\sigma^2\) (homoskedasticity). Then \(\hat{\boldsymbol\beta}_n\) is the "best" estimator of \(\boldsymbol\beta\) (BLUE) — among all estimators of the form \(\mathbb A'\mathbb Y\) (\(\mathbb A=\mathbb A(\mathbf X_1,\dots,\mathbf X_n)\)) that are unbiased (\(\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\boldsymbol\beta\)), \(\hat{\boldsymbol\beta}_n\) has the smallest \(\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)\).
Proof (Proposition 4.3) Unbiasedness requires \(\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\mathbb A'\mathbb X\boldsymbol\beta=\boldsymbol\beta\), i.e. \(\mathbb A'\mathbb X=\mathbb I\). The conditional variance \(\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)=\mathbb A'\mathrm{Var}(\mathbb U\mid\mathbf X)\mathbb A=\mathbb A'\mathbb A\sigma^2\) (4.21). OLS corresponds to \(\mathbb A'=(\mathbb X'\mathbb X)^{-1}\mathbb X'\), with variance \((\mathbb X'\mathbb X)^{-1}\sigma^2\). We show \(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}\) is positive semi-definite. Let \(\mathbb C=\mathbb A-\mathbb X(\mathbb X'\mathbb X)^{-1}\); one shows \(\mathbb X'\mathbb C=\mathbf 0\), so $$\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}=\mathbb C'\mathbb C$$ For any \(\mathbf c\ne\mathbf 0\), \(\mathbf c'(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1})\mathbf c=\mathbf c'\mathbb C'\mathbb C\mathbf c=(\mathbb C\mathbf c)'(\mathbb C\mathbf c)=\sum m_i^2\ge0\), positive semi-definite, so OLS has the smallest variance. \(\blacksquare\)
4.3.3 一致性.
命题 4.4(一致) OLS 一致:\(\hat{\boldsymbol\beta}_n\xrightarrow{p}\boldsymbol\beta\)。(仅需基本假设,无须均值独立或同方差。)
证明(命题 4.4) 由 WLLN,\(\frac1n\sum\mathbf X_i\mathbf X_i'\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']\)、\(\frac1n\sum\mathbf X_iY_i\xrightarrow{p}\mathbb E[\mathbf XY]\);无完全共线性使 \(\mathbb E[\mathbf X\mathbf X']\) 可逆,CMT 给出 \((\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\);边际收敛 ⇒ 联合收敛 + CMT(乘法连续): $$\hat{\boldsymbol\beta}_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_iY_i\Big)\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]=\boldsymbol\beta\quad\blacksquare$$
4.3.4 极限分布.
命题 4.5 / 命题 4.6(极限分布) 命题 4.5:额外假设 \(\mathrm{Var}(\mathbf Xu)\) 存在,则 $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega),\quad\Omega=\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}$$ (三明治形式,异方差稳健)。命题 4.6:若再加 \(\mathbb E[u\mid\mathbf X]=0\)(均值独立)与 \(\mathrm{Var}(u\mid\mathbf X)=\sigma^2\)(同方差),则 $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1})$$
证明(命题 4.5 与 4.6) 命题 4.5:展开 \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\sqrt n\frac1n\sum\mathbf X_iu_i)\)。\(\mathbf X_iu_i\) i.i.d.、均值 \(\mathbb E[\mathbf Xu]=0\),由 CLT \(\sqrt n\frac1n\sum\mathbf X_iu_i\xrightarrow{d}N(0,\mathrm{Var}(\mathbf Xu))\);又 \((\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\),由 Slutsky:\(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}\mathbb E[\mathbf X\mathbf X']^{-1}N(0,\mathrm{Var}(\mathbf Xu))=N(0,\Omega)\)。 命题 4.6:同方差使 \(\mathrm{Var}(\mathbf Xu)=\mathbb E[\mathbf X\mathbf X'u^2]=\mathbb E[\mathbf X\mathbf X'\mathbb E[u^2\mid\mathbf X]]=\sigma^2\mathbb E[\mathbf X\mathbf X']\) (4.22)(4.23),代入得 \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\) (4.24)。\(\blacksquare\)
4.3.3 Consistency.
Proposition 4.4 (Consistent) OLS is consistent: \(\hat{\boldsymbol\beta}_n\xrightarrow{p}\boldsymbol\beta\). (Only the baseline assumptions are needed, no mean independence or homoskedasticity.)
Proof (Proposition 4.4) By WLLN, \(\frac1n\sum\mathbf X_i\mathbf X_i'\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']\), \(\frac1n\sum\mathbf X_iY_i\xrightarrow{p}\mathbb E[\mathbf XY]\); no perfect collinearity makes \(\mathbb E[\mathbf X\mathbf X']\) invertible, and CMT gives \((\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\); marginal ⇒ joint convergence + CMT (multiplication is continuous): $$\hat{\boldsymbol\beta}_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_iY_i\Big)\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]=\boldsymbol\beta\quad\blacksquare$$
4.3.4 Limiting distribution.
Proposition 4.5 / Proposition 4.6 (Limiting distribution) Proposition 4.5: additionally assume \(\mathrm{Var}(\mathbf Xu)\) exists. Then $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega),\quad\Omega=\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}$$ (the sandwich form, heteroskedasticity-robust). Proposition 4.6: if we further add \(\mathbb E[u\mid\mathbf X]=0\) (mean independence) and \(\mathrm{Var}(u\mid\mathbf X)=\sigma^2\) (homoskedasticity), then $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1})$$
Proof (Propositions 4.5 and 4.6) Prop 4.5: expand \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\sqrt n\frac1n\sum\mathbf X_iu_i)\). The \(\mathbf X_iu_i\) are i.i.d. with mean \(\mathbb E[\mathbf Xu]=0\), so by CLT \(\sqrt n\frac1n\sum\mathbf X_iu_i\xrightarrow{d}N(0,\mathrm{Var}(\mathbf Xu))\); also \((\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\), so by Slutsky \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}\mathbb E[\mathbf X\mathbf X']^{-1}N(0,\mathrm{Var}(\mathbf Xu))=N(0,\Omega)\). Prop 4.6: homoskedasticity gives \(\mathrm{Var}(\mathbf Xu)=\mathbb E[\mathbf X\mathbf X'u^2]=\mathbb E[\mathbf X\mathbf X'\mathbb E[u^2\mid\mathbf X]]=\sigma^2\mathbb E[\mathbf X\mathbf X']\) (4.22)(4.23), so \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\) (4.24). \(\blacksquare\)
4.3.5 估计 \(\Omega\). 我们不知真实 \(\Omega\),需构造一致估计。同方差下 \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\),自然估计 \(\hat\Omega_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\hat\sigma_n^2\)(\(\hat\sigma_n^2=\frac1n\sum\hat u_i^2\))。一般(异方差稳健):
$$\hat\Omega_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\Big)\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1} \tag{4.25}$$
需证中间项 \(\frac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\xrightarrow{p}\mathrm{Var}(\mathbf Xu)\)。把 \(\hat u_i^2=u_i^2+(\hat u_i^2-u_i^2)\) 拆分 (4.26),第一项由 WLLN \(\frac1n\sum\mathbf X_i\mathbf X_i'u_i^2\xrightarrow{p}\mathbb E[\mathbf X\mathbf X'u^2]\),第二项用 \(\hat u_i^2-u_i^2=-2u_i\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)+[\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)]^2\) 与下述引理证其 \(\xrightarrow{p}0\)。
引理 4.2 设 \(\mathbf Z_1,\dots,\mathbf Z_n\) i.i.d.、\(\mathbb E[|\mathbf Z_i|^r]<\infty\),则 \(\max_{1\le i\le n}|\mathbf Z_i|=o_P(n^{1/r})\),即 \(\frac{\max_{1\le i\le n}|\mathbf Z_i|}{n^{1/r}}\xrightarrow{p}0\)。
最终 \(\hat\Omega_n\xrightarrow{p}\Omega\)。
4.3.5 Estimating \(\Omega\). We do not know the true \(\Omega\) and must construct a consistent estimator. Under homoskedasticity \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\), with the natural estimator \(\hat\Omega_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\hat\sigma_n^2\) (\(\hat\sigma_n^2=\frac1n\sum\hat u_i^2\)). In general (heteroskedasticity-robust):
$$\hat\Omega_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\Big)\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1} \tag{4.25}$$
We must show the middle term \(\frac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\xrightarrow{p}\mathrm{Var}(\mathbf Xu)\). Split \(\hat u_i^2=u_i^2+(\hat u_i^2-u_i^2)\) (4.26): the first term by WLLN \(\frac1n\sum\mathbf X_i\mathbf X_i'u_i^2\xrightarrow{p}\mathbb E[\mathbf X\mathbf X'u^2]\); the second uses \(\hat u_i^2-u_i^2=-2u_i\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)+[\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)]^2\) together with the lemma below to show it \(\xrightarrow{p}0\).
Lemma 4.2 Let \(\mathbf Z_1,\dots,\mathbf Z_n\) be i.i.d. with \(\mathbb E[|\mathbf Z_i|^r]<\infty\). Then \(\max_{1\le i\le n}|\mathbf Z_i|=o_P(n^{1/r})\), i.e. \(\frac{\max_{1\le i\le n}|\mathbf Z_i|}{n^{1/r}}\xrightarrow{p}0\).
Finally \(\hat\Omega_n\xrightarrow{p}\Omega\).
4.4 Hypothesis Testing
4.4.1 单一线性约束(\(t\) 检验,\(p=1\)). 检验 \(H_0:\mathbf r'\boldsymbol\beta=c\) vs \(H_1:\mathbf r'\boldsymbol\beta\ne c\)(\(\mathbf r\in\mathbb R^{k+1}\)、\(c\in\mathbb R\))。由 \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) 与 CMT:\(\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-\mathbf r'\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf r'\Omega\mathbf r)\)。取统计量
$$T_n=\frac{\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-c)}{\sqrt{\mathbf r'\hat\Omega_n\mathbf r}}\xrightarrow{d}N(0,1)\quad(H_0)$$
(\(H_0\) 下成立。)\(\phi_n=\mathbf 1\{|T_n|>z_{1-\frac\alpha2}\}\) 水平一致。
4.4.2 多重线性约束(Wald 检验). 检验 \(H_0:\mathbf R\boldsymbol\beta=\mathbf c\) vs \(H_1:\mathbf R\boldsymbol\beta\ne\mathbf c\),\(\mathbf R\) 为 \(p\times(k+1)\)、行线性独立(使 \(\mathbf R\Omega\mathbf R'\) 可逆)。由 \(\sqrt n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf R\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf R\Omega\mathbf R')\) 与 \(\mathbf x\sim N(0,\Sigma)\Rightarrow\mathbf x'\Sigma^{-1}\mathbf x\sim\chi^2\) 事实:
$$T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\xrightarrow{d}\chi^2_p\quad(H_0)$$
(\(H_0\) 下成立。)\(\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}\)(\(\chi^2_p\) 的 \(1-\alpha\) 分位)。
4.4.1 A single linear restriction (\(t\) test, \(p=1\)). Test \(H_0:\mathbf r'\boldsymbol\beta=c\) vs \(H_1:\mathbf r'\boldsymbol\beta\ne c\) (\(\mathbf r\in\mathbb R^{k+1}\), \(c\in\mathbb R\)). From \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) and CMT: \(\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-\mathbf r'\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf r'\Omega\mathbf r)\). Take the statistic
$$T_n=\frac{\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-c)}{\sqrt{\mathbf r'\hat\Omega_n\mathbf r}}\xrightarrow{d}N(0,1)\quad(\text{under }H_0)$$
and \(\phi_n=\mathbf 1\{|T_n|>z_{1-\frac\alpha2}\}\) is consistent in level.
4.4.2 Multiple linear restrictions (Wald test). Test \(H_0:\mathbf R\boldsymbol\beta=\mathbf c\) vs \(H_1:\mathbf R\boldsymbol\beta\ne\mathbf c\), with \(\mathbf R\) a \(p\times(k+1)\) matrix of linearly independent rows (so \(\mathbf R\Omega\mathbf R'\) is invertible). From \(\sqrt n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf R\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf R\Omega\mathbf R')\) and the fact \(\mathbf x\sim N(0,\Sigma)\Rightarrow\mathbf x'\Sigma^{-1}\mathbf x\sim\chi^2\):
$$T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\xrightarrow{d}\chi^2_p\quad(\text{under }H_0)$$
and \(\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}\) (the \(1-\alpha\) quantile of \(\chi^2_p\)).
4.4.3 拉格朗日乘子(LM)检验. 约束最小二乘(CLS) 估计量 \(\tilde{\boldsymbol\beta}_n\) 解 \(\min_{\boldsymbol\beta:\mathbf R\boldsymbol\beta=\mathbf c}\frac1n\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2\)。拉格朗日
$$\mathcal L(\boldsymbol\beta,\boldsymbol\lambda)=\frac1{2n}\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2+\boldsymbol\lambda'(\mathbf R\boldsymbol\beta-\mathbf c)$$
(\(\frac12\) 用于消去常数)。一阶条件 \(\frac{\partial\mathcal L}{\partial\boldsymbol\beta}=-\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\tilde{\boldsymbol\beta}_n)+\mathbf R'\tilde{\boldsymbol\lambda}_n=\mathbf 0\)、\(\frac{\partial\mathcal L}{\partial\boldsymbol\lambda}=\mathbf R\tilde{\boldsymbol\beta}_n-\mathbf c=\mathbf 0\)。可解出乘子
$$\tilde{\boldsymbol\lambda}_n=\Big(\mathbf R\big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\big)^{-1}\mathbf R'\Big)^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)$$
\(H_0\)(\(\mathbf R\boldsymbol\beta=\mathbf c\))下 \(\tilde{\boldsymbol\lambda}_n\xrightarrow{p}0\)。其渐近分布 \(\sqrt n\tilde{\boldsymbol\lambda}_n\xrightarrow{d}N(0,\Pi)\)(\(\Pi=(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R'(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\))。检验 \(T_n=n\tilde{\boldsymbol\lambda}_n'\hat\Pi^{-1}\tilde{\boldsymbol\lambda}_n\xrightarrow{d}\chi^2_p\),\(\phi_n=\mathbf 1\{T_n>\chi^2_{p,1-\alpha}\}\)。
LM 检验 = Wald 检验 把 LM 统计量重排,可证 \(T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\)——与 Wald 统计量完全相同,故两检验等价、渐近分布同为 \(\chi^2_p\)。
4.4.4 非线性约束(Delta 方法). 检验 \(H_0:f(\boldsymbol\beta)=\mathbf c\) vs \(H_1:f(\boldsymbol\beta)\ne\mathbf c\),\(f:\mathbb R^{k+1}\to\mathbb R^p\) 在 \(\boldsymbol\beta\) 处连续可微、雅可比 \(D_\beta f(\boldsymbol\beta)=\frac{\partial f(\boldsymbol\beta)}{\partial\boldsymbol\beta'}\)(\(p\times(k+1)\))行满秩(\(p\le k+1\))。由 Delta 方法 \(\sqrt n(f(\hat{\boldsymbol\beta}_n)-f(\boldsymbol\beta))\xrightarrow{d}N(0,D_\beta f(\boldsymbol\beta)\Omega D_\beta f(\boldsymbol\beta)')\),故
$$T_n=n(f(\hat{\boldsymbol\beta}_n)-\mathbf c)'\big(D_\beta f(\hat{\boldsymbol\beta}_n)\hat\Omega_n D_\beta f(\hat{\boldsymbol\beta}_n)'\big)^{-1}(f(\hat{\boldsymbol\beta}_n)-\mathbf c)\xrightarrow{d}\chi^2_p$$
\(\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}\)。
4.4.3 Lagrange multiplier (LM) test. The constrained least squares (CLS) estimator \(\tilde{\boldsymbol\beta}_n\) solves \(\min_{\boldsymbol\beta:\mathbf R\boldsymbol\beta=\mathbf c}\frac1n\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2\). The Lagrangian
$$\mathcal L(\boldsymbol\beta,\boldsymbol\lambda)=\frac1{2n}\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2+\boldsymbol\lambda'(\mathbf R\boldsymbol\beta-\mathbf c)$$
(the \(\frac12\) cancels a constant). The first-order conditions \(\frac{\partial\mathcal L}{\partial\boldsymbol\beta}=-\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\tilde{\boldsymbol\beta}_n)+\mathbf R'\tilde{\boldsymbol\lambda}_n=\mathbf 0\), \(\frac{\partial\mathcal L}{\partial\boldsymbol\lambda}=\mathbf R\tilde{\boldsymbol\beta}_n-\mathbf c=\mathbf 0\). Solving for the multiplier
$$\tilde{\boldsymbol\lambda}_n=\Big(\mathbf R\big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\big)^{-1}\mathbf R'\Big)^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)$$
Under \(H_0\) (\(\mathbf R\boldsymbol\beta=\mathbf c\)), \(\tilde{\boldsymbol\lambda}_n\xrightarrow{p}0\). Its asymptotic distribution \(\sqrt n\tilde{\boldsymbol\lambda}_n\xrightarrow{d}N(0,\Pi)\) (with \(\Pi=(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R'(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\)). The test \(T_n=n\tilde{\boldsymbol\lambda}_n'\hat\Pi^{-1}\tilde{\boldsymbol\lambda}_n\xrightarrow{d}\chi^2_p\), \(\phi_n=\mathbf 1\{T_n>\chi^2_{p,1-\alpha}\}\).
LM test = Wald test Rearranging the LM statistic, one shows \(T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\) — exactly the same as the Wald statistic, so the two tests are equivalent with the same asymptotic \(\chi^2_p\) distribution.
4.4.4 Nonlinear restrictions (delta method). Test \(H_0:f(\boldsymbol\beta)=\mathbf c\) vs \(H_1:f(\boldsymbol\beta)\ne\mathbf c\), where \(f:\mathbb R^{k+1}\to\mathbb R^p\) is continuously differentiable at \(\boldsymbol\beta\) with full-row-rank Jacobian \(D_\beta f(\boldsymbol\beta)=\frac{\partial f(\boldsymbol\beta)}{\partial\boldsymbol\beta'}\) (\(p\times(k+1)\), \(p\le k+1\)). By the delta method \(\sqrt n(f(\hat{\boldsymbol\beta}_n)-f(\boldsymbol\beta))\xrightarrow{d}N(0,D_\beta f(\boldsymbol\beta)\Omega D_\beta f(\boldsymbol\beta)')\), so
$$T_n=n(f(\hat{\boldsymbol\beta}_n)-\mathbf c)'\big(D_\beta f(\hat{\boldsymbol\beta}_n)\hat\Omega_n D_\beta f(\hat{\boldsymbol\beta}_n)'\big)^{-1}(f(\hat{\boldsymbol\beta}_n)-\mathbf c)\xrightarrow{d}\chi^2_p$$
and \(\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}\).
4.5 Generalized Least Squares (GLS) Estimator
高斯-马尔可夫定理表明 OLS 在同方差下是 BLUE;但异方差下 OLS 不再最优,广义最小二乘(GLS) 更有效。设 \((Y_i,\mathbf X_i)\) i.i.d.,\(Y_i\in\mathbb R\)、\(\mathbf X_i\in\mathbb R^{k+1}\),\(\mathbb E[Y_i\mid\mathbf X_i]=\mathbf X_i'\boldsymbol\beta\)、\(\mathrm{Var}(Y_i\mid\mathbf X_i)=\sigma^2(\mathbf X_i)\)(已知、$>0$)。\(\mathbb D=\mathrm{diag}(\sigma^2(\mathbf X_1),\dots,\sigma^2(\mathbf X_n))\),\(\mathbb X\) 列线性独立。
4.5.1 无偏性. 估计量 \(\tilde{\boldsymbol\beta}_n=\mathbb A'\mathbb Y\),GLS 取 \(\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}\)。\(\mathbb X'\mathbb D^{-1}\mathbb X\) 正定(故可逆):对 \(\mathbf c\ne\mathbf 0\),\(\mathbf c'\mathbb X'\mathbb D^{-1}\mathbb X\mathbf c=\sum_i\frac{m_i^2}{\sigma^2(\mathbf X_i)}>0\)(\(\mathbb X\mathbf c=(m_1,\dots,m_n)'\ne\mathbf 0\))。GLS 无偏:\(\mathbb A'\mathbb X=\mathbb I\)。
4.5.2 条件方差协方差矩阵.
$$\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb A'\mathbb D\mathbb A=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1} \tag{4.28}$$
(\(\mathrm{Var}(\mathbb Y\mid\mathbb X)=\mathbb D\)、\(\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}\) 代入化简)。
The Gauss-Markov theorem shows OLS is BLUE under homoskedasticity; but under heteroskedasticity OLS is no longer efficient, and generalized least squares (GLS) is more efficient. Let \((Y_i,\mathbf X_i)\) be i.i.d., \(Y_i\in\mathbb R\), \(\mathbf X_i\in\mathbb R^{k+1}\), \(\mathbb E[Y_i\mid\mathbf X_i]=\mathbf X_i'\boldsymbol\beta\), \(\mathrm{Var}(Y_i\mid\mathbf X_i)=\sigma^2(\mathbf X_i)\) (known, $>0$). Let \(\mathbb D=\mathrm{diag}(\sigma^2(\mathbf X_1),\dots,\sigma^2(\mathbf X_n))\), with \(\mathbb X\)'s columns linearly independent.
4.5.1 Unbiasedness. Estimator \(\tilde{\boldsymbol\beta}_n=\mathbb A'\mathbb Y\); GLS takes \(\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}\). \(\mathbb X'\mathbb D^{-1}\mathbb X\) is positive definite (hence invertible): for \(\mathbf c\ne\mathbf 0\), \(\mathbf c'\mathbb X'\mathbb D^{-1}\mathbb X\mathbf c=\sum_i\frac{m_i^2}{\sigma^2(\mathbf X_i)}>0\) (\(\mathbb X\mathbf c=(m_1,\dots,m_n)'\ne\mathbf 0\)). GLS is unbiased: \(\mathbb A'\mathbb X=\mathbb I\).
4.5.2 Conditional variance-covariance matrix.
$$\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb A'\mathbb D\mathbb A=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1} \tag{4.28}$$
(substitute \(\mathrm{Var}(\mathbb Y\mid\mathbb X)=\mathbb D\) and \(\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}\) and simplify).
4.5.3 异方差下 GLS 是 BLUE. 在所有线性无偏估计量中,GLS 的条件方差最小。
证明(GLS BLUE) 设另一线性无偏估计 \(\hat{\boldsymbol\beta}_n=\hat{\mathbb A}'\mathbb Y\)(\(\hat{\mathbb A}'\mathbb X=\mathbb I\)),需证 \(\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)\) 半正定。令 \(\mathbb C=\hat{\mathbb A}-\mathbb A\),可证 \(\mathbb A'\mathbb D\mathbb C=\mathbf 0\)、\(\mathbb C'\mathbb D\mathbb A=\mathbf 0\),于是 $$\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb C'\mathbb D\mathbb C$$ 对 \(\mathbf s\ne\mathbf 0\),\(\mathbf s'\mathbb C'\mathbb D\mathbb C\mathbf s=(\mathbb C\mathbf s)'\mathbb D(\mathbb C\mathbf s)=\sum s_i^2\sigma^2(\mathbf X_i)\ge0\),半正定。故 GLS 在高斯-马尔可夫意义下「最优」。\(\blacksquare\)
Remark 4.4(FGLS) 有时 \(\sigma^2(\mathbf X_i)\) 不可知,需先估计,称可行 GLS(FGLS)。FGLS 在异方差下通常比 OLS 更有效(近似 GLS),但因估计 \(\sigma^2(\mathbf X_i)\) 损失部分效率,OLS 与 FGLS 的效率比较不确定。
4.5.4 一般情形. 回归 \(Y_i=\mathbf X_i'\boldsymbol\beta+\varepsilon_i\)、堆叠 \(\mathbb Y=\mathbb X\boldsymbol\beta+\boldsymbol\varepsilon\),条件方差协方差 \(\mathrm{Cov}(\boldsymbol\varepsilon\mid\mathbb X)=\boldsymbol\Sigma\)(不要求对角)。GLS 解
$$\min_{\boldsymbol\beta}(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}(\mathbb Y-\mathbb X\boldsymbol\beta)$$
一阶条件 \(-2(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}\mathbb X=\mathbf 0_{1\times K}\) 给出
$$\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y)$$
即 \(\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y\)。这是一般情形,因 \(\boldsymbol\Sigma\) 不必对角(可含序列相关等)。
4.5.3 GLS is BLUE under heteroskedasticity. Among all linear unbiased estimators, GLS has the smallest conditional variance.
Proof (GLS BLUE) Let another linear unbiased estimator \(\hat{\boldsymbol\beta}_n=\hat{\mathbb A}'\mathbb Y\) (\(\hat{\mathbb A}'\mathbb X=\mathbb I\)); we show \(\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)\) is positive semi-definite. Let \(\mathbb C=\hat{\mathbb A}-\mathbb A\); one shows \(\mathbb A'\mathbb D\mathbb C=\mathbf 0\) and \(\mathbb C'\mathbb D\mathbb A=\mathbf 0\), so $$\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb C'\mathbb D\mathbb C$$ For \(\mathbf s\ne\mathbf 0\), \(\mathbf s'\mathbb C'\mathbb D\mathbb C\mathbf s=(\mathbb C\mathbf s)'\mathbb D(\mathbb C\mathbf s)=\sum s_i^2\sigma^2(\mathbf X_i)\ge0\), positive semi-definite. So GLS is "best" in the Gauss-Markov sense. \(\blacksquare\)
Remark 4.4 (FGLS) Sometimes \(\sigma^2(\mathbf X_i)\) is unknown and must be estimated first, called feasible GLS (FGLS). FGLS is usually more efficient than OLS under heteroskedasticity (an approximation to GLS), but loses some efficiency from estimating \(\sigma^2(\mathbf X_i)\), so the efficiency comparison between OLS and FGLS is not definite.
4.5.4 General case. Regression \(Y_i=\mathbf X_i'\boldsymbol\beta+\varepsilon_i\), stacked \(\mathbb Y=\mathbb X\boldsymbol\beta+\boldsymbol\varepsilon\), with conditional variance-covariance \(\mathrm{Cov}(\boldsymbol\varepsilon\mid\mathbb X)=\boldsymbol\Sigma\) (not required to be diagonal). GLS solves
$$\min_{\boldsymbol\beta}(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}(\mathbb Y-\mathbb X\boldsymbol\beta)$$
The first-order condition \(-2(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}\mathbb X=\mathbf 0_{1\times K}\) gives
$$\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y)$$
i.e. \(\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y\). This is the general case, since \(\boldsymbol\Sigma\) need not be diagonal (it may include serial correlation, etc.).
本章脉络 从「总体 \(\boldsymbol\beta\)」到「样本 OLS」到「推断」到「更有效的 GLS」。 §4.1 在外生性 \(\mathbb E[\mathbf Xu]=0\) 下解出总体 \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\),并用 FWL / OVB / 测量误差揭示「控制变量」与「偏误」的代数。§4.2 把总体矩换成样本矩得 OLS,几何上是投影。§4.3 是 OLS 的统计性质阶梯:均值独立 ⇒ 无偏;+ 同方差 ⇒ BLUE;基本假设 ⇒ 一致 + 渐近正态(三明治方差 \(\Omega\),同方差退化为 \(\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\))。§4.4 基于 \(\sqrt n(\hat{\boldsymbol\beta}-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) 构造 \(t\) / Wald / LM(与 Wald 等价)/ Delta 方法检验。§4.5 在异方差下用 GLS 恢复 BLUE。本章始终假设外生 \(\mathbb E[\mathbf Xu]=0\);下一章处理内生 \(\mathbb E[\mathbf Xu]\ne0\)(工具变量)。
Chapter arc From "population \(\boldsymbol\beta\)" to "sample OLS" to "inference" to "the more efficient GLS." §4.1 solves the population \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) under exogeneity \(\mathbb E[\mathbf Xu]=0\), and uses FWL / OVB / measurement error to reveal the algebra of "control variables" and "bias." §4.2 replaces population moments with sample moments to get OLS, which geometrically is a projection. §4.3 is the ladder of OLS's statistical properties: mean independence ⇒ unbiased; + homoskedasticity ⇒ BLUE; the baseline assumptions ⇒ consistent + asymptotically normal (the sandwich variance \(\Omega\), degenerating to \(\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\) under homoskedasticity). §4.4 builds \(t\) / Wald / LM (equivalent to Wald) / delta-method tests on \(\sqrt n(\hat{\boldsymbol\beta}-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\). §4.5 uses GLS to restore the BLUE property under heteroskedasticity. This chapter always assumes exogeneity \(\mathbb E[\mathbf Xu]=0\); the next chapter handles endogeneity \(\mathbb E[\mathbf Xu]\ne0\) (instrumental variables).