4. Linear Regression with Exogenous Variables

Note

本章主题:外生(\(\mathbb E[\mathbf Xu]=0\))线性回归的总体性质、OLS 估计与推断。 §4.1 总体性质:由 \(\mathbb E[\mathbf Xu]=0\) 解出 \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) (4.2)(无完全共线性 ⟺ \(\mathbb E[\mathbf X\mathbf X']\) 可逆,引理 4.1);子向量的 Frisch-Waugh-Lovell「先净化再回归」(4.5, 命题 4.1);遗漏变量偏误 OVB测量误差 / 衰减偏误§4.2 OLS 估计:\(\hat{\boldsymbol\beta}_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\frac1n\sum\mathbf X_iY_i)\) (4.10);矩阵形式 \(\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y\) (4.15) = 投影(投影阵 \(\mathbb P\)、残差生成阵 \(\mathbb M\),皆幂等对称);子向量 FWL (4.17);拟合优度 \(R^2\)、调整 \(\bar R^2\)(仅描述拟合、无因果含义)。§4.3 OLS 性质:均值独立 \(\mathbb E[u\mid\mathbf X]=0\) ⇒ 无偏(命题 4.2);再加同方差 ⇒ 高斯-马尔可夫 BLUE(命题 4.3);一致性(命题 4.4);极限分布 \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) (命题 4.5),同方差下 \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\)(命题 4.6);\(\Omega\) 的稳健估计 (4.25)。§4.4 假设检验:单一线性约束(\(t\) 检验)、多重线性约束 Wald 检验拉格朗日乘子 LM 检验(与 Wald 等价)、非线性约束 Delta 方法§4.5 GLS:异方差下 OLS 不再最优,GLS \(\hat{\boldsymbol\beta}=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y\) 为异方差下的 BLUE;不可观测方差时用 FGLS。

Note

Chapter theme: population properties, OLS estimation, and inference for the exogenous (\(\mathbb E[\mathbf Xu]=0\)) linear regression. §4.1 Population properties: from \(\mathbb E[\mathbf Xu]=0\) solve \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) (4.2) (no perfect collinearity ⟺ \(\mathbb E[\mathbf X\mathbf X']\) invertible, Lemma 4.1); the Frisch-Waugh-Lovell "partial out then regress" for a subvector (4.5, Proposition 4.1); omitted-variable bias (OVB) and measurement error / attenuation bias. §4.2 OLS estimation: \(\hat{\boldsymbol\beta}_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\frac1n\sum\mathbf X_iY_i)\) (4.10); matrix form \(\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y\) (4.15) = projection (projection matrix \(\mathbb P\), residual-maker \(\mathbb M\), both idempotent and symmetric); subvector FWL (4.17); goodness of fit \(R^2\), adjusted \(\bar R^2\) (description of fit only, no causal meaning). §4.3 OLS properties: mean independence \(\mathbb E[u\mid\mathbf X]=0\) ⇒ unbiased (Proposition 4.2); plus homoskedasticity ⇒ Gauss-Markov BLUE (Proposition 4.3); consistency (Proposition 4.4); limiting distribution \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) (Proposition 4.5), with \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\) under homoskedasticity (Proposition 4.6); a robust estimator of \(\Omega\) (4.25). §4.4 Hypothesis testing: a single linear restriction (\(t\) test), multiple linear restrictions (Wald test), the Lagrange multiplier (LM) test (equivalent to Wald), and nonlinear restrictions (delta method). §4.5 GLS: under heteroskedasticity OLS is no longer efficient, and GLS \(\hat{\boldsymbol\beta}=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y\) is the BLUE under heteroskedasticity; use FGLS when the variance is unobservable.

4.1 Population Properties of β

设 \((Y,\mathbf X,u)\) 如第 3 章定义,假设 \(\mathbf X\) 无完全共线性、\(\mathbb E[\mathbf X\mathbf X']\) 存在、\(\mathbb E[Y^2]<\infty\)、\(\mathbb E[\mathbf Xu]=\mathbf 0\)(外生性)。

4.1.1 解出 \(\boldsymbol\beta\).

Important

定义 4.1 / 引理 4.1 完全共线性:\(\mathbf X\) 完全共线 若 \(\exists\mathbf c\ne\mathbf 0,\mathbf c\in\mathbb R^{k+1}\) 使 \(\mathbb P(\mathbf c'\mathbf X=0)=1\)。引理 4.1:\(\mathbb E[\mathbf X\mathbf X']\) 可逆 当且仅当 \(\mathbf X\) 无完全共线性。

Note

证明(引理 4.1) 等价命题:\(\mathbb E[\mathbf X\mathbf X']\) 不可逆 当且仅当存在完全共线性。 不可逆 ⇒ 完全共线:\(\mathbb E[\mathbf X\mathbf X']\) 不可逆 ⇒ \(\exists\mathbf c\ne\mathbf 0\) 使 \(\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbf 0\),于是 \(0=\mathbf c'\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbb E[(\mathbf c'\mathbf X)^2]\),故 \(\mathbb P(\mathbf c'\mathbf X=0)=1\)。 完全共线 ⇒ 不可逆:设 \(\mathbb P(\mathbf c'\mathbf X=0)=1\),则 \(\mathbf c'\mathbb E[\mathbf X\mathbf X']=\mathbb E[\underbrace{\mathbf c'\mathbf X}_{=0}\mathbf X']=\mathbf 0\),故 \(\mathbb E[\mathbf X\mathbf X']\) 不可逆。\(\blacksquare\)

由 \(\mathbb E[\mathbf Xu]=\mathbf 0\) 与 \(u=Y-\mathbf X'\boldsymbol\beta\):

$$\mathbf 0=\mathbb E[\mathbf Xu]=\mathbb E[\mathbf X(Y-\mathbf X'\boldsymbol\beta)]=\mathbb E[\mathbf XY]-\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta\Rightarrow\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta=\mathbb E[\mathbf XY] \tag{4.1}$$

$$\Rightarrow\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY] \tag{4.2}$$

Tip

Remark 4.1 若 \(\mathbb E[\mathbf X\mathbf X']\) 不可逆,(4.1) 有多解;任意两解 \(\hat{\boldsymbol\beta},\tilde{\boldsymbol\beta}\) 满足 \(\mathbb P(\mathbf X'\hat{\boldsymbol\beta}=\mathbf X'\tilde{\boldsymbol\beta})=1\)(同样的预测值)。这对预测/最优线性近似无碍,但使因果解读难以进行。

Let \((Y,\mathbf X,u)\) be as defined in Chapter 3, and assume \(\mathbf X\) has no perfect collinearity, \(\mathbb E[\mathbf X\mathbf X']\) exists, \(\mathbb E[Y^2]<\infty\), and \(\mathbb E[\mathbf Xu]=\mathbf 0\) (exogeneity).

4.1.1 Solving for \(\boldsymbol\beta\).

Important

Definition 4.1 / Lemma 4.1 Perfect collinearity: \(\mathbf X\) is perfectly collinear if \(\exists\mathbf c\ne\mathbf 0,\mathbf c\in\mathbb R^{k+1}\) with \(\mathbb P(\mathbf c'\mathbf X=0)=1\). Lemma 4.1: \(\mathbb E[\mathbf X\mathbf X']\) is invertible if and only if \(\mathbf X\) has no perfect collinearity.

Note

Proof (Lemma 4.1) Equivalent statement: \(\mathbb E[\mathbf X\mathbf X']\) is not invertible iff there is perfect collinearity. Not invertible ⇒ collinear: not invertible ⇒ \(\exists\mathbf c\ne\mathbf 0\) with \(\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbf 0\), so \(0=\mathbf c'\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbb E[(\mathbf c'\mathbf X)^2]\), hence \(\mathbb P(\mathbf c'\mathbf X=0)=1\). Collinear ⇒ not invertible: if \(\mathbb P(\mathbf c'\mathbf X=0)=1\), then \(\mathbf c'\mathbb E[\mathbf X\mathbf X']=\mathbb E[\underbrace{\mathbf c'\mathbf X}_{=0}\mathbf X']=\mathbf 0\), so \(\mathbb E[\mathbf X\mathbf X']\) is not invertible. \(\blacksquare\)

From \(\mathbb E[\mathbf Xu]=\mathbf 0\) with \(u=Y-\mathbf X'\boldsymbol\beta\):

$$\mathbf 0=\mathbb E[\mathbf Xu]=\mathbb E[\mathbf X(Y-\mathbf X'\boldsymbol\beta)]=\mathbb E[\mathbf XY]-\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta\Rightarrow\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta=\mathbb E[\mathbf XY] \tag{4.1}$$

$$\Rightarrow\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY] \tag{4.2}$$

Tip

Remark 4.1 If \(\mathbb E[\mathbf X\mathbf X']\) is not invertible, (4.1) has multiple solutions; any two solutions \(\hat{\boldsymbol\beta},\tilde{\boldsymbol\beta}\) satisfy \(\mathbb P(\mathbf X'\hat{\boldsymbol\beta}=\mathbf X'\tilde{\boldsymbol\beta})=1\) (the same fitted values). This is harmless for prediction / best linear approximation, but makes the causal interpretation difficult.

4.1.2 解出 \(\boldsymbol\beta\) 的子向量(Frisch-Waugh-Lovell). 把 \(\mathbf X\) 分为 \((\mathbf X_1)_{k_1\times1}\)、\((\mathbf X_2)_{k_2\times1}\),\(\boldsymbol\beta\) 分为 \(\boldsymbol\beta_1,\boldsymbol\beta_2\),

$$Y=\mathbf X'\boldsymbol\beta+u=\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u \tag{4.3}$$

由 \(\mathbb E[\mathbf Xu]=\mathbf 0\) 得 \(\mathbb E[\mathbf X_1u]=\mathbf 0_{k_1\times1}\)、\(\mathbb E[\mathbf X_2u]=\mathbf 0_{k_2\times1}\) (4.4)。若只关心 \(\boldsymbol\beta_1\),可分三步:

  1. 取 \(Y\) 的残差 \(\tilde Y=Y-\mathrm{BLP}(Y\mid\mathbf X_2)\);
  2. 取 \(\mathbf X_1\) 的残差 \(\tilde{\mathbf X}_1=\mathbf X_1-\mathrm{BLP}(\mathbf X_1\mid\mathbf X_2)\);
  3. 回归 \(\tilde Y=\tilde{\mathbf X}_1'\boldsymbol\beta_1+\tilde u\),得 \(\tilde{\boldsymbol\beta}=\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']^{-1}\mathbb E[\tilde{\mathbf X}_1\tilde Y]\) (4.5)。
Important

命题 4.1 \(\tilde{\boldsymbol\beta}=\boldsymbol\beta_1\),其中 \(\tilde{\boldsymbol\beta}\) 由 (4.5) 定义、\(\boldsymbol\beta_1\) 由 (4.3) 定义。即「先把 \(\mathbf X_2\) 从 \(Y\) 与 \(\mathbf X_1\) 中净化掉、再回归」恰好得到 \(\boldsymbol\beta_1\)(Frisch-Waugh-Lovell 定理的总体版本)。

Tip

Remark 4.2 若 \(\mathbf X_2\) 含常数项,则 \(\tilde{\mathbf X}_1\) 必为零均值(由一阶条件),故 \(\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']=\mathrm{Var}[\tilde{\mathbf X}_1]\);若 \(\mathbf X_2\) 不含常数则此变换不成立。

4.1.2 Solving for a subvector of \(\boldsymbol\beta\) (Frisch-Waugh-Lovell). Partition \(\mathbf X\) into \((\mathbf X_1)_{k_1\times1}\), \((\mathbf X_2)_{k_2\times1}\) and \(\boldsymbol\beta\) into \(\boldsymbol\beta_1,\boldsymbol\beta_2\),

$$Y=\mathbf X'\boldsymbol\beta+u=\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u \tag{4.3}$$

From \(\mathbb E[\mathbf Xu]=\mathbf 0\), \(\mathbb E[\mathbf X_1u]=\mathbf 0_{k_1\times1}\), \(\mathbb E[\mathbf X_2u]=\mathbf 0_{k_2\times1}\) (4.4). To recover only \(\boldsymbol\beta_1\), in three steps:

  1. Take \(Y\)'s residual \(\tilde Y=Y-\mathrm{BLP}(Y\mid\mathbf X_2)\);
  2. Take \(\mathbf X_1\)'s residual \(\tilde{\mathbf X}_1=\mathbf X_1-\mathrm{BLP}(\mathbf X_1\mid\mathbf X_2)\);
  3. Regress \(\tilde Y=\tilde{\mathbf X}_1'\boldsymbol\beta_1+\tilde u\), giving \(\tilde{\boldsymbol\beta}=\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']^{-1}\mathbb E[\tilde{\mathbf X}_1\tilde Y]\) (4.5).
Important

Proposition 4.1 \(\tilde{\boldsymbol\beta}=\boldsymbol\beta_1\), where \(\tilde{\boldsymbol\beta}\) is defined in (4.5) and \(\boldsymbol\beta_1\) in (4.3). That is, "partial \(\mathbf X_2\) out of \(Y\) and \(\mathbf X_1\) first, then regress" recovers exactly \(\boldsymbol\beta_1\) (the population version of the Frisch-Waugh-Lovell theorem).

Tip

Remark 4.2 If \(\mathbf X_2\) contains a constant, then \(\tilde{\mathbf X}_1\) must have zero mean (by the first-order condition), so \(\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']=\mathrm{Var}[\tilde{\mathbf X}_1]\); if \(\mathbf X_2\) contains no constant, this transformation does not hold.

推论 4.1(\(\mathbf X_2\) 为常数的特例). 若 \(\mathbf X_2=1\),则 \(\mathrm{BLP}(Y\mid1)=\mathbb E[Y]\)、\(\mathrm{BLP}(\mathbf X_1\mid1)=\mathbb E[\mathbf X_1]\),于是 \(\tilde Y=Y-\mathbb E[Y]\)、\(\tilde{\mathbf X}_1=\mathbf X_1-\mathbb E[\mathbf X_1]\) (4.6),代入 (4.5):

$$\boldsymbol\beta_1=\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,Y) \tag{4.7}$$

即去除常数后,斜率系数等于「协方差除以方差」。

4.1.3 遗漏变量偏误(OVB). 真实模型 \(Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u\),若遗漏 \(\mathbf X_2\)、错误地估计 \(Y=\beta_0^*+\mathbf X_1'\boldsymbol\beta_1^*+u^*\),则由 (4.7):

$$\boldsymbol\beta_1^*=\boldsymbol\beta_1+\underbrace{\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,\mathbf X_2)\boldsymbol\beta_2}_{\text{omitted variable bias}}$$

偏误 = 「\(\mathbf X_1\) 对 \(\mathbf X_2\) 的回归系数」乘以「\(\mathbf X_2\) 的真实效应 \(\boldsymbol\beta_2\)」。当 \(\mathbf X_1\perp\mathbf X_2\)(\(\mathrm{Cov}=0\))或 \(\boldsymbol\beta_2=0\) 时无偏误。

Corollary 4.1 (\(\mathbf X_2\) a constant). If \(\mathbf X_2=1\), then \(\mathrm{BLP}(Y\mid1)=\mathbb E[Y]\), \(\mathrm{BLP}(\mathbf X_1\mid1)=\mathbb E[\mathbf X_1]\), so \(\tilde Y=Y-\mathbb E[Y]\), \(\tilde{\mathbf X}_1=\mathbf X_1-\mathbb E[\mathbf X_1]\) (4.6); substituting into (4.5):

$$\boldsymbol\beta_1=\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,Y) \tag{4.7}$$

i.e., after removing the constant, the slope coefficient equals "covariance over variance."

4.1.3 Omitted-variable bias (OVB). True model \(Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u\); if we omit \(\mathbf X_2\) and wrongly estimate \(Y=\beta_0^*+\mathbf X_1'\boldsymbol\beta_1^*+u^*\), then by (4.7):

$$\boldsymbol\beta_1^*=\boldsymbol\beta_1+\underbrace{\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,\mathbf X_2)\boldsymbol\beta_2}_{\text{omitted variable bias}}$$

The bias = "the regression coefficient of \(\mathbf X_1\) on \(\mathbf X_2\)" times "the true effect \(\boldsymbol\beta_2\) of \(\mathbf X_2\)." No bias when \(\mathbf X_1\perp\mathbf X_2\) (\(\mathrm{Cov}=0\)) or \(\boldsymbol\beta_2=0\).

4.1.4 测量误差(衰减偏误). 模型 \(Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+u\) (4.8),但 \(\mathbf X_1\) 不可观测,只能观测 \(\hat{\mathbf X}_1=\mathbf X_1+\mathbf v\)。经典测量误差假设:\(\mathbb E[\mathbf v]=\mathbf 0\)、\(\mathrm{Cov}(u,\mathbf v)=\mathbf 0\)、\(\mathrm{Cov}(\mathbf X_1,\mathbf v)=\mathbf 0\)。用 \(\hat{\mathbf X}_1\) 替代 \(\mathbf X_1\) 回归 \(Y=\beta_0^*+\hat{\mathbf X}_1'\boldsymbol\beta_1^*+u^*\),由推论 4.1:

$$\boldsymbol\beta_1^*=\mathrm{Var}(\hat{\mathbf X}_1)^{-1}\mathrm{Cov}(\hat{\mathbf X}_1,Y)=(\mathrm{Var}(\mathbf X_1)+\mathrm{Var}(\mathbf v))^{-1}\mathrm{Var}(\mathbf X_1)\boldsymbol\beta_1$$

一维时 \(\beta_1^*=\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\beta_1\),系数 \(\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\in(0,1)\),故 \(\beta_1^*\) 被向零方向压缩——称衰减偏误(attenuation bias)

例 4.2 更一般地把 \(\mathbf X\) 分为 \(X_0=1\)、\(X_1\in\mathbb R\)(误测)、\(\mathbf X_2\in\mathbb R^{k_2}\)(精确观测),\(Y=\beta_0+\beta_1X_1+\mathbf X_2'\boldsymbol\beta_2+u\);用 FWL 先净化 \(\mathbf X_2\)(\(\tilde{\hat X}_1=\hat X_1-\mathrm{BLP}(\hat X_1\mid1,\mathbf X_2)\)、\(\tilde Y=Y-\mathrm{BLP}(Y\mid1,\mathbf X_2)\))再处理,可证 \(\tilde{\hat X}_1=\tilde X_1+v\),衰减结论仍成立:\(\beta_1^*=\frac{\mathrm{Var}(\tilde X_1)}{\mathrm{Var}(\tilde X_1)+\mathrm{Var}(v)}\beta_1\)。

4.1.4 Measurement error (attenuation bias). Model \(Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+u\) (4.8), but \(\mathbf X_1\) is unobservable and we observe only \(\hat{\mathbf X}_1=\mathbf X_1+\mathbf v\). Classical measurement-error assumptions: \(\mathbb E[\mathbf v]=\mathbf 0\), \(\mathrm{Cov}(u,\mathbf v)=\mathbf 0\), \(\mathrm{Cov}(\mathbf X_1,\mathbf v)=\mathbf 0\). Regressing \(Y=\beta_0^*+\hat{\mathbf X}_1'\boldsymbol\beta_1^*+u^*\) with \(\hat{\mathbf X}_1\) in place of \(\mathbf X_1\), by Corollary 4.1:

$$\boldsymbol\beta_1^*=\mathrm{Var}(\hat{\mathbf X}_1)^{-1}\mathrm{Cov}(\hat{\mathbf X}_1,Y)=(\mathrm{Var}(\mathbf X_1)+\mathrm{Var}(\mathbf v))^{-1}\mathrm{Var}(\mathbf X_1)\boldsymbol\beta_1$$

In one dimension \(\beta_1^*=\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\beta_1\), where the factor \(\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\in(0,1)\), so \(\beta_1^*\) is shrunk toward zero — called attenuation bias.

Example 4.2 More generally, partition \(\mathbf X\) into \(X_0=1\), \(X_1\in\mathbb R\) (mismeasured), \(\mathbf X_2\in\mathbb R^{k_2}\) (measured exactly), \(Y=\beta_0+\beta_1X_1+\mathbf X_2'\boldsymbol\beta_2+u\); using FWL to partial out \(\mathbf X_2\) first (\(\tilde{\hat X}_1=\hat X_1-\mathrm{BLP}(\hat X_1\mid1,\mathbf X_2)\), \(\tilde Y=Y-\mathrm{BLP}(Y\mid1,\mathbf X_2)\)), one shows \(\tilde{\hat X}_1=\tilde X_1+v\), and the attenuation conclusion still holds: \(\beta_1^*=\frac{\mathrm{Var}(\tilde X_1)}{\mathrm{Var}(\tilde X_1)+\mathrm{Var}(v)}\beta_1\).

4.2 Ordinary Least Squares (OLS) Estimator

4.2.1 估计 \(\boldsymbol\beta\). 在总体性质(\(\mathbb E[\mathbf Xu]=0\)、\(\mathbb E[\mathbf X\mathbf X']<\infty\)、无完全共线性、\((Y,\mathbf X)\sim P\))与样本 i.i.d. 假设下,总体 \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) (4.9) 的自然样本类比为

$$\hat{\boldsymbol\beta}_n=\Big(\frac1n\sum_{i=1}^n\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\frac1n\sum_{i=1}^n\mathbf X_iY_i\Big) \tag{4.10}$$

普通最小二乘(OLS)估计量——它最小化真实 \(Y_i\) 与拟合 \(\hat Y_i=\mathbf X_i'\hat{\boldsymbol\beta}_n\) 的平方距离和:\(\hat{\boldsymbol\beta}_n=\arg\min_{\mathbf b}\frac1n\sum(Y_i-\mathbf X_i'\mathbf b)^2\)。一阶条件 \(\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\hat{\boldsymbol\beta}_n)=\mathbf 0\) (4.11) 给出 (4.12)(与 (4.10) 同);由 CMT,\(\det(\frac1n\sum\mathbf X_i\mathbf X_i')\xrightarrow{p}\det(\mathbb E[\mathbf X\mathbf X'])\ne0\),故大 \(n\) 时几乎必可逆。残差 \(\hat u_i=Y_i-\hat Y_i\) 满足 \(\sum\mathbf X_i\hat u_i=\mathbf 0\) (4.13);因 \(X_{i,0}=1\),对 \(\beta_0\) 的一阶条件给出 \(\sum\hat u_i=0\) (4.14)。

4.2.2 OLS 作为投影. 用矩阵记号:\(\mathbb X_{n\times(k+1)}=(\mathbf X_1,\dots,\mathbf X_n)'\)(每行一个观测的转置)、\(\mathbb Y=(Y_1,\dots,Y_n)'\)、\(\mathbb U=(u_1,\dots,u_n)'\)、\(\hat{\mathbb Y}=\mathbb X\hat{\boldsymbol\beta}_n\)、\(\hat{\mathbb U}=\mathbb Y-\hat{\mathbb Y}\)。则

$$\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y \tag{4.15}$$

为 \(\min_{\mathbf b}|\mathbb Y-\mathbb X\mathbf b|^2\) 之解。\(\mathbb X\) 的列空间 \(\mathrm{Col}[\mathbb X]=\{\mathbb X\mathbf b:\mathbf b\in\mathbb R^{k+1}\}\)。定义投影阵

$$\mathbb P=\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X',\qquad\mathbb P\mathbb Y=\mathbb X\hat{\boldsymbol\beta}_n=\hat{\mathbb Y}$$

\(\mathbb P\mathbb Y\in\mathrm{Col}[\mathbb X]\),故 \(\mathbb P\) 把 \(\mathbb Y\) 投影到 \(\mathrm{Col}[\mathbb X]\)。残差生成阵 \(\mathbb M=\mathbb I-\mathbb P=\mathbb I-\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X'\),把 \(\mathbb Y\) 投影到与 \(\mathrm{Col}[\mathbb X]\) 正交的空间:\(\mathbb M\mathbb X=\mathbf 0\)、\(\mathbb M\mathbb Y=\hat{\mathbb U}\)、\(\mathbb M\mathbb U=\hat{\mathbb U}\)。\(\mathbb P\) 与 \(\mathbb M\) 皆幂等(\(\mathbb P^2=\mathbb P\)、\(\mathbb M^2=\mathbb M\))且对称(\(\mathbb P'=\mathbb P\)、\(\mathbb M'=\mathbb M\))。

4.2.1 Estimating \(\boldsymbol\beta\). Under the population properties (\(\mathbb E[\mathbf Xu]=0\), \(\mathbb E[\mathbf X\mathbf X']<\infty\), no perfect collinearity, \((Y,\mathbf X)\sim P\)) and an i.i.d. sample, the natural sample analog of \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) (4.9) is

$$\hat{\boldsymbol\beta}_n=\Big(\frac1n\sum_{i=1}^n\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\frac1n\sum_{i=1}^n\mathbf X_iY_i\Big) \tag{4.10}$$

the ordinary least squares (OLS) estimator — it minimizes the sum of squared distances between the true \(Y_i\) and the fitted \(\hat Y_i=\mathbf X_i'\hat{\boldsymbol\beta}_n\): \(\hat{\boldsymbol\beta}_n=\arg\min_{\mathbf b}\frac1n\sum(Y_i-\mathbf X_i'\mathbf b)^2\). The first-order condition \(\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\hat{\boldsymbol\beta}_n)=\mathbf 0\) (4.11) gives (4.12) (same as (4.10)); by CMT, \(\det(\frac1n\sum\mathbf X_i\mathbf X_i')\xrightarrow{p}\det(\mathbb E[\mathbf X\mathbf X'])\ne0\), so it is almost surely invertible for large \(n\). The residual \(\hat u_i=Y_i-\hat Y_i\) satisfies \(\sum\mathbf X_i\hat u_i=\mathbf 0\) (4.13); since \(X_{i,0}=1\), the first-order condition for \(\beta_0\) gives \(\sum\hat u_i=0\) (4.14).

4.2.2 OLS as a projection. In matrix notation: \(\mathbb X_{n\times(k+1)}=(\mathbf X_1,\dots,\mathbf X_n)'\) (each row is one observation transposed), \(\mathbb Y=(Y_1,\dots,Y_n)'\), \(\mathbb U=(u_1,\dots,u_n)'\), \(\hat{\mathbb Y}=\mathbb X\hat{\boldsymbol\beta}_n\), \(\hat{\mathbb U}=\mathbb Y-\hat{\mathbb Y}\). Then

$$\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y \tag{4.15}$$

solves \(\min_{\mathbf b}|\mathbb Y-\mathbb X\mathbf b|^2\). The column space of \(\mathbb X\) is \(\mathrm{Col}[\mathbb X]=\{\mathbb X\mathbf b:\mathbf b\in\mathbb R^{k+1}\}\). Define the projection matrix

$$\mathbb P=\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X',\qquad\mathbb P\mathbb Y=\mathbb X\hat{\boldsymbol\beta}_n=\hat{\mathbb Y}$$

\(\mathbb P\mathbb Y\in\mathrm{Col}[\mathbb X]\), so \(\mathbb P\) projects \(\mathbb Y\) onto \(\mathrm{Col}[\mathbb X]\). The residual-maker matrix \(\mathbb M=\mathbb I-\mathbb P=\mathbb I-\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X'\) projects \(\mathbb Y\) onto the space orthogonal to \(\mathrm{Col}[\mathbb X]\): \(\mathbb M\mathbb X=\mathbf 0\), \(\mathbb M\mathbb Y=\hat{\mathbb U}\), \(\mathbb M\mathbb U=\hat{\mathbb U}\). Both \(\mathbb P\) and \(\mathbb M\) are idempotent (\(\mathbb P^2=\mathbb P\), \(\mathbb M^2=\mathbb M\)) and symmetric (\(\mathbb P'=\mathbb P\), \(\mathbb M'=\mathbb M\)).

4.2.3 估计子向量 \(\hat{\boldsymbol\beta}_{1,n}\)(FWL 的样本版本). 把 \(\mathbb X\) 分为 \(\mathbb X_1,\mathbb X_2\)、\(\mathbb P\) 分为 \(\mathbb P_1,\mathbb P_2\)、\(\mathbb M\) 分为 \(\mathbb M_1,\mathbb M_2\)。样本类比 \(\mathbb Y=\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\mathbb X_2\hat{\boldsymbol\beta}_{2,n}+\hat{\mathbb U}\)。用 \(\mathbb M_2\) 左乘消去 \(\mathbb X_2\)(\(\mathbb M_2\mathbb X_2=0\))、\(\mathbb M_2\hat{\mathbb U}=\hat{\mathbb U}\):\(\mathbb M_2\mathbb Y=\mathbb M_2\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\hat{\mathbb U}\)。再左乘 \((\mathbb M_2\mathbb X_1)'\) 并用 \(\mathbb X_1'\hat{\mathbb U}=0\):

$$\hat{\boldsymbol\beta}_{1,n}=[(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb X_1)]^{-1}(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb Y) \tag{4.17}$$

Frisch-Waugh-Lovell 分解。由 \(\mathbb M_2\) 幂等对称,可简化为 \(\hat{\boldsymbol\beta}_{1,n}=[\mathbb X_1'\mathbb M_2\mathbb X_1]^{-1}\mathbb X_1'\mathbb M_2\mathbb Y\)。

4.2.4 拟合优度. \(R\) 方 \(R^2\equiv\frac{\mathrm{ESS}}{\mathrm{TSS}}=1-\frac{\mathrm{SSR}}{\mathrm{TSS}}\) (4.18),其中估计平方和 \(\mathrm{ESS}=\sum(\hat Y_i-\bar Y_n)^2\)、残差平方和 \(\mathrm{SSR}=\sum\hat u_i^2\)、总平方和 \(\mathrm{TSS}=\sum(Y_i-\bar Y_n)^2\)。由 \(\sum\hat u_i(\hat Y_i-\bar Y_n)=0\) 得 \(\mathrm{TSS}=\mathrm{ESS}+\mathrm{SSR}\) (4.19),故 \(0\le R^2\le1\)。\(R^2=1\) 为完美拟合(\(\hat u_i=0\))、\(R^2=0\) 为无法预测。加入更多变量 \(R^2\) 不减(自由度增加),故引入调整 \(R^2\) 作惩罚:

$$\bar R^2=1-\frac{\frac1{n-(k+1)}\sum\hat u_i^2}{\frac1{n-1}\sum(Y_i-\bar Y_n)^2}=1-\frac{n-1}{n-k-1}\frac{\mathrm{SSR}}{\mathrm{TSS}}$$

\(\bar R^2\) 可为负。\(R^2\) 与 \(\bar R^2\) 都无因果基础——只是拟合的描述,不是因果解读。

4.2.3 Estimating a subvector \(\hat{\boldsymbol\beta}_{1,n}\) (sample version of FWL). Partition \(\mathbb X\) into \(\mathbb X_1,\mathbb X_2\), \(\mathbb P\) into \(\mathbb P_1,\mathbb P_2\), \(\mathbb M\) into \(\mathbb M_1,\mathbb M_2\). The sample analog \(\mathbb Y=\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\mathbb X_2\hat{\boldsymbol\beta}_{2,n}+\hat{\mathbb U}\). Left-multiply by \(\mathbb M_2\) to remove \(\mathbb X_2\) (\(\mathbb M_2\mathbb X_2=0\)), with \(\mathbb M_2\hat{\mathbb U}=\hat{\mathbb U}\): \(\mathbb M_2\mathbb Y=\mathbb M_2\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\hat{\mathbb U}\). Then left-multiply by \((\mathbb M_2\mathbb X_1)'\) and use \(\mathbb X_1'\hat{\mathbb U}=0\):

$$\hat{\boldsymbol\beta}_{1,n}=[(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb X_1)]^{-1}(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb Y) \tag{4.17}$$

the Frisch-Waugh-Lovell decomposition. By idempotency and symmetry of \(\mathbb M_2\), it simplifies to \(\hat{\boldsymbol\beta}_{1,n}=[\mathbb X_1'\mathbb M_2\mathbb X_1]^{-1}\mathbb X_1'\mathbb M_2\mathbb Y\).

4.2.4 Measures of fit. The \(R\)-square \(R^2\equiv\frac{\mathrm{ESS}}{\mathrm{TSS}}=1-\frac{\mathrm{SSR}}{\mathrm{TSS}}\) (4.18), where the estimated sum of squares \(\mathrm{ESS}=\sum(\hat Y_i-\bar Y_n)^2\), sum of squared residuals \(\mathrm{SSR}=\sum\hat u_i^2\), total sum of squares \(\mathrm{TSS}=\sum(Y_i-\bar Y_n)^2\). Since \(\sum\hat u_i(\hat Y_i-\bar Y_n)=0\), \(\mathrm{TSS}=\mathrm{ESS}+\mathrm{SSR}\) (4.19), hence \(0\le R^2\le1\). \(R^2=1\) is a perfect fit (\(\hat u_i=0\)), \(R^2=0\) means no predictive power. Adding more variables never decreases \(R^2\) (more degrees of freedom), so introduce the adjusted \(R^2\) as a penalty:

$$\bar R^2=1-\frac{\frac1{n-(k+1)}\sum\hat u_i^2}{\frac1{n-1}\sum(Y_i-\bar Y_n)^2}=1-\frac{n-1}{n-k-1}\frac{\mathrm{SSR}}{\mathrm{TSS}}$$

\(\bar R^2\) can be negative. Neither \(R^2\) nor \(\bar R^2\) has any causal basis — they are descriptions of fit, not causal interpretations.

4.3 Properties of the OLS Estimator

以下都在 §4.2.1 的基本假设下,子节会另加假设。

4.3.1 无偏性.

Important

命题 4.2(无偏) 额外假设 \(\mathbb E[u\mid\mathbf X]=0\)(均值独立),则 OLS 无偏:\(\mathbb E[\hat{\boldsymbol\beta}_n]=\boldsymbol\beta\)。

Note

证明(命题 4.2) \(\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y=(\mathbb X'\mathbb X)^{-1}\mathbb X'(\mathbb X\boldsymbol\beta+\mathbb U)=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb U\)。取条件期望: $$\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X_1,\dots,\mathbf X_n]=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\underbrace{\mathbb E[\mathbb U\mid\mathbf X_1,\dots,\mathbf X_n]}_{=0\,\because\,\mathbb E[u_i\mid\mathbf X_i]=0}=\boldsymbol\beta$$ 由重期望定律 \(\mathbb E[\hat{\boldsymbol\beta}_n]=\mathbb E[\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X]]=\boldsymbol\beta\)。\(\blacksquare\)

4.3.2 高斯-马尔可夫定理.

Important

定义 4.2 / 命题 4.3(高斯-马尔可夫) 同方差:\(\mathrm{Var}(u\mid\mathbf X)\) 为常数;否则异方差命题 4.3:额外假设 \(\mathbb E[u\mid\mathbf X]=0\)(均值独立)与 \(\mathrm{Var}(u\mid\mathbf X)=\sigma^2\)(同方差),则 \(\hat{\boldsymbol\beta}_n\) 是 \(\boldsymbol\beta\) 的「最优」估计量(BLUE)——在所有形如 \(\mathbb A'\mathbb Y\)(\(\mathbb A=\mathbb A(\mathbf X_1,\dots,\mathbf X_n)\))且无偏(\(\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\boldsymbol\beta\))的估计量中,\(\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)\) 最小。

Note

证明(命题 4.3) 无偏要求 \(\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\mathbb A'\mathbb X\boldsymbol\beta=\boldsymbol\beta\),即 \(\mathbb A'\mathbb X=\mathbb I\)。条件方差 \(\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)=\mathbb A'\mathrm{Var}(\mathbb U\mid\mathbf X)\mathbb A=\mathbb A'\mathbb A\sigma^2\) (4.21)。OLS 对应 \(\mathbb A'=(\mathbb X'\mathbb X)^{-1}\mathbb X'\),其方差 \((\mathbb X'\mathbb X)^{-1}\sigma^2\)。需证 \(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}\) 半正定。令 \(\mathbb C=\mathbb A-\mathbb X(\mathbb X'\mathbb X)^{-1}\),可证 \(\mathbb X'\mathbb C=\mathbf 0\),于是 $$\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}=\mathbb C'\mathbb C$$ 对任意 \(\mathbf c\ne\mathbf 0\),\(\mathbf c'(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1})\mathbf c=\mathbf c'\mathbb C'\mathbb C\mathbf c=(\mathbb C\mathbf c)'(\mathbb C\mathbf c)=\sum m_i^2\ge0\),半正定,故 OLS 方差最小。\(\blacksquare\)

All below are under the §4.2.1 baseline assumptions; subsections add extra assumptions.

4.3.1 Unbiasedness.

Important

Proposition 4.2 (Unbiased) Additionally assume \(\mathbb E[u\mid\mathbf X]=0\) (mean independence). Then OLS is unbiased: \(\mathbb E[\hat{\boldsymbol\beta}_n]=\boldsymbol\beta\).

Note

Proof (Proposition 4.2) \(\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y=(\mathbb X'\mathbb X)^{-1}\mathbb X'(\mathbb X\boldsymbol\beta+\mathbb U)=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb U\). Take conditional expectation: $$\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X_1,\dots,\mathbf X_n]=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\underbrace{\mathbb E[\mathbb U\mid\mathbf X_1,\dots,\mathbf X_n]}_{=0\,\because\,\mathbb E[u_i\mid\mathbf X_i]=0}=\boldsymbol\beta$$ By LIE, \(\mathbb E[\hat{\boldsymbol\beta}_n]=\mathbb E[\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X]]=\boldsymbol\beta\). \(\blacksquare\)

4.3.2 Gauss-Markov Theorem.

Important

Definition 4.2 / Proposition 4.3 (Gauss-Markov) Homoskedastic: \(\mathrm{Var}(u\mid\mathbf X)\) is constant; otherwise heteroskedastic. Proposition 4.3: additionally assume \(\mathbb E[u\mid\mathbf X]=0\) (mean independence) and \(\mathrm{Var}(u\mid\mathbf X)=\sigma^2\) (homoskedasticity). Then \(\hat{\boldsymbol\beta}_n\) is the "best" estimator of \(\boldsymbol\beta\) (BLUE) — among all estimators of the form \(\mathbb A'\mathbb Y\) (\(\mathbb A=\mathbb A(\mathbf X_1,\dots,\mathbf X_n)\)) that are unbiased (\(\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\boldsymbol\beta\)), \(\hat{\boldsymbol\beta}_n\) has the smallest \(\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)\).

Note

Proof (Proposition 4.3) Unbiasedness requires \(\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\mathbb A'\mathbb X\boldsymbol\beta=\boldsymbol\beta\), i.e. \(\mathbb A'\mathbb X=\mathbb I\). The conditional variance \(\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)=\mathbb A'\mathrm{Var}(\mathbb U\mid\mathbf X)\mathbb A=\mathbb A'\mathbb A\sigma^2\) (4.21). OLS corresponds to \(\mathbb A'=(\mathbb X'\mathbb X)^{-1}\mathbb X'\), with variance \((\mathbb X'\mathbb X)^{-1}\sigma^2\). We show \(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}\) is positive semi-definite. Let \(\mathbb C=\mathbb A-\mathbb X(\mathbb X'\mathbb X)^{-1}\); one shows \(\mathbb X'\mathbb C=\mathbf 0\), so $$\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}=\mathbb C'\mathbb C$$ For any \(\mathbf c\ne\mathbf 0\), \(\mathbf c'(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1})\mathbf c=\mathbf c'\mathbb C'\mathbb C\mathbf c=(\mathbb C\mathbf c)'(\mathbb C\mathbf c)=\sum m_i^2\ge0\), positive semi-definite, so OLS has the smallest variance. \(\blacksquare\)

4.3.3 一致性.

Important

命题 4.4(一致) OLS 一致:\(\hat{\boldsymbol\beta}_n\xrightarrow{p}\boldsymbol\beta\)。(仅需基本假设,无须均值独立或同方差。)

Note

证明(命题 4.4) 由 WLLN,\(\frac1n\sum\mathbf X_i\mathbf X_i'\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']\)、\(\frac1n\sum\mathbf X_iY_i\xrightarrow{p}\mathbb E[\mathbf XY]\);无完全共线性使 \(\mathbb E[\mathbf X\mathbf X']\) 可逆,CMT 给出 \((\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\);边际收敛 ⇒ 联合收敛 + CMT(乘法连续): $$\hat{\boldsymbol\beta}_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_iY_i\Big)\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]=\boldsymbol\beta\quad\blacksquare$$

4.3.4 极限分布.

Important

命题 4.5 / 命题 4.6(极限分布) 命题 4.5:额外假设 \(\mathrm{Var}(\mathbf Xu)\) 存在,则 $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega),\quad\Omega=\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}$$ (三明治形式,异方差稳健)。命题 4.6:若再加 \(\mathbb E[u\mid\mathbf X]=0\)(均值独立)与 \(\mathrm{Var}(u\mid\mathbf X)=\sigma^2\)(同方差),则 $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1})$$

Note

证明(命题 4.5 与 4.6) 命题 4.5:展开 \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\sqrt n\frac1n\sum\mathbf X_iu_i)\)。\(\mathbf X_iu_i\) i.i.d.、均值 \(\mathbb E[\mathbf Xu]=0\),由 CLT \(\sqrt n\frac1n\sum\mathbf X_iu_i\xrightarrow{d}N(0,\mathrm{Var}(\mathbf Xu))\);又 \((\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\),由 Slutsky:\(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}\mathbb E[\mathbf X\mathbf X']^{-1}N(0,\mathrm{Var}(\mathbf Xu))=N(0,\Omega)\)。 命题 4.6:同方差使 \(\mathrm{Var}(\mathbf Xu)=\mathbb E[\mathbf X\mathbf X'u^2]=\mathbb E[\mathbf X\mathbf X'\mathbb E[u^2\mid\mathbf X]]=\sigma^2\mathbb E[\mathbf X\mathbf X']\) (4.22)(4.23),代入得 \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\) (4.24)。\(\blacksquare\)

4.3.3 Consistency.

Important

Proposition 4.4 (Consistent) OLS is consistent: \(\hat{\boldsymbol\beta}_n\xrightarrow{p}\boldsymbol\beta\). (Only the baseline assumptions are needed, no mean independence or homoskedasticity.)

Note

Proof (Proposition 4.4) By WLLN, \(\frac1n\sum\mathbf X_i\mathbf X_i'\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']\), \(\frac1n\sum\mathbf X_iY_i\xrightarrow{p}\mathbb E[\mathbf XY]\); no perfect collinearity makes \(\mathbb E[\mathbf X\mathbf X']\) invertible, and CMT gives \((\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\); marginal ⇒ joint convergence + CMT (multiplication is continuous): $$\hat{\boldsymbol\beta}_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_iY_i\Big)\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]=\boldsymbol\beta\quad\blacksquare$$

4.3.4 Limiting distribution.

Important

Proposition 4.5 / Proposition 4.6 (Limiting distribution) Proposition 4.5: additionally assume \(\mathrm{Var}(\mathbf Xu)\) exists. Then $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega),\quad\Omega=\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}$$ (the sandwich form, heteroskedasticity-robust). Proposition 4.6: if we further add \(\mathbb E[u\mid\mathbf X]=0\) (mean independence) and \(\mathrm{Var}(u\mid\mathbf X)=\sigma^2\) (homoskedasticity), then $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1})$$

Note

Proof (Propositions 4.5 and 4.6) Prop 4.5: expand \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\sqrt n\frac1n\sum\mathbf X_iu_i)\). The \(\mathbf X_iu_i\) are i.i.d. with mean \(\mathbb E[\mathbf Xu]=0\), so by CLT \(\sqrt n\frac1n\sum\mathbf X_iu_i\xrightarrow{d}N(0,\mathrm{Var}(\mathbf Xu))\); also \((\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\), so by Slutsky \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}\mathbb E[\mathbf X\mathbf X']^{-1}N(0,\mathrm{Var}(\mathbf Xu))=N(0,\Omega)\). Prop 4.6: homoskedasticity gives \(\mathrm{Var}(\mathbf Xu)=\mathbb E[\mathbf X\mathbf X'u^2]=\mathbb E[\mathbf X\mathbf X'\mathbb E[u^2\mid\mathbf X]]=\sigma^2\mathbb E[\mathbf X\mathbf X']\) (4.22)(4.23), so \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\) (4.24). \(\blacksquare\)

4.3.5 估计 \(\Omega\). 我们不知真实 \(\Omega\),需构造一致估计。同方差下 \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\),自然估计 \(\hat\Omega_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\hat\sigma_n^2\)(\(\hat\sigma_n^2=\frac1n\sum\hat u_i^2\))。一般(异方差稳健)

$$\hat\Omega_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\Big)\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1} \tag{4.25}$$

需证中间项 \(\frac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\xrightarrow{p}\mathrm{Var}(\mathbf Xu)\)。把 \(\hat u_i^2=u_i^2+(\hat u_i^2-u_i^2)\) 拆分 (4.26),第一项由 WLLN \(\frac1n\sum\mathbf X_i\mathbf X_i'u_i^2\xrightarrow{p}\mathbb E[\mathbf X\mathbf X'u^2]\),第二项用 \(\hat u_i^2-u_i^2=-2u_i\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)+[\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)]^2\) 与下述引理证其 \(\xrightarrow{p}0\)。

Important

引理 4.2 设 \(\mathbf Z_1,\dots,\mathbf Z_n\) i.i.d.、\(\mathbb E[|\mathbf Z_i|^r]<\infty\),则 \(\max_{1\le i\le n}|\mathbf Z_i|=o_P(n^{1/r})\),即 \(\frac{\max_{1\le i\le n}|\mathbf Z_i|}{n^{1/r}}\xrightarrow{p}0\)。

最终 \(\hat\Omega_n\xrightarrow{p}\Omega\)。

4.3.5 Estimating \(\Omega\). We do not know the true \(\Omega\) and must construct a consistent estimator. Under homoskedasticity \(\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\), with the natural estimator \(\hat\Omega_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\hat\sigma_n^2\) (\(\hat\sigma_n^2=\frac1n\sum\hat u_i^2\)). In general (heteroskedasticity-robust):

$$\hat\Omega_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\Big)\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1} \tag{4.25}$$

We must show the middle term \(\frac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\xrightarrow{p}\mathrm{Var}(\mathbf Xu)\). Split \(\hat u_i^2=u_i^2+(\hat u_i^2-u_i^2)\) (4.26): the first term by WLLN \(\frac1n\sum\mathbf X_i\mathbf X_i'u_i^2\xrightarrow{p}\mathbb E[\mathbf X\mathbf X'u^2]\); the second uses \(\hat u_i^2-u_i^2=-2u_i\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)+[\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)]^2\) together with the lemma below to show it \(\xrightarrow{p}0\).

Important

Lemma 4.2 Let \(\mathbf Z_1,\dots,\mathbf Z_n\) be i.i.d. with \(\mathbb E[|\mathbf Z_i|^r]<\infty\). Then \(\max_{1\le i\le n}|\mathbf Z_i|=o_P(n^{1/r})\), i.e. \(\frac{\max_{1\le i\le n}|\mathbf Z_i|}{n^{1/r}}\xrightarrow{p}0\).

Finally \(\hat\Omega_n\xrightarrow{p}\Omega\).

4.4 Hypothesis Testing

4.4.1 单一线性约束(\(t\) 检验,\(p=1\)). 检验 \(H_0:\mathbf r'\boldsymbol\beta=c\) vs \(H_1:\mathbf r'\boldsymbol\beta\ne c\)(\(\mathbf r\in\mathbb R^{k+1}\)、\(c\in\mathbb R\))。由 \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) 与 CMT:\(\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-\mathbf r'\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf r'\Omega\mathbf r)\)。取统计量

$$T_n=\frac{\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-c)}{\sqrt{\mathbf r'\hat\Omega_n\mathbf r}}\xrightarrow{d}N(0,1)\quad(H_0)$$

(\(H_0\) 下成立。)\(\phi_n=\mathbf 1\{|T_n|>z_{1-\frac\alpha2}\}\) 水平一致。

4.4.2 多重线性约束(Wald 检验). 检验 \(H_0:\mathbf R\boldsymbol\beta=\mathbf c\) vs \(H_1:\mathbf R\boldsymbol\beta\ne\mathbf c\),\(\mathbf R\) 为 \(p\times(k+1)\)、行线性独立(使 \(\mathbf R\Omega\mathbf R'\) 可逆)。由 \(\sqrt n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf R\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf R\Omega\mathbf R')\) 与 \(\mathbf x\sim N(0,\Sigma)\Rightarrow\mathbf x'\Sigma^{-1}\mathbf x\sim\chi^2\) 事实:

$$T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\xrightarrow{d}\chi^2_p\quad(H_0)$$

(\(H_0\) 下成立。)\(\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}\)(\(\chi^2_p\) 的 \(1-\alpha\) 分位)。

4.4.1 A single linear restriction (\(t\) test, \(p=1\)). Test \(H_0:\mathbf r'\boldsymbol\beta=c\) vs \(H_1:\mathbf r'\boldsymbol\beta\ne c\) (\(\mathbf r\in\mathbb R^{k+1}\), \(c\in\mathbb R\)). From \(\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) and CMT: \(\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-\mathbf r'\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf r'\Omega\mathbf r)\). Take the statistic

$$T_n=\frac{\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-c)}{\sqrt{\mathbf r'\hat\Omega_n\mathbf r}}\xrightarrow{d}N(0,1)\quad(\text{under }H_0)$$

and \(\phi_n=\mathbf 1\{|T_n|>z_{1-\frac\alpha2}\}\) is consistent in level.

4.4.2 Multiple linear restrictions (Wald test). Test \(H_0:\mathbf R\boldsymbol\beta=\mathbf c\) vs \(H_1:\mathbf R\boldsymbol\beta\ne\mathbf c\), with \(\mathbf R\) a \(p\times(k+1)\) matrix of linearly independent rows (so \(\mathbf R\Omega\mathbf R'\) is invertible). From \(\sqrt n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf R\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf R\Omega\mathbf R')\) and the fact \(\mathbf x\sim N(0,\Sigma)\Rightarrow\mathbf x'\Sigma^{-1}\mathbf x\sim\chi^2\):

$$T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\xrightarrow{d}\chi^2_p\quad(\text{under }H_0)$$

and \(\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}\) (the \(1-\alpha\) quantile of \(\chi^2_p\)).

4.4.3 拉格朗日乘子(LM)检验. 约束最小二乘(CLS) 估计量 \(\tilde{\boldsymbol\beta}_n\) 解 \(\min_{\boldsymbol\beta:\mathbf R\boldsymbol\beta=\mathbf c}\frac1n\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2\)。拉格朗日

$$\mathcal L(\boldsymbol\beta,\boldsymbol\lambda)=\frac1{2n}\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2+\boldsymbol\lambda'(\mathbf R\boldsymbol\beta-\mathbf c)$$

(\(\frac12\) 用于消去常数)。一阶条件 \(\frac{\partial\mathcal L}{\partial\boldsymbol\beta}=-\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\tilde{\boldsymbol\beta}_n)+\mathbf R'\tilde{\boldsymbol\lambda}_n=\mathbf 0\)、\(\frac{\partial\mathcal L}{\partial\boldsymbol\lambda}=\mathbf R\tilde{\boldsymbol\beta}_n-\mathbf c=\mathbf 0\)。可解出乘子

$$\tilde{\boldsymbol\lambda}_n=\Big(\mathbf R\big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\big)^{-1}\mathbf R'\Big)^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)$$

\(H_0\)(\(\mathbf R\boldsymbol\beta=\mathbf c\))下 \(\tilde{\boldsymbol\lambda}_n\xrightarrow{p}0\)。其渐近分布 \(\sqrt n\tilde{\boldsymbol\lambda}_n\xrightarrow{d}N(0,\Pi)\)(\(\Pi=(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R'(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\))。检验 \(T_n=n\tilde{\boldsymbol\lambda}_n'\hat\Pi^{-1}\tilde{\boldsymbol\lambda}_n\xrightarrow{d}\chi^2_p\),\(\phi_n=\mathbf 1\{T_n>\chi^2_{p,1-\alpha}\}\)。

Important

LM 检验 = Wald 检验 把 LM 统计量重排,可证 \(T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\)——与 Wald 统计量完全相同,故两检验等价、渐近分布同为 \(\chi^2_p\)。

4.4.4 非线性约束(Delta 方法). 检验 \(H_0:f(\boldsymbol\beta)=\mathbf c\) vs \(H_1:f(\boldsymbol\beta)\ne\mathbf c\),\(f:\mathbb R^{k+1}\to\mathbb R^p\) 在 \(\boldsymbol\beta\) 处连续可微、雅可比 \(D_\beta f(\boldsymbol\beta)=\frac{\partial f(\boldsymbol\beta)}{\partial\boldsymbol\beta'}\)(\(p\times(k+1)\))行满秩(\(p\le k+1\))。由 Delta 方法 \(\sqrt n(f(\hat{\boldsymbol\beta}_n)-f(\boldsymbol\beta))\xrightarrow{d}N(0,D_\beta f(\boldsymbol\beta)\Omega D_\beta f(\boldsymbol\beta)')\),故

$$T_n=n(f(\hat{\boldsymbol\beta}_n)-\mathbf c)'\big(D_\beta f(\hat{\boldsymbol\beta}_n)\hat\Omega_n D_\beta f(\hat{\boldsymbol\beta}_n)'\big)^{-1}(f(\hat{\boldsymbol\beta}_n)-\mathbf c)\xrightarrow{d}\chi^2_p$$

\(\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}\)。

4.4.3 Lagrange multiplier (LM) test. The constrained least squares (CLS) estimator \(\tilde{\boldsymbol\beta}_n\) solves \(\min_{\boldsymbol\beta:\mathbf R\boldsymbol\beta=\mathbf c}\frac1n\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2\). The Lagrangian

$$\mathcal L(\boldsymbol\beta,\boldsymbol\lambda)=\frac1{2n}\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2+\boldsymbol\lambda'(\mathbf R\boldsymbol\beta-\mathbf c)$$

(the \(\frac12\) cancels a constant). The first-order conditions \(\frac{\partial\mathcal L}{\partial\boldsymbol\beta}=-\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\tilde{\boldsymbol\beta}_n)+\mathbf R'\tilde{\boldsymbol\lambda}_n=\mathbf 0\), \(\frac{\partial\mathcal L}{\partial\boldsymbol\lambda}=\mathbf R\tilde{\boldsymbol\beta}_n-\mathbf c=\mathbf 0\). Solving for the multiplier

$$\tilde{\boldsymbol\lambda}_n=\Big(\mathbf R\big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\big)^{-1}\mathbf R'\Big)^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)$$

Under \(H_0\) (\(\mathbf R\boldsymbol\beta=\mathbf c\)), \(\tilde{\boldsymbol\lambda}_n\xrightarrow{p}0\). Its asymptotic distribution \(\sqrt n\tilde{\boldsymbol\lambda}_n\xrightarrow{d}N(0,\Pi)\) (with \(\Pi=(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R'(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\)). The test \(T_n=n\tilde{\boldsymbol\lambda}_n'\hat\Pi^{-1}\tilde{\boldsymbol\lambda}_n\xrightarrow{d}\chi^2_p\), \(\phi_n=\mathbf 1\{T_n>\chi^2_{p,1-\alpha}\}\).

Important

LM test = Wald test Rearranging the LM statistic, one shows \(T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\) — exactly the same as the Wald statistic, so the two tests are equivalent with the same asymptotic \(\chi^2_p\) distribution.

4.4.4 Nonlinear restrictions (delta method). Test \(H_0:f(\boldsymbol\beta)=\mathbf c\) vs \(H_1:f(\boldsymbol\beta)\ne\mathbf c\), where \(f:\mathbb R^{k+1}\to\mathbb R^p\) is continuously differentiable at \(\boldsymbol\beta\) with full-row-rank Jacobian \(D_\beta f(\boldsymbol\beta)=\frac{\partial f(\boldsymbol\beta)}{\partial\boldsymbol\beta'}\) (\(p\times(k+1)\), \(p\le k+1\)). By the delta method \(\sqrt n(f(\hat{\boldsymbol\beta}_n)-f(\boldsymbol\beta))\xrightarrow{d}N(0,D_\beta f(\boldsymbol\beta)\Omega D_\beta f(\boldsymbol\beta)')\), so

$$T_n=n(f(\hat{\boldsymbol\beta}_n)-\mathbf c)'\big(D_\beta f(\hat{\boldsymbol\beta}_n)\hat\Omega_n D_\beta f(\hat{\boldsymbol\beta}_n)'\big)^{-1}(f(\hat{\boldsymbol\beta}_n)-\mathbf c)\xrightarrow{d}\chi^2_p$$

and \(\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}\).

4.5 Generalized Least Squares (GLS) Estimator

高斯-马尔可夫定理表明 OLS 在同方差下是 BLUE;但异方差下 OLS 不再最优,广义最小二乘(GLS) 更有效。设 \((Y_i,\mathbf X_i)\) i.i.d.,\(Y_i\in\mathbb R\)、\(\mathbf X_i\in\mathbb R^{k+1}\),\(\mathbb E[Y_i\mid\mathbf X_i]=\mathbf X_i'\boldsymbol\beta\)、\(\mathrm{Var}(Y_i\mid\mathbf X_i)=\sigma^2(\mathbf X_i)\)(已知、$>0$)。\(\mathbb D=\mathrm{diag}(\sigma^2(\mathbf X_1),\dots,\sigma^2(\mathbf X_n))\),\(\mathbb X\) 列线性独立。

4.5.1 无偏性. 估计量 \(\tilde{\boldsymbol\beta}_n=\mathbb A'\mathbb Y\),GLS 取 \(\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}\)。\(\mathbb X'\mathbb D^{-1}\mathbb X\) 正定(故可逆):对 \(\mathbf c\ne\mathbf 0\),\(\mathbf c'\mathbb X'\mathbb D^{-1}\mathbb X\mathbf c=\sum_i\frac{m_i^2}{\sigma^2(\mathbf X_i)}>0\)(\(\mathbb X\mathbf c=(m_1,\dots,m_n)'\ne\mathbf 0\))。GLS 无偏:\(\mathbb A'\mathbb X=\mathbb I\)。

4.5.2 条件方差协方差矩阵.

$$\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb A'\mathbb D\mathbb A=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1} \tag{4.28}$$

(\(\mathrm{Var}(\mathbb Y\mid\mathbb X)=\mathbb D\)、\(\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}\) 代入化简)。

The Gauss-Markov theorem shows OLS is BLUE under homoskedasticity; but under heteroskedasticity OLS is no longer efficient, and generalized least squares (GLS) is more efficient. Let \((Y_i,\mathbf X_i)\) be i.i.d., \(Y_i\in\mathbb R\), \(\mathbf X_i\in\mathbb R^{k+1}\), \(\mathbb E[Y_i\mid\mathbf X_i]=\mathbf X_i'\boldsymbol\beta\), \(\mathrm{Var}(Y_i\mid\mathbf X_i)=\sigma^2(\mathbf X_i)\) (known, $>0$). Let \(\mathbb D=\mathrm{diag}(\sigma^2(\mathbf X_1),\dots,\sigma^2(\mathbf X_n))\), with \(\mathbb X\)'s columns linearly independent.

4.5.1 Unbiasedness. Estimator \(\tilde{\boldsymbol\beta}_n=\mathbb A'\mathbb Y\); GLS takes \(\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}\). \(\mathbb X'\mathbb D^{-1}\mathbb X\) is positive definite (hence invertible): for \(\mathbf c\ne\mathbf 0\), \(\mathbf c'\mathbb X'\mathbb D^{-1}\mathbb X\mathbf c=\sum_i\frac{m_i^2}{\sigma^2(\mathbf X_i)}>0\) (\(\mathbb X\mathbf c=(m_1,\dots,m_n)'\ne\mathbf 0\)). GLS is unbiased: \(\mathbb A'\mathbb X=\mathbb I\).

4.5.2 Conditional variance-covariance matrix.

$$\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb A'\mathbb D\mathbb A=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1} \tag{4.28}$$

(substitute \(\mathrm{Var}(\mathbb Y\mid\mathbb X)=\mathbb D\) and \(\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}\) and simplify).

4.5.3 异方差下 GLS 是 BLUE. 在所有线性无偏估计量中,GLS 的条件方差最小。

Note

证明(GLS BLUE) 设另一线性无偏估计 \(\hat{\boldsymbol\beta}_n=\hat{\mathbb A}'\mathbb Y\)(\(\hat{\mathbb A}'\mathbb X=\mathbb I\)),需证 \(\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)\) 半正定。令 \(\mathbb C=\hat{\mathbb A}-\mathbb A\),可证 \(\mathbb A'\mathbb D\mathbb C=\mathbf 0\)、\(\mathbb C'\mathbb D\mathbb A=\mathbf 0\),于是 $$\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb C'\mathbb D\mathbb C$$ 对 \(\mathbf s\ne\mathbf 0\),\(\mathbf s'\mathbb C'\mathbb D\mathbb C\mathbf s=(\mathbb C\mathbf s)'\mathbb D(\mathbb C\mathbf s)=\sum s_i^2\sigma^2(\mathbf X_i)\ge0\),半正定。故 GLS 在高斯-马尔可夫意义下「最优」。\(\blacksquare\)

Tip

Remark 4.4(FGLS) 有时 \(\sigma^2(\mathbf X_i)\) 不可知,需先估计,称可行 GLS(FGLS)。FGLS 在异方差下通常比 OLS 更有效(近似 GLS),但因估计 \(\sigma^2(\mathbf X_i)\) 损失部分效率,OLS 与 FGLS 的效率比较不确定

4.5.4 一般情形. 回归 \(Y_i=\mathbf X_i'\boldsymbol\beta+\varepsilon_i\)、堆叠 \(\mathbb Y=\mathbb X\boldsymbol\beta+\boldsymbol\varepsilon\),条件方差协方差 \(\mathrm{Cov}(\boldsymbol\varepsilon\mid\mathbb X)=\boldsymbol\Sigma\)(不要求对角)。GLS 解

$$\min_{\boldsymbol\beta}(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}(\mathbb Y-\mathbb X\boldsymbol\beta)$$

一阶条件 \(-2(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}\mathbb X=\mathbf 0_{1\times K}\) 给出

$$\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y)$$

即 \(\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y\)。这是一般情形,因 \(\boldsymbol\Sigma\) 不必对角(可含序列相关等)。

4.5.3 GLS is BLUE under heteroskedasticity. Among all linear unbiased estimators, GLS has the smallest conditional variance.

Note

Proof (GLS BLUE) Let another linear unbiased estimator \(\hat{\boldsymbol\beta}_n=\hat{\mathbb A}'\mathbb Y\) (\(\hat{\mathbb A}'\mathbb X=\mathbb I\)); we show \(\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)\) is positive semi-definite. Let \(\mathbb C=\hat{\mathbb A}-\mathbb A\); one shows \(\mathbb A'\mathbb D\mathbb C=\mathbf 0\) and \(\mathbb C'\mathbb D\mathbb A=\mathbf 0\), so $$\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb C'\mathbb D\mathbb C$$ For \(\mathbf s\ne\mathbf 0\), \(\mathbf s'\mathbb C'\mathbb D\mathbb C\mathbf s=(\mathbb C\mathbf s)'\mathbb D(\mathbb C\mathbf s)=\sum s_i^2\sigma^2(\mathbf X_i)\ge0\), positive semi-definite. So GLS is "best" in the Gauss-Markov sense. \(\blacksquare\)

Tip

Remark 4.4 (FGLS) Sometimes \(\sigma^2(\mathbf X_i)\) is unknown and must be estimated first, called feasible GLS (FGLS). FGLS is usually more efficient than OLS under heteroskedasticity (an approximation to GLS), but loses some efficiency from estimating \(\sigma^2(\mathbf X_i)\), so the efficiency comparison between OLS and FGLS is not definite.

4.5.4 General case. Regression \(Y_i=\mathbf X_i'\boldsymbol\beta+\varepsilon_i\), stacked \(\mathbb Y=\mathbb X\boldsymbol\beta+\boldsymbol\varepsilon\), with conditional variance-covariance \(\mathrm{Cov}(\boldsymbol\varepsilon\mid\mathbb X)=\boldsymbol\Sigma\) (not required to be diagonal). GLS solves

$$\min_{\boldsymbol\beta}(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}(\mathbb Y-\mathbb X\boldsymbol\beta)$$

The first-order condition \(-2(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}\mathbb X=\mathbf 0_{1\times K}\) gives

$$\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y)$$

i.e. \(\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y\). This is the general case, since \(\boldsymbol\Sigma\) need not be diagonal (it may include serial correlation, etc.).

Important

本章脉络 从「总体 \(\boldsymbol\beta\)」到「样本 OLS」到「推断」到「更有效的 GLS」。 §4.1 在外生性 \(\mathbb E[\mathbf Xu]=0\) 下解出总体 \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\),并用 FWL / OVB / 测量误差揭示「控制变量」与「偏误」的代数。§4.2 把总体矩换成样本矩得 OLS,几何上是投影。§4.3 是 OLS 的统计性质阶梯:均值独立 ⇒ 无偏;+ 同方差 ⇒ BLUE;基本假设 ⇒ 一致 + 渐近正态(三明治方差 \(\Omega\),同方差退化为 \(\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\))。§4.4 基于 \(\sqrt n(\hat{\boldsymbol\beta}-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\) 构造 \(t\) / Wald / LM(与 Wald 等价)/ Delta 方法检验。§4.5 在异方差下用 GLS 恢复 BLUE。本章始终假设外生 \(\mathbb E[\mathbf Xu]=0\);下一章处理内生 \(\mathbb E[\mathbf Xu]\ne0\)(工具变量)。

Important

Chapter arc From "population \(\boldsymbol\beta\)" to "sample OLS" to "inference" to "the more efficient GLS." §4.1 solves the population \(\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]\) under exogeneity \(\mathbb E[\mathbf Xu]=0\), and uses FWL / OVB / measurement error to reveal the algebra of "control variables" and "bias." §4.2 replaces population moments with sample moments to get OLS, which geometrically is a projection. §4.3 is the ladder of OLS's statistical properties: mean independence ⇒ unbiased; + homoskedasticity ⇒ BLUE; the baseline assumptions ⇒ consistent + asymptotically normal (the sandwich variance \(\Omega\), degenerating to \(\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}\) under homoskedasticity). §4.4 builds \(t\) / Wald / LM (equivalent to Wald) / delta-method tests on \(\sqrt n(\hat{\boldsymbol\beta}-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)\). §4.5 uses GLS to restore the BLUE property under heteroskedasticity. This chapter always assumes exogeneity \(\mathbb E[\mathbf Xu]=0\); the next chapter handles endogeneity \(\mathbb E[\mathbf Xu]\ne0\) (instrumental variables).