4. Linear Regression with Exogenous Variables

Jun He May 31, 2026

计量经济学Econometrics 线性回归Linear Regression 最小二乘OLS 外生性Exogeneity 高斯-马尔可夫Gauss-Markov 投影Projection 渐近分布Asymptotic Distribution 假设检验Hypothesis Testing 广义最小二乘GLS 学习笔记Study Note

Note

本章主题：外生（$\mathbb E[\mathbf Xu]=0$）线性回归的总体性质、OLS 估计与推断。 §4.1 总体性质：由 $\mathbb E[\mathbf Xu]=0$ 解出 $\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]$ (4.2)（无完全共线性 ⟺ $\mathbb E[\mathbf X\mathbf X']$ 可逆，引理 4.1）；子向量的 Frisch-Waugh-Lovell「先净化再回归」(4.5, 命题 4.1)；遗漏变量偏误 OVB 与测量误差 / 衰减偏误。§4.2 OLS 估计：$\hat{\boldsymbol\beta}_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\frac1n\sum\mathbf X_iY_i)$ (4.10)；矩阵形式 $\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y$ (4.15) = 投影（投影阵 $\mathbb P$、残差生成阵 $\mathbb M$，皆幂等对称）；子向量 FWL (4.17)；拟合优度 $R^2$、调整 $\bar R^2$（仅描述拟合、无因果含义）。§4.3 OLS 性质：均值独立 $\mathbb E[u\mid\mathbf X]=0$ ⇒ 无偏（命题 4.2）；再加同方差 ⇒ 高斯-马尔可夫 BLUE（命题 4.3）；一致性（命题 4.4）；极限分布 $\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)$ (命题 4.5)，同方差下 $\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}$（命题 4.6）；$\Omega$ 的稳健估计 (4.25)。§4.4 假设检验：单一线性约束（$t$ 检验）、多重线性约束 Wald 检验、拉格朗日乘子 LM 检验（与 Wald 等价）、非线性约束 Delta 方法。§4.5 GLS：异方差下 OLS 不再最优，GLS $\hat{\boldsymbol\beta}=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y$ 为异方差下的 BLUE；不可观测方差时用 FGLS。

Note

Chapter theme: population properties, OLS estimation, and inference for the exogenous ($\mathbb E[\mathbf Xu]=0$) linear regression. §4.1 Population properties: from $\mathbb E[\mathbf Xu]=0$ solve $\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]$ (4.2) (no perfect collinearity ⟺ $\mathbb E[\mathbf X\mathbf X']$ invertible, Lemma 4.1); the Frisch-Waugh-Lovell "partial out then regress" for a subvector (4.5, Proposition 4.1); omitted-variable bias (OVB) and measurement error / attenuation bias. §4.2 OLS estimation: $\hat{\boldsymbol\beta}_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\frac1n\sum\mathbf X_iY_i)$ (4.10); matrix form $\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y$ (4.15) = projection (projection matrix $\mathbb P$, residual-maker $\mathbb M$, both idempotent and symmetric); subvector FWL (4.17); goodness of fit $R^2$, adjusted $\bar R^2$ (description of fit only, no causal meaning). §4.3 OLS properties: mean independence $\mathbb E[u\mid\mathbf X]=0$ ⇒ unbiased (Proposition 4.2); plus homoskedasticity ⇒ Gauss-Markov BLUE (Proposition 4.3); consistency (Proposition 4.4); limiting distribution $\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)$ (Proposition 4.5), with $\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}$ under homoskedasticity (Proposition 4.6); a robust estimator of $\Omega$ (4.25). §4.4 Hypothesis testing: a single linear restriction ($t$ test), multiple linear restrictions (Wald test), the Lagrange multiplier (LM) test (equivalent to Wald), and nonlinear restrictions (delta method). §4.5 GLS: under heteroskedasticity OLS is no longer efficient, and GLS $\hat{\boldsymbol\beta}=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y$ is the BLUE under heteroskedasticity; use FGLS when the variance is unobservable.

4.1 Population Properties of β

设 $(Y,\mathbf X,u)$ 如第 3 章定义，假设 $\mathbf X$ 无完全共线性、$\mathbb E[\mathbf X\mathbf X']$ 存在、$\mathbb E[Y^2]<\infty$、$\mathbb E[\mathbf Xu]=\mathbf 0$（外生性）。

4.1.1 解出 $\boldsymbol\beta$.

Important

定义 4.1 / 引理 4.1 完全共线性：$\mathbf X$ 完全共线若 $\exists\mathbf c\ne\mathbf 0,\mathbf c\in\mathbb R^{k+1}$ 使 $\mathbb P(\mathbf c'\mathbf X=0)=1$。引理 4.1：$\mathbb E[\mathbf X\mathbf X']$ 可逆当且仅当 $\mathbf X$ 无完全共线性。

Note

证明（引理 4.1）等价命题：$\mathbb E[\mathbf X\mathbf X']$ 不可逆当且仅当存在完全共线性。 不可逆 ⇒ 完全共线：$\mathbb E[\mathbf X\mathbf X']$ 不可逆 ⇒ $\exists\mathbf c\ne\mathbf 0$ 使 $\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbf 0$，于是 $0=\mathbf c'\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbb E[(\mathbf c'\mathbf X)^2]$，故 $\mathbb P(\mathbf c'\mathbf X=0)=1$。 完全共线 ⇒ 不可逆：设 $\mathbb P(\mathbf c'\mathbf X=0)=1$，则 $\mathbf c'\mathbb E[\mathbf X\mathbf X']=\mathbb E[\underbrace{\mathbf c'\mathbf X}_{=0}\mathbf X']=\mathbf 0$，故 $\mathbb E[\mathbf X\mathbf X']$ 不可逆。$\blacksquare$

由 $\mathbb E[\mathbf Xu]=\mathbf 0$ 与 $u=Y-\mathbf X'\boldsymbol\beta$：

$$\mathbf 0=\mathbb E[\mathbf Xu]=\mathbb E[\mathbf X(Y-\mathbf X'\boldsymbol\beta)]=\mathbb E[\mathbf XY]-\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta\Rightarrow\mathbb E[\mathbf X\mathbf X']\boldsymbol\beta=\mathbb E[\mathbf XY] \tag{4.1}$$

$$\Rightarrow\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY] \tag{4.2}$$

Tip

Remark 4.1 若 $\mathbb E[\mathbf X\mathbf X']$ 不可逆，(4.1) 有多解；任意两解 $\hat{\boldsymbol\beta},\tilde{\boldsymbol\beta}$ 满足 $\mathbb P(\mathbf X'\hat{\boldsymbol\beta}=\mathbf X'\tilde{\boldsymbol\beta})=1$（同样的预测值）。这对预测/最优线性近似无碍，但使因果解读难以进行。

Let $(Y,\mathbf X,u)$ be as defined in Chapter 3, and assume $\mathbf X$ has no perfect collinearity, $\mathbb E[\mathbf X\mathbf X']$ exists, $\mathbb E[Y^2]<\infty$, and $\mathbb E[\mathbf Xu]=\mathbf 0$ (exogeneity).

4.1.1 Solving for $\boldsymbol\beta$.

Important

Definition 4.1 / Lemma 4.1 Perfect collinearity: $\mathbf X$ is perfectly collinear if $\exists\mathbf c\ne\mathbf 0,\mathbf c\in\mathbb R^{k+1}$ with $\mathbb P(\mathbf c'\mathbf X=0)=1$. Lemma 4.1: $\mathbb E[\mathbf X\mathbf X']$ is invertible if and only if $\mathbf X$ has no perfect collinearity.

Note

Proof (Lemma 4.1) Equivalent statement: $\mathbb E[\mathbf X\mathbf X']$ is not invertible iff there is perfect collinearity. Not invertible ⇒ collinear: not invertible ⇒ $\exists\mathbf c\ne\mathbf 0$ with $\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbf 0$, so $0=\mathbf c'\mathbb E[\mathbf X\mathbf X']\mathbf c=\mathbb E[(\mathbf c'\mathbf X)^2]$, hence $\mathbb P(\mathbf c'\mathbf X=0)=1$. Collinear ⇒ not invertible: if $\mathbb P(\mathbf c'\mathbf X=0)=1$, then $\mathbf c'\mathbb E[\mathbf X\mathbf X']=\mathbb E[\underbrace{\mathbf c'\mathbf X}_{=0}\mathbf X']=\mathbf 0$, so $\mathbb E[\mathbf X\mathbf X']$ is not invertible. $\blacksquare$

From $\mathbb E[\mathbf Xu]=\mathbf 0$ with $u=Y-\mathbf X'\boldsymbol\beta$:

$$\Rightarrow\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY] \tag{4.2}$$

Tip

Remark 4.1 If $\mathbb E[\mathbf X\mathbf X']$ is not invertible, (4.1) has multiple solutions; any two solutions $\hat{\boldsymbol\beta},\tilde{\boldsymbol\beta}$ satisfy $\mathbb P(\mathbf X'\hat{\boldsymbol\beta}=\mathbf X'\tilde{\boldsymbol\beta})=1$ (the same fitted values). This is harmless for prediction / best linear approximation, but makes the causal interpretation difficult.

4.1.2 解出 $\boldsymbol\beta$ 的子向量（Frisch-Waugh-Lovell）. 把 $\mathbf X$ 分为 $(\mathbf X_1)_{k_1\times1}$、$(\mathbf X_2)_{k_2\times1}$，$\boldsymbol\beta$ 分为 $\boldsymbol\beta_1,\boldsymbol\beta_2$，

$$Y=\mathbf X'\boldsymbol\beta+u=\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u \tag{4.3}$$

由 $\mathbb E[\mathbf Xu]=\mathbf 0$ 得 $\mathbb E[\mathbf X_1u]=\mathbf 0_{k_1\times1}$、$\mathbb E[\mathbf X_2u]=\mathbf 0_{k_2\times1}$ (4.4)。若只关心 $\boldsymbol\beta_1$，可分三步：

取 $Y$ 的残差 $\tilde Y=Y-\mathrm{BLP}(Y\mid\mathbf X_2)$；
取 $\mathbf X_1$ 的残差 $\tilde{\mathbf X}_1=\mathbf X_1-\mathrm{BLP}(\mathbf X_1\mid\mathbf X_2)$；
回归 $\tilde Y=\tilde{\mathbf X}_1'\boldsymbol\beta_1+\tilde u$，得 $\tilde{\boldsymbol\beta}=\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']^{-1}\mathbb E[\tilde{\mathbf X}_1\tilde Y]$ (4.5)。

Important

命题 4.1 $\tilde{\boldsymbol\beta}=\boldsymbol\beta_1$，其中 $\tilde{\boldsymbol\beta}$ 由 (4.5) 定义、$\boldsymbol\beta_1$ 由 (4.3) 定义。即「先把 $\mathbf X_2$ 从 $Y$ 与 $\mathbf X_1$ 中净化掉、再回归」恰好得到 $\boldsymbol\beta_1$（Frisch-Waugh-Lovell 定理的总体版本）。

Tip

Remark 4.2 若 $\mathbf X_2$ 含常数项，则 $\tilde{\mathbf X}_1$ 必为零均值（由一阶条件），故 $\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']=\mathrm{Var}[\tilde{\mathbf X}_1]$；若 $\mathbf X_2$ 不含常数则此变换不成立。

4.1.2 Solving for a subvector of $\boldsymbol\beta$ (Frisch-Waugh-Lovell). Partition $\mathbf X$ into $(\mathbf X_1)_{k_1\times1}$, $(\mathbf X_2)_{k_2\times1}$ and $\boldsymbol\beta$ into $\boldsymbol\beta_1,\boldsymbol\beta_2$,

$$Y=\mathbf X'\boldsymbol\beta+u=\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u \tag{4.3}$$

From $\mathbb E[\mathbf Xu]=\mathbf 0$, $\mathbb E[\mathbf X_1u]=\mathbf 0_{k_1\times1}$, $\mathbb E[\mathbf X_2u]=\mathbf 0_{k_2\times1}$ (4.4). To recover only $\boldsymbol\beta_1$, in three steps:

Take $Y$'s residual $\tilde Y=Y-\mathrm{BLP}(Y\mid\mathbf X_2)$;
Take $\mathbf X_1$'s residual $\tilde{\mathbf X}_1=\mathbf X_1-\mathrm{BLP}(\mathbf X_1\mid\mathbf X_2)$;
Regress $\tilde Y=\tilde{\mathbf X}_1'\boldsymbol\beta_1+\tilde u$, giving $\tilde{\boldsymbol\beta}=\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']^{-1}\mathbb E[\tilde{\mathbf X}_1\tilde Y]$ (4.5).

Important

Proposition 4.1 $\tilde{\boldsymbol\beta}=\boldsymbol\beta_1$, where $\tilde{\boldsymbol\beta}$ is defined in (4.5) and $\boldsymbol\beta_1$ in (4.3). That is, "partial $\mathbf X_2$ out of $Y$ and $\mathbf X_1$ first, then regress" recovers exactly $\boldsymbol\beta_1$ (the population version of the Frisch-Waugh-Lovell theorem).

Tip

Remark 4.2 If $\mathbf X_2$ contains a constant, then $\tilde{\mathbf X}_1$ must have zero mean (by the first-order condition), so $\mathbb E[\tilde{\mathbf X}_1\tilde{\mathbf X}_1']=\mathrm{Var}[\tilde{\mathbf X}_1]$; if $\mathbf X_2$ contains no constant, this transformation does not hold.

推论 4.1（$\mathbf X_2$ 为常数的特例）. 若 $\mathbf X_2=1$，则 $\mathrm{BLP}(Y\mid1)=\mathbb E[Y]$、$\mathrm{BLP}(\mathbf X_1\mid1)=\mathbb E[\mathbf X_1]$，于是 $\tilde Y=Y-\mathbb E[Y]$、$\tilde{\mathbf X}_1=\mathbf X_1-\mathbb E[\mathbf X_1]$ (4.6)，代入 (4.5)：

$$\boldsymbol\beta_1=\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,Y) \tag{4.7}$$

即去除常数后，斜率系数等于「协方差除以方差」。

4.1.3 遗漏变量偏误（OVB）. 真实模型 $Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u$，若遗漏 $\mathbf X_2$、错误地估计 $Y=\beta_0^*+\mathbf X_1'\boldsymbol\beta_1^*+u^*$，则由 (4.7)：

$$\boldsymbol\beta_1^*=\boldsymbol\beta_1+\underbrace{\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,\mathbf X_2)\boldsymbol\beta_2}_{\text{omitted variable bias}}$$

偏误 = 「$\mathbf X_1$ 对 $\mathbf X_2$ 的回归系数」乘以「$\mathbf X_2$ 的真实效应 $\boldsymbol\beta_2$」。当 $\mathbf X_1\perp\mathbf X_2$（$\mathrm{Cov}=0$）或 $\boldsymbol\beta_2=0$ 时无偏误。

Corollary 4.1 ($\mathbf X_2$ a constant). If $\mathbf X_2=1$, then $\mathrm{BLP}(Y\mid1)=\mathbb E[Y]$, $\mathrm{BLP}(\mathbf X_1\mid1)=\mathbb E[\mathbf X_1]$, so $\tilde Y=Y-\mathbb E[Y]$, $\tilde{\mathbf X}_1=\mathbf X_1-\mathbb E[\mathbf X_1]$ (4.6); substituting into (4.5):

$$\boldsymbol\beta_1=\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,Y) \tag{4.7}$$

i.e., after removing the constant, the slope coefficient equals "covariance over variance."

4.1.3 Omitted-variable bias (OVB). True model $Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+\mathbf X_2'\boldsymbol\beta_2+u$; if we omit $\mathbf X_2$ and wrongly estimate $Y=\beta_0^*+\mathbf X_1'\boldsymbol\beta_1^*+u^*$, then by (4.7):

$$\boldsymbol\beta_1^*=\boldsymbol\beta_1+\underbrace{\mathrm{Var}(\mathbf X_1)^{-1}\mathrm{Cov}(\mathbf X_1,\mathbf X_2)\boldsymbol\beta_2}_{\text{omitted variable bias}}$$

The bias = "the regression coefficient of $\mathbf X_1$ on $\mathbf X_2$" times "the true effect $\boldsymbol\beta_2$ of $\mathbf X_2$." No bias when $\mathbf X_1\perp\mathbf X_2$ ($\mathrm{Cov}=0$) or $\boldsymbol\beta_2=0$.

4.1.4 测量误差（衰减偏误）. 模型 $Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+u$ (4.8)，但 $\mathbf X_1$ 不可观测，只能观测 $\hat{\mathbf X}_1=\mathbf X_1+\mathbf v$。经典测量误差假设：$\mathbb E[\mathbf v]=\mathbf 0$、$\mathrm{Cov}(u,\mathbf v)=\mathbf 0$、$\mathrm{Cov}(\mathbf X_1,\mathbf v)=\mathbf 0$。用 $\hat{\mathbf X}_1$ 替代 $\mathbf X_1$ 回归 $Y=\beta_0^*+\hat{\mathbf X}_1'\boldsymbol\beta_1^*+u^*$，由推论 4.1：

$$\boldsymbol\beta_1^*=\mathrm{Var}(\hat{\mathbf X}_1)^{-1}\mathrm{Cov}(\hat{\mathbf X}_1,Y)=(\mathrm{Var}(\mathbf X_1)+\mathrm{Var}(\mathbf v))^{-1}\mathrm{Var}(\mathbf X_1)\boldsymbol\beta_1$$

一维时 $\beta_1^*=\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\beta_1$，系数 $\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\in(0,1)$，故 $\beta_1^*$ 被向零方向压缩——称衰减偏误（attenuation bias）。

例 4.2 更一般地把 $\mathbf X$ 分为 $X_0=1$、$X_1\in\mathbb R$（误测）、$\mathbf X_2\in\mathbb R^{k_2}$（精确观测），$Y=\beta_0+\beta_1X_1+\mathbf X_2'\boldsymbol\beta_2+u$；用 FWL 先净化 $\mathbf X_2$（$\tilde{\hat X}_1=\hat X_1-\mathrm{BLP}(\hat X_1\mid1,\mathbf X_2)$、$\tilde Y=Y-\mathrm{BLP}(Y\mid1,\mathbf X_2)$）再处理，可证 $\tilde{\hat X}_1=\tilde X_1+v$，衰减结论仍成立：$\beta_1^*=\frac{\mathrm{Var}(\tilde X_1)}{\mathrm{Var}(\tilde X_1)+\mathrm{Var}(v)}\beta_1$。

4.1.4 Measurement error (attenuation bias). Model $Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+u$ (4.8), but $\mathbf X_1$ is unobservable and we observe only $\hat{\mathbf X}_1=\mathbf X_1+\mathbf v$. Classical measurement-error assumptions: $\mathbb E[\mathbf v]=\mathbf 0$, $\mathrm{Cov}(u,\mathbf v)=\mathbf 0$, $\mathrm{Cov}(\mathbf X_1,\mathbf v)=\mathbf 0$. Regressing $Y=\beta_0^*+\hat{\mathbf X}_1'\boldsymbol\beta_1^*+u^*$ with $\hat{\mathbf X}_1$ in place of $\mathbf X_1$, by Corollary 4.1:

$$\boldsymbol\beta_1^*=\mathrm{Var}(\hat{\mathbf X}_1)^{-1}\mathrm{Cov}(\hat{\mathbf X}_1,Y)=(\mathrm{Var}(\mathbf X_1)+\mathrm{Var}(\mathbf v))^{-1}\mathrm{Var}(\mathbf X_1)\boldsymbol\beta_1$$

In one dimension $\beta_1^*=\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\beta_1$, where the factor $\frac{\mathrm{Var}(X_1)}{\mathrm{Var}(X_1)+\mathrm{Var}(v)}\in(0,1)$, so $\beta_1^*$ is shrunk toward zero — called attenuation bias.

Example 4.2 More generally, partition $\mathbf X$ into $X_0=1$, $X_1\in\mathbb R$ (mismeasured), $\mathbf X_2\in\mathbb R^{k_2}$ (measured exactly), $Y=\beta_0+\beta_1X_1+\mathbf X_2'\boldsymbol\beta_2+u$; using FWL to partial out $\mathbf X_2$ first ($\tilde{\hat X}_1=\hat X_1-\mathrm{BLP}(\hat X_1\mid1,\mathbf X_2)$, $\tilde Y=Y-\mathrm{BLP}(Y\mid1,\mathbf X_2)$), one shows $\tilde{\hat X}_1=\tilde X_1+v$, and the attenuation conclusion still holds: $\beta_1^*=\frac{\mathrm{Var}(\tilde X_1)}{\mathrm{Var}(\tilde X_1)+\mathrm{Var}(v)}\beta_1$.

4.2 Ordinary Least Squares (OLS) Estimator

4.2.1 估计 $\boldsymbol\beta$. 在总体性质（$\mathbb E[\mathbf Xu]=0$、$\mathbb E[\mathbf X\mathbf X']<\infty$、无完全共线性、$(Y,\mathbf X)\sim P$）与样本 i.i.d. 假设下，总体 $\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]$ (4.9) 的自然样本类比为

$$\hat{\boldsymbol\beta}_n=\Big(\frac1n\sum_{i=1}^n\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\frac1n\sum_{i=1}^n\mathbf X_iY_i\Big) \tag{4.10}$$

称普通最小二乘（OLS）估计量——它最小化真实 $Y_i$ 与拟合 $\hat Y_i=\mathbf X_i'\hat{\boldsymbol\beta}_n$ 的平方距离和：$\hat{\boldsymbol\beta}_n=\arg\min_{\mathbf b}\frac1n\sum(Y_i-\mathbf X_i'\mathbf b)^2$。一阶条件 $\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\hat{\boldsymbol\beta}_n)=\mathbf 0$ (4.11) 给出 (4.12)（与 (4.10) 同）；由 CMT，$\det(\frac1n\sum\mathbf X_i\mathbf X_i')\xrightarrow{p}\det(\mathbb E[\mathbf X\mathbf X'])\ne0$，故大 $n$ 时几乎必可逆。残差 $\hat u_i=Y_i-\hat Y_i$ 满足 $\sum\mathbf X_i\hat u_i=\mathbf 0$ (4.13)；因 $X_{i,0}=1$，对 $\beta_0$ 的一阶条件给出 $\sum\hat u_i=0$ (4.14)。

4.2.2 OLS 作为投影. 用矩阵记号：$\mathbb X_{n\times(k+1)}=(\mathbf X_1,\dots,\mathbf X_n)'$（每行一个观测的转置）、$\mathbb Y=(Y_1,\dots,Y_n)'$、$\mathbb U=(u_1,\dots,u_n)'$、$\hat{\mathbb Y}=\mathbb X\hat{\boldsymbol\beta}_n$、$\hat{\mathbb U}=\mathbb Y-\hat{\mathbb Y}$。则

$$\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y \tag{4.15}$$

为 $\min_{\mathbf b}|\mathbb Y-\mathbb X\mathbf b|^2$ 之解。$\mathbb X$ 的列空间 $\mathrm{Col}[\mathbb X]=\{\mathbb X\mathbf b:\mathbf b\in\mathbb R^{k+1}\}$。定义投影阵

$$\mathbb P=\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X',\qquad\mathbb P\mathbb Y=\mathbb X\hat{\boldsymbol\beta}_n=\hat{\mathbb Y}$$

$\mathbb P\mathbb Y\in\mathrm{Col}[\mathbb X]$，故 $\mathbb P$ 把 $\mathbb Y$ 投影到 $\mathrm{Col}[\mathbb X]$。残差生成阵 $\mathbb M=\mathbb I-\mathbb P=\mathbb I-\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X'$，把 $\mathbb Y$ 投影到与 $\mathrm{Col}[\mathbb X]$ 正交的空间：$\mathbb M\mathbb X=\mathbf 0$、$\mathbb M\mathbb Y=\hat{\mathbb U}$、$\mathbb M\mathbb U=\hat{\mathbb U}$。$\mathbb P$ 与 $\mathbb M$ 皆幂等（$\mathbb P^2=\mathbb P$、$\mathbb M^2=\mathbb M$）且对称（$\mathbb P'=\mathbb P$、$\mathbb M'=\mathbb M$）。

4.2.1 Estimating $\boldsymbol\beta$. Under the population properties ($\mathbb E[\mathbf Xu]=0$, $\mathbb E[\mathbf X\mathbf X']<\infty$, no perfect collinearity, $(Y,\mathbf X)\sim P$) and an i.i.d. sample, the natural sample analog of $\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]$ (4.9) is

$$\hat{\boldsymbol\beta}_n=\Big(\frac1n\sum_{i=1}^n\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\frac1n\sum_{i=1}^n\mathbf X_iY_i\Big) \tag{4.10}$$

the ordinary least squares (OLS) estimator — it minimizes the sum of squared distances between the true $Y_i$ and the fitted $\hat Y_i=\mathbf X_i'\hat{\boldsymbol\beta}_n$: $\hat{\boldsymbol\beta}_n=\arg\min_{\mathbf b}\frac1n\sum(Y_i-\mathbf X_i'\mathbf b)^2$. The first-order condition $\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\hat{\boldsymbol\beta}_n)=\mathbf 0$ (4.11) gives (4.12) (same as (4.10)); by CMT, $\det(\frac1n\sum\mathbf X_i\mathbf X_i')\xrightarrow{p}\det(\mathbb E[\mathbf X\mathbf X'])\ne0$, so it is almost surely invertible for large $n$. The residual $\hat u_i=Y_i-\hat Y_i$ satisfies $\sum\mathbf X_i\hat u_i=\mathbf 0$ (4.13); since $X_{i,0}=1$, the first-order condition for $\beta_0$ gives $\sum\hat u_i=0$ (4.14).

4.2.2 OLS as a projection. In matrix notation: $\mathbb X_{n\times(k+1)}=(\mathbf X_1,\dots,\mathbf X_n)'$ (each row is one observation transposed), $\mathbb Y=(Y_1,\dots,Y_n)'$, $\mathbb U=(u_1,\dots,u_n)'$, $\hat{\mathbb Y}=\mathbb X\hat{\boldsymbol\beta}_n$, $\hat{\mathbb U}=\mathbb Y-\hat{\mathbb Y}$. Then

$$\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y \tag{4.15}$$

solves $\min_{\mathbf b}|\mathbb Y-\mathbb X\mathbf b|^2$. The column space of $\mathbb X$ is $\mathrm{Col}[\mathbb X]=\{\mathbb X\mathbf b:\mathbf b\in\mathbb R^{k+1}\}$. Define the projection matrix

$$\mathbb P=\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X',\qquad\mathbb P\mathbb Y=\mathbb X\hat{\boldsymbol\beta}_n=\hat{\mathbb Y}$$

$\mathbb P\mathbb Y\in\mathrm{Col}[\mathbb X]$, so $\mathbb P$ projects $\mathbb Y$ onto $\mathrm{Col}[\mathbb X]$. The residual-maker matrix $\mathbb M=\mathbb I-\mathbb P=\mathbb I-\mathbb X(\mathbb X'\mathbb X)^{-1}\mathbb X'$ projects $\mathbb Y$ onto the space orthogonal to $\mathrm{Col}[\mathbb X]$: $\mathbb M\mathbb X=\mathbf 0$, $\mathbb M\mathbb Y=\hat{\mathbb U}$, $\mathbb M\mathbb U=\hat{\mathbb U}$. Both $\mathbb P$ and $\mathbb M$ are idempotent ($\mathbb P^2=\mathbb P$, $\mathbb M^2=\mathbb M$) and symmetric ($\mathbb P'=\mathbb P$, $\mathbb M'=\mathbb M$).

4.2.3 估计子向量 $\hat{\boldsymbol\beta}_{1,n}$（FWL 的样本版本）. 把 $\mathbb X$ 分为 $\mathbb X_1,\mathbb X_2$、$\mathbb P$ 分为 $\mathbb P_1,\mathbb P_2$、$\mathbb M$ 分为 $\mathbb M_1,\mathbb M_2$。样本类比 $\mathbb Y=\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\mathbb X_2\hat{\boldsymbol\beta}_{2,n}+\hat{\mathbb U}$。用 $\mathbb M_2$ 左乘消去 $\mathbb X_2$（$\mathbb M_2\mathbb X_2=0$）、$\mathbb M_2\hat{\mathbb U}=\hat{\mathbb U}$：$\mathbb M_2\mathbb Y=\mathbb M_2\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\hat{\mathbb U}$。再左乘 $(\mathbb M_2\mathbb X_1)'$ 并用 $\mathbb X_1'\hat{\mathbb U}=0$：

$$\hat{\boldsymbol\beta}_{1,n}=[(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb X_1)]^{-1}(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb Y) \tag{4.17}$$

称 Frisch-Waugh-Lovell 分解。由 $\mathbb M_2$ 幂等对称，可简化为 $\hat{\boldsymbol\beta}_{1,n}=[\mathbb X_1'\mathbb M_2\mathbb X_1]^{-1}\mathbb X_1'\mathbb M_2\mathbb Y$。

4.2.4 拟合优度. $R$ 方 $R^2\equiv\frac{\mathrm{ESS}}{\mathrm{TSS}}=1-\frac{\mathrm{SSR}}{\mathrm{TSS}}$ (4.18)，其中估计平方和 $\mathrm{ESS}=\sum(\hat Y_i-\bar Y_n)^2$、残差平方和 $\mathrm{SSR}=\sum\hat u_i^2$、总平方和 $\mathrm{TSS}=\sum(Y_i-\bar Y_n)^2$。由 $\sum\hat u_i(\hat Y_i-\bar Y_n)=0$ 得 $\mathrm{TSS}=\mathrm{ESS}+\mathrm{SSR}$ (4.19)，故 $0\le R^2\le1$。$R^2=1$ 为完美拟合（$\hat u_i=0$）、$R^2=0$ 为无法预测。加入更多变量 $R^2$ 不减（自由度增加），故引入调整 $R^2$ 作惩罚：

$$\bar R^2=1-\frac{\frac1{n-(k+1)}\sum\hat u_i^2}{\frac1{n-1}\sum(Y_i-\bar Y_n)^2}=1-\frac{n-1}{n-k-1}\frac{\mathrm{SSR}}{\mathrm{TSS}}$$

$\bar R^2$ 可为负。$R^2$ 与 $\bar R^2$ 都无因果基础——只是拟合的描述，不是因果解读。

4.2.3 Estimating a subvector $\hat{\boldsymbol\beta}_{1,n}$ (sample version of FWL). Partition $\mathbb X$ into $\mathbb X_1,\mathbb X_2$, $\mathbb P$ into $\mathbb P_1,\mathbb P_2$, $\mathbb M$ into $\mathbb M_1,\mathbb M_2$. The sample analog $\mathbb Y=\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\mathbb X_2\hat{\boldsymbol\beta}_{2,n}+\hat{\mathbb U}$. Left-multiply by $\mathbb M_2$ to remove $\mathbb X_2$ ($\mathbb M_2\mathbb X_2=0$), with $\mathbb M_2\hat{\mathbb U}=\hat{\mathbb U}$: $\mathbb M_2\mathbb Y=\mathbb M_2\mathbb X_1\hat{\boldsymbol\beta}_{1,n}+\hat{\mathbb U}$. Then left-multiply by $(\mathbb M_2\mathbb X_1)'$ and use $\mathbb X_1'\hat{\mathbb U}=0$:

$$\hat{\boldsymbol\beta}_{1,n}=[(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb X_1)]^{-1}(\mathbb M_2\mathbb X_1)'(\mathbb M_2\mathbb Y) \tag{4.17}$$

the Frisch-Waugh-Lovell decomposition. By idempotency and symmetry of $\mathbb M_2$, it simplifies to $\hat{\boldsymbol\beta}_{1,n}=[\mathbb X_1'\mathbb M_2\mathbb X_1]^{-1}\mathbb X_1'\mathbb M_2\mathbb Y$.

4.2.4 Measures of fit. The $R$-square $R^2\equiv\frac{\mathrm{ESS}}{\mathrm{TSS}}=1-\frac{\mathrm{SSR}}{\mathrm{TSS}}$ (4.18), where the estimated sum of squares $\mathrm{ESS}=\sum(\hat Y_i-\bar Y_n)^2$, sum of squared residuals $\mathrm{SSR}=\sum\hat u_i^2$, total sum of squares $\mathrm{TSS}=\sum(Y_i-\bar Y_n)^2$. Since $\sum\hat u_i(\hat Y_i-\bar Y_n)=0$, $\mathrm{TSS}=\mathrm{ESS}+\mathrm{SSR}$ (4.19), hence $0\le R^2\le1$. $R^2=1$ is a perfect fit ($\hat u_i=0$), $R^2=0$ means no predictive power. Adding more variables never decreases $R^2$ (more degrees of freedom), so introduce the adjusted $R^2$ as a penalty:

$$\bar R^2=1-\frac{\frac1{n-(k+1)}\sum\hat u_i^2}{\frac1{n-1}\sum(Y_i-\bar Y_n)^2}=1-\frac{n-1}{n-k-1}\frac{\mathrm{SSR}}{\mathrm{TSS}}$$

$\bar R^2$ can be negative. Neither $R^2$ nor $\bar R^2$ has any causal basis — they are descriptions of fit, not causal interpretations.

4.3 Properties of the OLS Estimator

以下都在 §4.2.1 的基本假设下，子节会另加假设。

4.3.1 无偏性.

Important

命题 4.2（无偏）额外假设 $\mathbb E[u\mid\mathbf X]=0$（均值独立），则 OLS 无偏：$\mathbb E[\hat{\boldsymbol\beta}_n]=\boldsymbol\beta$。

Note

证明（命题 4.2） $\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y=(\mathbb X'\mathbb X)^{-1}\mathbb X'(\mathbb X\boldsymbol\beta+\mathbb U)=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb U$。取条件期望： $$\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X_1,\dots,\mathbf X_n]=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\underbrace{\mathbb E[\mathbb U\mid\mathbf X_1,\dots,\mathbf X_n]}_{=0\,\because\,\mathbb E[u_i\mid\mathbf X_i]=0}=\boldsymbol\beta$$ 由重期望定律 $\mathbb E[\hat{\boldsymbol\beta}_n]=\mathbb E[\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X]]=\boldsymbol\beta$。$\blacksquare$

4.3.2 高斯-马尔可夫定理.

Important

定义 4.2 / 命题 4.3（高斯-马尔可夫） 同方差：$\mathrm{Var}(u\mid\mathbf X)$ 为常数；否则异方差。命题 4.3：额外假设 $\mathbb E[u\mid\mathbf X]=0$（均值独立）与 $\mathrm{Var}(u\mid\mathbf X)=\sigma^2$（同方差），则 $\hat{\boldsymbol\beta}_n$ 是 $\boldsymbol\beta$ 的「最优」估计量（BLUE）——在所有形如 $\mathbb A'\mathbb Y$（$\mathbb A=\mathbb A(\mathbf X_1,\dots,\mathbf X_n)$）且无偏（$\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\boldsymbol\beta$）的估计量中，$\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)$ 最小。

Note

证明（命题 4.3）无偏要求 $\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\mathbb A'\mathbb X\boldsymbol\beta=\boldsymbol\beta$，即 $\mathbb A'\mathbb X=\mathbb I$。条件方差 $\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)=\mathbb A'\mathrm{Var}(\mathbb U\mid\mathbf X)\mathbb A=\mathbb A'\mathbb A\sigma^2$ (4.21)。OLS 对应 $\mathbb A'=(\mathbb X'\mathbb X)^{-1}\mathbb X'$，其方差 $(\mathbb X'\mathbb X)^{-1}\sigma^2$。需证 $\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}$ 半正定。令 $\mathbb C=\mathbb A-\mathbb X(\mathbb X'\mathbb X)^{-1}$，可证 $\mathbb X'\mathbb C=\mathbf 0$，于是 $$\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}=\mathbb C'\mathbb C$$ 对任意 $\mathbf c\ne\mathbf 0$，$\mathbf c'(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1})\mathbf c=\mathbf c'\mathbb C'\mathbb C\mathbf c=(\mathbb C\mathbf c)'(\mathbb C\mathbf c)=\sum m_i^2\ge0$，半正定，故 OLS 方差最小。$\blacksquare$

All below are under the §4.2.1 baseline assumptions; subsections add extra assumptions.

4.3.1 Unbiasedness.

Important

Proposition 4.2 (Unbiased) Additionally assume $\mathbb E[u\mid\mathbf X]=0$ (mean independence). Then OLS is unbiased: $\mathbb E[\hat{\boldsymbol\beta}_n]=\boldsymbol\beta$.

Note

Proof (Proposition 4.2) $\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb Y=(\mathbb X'\mathbb X)^{-1}\mathbb X'(\mathbb X\boldsymbol\beta+\mathbb U)=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\mathbb U$. Take conditional expectation: $$\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X_1,\dots,\mathbf X_n]=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\underbrace{\mathbb E[\mathbb U\mid\mathbf X_1,\dots,\mathbf X_n]}_{=0\,\because\,\mathbb E[u_i\mid\mathbf X_i]=0}=\boldsymbol\beta$$ By LIE, $\mathbb E[\hat{\boldsymbol\beta}_n]=\mathbb E[\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X]]=\boldsymbol\beta$. $\blacksquare$

4.3.2 Gauss-Markov Theorem.

Important

Definition 4.2 / Proposition 4.3 (Gauss-Markov) Homoskedastic: $\mathrm{Var}(u\mid\mathbf X)$ is constant; otherwise heteroskedastic. Proposition 4.3: additionally assume $\mathbb E[u\mid\mathbf X]=0$ (mean independence) and $\mathrm{Var}(u\mid\mathbf X)=\sigma^2$ (homoskedasticity). Then $\hat{\boldsymbol\beta}_n$ is the "best" estimator of $\boldsymbol\beta$ (BLUE) — among all estimators of the form $\mathbb A'\mathbb Y$ ($\mathbb A=\mathbb A(\mathbf X_1,\dots,\mathbf X_n)$) that are unbiased ($\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\boldsymbol\beta$), $\hat{\boldsymbol\beta}_n$ has the smallest $\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)$.

Note

Proof (Proposition 4.3) Unbiasedness requires $\mathbb E[\mathbb A'\mathbb Y\mid\mathbf X]=\mathbb A'\mathbb X\boldsymbol\beta=\boldsymbol\beta$, i.e. $\mathbb A'\mathbb X=\mathbb I$. The conditional variance $\mathrm{Var}(\mathbb A'\mathbb Y\mid\mathbf X)=\mathbb A'\mathrm{Var}(\mathbb U\mid\mathbf X)\mathbb A=\mathbb A'\mathbb A\sigma^2$ (4.21). OLS corresponds to $\mathbb A'=(\mathbb X'\mathbb X)^{-1}\mathbb X'$, with variance $(\mathbb X'\mathbb X)^{-1}\sigma^2$. We show $\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}$ is positive semi-definite. Let $\mathbb C=\mathbb A-\mathbb X(\mathbb X'\mathbb X)^{-1}$; one shows $\mathbb X'\mathbb C=\mathbf 0$, so $$\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1}=\mathbb C'\mathbb C$$ For any $\mathbf c\ne\mathbf 0$, $\mathbf c'(\mathbb A'\mathbb A-(\mathbb X'\mathbb X)^{-1})\mathbf c=\mathbf c'\mathbb C'\mathbb C\mathbf c=(\mathbb C\mathbf c)'(\mathbb C\mathbf c)=\sum m_i^2\ge0$, positive semi-definite, so OLS has the smallest variance. $\blacksquare$

4.3.3 一致性.

Important

命题 4.4（一致） OLS 一致：$\hat{\boldsymbol\beta}_n\xrightarrow{p}\boldsymbol\beta$。（仅需基本假设，无须均值独立或同方差。）

Note

证明（命题 4.4）由 WLLN，$\frac1n\sum\mathbf X_i\mathbf X_i'\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']$、$\frac1n\sum\mathbf X_iY_i\xrightarrow{p}\mathbb E[\mathbf XY]$；无完全共线性使 $\mathbb E[\mathbf X\mathbf X']$ 可逆，CMT 给出 $(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}$；边际收敛 ⇒ 联合收敛 + CMT（乘法连续）： $$\hat{\boldsymbol\beta}_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_iY_i\Big)\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]=\boldsymbol\beta\quad\blacksquare$$

4.3.4 极限分布.

Important

命题 4.5 / 命题 4.6（极限分布） 命题 4.5：额外假设 $\mathrm{Var}(\mathbf Xu)$ 存在，则 $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega),\quad\Omega=\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}$$ （三明治形式，异方差稳健）。命题 4.6：若再加 $\mathbb E[u\mid\mathbf X]=0$（均值独立）与 $\mathrm{Var}(u\mid\mathbf X)=\sigma^2$（同方差），则 $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1})$$

Note

证明（命题 4.5 与 4.6） 命题 4.5：展开 $\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\sqrt n\frac1n\sum\mathbf X_iu_i)$。$\mathbf X_iu_i$ i.i.d.、均值 $\mathbb E[\mathbf Xu]=0$，由 CLT $\sqrt n\frac1n\sum\mathbf X_iu_i\xrightarrow{d}N(0,\mathrm{Var}(\mathbf Xu))$；又 $(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}$，由 Slutsky：$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}\mathbb E[\mathbf X\mathbf X']^{-1}N(0,\mathrm{Var}(\mathbf Xu))=N(0,\Omega)$。 命题 4.6：同方差使 $\mathrm{Var}(\mathbf Xu)=\mathbb E[\mathbf X\mathbf X'u^2]=\mathbb E[\mathbf X\mathbf X'\mathbb E[u^2\mid\mathbf X]]=\sigma^2\mathbb E[\mathbf X\mathbf X']$ (4.22)(4.23)，代入得 $\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}$ (4.24)。$\blacksquare$

4.3.3 Consistency.

Important

Proposition 4.4 (Consistent) OLS is consistent: $\hat{\boldsymbol\beta}_n\xrightarrow{p}\boldsymbol\beta$. (Only the baseline assumptions are needed, no mean independence or homoskedasticity.)

Note

Proof (Proposition 4.4) By WLLN, $\frac1n\sum\mathbf X_i\mathbf X_i'\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']$, $\frac1n\sum\mathbf X_iY_i\xrightarrow{p}\mathbb E[\mathbf XY]$; no perfect collinearity makes $\mathbb E[\mathbf X\mathbf X']$ invertible, and CMT gives $(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}$; marginal ⇒ joint convergence + CMT (multiplication is continuous): $$\hat{\boldsymbol\beta}_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_iY_i\Big)\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]=\boldsymbol\beta\quad\blacksquare$$

4.3.4 Limiting distribution.

Important

Proposition 4.5 / Proposition 4.6 (Limiting distribution) Proposition 4.5: additionally assume $\mathrm{Var}(\mathbf Xu)$ exists. Then $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega),\quad\Omega=\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}$$ (the sandwich form, heteroskedasticity-robust). Proposition 4.6: if we further add $\mathbb E[u\mid\mathbf X]=0$ (mean independence) and $\mathrm{Var}(u\mid\mathbf X)=\sigma^2$ (homoskedasticity), then $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1})$$

Note

Proof (Propositions 4.5 and 4.6) Prop 4.5: expand $\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}(\sqrt n\frac1n\sum\mathbf X_iu_i)$. The $\mathbf X_iu_i$ are i.i.d. with mean $\mathbb E[\mathbf Xu]=0$, so by CLT $\sqrt n\frac1n\sum\mathbf X_iu_i\xrightarrow{d}N(0,\mathrm{Var}(\mathbf Xu))$; also $(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}$, so by Slutsky $\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}\mathbb E[\mathbf X\mathbf X']^{-1}N(0,\mathrm{Var}(\mathbf Xu))=N(0,\Omega)$. Prop 4.6: homoskedasticity gives $\mathrm{Var}(\mathbf Xu)=\mathbb E[\mathbf X\mathbf X'u^2]=\mathbb E[\mathbf X\mathbf X'\mathbb E[u^2\mid\mathbf X]]=\sigma^2\mathbb E[\mathbf X\mathbf X']$ (4.22)(4.23), so $\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}$ (4.24). $\blacksquare$

4.3.5 估计 $\Omega$. 我们不知真实 $\Omega$，需构造一致估计。同方差下 $\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}$，自然估计 $\hat\Omega_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\hat\sigma_n^2$（$\hat\sigma_n^2=\frac1n\sum\hat u_i^2$）。一般（异方差稳健）：

$$\hat\Omega_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\Big)\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1} \tag{4.25}$$

需证中间项 $\frac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\xrightarrow{p}\mathrm{Var}(\mathbf Xu)$。把 $\hat u_i^2=u_i^2+(\hat u_i^2-u_i^2)$ 拆分 (4.26)，第一项由 WLLN $\frac1n\sum\mathbf X_i\mathbf X_i'u_i^2\xrightarrow{p}\mathbb E[\mathbf X\mathbf X'u^2]$，第二项用 $\hat u_i^2-u_i^2=-2u_i\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)+[\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)]^2$ 与下述引理证其 $\xrightarrow{p}0$。

Important

引理 4.2 设 $\mathbf Z_1,\dots,\mathbf Z_n$ i.i.d.、$\mathbb E[|\mathbf Z_i|^r]<\infty$，则 $\max_{1\le i\le n}|\mathbf Z_i|=o_P(n^{1/r})$，即 $\frac{\max_{1\le i\le n}|\mathbf Z_i|}{n^{1/r}}\xrightarrow{p}0$。

最终 $\hat\Omega_n\xrightarrow{p}\Omega$。

4.3.5 Estimating $\Omega$. We do not know the true $\Omega$ and must construct a consistent estimator. Under homoskedasticity $\Omega=\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}$, with the natural estimator $\hat\Omega_n=(\frac1n\sum\mathbf X_i\mathbf X_i')^{-1}\hat\sigma_n^2$ ($\hat\sigma_n^2=\frac1n\sum\hat u_i^2$). In general (heteroskedasticity-robust):

$$\hat\Omega_n=\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1}\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\Big)\Big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\Big)^{-1} \tag{4.25}$$

We must show the middle term $\frac1n\sum\mathbf X_i\mathbf X_i'\hat u_i^2\xrightarrow{p}\mathrm{Var}(\mathbf Xu)$. Split $\hat u_i^2=u_i^2+(\hat u_i^2-u_i^2)$ (4.26): the first term by WLLN $\frac1n\sum\mathbf X_i\mathbf X_i'u_i^2\xrightarrow{p}\mathbb E[\mathbf X\mathbf X'u^2]$; the second uses $\hat u_i^2-u_i^2=-2u_i\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)+[\mathbf X_i'(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)]^2$ together with the lemma below to show it $\xrightarrow{p}0$.

Important

Lemma 4.2 Let $\mathbf Z_1,\dots,\mathbf Z_n$ be i.i.d. with $\mathbb E[|\mathbf Z_i|^r]<\infty$. Then $\max_{1\le i\le n}|\mathbf Z_i|=o_P(n^{1/r})$, i.e. $\frac{\max_{1\le i\le n}|\mathbf Z_i|}{n^{1/r}}\xrightarrow{p}0$.

Finally $\hat\Omega_n\xrightarrow{p}\Omega$.

4.4 Hypothesis Testing

4.4.1 单一线性约束（$t$ 检验，$p=1$）. 检验 $H_0:\mathbf r'\boldsymbol\beta=c$ vs $H_1:\mathbf r'\boldsymbol\beta\ne c$（$\mathbf r\in\mathbb R^{k+1}$、$c\in\mathbb R$）。由 $\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)$ 与 CMT：$\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-\mathbf r'\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf r'\Omega\mathbf r)$。取统计量

$$T_n=\frac{\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-c)}{\sqrt{\mathbf r'\hat\Omega_n\mathbf r}}\xrightarrow{d}N(0,1)\quad(H_0)$$

（$H_0$ 下成立。）$\phi_n=\mathbf 1\{|T_n|>z_{1-\frac\alpha2}\}$ 水平一致。

4.4.2 多重线性约束（Wald 检验）. 检验 $H_0:\mathbf R\boldsymbol\beta=\mathbf c$ vs $H_1:\mathbf R\boldsymbol\beta\ne\mathbf c$，$\mathbf R$ 为 $p\times(k+1)$、行线性独立（使 $\mathbf R\Omega\mathbf R'$ 可逆）。由 $\sqrt n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf R\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf R\Omega\mathbf R')$ 与 $\mathbf x\sim N(0,\Sigma)\Rightarrow\mathbf x'\Sigma^{-1}\mathbf x\sim\chi^2$ 事实：

$$T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\xrightarrow{d}\chi^2_p\quad(H_0)$$

（$H_0$ 下成立。）$\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}$（$\chi^2_p$ 的 $1-\alpha$ 分位）。

4.4.1 A single linear restriction ($t$ test, $p=1$). Test $H_0:\mathbf r'\boldsymbol\beta=c$ vs $H_1:\mathbf r'\boldsymbol\beta\ne c$ ($\mathbf r\in\mathbb R^{k+1}$, $c\in\mathbb R$). From $\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)$ and CMT: $\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-\mathbf r'\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf r'\Omega\mathbf r)$. Take the statistic

$$T_n=\frac{\sqrt n(\mathbf r'\hat{\boldsymbol\beta}_n-c)}{\sqrt{\mathbf r'\hat\Omega_n\mathbf r}}\xrightarrow{d}N(0,1)\quad(\text{under }H_0)$$

and $\phi_n=\mathbf 1\{|T_n|>z_{1-\frac\alpha2}\}$ is consistent in level.

4.4.2 Multiple linear restrictions (Wald test). Test $H_0:\mathbf R\boldsymbol\beta=\mathbf c$ vs $H_1:\mathbf R\boldsymbol\beta\ne\mathbf c$, with $\mathbf R$ a $p\times(k+1)$ matrix of linearly independent rows (so $\mathbf R\Omega\mathbf R'$ is invertible). From $\sqrt n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf R\boldsymbol\beta)\xrightarrow{d}N(0,\mathbf R\Omega\mathbf R')$ and the fact $\mathbf x\sim N(0,\Sigma)\Rightarrow\mathbf x'\Sigma^{-1}\mathbf x\sim\chi^2$:

$$T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)\xrightarrow{d}\chi^2_p\quad(\text{under }H_0)$$

and $\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}$ (the $1-\alpha$ quantile of $\chi^2_p$).

4.4.3 拉格朗日乘子（LM）检验. 约束最小二乘（CLS） 估计量 $\tilde{\boldsymbol\beta}_n$ 解 $\min_{\boldsymbol\beta:\mathbf R\boldsymbol\beta=\mathbf c}\frac1n\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2$。拉格朗日

$$\mathcal L(\boldsymbol\beta,\boldsymbol\lambda)=\frac1{2n}\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2+\boldsymbol\lambda'(\mathbf R\boldsymbol\beta-\mathbf c)$$

（$\frac12$ 用于消去常数）。一阶条件 $\frac{\partial\mathcal L}{\partial\boldsymbol\beta}=-\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\tilde{\boldsymbol\beta}_n)+\mathbf R'\tilde{\boldsymbol\lambda}_n=\mathbf 0$、$\frac{\partial\mathcal L}{\partial\boldsymbol\lambda}=\mathbf R\tilde{\boldsymbol\beta}_n-\mathbf c=\mathbf 0$。可解出乘子

$$\tilde{\boldsymbol\lambda}_n=\Big(\mathbf R\big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\big)^{-1}\mathbf R'\Big)^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)$$

$H_0$（$\mathbf R\boldsymbol\beta=\mathbf c$）下 $\tilde{\boldsymbol\lambda}_n\xrightarrow{p}0$。其渐近分布 $\sqrt n\tilde{\boldsymbol\lambda}_n\xrightarrow{d}N(0,\Pi)$（$\Pi=(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R'(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}$）。检验 $T_n=n\tilde{\boldsymbol\lambda}_n'\hat\Pi^{-1}\tilde{\boldsymbol\lambda}_n\xrightarrow{d}\chi^2_p$，$\phi_n=\mathbf 1\{T_n>\chi^2_{p,1-\alpha}\}$。

Important

LM 检验 = Wald 检验把 LM 统计量重排，可证 $T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)$——与 Wald 统计量完全相同，故两检验等价、渐近分布同为 $\chi^2_p$。

4.4.4 非线性约束（Delta 方法）. 检验 $H_0:f(\boldsymbol\beta)=\mathbf c$ vs $H_1:f(\boldsymbol\beta)\ne\mathbf c$，$f:\mathbb R^{k+1}\to\mathbb R^p$ 在 $\boldsymbol\beta$ 处连续可微、雅可比 $D_\beta f(\boldsymbol\beta)=\frac{\partial f(\boldsymbol\beta)}{\partial\boldsymbol\beta'}$（$p\times(k+1)$）行满秩（$p\le k+1$）。由 Delta 方法 $\sqrt n(f(\hat{\boldsymbol\beta}_n)-f(\boldsymbol\beta))\xrightarrow{d}N(0,D_\beta f(\boldsymbol\beta)\Omega D_\beta f(\boldsymbol\beta)')$，故

$$T_n=n(f(\hat{\boldsymbol\beta}_n)-\mathbf c)'\big(D_\beta f(\hat{\boldsymbol\beta}_n)\hat\Omega_n D_\beta f(\hat{\boldsymbol\beta}_n)'\big)^{-1}(f(\hat{\boldsymbol\beta}_n)-\mathbf c)\xrightarrow{d}\chi^2_p$$

$\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}$。

4.4.3 Lagrange multiplier (LM) test. The constrained least squares (CLS) estimator $\tilde{\boldsymbol\beta}_n$ solves $\min_{\boldsymbol\beta:\mathbf R\boldsymbol\beta=\mathbf c}\frac1n\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2$. The Lagrangian

$$\mathcal L(\boldsymbol\beta,\boldsymbol\lambda)=\frac1{2n}\sum(Y_i-\mathbf X_i'\boldsymbol\beta)^2+\boldsymbol\lambda'(\mathbf R\boldsymbol\beta-\mathbf c)$$

(the $\frac12$ cancels a constant). The first-order conditions $\frac{\partial\mathcal L}{\partial\boldsymbol\beta}=-\frac1n\sum\mathbf X_i(Y_i-\mathbf X_i'\tilde{\boldsymbol\beta}_n)+\mathbf R'\tilde{\boldsymbol\lambda}_n=\mathbf 0$, $\frac{\partial\mathcal L}{\partial\boldsymbol\lambda}=\mathbf R\tilde{\boldsymbol\beta}_n-\mathbf c=\mathbf 0$. Solving for the multiplier

$$\tilde{\boldsymbol\lambda}_n=\Big(\mathbf R\big(\tfrac1n\sum\mathbf X_i\mathbf X_i'\big)^{-1}\mathbf R'\Big)^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)$$

Under $H_0$ ($\mathbf R\boldsymbol\beta=\mathbf c$), $\tilde{\boldsymbol\lambda}_n\xrightarrow{p}0$. Its asymptotic distribution $\sqrt n\tilde{\boldsymbol\lambda}_n\xrightarrow{d}N(0,\Pi)$ (with $\Pi=(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathrm{Var}(\mathbf Xu)\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R'(\mathbf R\mathbb E[\mathbf X\mathbf X']^{-1}\mathbf R')^{-1}$). The test $T_n=n\tilde{\boldsymbol\lambda}_n'\hat\Pi^{-1}\tilde{\boldsymbol\lambda}_n\xrightarrow{d}\chi^2_p$, $\phi_n=\mathbf 1\{T_n>\chi^2_{p,1-\alpha}\}$.

Important

LM test = Wald test Rearranging the LM statistic, one shows $T_n=n(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)'(\mathbf R\hat\Omega_n\mathbf R')^{-1}(\mathbf R\hat{\boldsymbol\beta}_n-\mathbf c)$ — exactly the same as the Wald statistic, so the two tests are equivalent with the same asymptotic $\chi^2_p$ distribution.

4.4.4 Nonlinear restrictions (delta method). Test $H_0:f(\boldsymbol\beta)=\mathbf c$ vs $H_1:f(\boldsymbol\beta)\ne\mathbf c$, where $f:\mathbb R^{k+1}\to\mathbb R^p$ is continuously differentiable at $\boldsymbol\beta$ with full-row-rank Jacobian $D_\beta f(\boldsymbol\beta)=\frac{\partial f(\boldsymbol\beta)}{\partial\boldsymbol\beta'}$ ($p\times(k+1)$, $p\le k+1$). By the delta method $\sqrt n(f(\hat{\boldsymbol\beta}_n)-f(\boldsymbol\beta))\xrightarrow{d}N(0,D_\beta f(\boldsymbol\beta)\Omega D_\beta f(\boldsymbol\beta)')$, so

and $\phi_n=\mathbf 1\{T_n>c_{p,1-\alpha}\}$.

4.5 Generalized Least Squares (GLS) Estimator

高斯-马尔可夫定理表明 OLS 在同方差下是 BLUE；但异方差下 OLS 不再最优，广义最小二乘（GLS） 更有效。设 $(Y_i,\mathbf X_i)$ i.i.d.，$Y_i\in\mathbb R$、$\mathbf X_i\in\mathbb R^{k+1}$，$\mathbb E[Y_i\mid\mathbf X_i]=\mathbf X_i'\boldsymbol\beta$、$\mathrm{Var}(Y_i\mid\mathbf X_i)=\sigma^2(\mathbf X_i)$（已知、$>0$）。$\mathbb D=\mathrm{diag}(\sigma^2(\mathbf X_1),\dots,\sigma^2(\mathbf X_n))$，$\mathbb X$ 列线性独立。

4.5.1 无偏性. 估计量 $\tilde{\boldsymbol\beta}_n=\mathbb A'\mathbb Y$，GLS 取 $\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}$。$\mathbb X'\mathbb D^{-1}\mathbb X$ 正定（故可逆）：对 $\mathbf c\ne\mathbf 0$，$\mathbf c'\mathbb X'\mathbb D^{-1}\mathbb X\mathbf c=\sum_i\frac{m_i^2}{\sigma^2(\mathbf X_i)}>0$（$\mathbb X\mathbf c=(m_1,\dots,m_n)'\ne\mathbf 0$）。GLS 无偏：$\mathbb A'\mathbb X=\mathbb I$。

4.5.2 条件方差协方差矩阵.

$$\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb A'\mathbb D\mathbb A=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1} \tag{4.28}$$

（$\mathrm{Var}(\mathbb Y\mid\mathbb X)=\mathbb D$、$\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}$ 代入化简）。

The Gauss-Markov theorem shows OLS is BLUE under homoskedasticity; but under heteroskedasticity OLS is no longer efficient, and generalized least squares (GLS) is more efficient. Let $(Y_i,\mathbf X_i)$ be i.i.d., $Y_i\in\mathbb R$, $\mathbf X_i\in\mathbb R^{k+1}$, $\mathbb E[Y_i\mid\mathbf X_i]=\mathbf X_i'\boldsymbol\beta$, $\mathrm{Var}(Y_i\mid\mathbf X_i)=\sigma^2(\mathbf X_i)$ (known, $>0$). Let $\mathbb D=\mathrm{diag}(\sigma^2(\mathbf X_1),\dots,\sigma^2(\mathbf X_n))$, with $\mathbb X$'s columns linearly independent.

4.5.1 Unbiasedness. Estimator $\tilde{\boldsymbol\beta}_n=\mathbb A'\mathbb Y$; GLS takes $\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}$. $\mathbb X'\mathbb D^{-1}\mathbb X$ is positive definite (hence invertible): for $\mathbf c\ne\mathbf 0$, $\mathbf c'\mathbb X'\mathbb D^{-1}\mathbb X\mathbf c=\sum_i\frac{m_i^2}{\sigma^2(\mathbf X_i)}>0$ ($\mathbb X\mathbf c=(m_1,\dots,m_n)'\ne\mathbf 0$). GLS is unbiased: $\mathbb A'\mathbb X=\mathbb I$.

4.5.2 Conditional variance-covariance matrix.

$$\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb A'\mathbb D\mathbb A=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1} \tag{4.28}$$

(substitute $\mathrm{Var}(\mathbb Y\mid\mathbb X)=\mathbb D$ and $\mathbb A'=(\mathbb X'\mathbb D^{-1}\mathbb X)^{-1}\mathbb X'\mathbb D^{-1}$ and simplify).

4.5.3 异方差下 GLS 是 BLUE. 在所有线性无偏估计量中，GLS 的条件方差最小。

Note

证明（GLS BLUE）设另一线性无偏估计 $\hat{\boldsymbol\beta}_n=\hat{\mathbb A}'\mathbb Y$（$\hat{\mathbb A}'\mathbb X=\mathbb I$），需证 $\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)$ 半正定。令 $\mathbb C=\hat{\mathbb A}-\mathbb A$，可证 $\mathbb A'\mathbb D\mathbb C=\mathbf 0$、$\mathbb C'\mathbb D\mathbb A=\mathbf 0$，于是 $$\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb C'\mathbb D\mathbb C$$ 对 $\mathbf s\ne\mathbf 0$，$\mathbf s'\mathbb C'\mathbb D\mathbb C\mathbf s=(\mathbb C\mathbf s)'\mathbb D(\mathbb C\mathbf s)=\sum s_i^2\sigma^2(\mathbf X_i)\ge0$，半正定。故 GLS 在高斯-马尔可夫意义下「最优」。$\blacksquare$

Tip

Remark 4.4（FGLS）有时 $\sigma^2(\mathbf X_i)$ 不可知，需先估计，称可行 GLS（FGLS）。FGLS 在异方差下通常比 OLS 更有效（近似 GLS），但因估计 $\sigma^2(\mathbf X_i)$ 损失部分效率，OLS 与 FGLS 的效率比较不确定。

4.5.4 一般情形. 回归 $Y_i=\mathbf X_i'\boldsymbol\beta+\varepsilon_i$、堆叠 $\mathbb Y=\mathbb X\boldsymbol\beta+\boldsymbol\varepsilon$，条件方差协方差 $\mathrm{Cov}(\boldsymbol\varepsilon\mid\mathbb X)=\boldsymbol\Sigma$（不要求对角）。GLS 解

$$\min_{\boldsymbol\beta}(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}(\mathbb Y-\mathbb X\boldsymbol\beta)$$

一阶条件 $-2(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}\mathbb X=\mathbf 0_{1\times K}$ 给出

$$\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y)$$

即 $\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y$。这是一般情形，因 $\boldsymbol\Sigma$ 不必对角（可含序列相关等）。

4.5.3 GLS is BLUE under heteroskedasticity. Among all linear unbiased estimators, GLS has the smallest conditional variance.

Note

Proof (GLS BLUE) Let another linear unbiased estimator $\hat{\boldsymbol\beta}_n=\hat{\mathbb A}'\mathbb Y$ ($\hat{\mathbb A}'\mathbb X=\mathbb I$); we show $\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)$ is positive semi-definite. Let $\mathbb C=\hat{\mathbb A}-\mathbb A$; one shows $\mathbb A'\mathbb D\mathbb C=\mathbf 0$ and $\mathbb C'\mathbb D\mathbb A=\mathbf 0$, so $$\mathrm{Var}(\hat{\boldsymbol\beta}_n\mid\mathbb X)-\mathrm{Var}(\tilde{\boldsymbol\beta}_n\mid\mathbb X)=\mathbb C'\mathbb D\mathbb C$$ For $\mathbf s\ne\mathbf 0$, $\mathbf s'\mathbb C'\mathbb D\mathbb C\mathbf s=(\mathbb C\mathbf s)'\mathbb D(\mathbb C\mathbf s)=\sum s_i^2\sigma^2(\mathbf X_i)\ge0$, positive semi-definite. So GLS is "best" in the Gauss-Markov sense. $\blacksquare$

Tip

Remark 4.4 (FGLS) Sometimes $\sigma^2(\mathbf X_i)$ is unknown and must be estimated first, called feasible GLS (FGLS). FGLS is usually more efficient than OLS under heteroskedasticity (an approximation to GLS), but loses some efficiency from estimating $\sigma^2(\mathbf X_i)$, so the efficiency comparison between OLS and FGLS is not definite.

4.5.4 General case. Regression $Y_i=\mathbf X_i'\boldsymbol\beta+\varepsilon_i$, stacked $\mathbb Y=\mathbb X\boldsymbol\beta+\boldsymbol\varepsilon$, with conditional variance-covariance $\mathrm{Cov}(\boldsymbol\varepsilon\mid\mathbb X)=\boldsymbol\Sigma$ (not required to be diagonal). GLS solves

$$\min_{\boldsymbol\beta}(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}(\mathbb Y-\mathbb X\boldsymbol\beta)$$

The first-order condition $-2(\mathbb Y-\mathbb X\boldsymbol\beta)'\boldsymbol\Sigma^{-1}\mathbb X=\mathbf 0_{1\times K}$ gives

$$\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y)$$

i.e. $\boldsymbol\beta=(\mathbb X'\boldsymbol\Sigma^{-1}\mathbb X)^{-1}\mathbb X'\boldsymbol\Sigma^{-1}\mathbb Y$. This is the general case, since $\boldsymbol\Sigma$ need not be diagonal (it may include serial correlation, etc.).

Important

本章脉络 从「总体 $\boldsymbol\beta$」到「样本 OLS」到「推断」到「更有效的 GLS」。 §4.1 在外生性 $\mathbb E[\mathbf Xu]=0$ 下解出总体 $\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]$，并用 FWL / OVB / 测量误差揭示「控制变量」与「偏误」的代数。§4.2 把总体矩换成样本矩得 OLS，几何上是投影。§4.3 是 OLS 的统计性质阶梯：均值独立 ⇒ 无偏；+ 同方差 ⇒ BLUE；基本假设 ⇒ 一致 + 渐近正态（三明治方差 $\Omega$，同方差退化为 $\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}$）。§4.4 基于 $\sqrt n(\hat{\boldsymbol\beta}-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)$ 构造 $t$ / Wald / LM（与 Wald 等价）/ Delta 方法检验。§4.5 在异方差下用 GLS 恢复 BLUE。本章始终假设外生 $\mathbb E[\mathbf Xu]=0$；下一章处理内生 $\mathbb E[\mathbf Xu]\ne0$（工具变量）。

Important

Chapter arc From "population $\boldsymbol\beta$" to "sample OLS" to "inference" to "the more efficient GLS." §4.1 solves the population $\boldsymbol\beta=\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]$ under exogeneity $\mathbb E[\mathbf Xu]=0$, and uses FWL / OVB / measurement error to reveal the algebra of "control variables" and "bias." §4.2 replaces population moments with sample moments to get OLS, which geometrically is a projection. §4.3 is the ladder of OLS's statistical properties: mean independence ⇒ unbiased; + homoskedasticity ⇒ BLUE; the baseline assumptions ⇒ consistent + asymptotically normal (the sandwich variance $\Omega$, degenerating to $\sigma^2\mathbb E[\mathbf X\mathbf X']^{-1}$ under homoskedasticity). §4.4 builds $t$ / Wald / LM (equivalent to Wald) / delta-method tests on $\sqrt n(\hat{\boldsymbol\beta}-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega)$. §4.5 uses GLS to restore the BLUE property under heteroskedasticity. This chapter always assumes exogeneity $\mathbb E[\mathbf Xu]=0$; the next chapter handles endogeneity $\mathbb E[\mathbf Xu]\ne0$ (instrumental variables).