5. Linear Regression with Endogenous Variables

Jun He May 31, 2026

计量经济学Econometrics 内生性Endogeneity 工具变量Instrumental Variables 两阶段最小二乘TSLS 弱工具变量Weak Instruments 局部平均处理效应LATE 潜在结果Potential Outcomes 学习笔记Study Note

Note

本章主题：内生（$\mathbb E[\mathbf Xu]\ne0$）线性回归与工具变量。 §5.1 内生性的来源：遗漏变量、测量误差、联立因果三例均破坏 $\mathbb E[\mathbf Xu]=0$；$\mathbb E[\mathbf Xu]\ne0$ 时 OLS 有偏且不一致。§5.2 总体性质：引入工具 $\mathbf Z$（外生 $\mathbb E[\mathbf Zu]=0$、相关 $\mathrm{rank}(\mathbb E[\mathbf Z\mathbf X'])=k+1$、阶条件 $l+1\ge k+1$），由 $\mathbb E[\mathbf Z\mathbf X']\boldsymbol\beta=\mathbb E[\mathbf ZY]$ (5.8) 解出 $\boldsymbol\beta$：恰好识别 $\boldsymbol\beta=\mathbb E[\mathbf Z\mathbf X']^{-1}\mathbb E[\mathbf ZY]$ (5.9)、过度识别 $\boldsymbol\beta=(\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi)^{-1}\boldsymbol\Pi'\mathbb E[\mathbf ZY]$ (5.10)。§5.3 估计：恰好识别 IV $\hat{\boldsymbol\beta}_n=(\frac1n\sum\mathbf Z_i\mathbf X_i')^{-1}(\frac1n\sum\mathbf Z_iY_i)$ (5.21)；过度识别 TSLS $\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb P_Z\mathbb X)^{-1}\mathbb X'\mathbb P_Z\mathbb Y$。§5.4 TSLS 性质：一致、渐近正态（三明治 $\Omega$）；最优工具 $\boldsymbol\Pi$（命题 5.1）。§5.5 弱工具变量：相关性弱（$\pi\ll1/\sqrt n$）时有限样本分布差；Anderson-Rubin 检验不受弱工具影响。§5.6 异质效应下的 TSLS = LATE：随机分配下 OLS = ATE；否则有选择偏误，需工具；潜在结果框架下分四类人（never/always takers、compliers、defiers），单调性排除 defiers，则 IV 估计 = 服从者的局部平均处理效应（LATE） (5.42)。

Note

Chapter theme: endogenous ($\mathbb E[\mathbf Xu]\ne0$) linear regression and instrumental variables. §5.1 Sources of endogeneity: omitted variables, measurement error, and simultaneous causality all break $\mathbb E[\mathbf Xu]=0$; when $\mathbb E[\mathbf Xu]\ne0$, OLS is biased and inconsistent. §5.2 Population properties: introduce instruments $\mathbf Z$ (exogenous $\mathbb E[\mathbf Zu]=0$, relevant $\mathrm{rank}(\mathbb E[\mathbf Z\mathbf X'])=k+1$, order condition $l+1\ge k+1$), and from $\mathbb E[\mathbf Z\mathbf X']\boldsymbol\beta=\mathbb E[\mathbf ZY]$ (5.8) solve for $\boldsymbol\beta$: exact identification $\boldsymbol\beta=\mathbb E[\mathbf Z\mathbf X']^{-1}\mathbb E[\mathbf ZY]$ (5.9), over-identification $\boldsymbol\beta=(\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi)^{-1}\boldsymbol\Pi'\mathbb E[\mathbf ZY]$ (5.10). §5.3 Estimation: exact-identification IV $\hat{\boldsymbol\beta}_n=(\frac1n\sum\mathbf Z_i\mathbf X_i')^{-1}(\frac1n\sum\mathbf Z_iY_i)$ (5.21); over-identification TSLS $\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb P_Z\mathbb X)^{-1}\mathbb X'\mathbb P_Z\mathbb Y$. §5.4 TSLS properties: consistent, asymptotically normal (sandwich $\Omega$); the optimal instrument is $\boldsymbol\Pi$ (Proposition 5.1). §5.5 Weak instruments: when relevance is weak ($\pi\ll1/\sqrt n$) the finite-sample distribution is poor; the Anderson-Rubin test is unaffected by weak instruments. §5.6 TSLS under heterogeneous effects = LATE: under random assignment OLS = ATE; otherwise there is selection bias and instruments are needed; in the potential-outcomes framework people split into four types (never/always takers, compliers, defiers), and monotonicity rules out defiers, so the IV estimator = the local average treatment effect (LATE) among compliers (5.42).

5.1 Problems Caused by Endogeneity

线性模型 $Y=\mathbf X'\boldsymbol\beta+u$ (5.1) 中，本章不假设 $\mathbb E[\mathbf Xu]=0$，但仍可归一化使 $\mathbb E[u]=0$。

Important

定义 5.1（外生/内生）在 (5.1) 中，$X_j$ 是外生变量若 $\mathbb E[X_ju]=0$；是内生变量若 $\mathbb E[X_ju]\ne0$。

5.1.1 遗漏变量. 真实模型 $Y=\beta_0+\beta_1X_1+\beta_2X_2+u$ (5.2)，两者皆纳入时外生 $\mathbb E[\mathbf Xu]=0$。但 $X_2$ 不可观测，只能估 $Y=\beta_0^*+\beta_1^*X_1+u^*$ (5.3)。重排得 $\beta_0^*=\beta_0+\mathbb E[X_2]\beta_2$、$\beta_1^*=\beta_1$、$u^*=u+(X_2-\mathbb E[X_2])\beta_2$（$\mathbb E[u^*]=0$）。但

$$\mathrm{Cov}(X_1,u^*)=\underbrace{\mathrm{Cov}(X_1,u)}_{=0}+\mathrm{Cov}(X_1,X_2)\beta_2=\mathrm{Cov}(X_1,X_2)\beta_2$$

除非 $\mathrm{Cov}(X_1,X_2)=0$ 或 $\beta_2=0$，否则 $X_1$ 内生（$\mathbb E[X_1u^*]\ne0$）。

5.1.2 测量误差. $Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+u$ (5.4)，外生 $\mathbb E[\mathbf Xu]=0$；$\mathbf X_1$ 不可观测，观测 $\hat{\mathbf X}_1=\mathbf X_1+\mathbf v$（经典测量误差 $\mathbb E[\mathbf v]=0$、$\mathrm{Cov}(\mathbf X_1,\mathbf v)=\mathbf 0$、$\mathrm{Cov}(u,\mathbf v)=\mathbf 0$）。代入 $Y=\beta_0^*+\hat{\mathbf X}_1'\boldsymbol\beta_1^*+u^*$ (5.5)，得 $u^*=-\mathbf v'\boldsymbol\beta_1+u$（$\mathbb E[u^*]=0$）；但

$$\mathrm{Cov}(\hat{\mathbf X}_1,u^*)=-\mathrm{Var}(\mathbf v)\boldsymbol\beta_1$$

除非 $\mathrm{Var}(\mathbf v)=\mathbf 0$ 或 $\boldsymbol\beta_1=\mathbf 0$，否则 $\hat{\mathbf X}_1$ 内生。

In the linear model $Y=\mathbf X'\boldsymbol\beta+u$ (5.1), this chapter does not assume $\mathbb E[\mathbf Xu]=0$, but can still normalize $\mathbb E[u]=0$.

Important

Definition 5.1 (Exogenous/endogenous) In (5.1), $X_j$ is an exogenous variable if $\mathbb E[X_ju]=0$; an endogenous variable if $\mathbb E[X_ju]\ne0$.

5.1.1 Omitted variables. True model $Y=\beta_0+\beta_1X_1+\beta_2X_2+u$ (5.2); with both included, exogeneity $\mathbb E[\mathbf Xu]=0$ holds. But $X_2$ is unobservable, and we can only estimate $Y=\beta_0^*+\beta_1^*X_1+u^*$ (5.3). Rearranging gives $\beta_0^*=\beta_0+\mathbb E[X_2]\beta_2$, $\beta_1^*=\beta_1$, $u^*=u+(X_2-\mathbb E[X_2])\beta_2$ ($\mathbb E[u^*]=0$). But

$$\mathrm{Cov}(X_1,u^*)=\underbrace{\mathrm{Cov}(X_1,u)}_{=0}+\mathrm{Cov}(X_1,X_2)\beta_2=\mathrm{Cov}(X_1,X_2)\beta_2$$

so unless $\mathrm{Cov}(X_1,X_2)=0$ or $\beta_2=0$, $X_1$ is endogenous ($\mathbb E[X_1u^*]\ne0$).

5.1.2 Measurement error. $Y=\beta_0+\mathbf X_1'\boldsymbol\beta_1+u$ (5.4), exogenous $\mathbb E[\mathbf Xu]=0$; $\mathbf X_1$ is unobservable, and we observe $\hat{\mathbf X}_1=\mathbf X_1+\mathbf v$ (classical measurement error $\mathbb E[\mathbf v]=0$, $\mathrm{Cov}(\mathbf X_1,\mathbf v)=\mathbf 0$, $\mathrm{Cov}(u,\mathbf v)=\mathbf 0$). Substituting $Y=\beta_0^*+\hat{\mathbf X}_1'\boldsymbol\beta_1^*+u^*$ (5.5) gives $u^*=-\mathbf v'\boldsymbol\beta_1+u$ ($\mathbb E[u^*]=0$); but

$$\mathrm{Cov}(\hat{\mathbf X}_1,u^*)=-\mathrm{Var}(\mathbf v)\boldsymbol\beta_1$$

so unless $\mathrm{Var}(\mathbf v)=\mathbf 0$ or $\boldsymbol\beta_1=\mathbf 0$, $\hat{\mathbf X}_1$ is endogenous.

5.1.3 联立因果. 需求 $Q^d=\beta_0^d+\beta_1^d\tilde P+u^d$ (5.6) 与供给 $Q^s=\beta_0^s+\beta_1^s\tilde P+u^s$ (5.7)，$\mathbb E[u^d]=\mathbb E[u^s]=0$、$\mathbb E[u^du^s]=0$。市场出清 $Q^d=Q^s$ 解出

$$\tilde P=\frac{(\beta_0^s-\beta_0^d)+(u^s-u^d)}{\beta_1^d-\beta_1^s}$$

$\tilde P$ 同时含 $u^s,u^d$，故 $\mathbb E[\tilde Pu^d]=-\mathbb E[\frac1{\beta_1^d-\beta_1^s}(u^d)^2]\ne0$、$\mathbb E[\tilde Pu^s]\ne0$，即 $\tilde P$ 在 (5.6)(5.7) 中皆内生。

5.1.4 $\mathbb E[\mathbf Xu]\ne0$ 时 OLS 有偏且不一致. 由 $\mathbb E[u\mid\mathbf X]=0\Rightarrow\mathbb E[\mathbf Xu]=0$（重期望），逆否：$\mathbb E[\mathbf Xu]\ne0\Rightarrow\mathbb E[u\mid\mathbf X]\ne0$。有偏：$\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X]=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\underbrace{\mathbb E[\mathbb U\mid\mathbf X]}_{\ne\mathbf 0}\ne\boldsymbol\beta$。不一致：

$$\hat{\boldsymbol\beta}_n\xrightarrow{p}\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf XY]=\boldsymbol\beta+\underbrace{\mathbb E[\mathbf X\mathbf X']^{-1}\mathbb E[\mathbf Xu]}_{\ne\mathbf 0}\ne\boldsymbol\beta$$

5.1.3 Simultaneous causality. Demand $Q^d=\beta_0^d+\beta_1^d\tilde P+u^d$ (5.6) and supply $Q^s=\beta_0^s+\beta_1^s\tilde P+u^s$ (5.7), with $\mathbb E[u^d]=\mathbb E[u^s]=0$, $\mathbb E[u^du^s]=0$. Market clearing $Q^d=Q^s$ solves

$$\tilde P=\frac{(\beta_0^s-\beta_0^d)+(u^s-u^d)}{\beta_1^d-\beta_1^s}$$

$\tilde P$ contains both $u^s,u^d$, so $\mathbb E[\tilde Pu^d]=-\mathbb E[\frac1{\beta_1^d-\beta_1^s}(u^d)^2]\ne0$, $\mathbb E[\tilde Pu^s]\ne0$, i.e. $\tilde P$ is endogenous in both (5.6) and (5.7).

5.1.4 OLS is biased and inconsistent when $\mathbb E[\mathbf Xu]\ne0$. Since $\mathbb E[u\mid\mathbf X]=0\Rightarrow\mathbb E[\mathbf Xu]=0$ (LIE), the contrapositive: $\mathbb E[\mathbf Xu]\ne0\Rightarrow\mathbb E[u\mid\mathbf X]\ne0$. Biased: $\mathbb E[\hat{\boldsymbol\beta}_n\mid\mathbf X]=\boldsymbol\beta+(\mathbb X'\mathbb X)^{-1}\mathbb X'\underbrace{\mathbb E[\mathbb U\mid\mathbf X]}_{\ne\mathbf 0}\ne\boldsymbol\beta$. Inconsistent:

5.2 Population Properties of β

5.2.1 工具变量的假设. 保留 $\mathbf X$ 无完全共线、$\mathbb E[\mathbf X\mathbf X']$ 存在、$\mathbb E[Y^2]<\infty$。引入工具 $\mathbf Z\in\mathbb R^{l+1}$，附加：

工具外生性：$\mathbb E[\mathbf Zu]=\mathbf 0$；
工具相关性（秩条件）：$\mathrm{rank}(\mathbb E[\mathbf Z\mathbf X'])=k+1$；
阶条件：$l+1\ge k+1$（相关性的必要条件）；
$\mathbf Z$ 无完全共线，$\mathbb E[\mathbf Z\mathbf Z']$、$\mathbb E[\mathbf Z\mathbf X']$ 存在。

$\mathbf Z$ 包含 $\mathbf X$ 中所有外生变量，尤其 $Z_0=X_0=1$。由外生性 $\mathbb E[\mathbf Zu]=\mathbb E[\mathbf Z(Y-\mathbf X'\boldsymbol\beta)]=\mathbf 0$：

$$\mathbb E[\mathbf Z\mathbf X']\boldsymbol\beta=\mathbb E[\mathbf ZY] \tag{5.8}$$

5.2.2 恰好识别. $l+1=k+1$ 时 $\mathbf X$ 恰好识别，$\mathbb E[\mathbf Z\mathbf X']$ 可逆：

$$\boldsymbol\beta=\mathbb E[\mathbf Z\mathbf X']^{-1}\mathbb E[\mathbf ZY] \tag{5.9}$$

5.2.3 过度识别. $l+1>k+1$ 时 $\mathbf X$ 过度识别，$\mathbb E[\mathbf Z\mathbf X']$ 非方阵、不可逆，需引入 $\boldsymbol\Pi$。定义 $\boldsymbol\Pi$ 使 $\mathrm{BLP}(\mathbf X\mid\mathbf Z)=\boldsymbol\Pi'\mathbf Z$，即 $\boldsymbol\Pi=\mathbb E[\mathbf Z\mathbf Z']^{-1}\mathbb E[\mathbf Z\mathbf X']$ (5.25)。引理 5.1：$\mathbf Z$ 无完全共线 ⇒ $\mathrm{rank}(\mathbb E[\mathbf Z\mathbf X'])=\mathrm{rank}(\boldsymbol\Pi)=k+1$（用秩不等式 $\mathrm{rank}(AB)\le\min\{\mathrm{rank}(A),\mathrm{rank}(B)\}$，定理 5.1），从而 $\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf X']=\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi$ 可逆。把 (5.8) 两边左乘 $\boldsymbol\Pi'$：

$$\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi\boldsymbol\beta=\boldsymbol\Pi'\mathbb E[\mathbf ZY]\Rightarrow\boldsymbol\beta=(\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi)^{-1}\boldsymbol\Pi'\mathbb E[\mathbf ZY] \tag{5.10}$$

5.2.4 工具相关性的含义. $\mathbb E[\mathbf Z\mathbf X']=\mathbb E[\mathbf Z(\boldsymbol\Pi'\mathbf Z)']=\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi$；秩 $k+1$ 要求 $\boldsymbol\Pi$ 满秩。这意味着内生变量必须与工具（线性独立地）相关——若某内生变量与所有工具不相关，$\boldsymbol\Pi$ 对应行为零、秩破裂。

5.2.1 Assumptions for the instruments. Keep $\mathbf X$ with no perfect collinearity, $\mathbb E[\mathbf X\mathbf X']$ existing, $\mathbb E[Y^2]<\infty$. Introduce instruments $\mathbf Z\in\mathbb R^{l+1}$, adding:

Instrument exogeneity: $\mathbb E[\mathbf Zu]=\mathbf 0$;
Instrument relevance (rank condition): $\mathrm{rank}(\mathbb E[\mathbf Z\mathbf X'])=k+1$;
Order condition: $l+1\ge k+1$ (necessary for relevance);
$\mathbf Z$ has no perfect collinearity, and $\mathbb E[\mathbf Z\mathbf Z']$, $\mathbb E[\mathbf Z\mathbf X']$ exist.

$\mathbf Z$ includes all exogenous variables in $\mathbf X$, in particular $Z_0=X_0=1$. By exogeneity $\mathbb E[\mathbf Zu]=\mathbb E[\mathbf Z(Y-\mathbf X'\boldsymbol\beta)]=\mathbf 0$:

$$\mathbb E[\mathbf Z\mathbf X']\boldsymbol\beta=\mathbb E[\mathbf ZY] \tag{5.8}$$

5.2.2 Exact identification. When $l+1=k+1$, $\mathbf X$ is exactly identified and $\mathbb E[\mathbf Z\mathbf X']$ is invertible:

$$\boldsymbol\beta=\mathbb E[\mathbf Z\mathbf X']^{-1}\mathbb E[\mathbf ZY] \tag{5.9}$$

5.2.3 Over-identification. When $l+1>k+1$, $\mathbf X$ is over-identified, $\mathbb E[\mathbf Z\mathbf X']$ is non-square and not invertible, so introduce $\boldsymbol\Pi$. Define $\boldsymbol\Pi$ such that $\mathrm{BLP}(\mathbf X\mid\mathbf Z)=\boldsymbol\Pi'\mathbf Z$, i.e. $\boldsymbol\Pi=\mathbb E[\mathbf Z\mathbf Z']^{-1}\mathbb E[\mathbf Z\mathbf X']$ (5.25). Lemma 5.1: no perfect collinearity in $\mathbf Z$ ⇒ $\mathrm{rank}(\mathbb E[\mathbf Z\mathbf X'])=\mathrm{rank}(\boldsymbol\Pi)=k+1$ (using the rank inequality $\mathrm{rank}(AB)\le\min\{\mathrm{rank}(A),\mathrm{rank}(B)\}$, Theorem 5.1), so $\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf X']=\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi$ is invertible. Left-multiply (5.8) by $\boldsymbol\Pi'$:

5.2.4 The meaning of instrument relevance. $\mathbb E[\mathbf Z\mathbf X']=\mathbb E[\mathbf Z(\boldsymbol\Pi'\mathbf Z)']=\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi$; rank $k+1$ requires $\boldsymbol\Pi$ to have full rank. This means endogenous variables must be (linearly independently) correlated with the instruments — if an endogenous variable is uncorrelated with all instruments, the corresponding row of $\boldsymbol\Pi$ is zero and the rank breaks.

5.3 Estimation

设 $(Y,\mathbf X,\mathbf Z,u)$ 满足上述假设、样本 i.i.d.。两种情形：恰好识别用 IV 估计、过度识别用 两阶段最小二乘（TSLS）。

5.3.1 恰好识别：IV 估计. $\boldsymbol\beta=\mathbb E[\mathbf Z\mathbf X']^{-1}\mathbb E[\mathbf ZY]$ 的样本类比

$$\hat{\boldsymbol\beta}_n=\Big(\frac1n\sum\mathbf Z_i\mathbf X_i'\Big)^{-1}\Big(\frac1n\sum\mathbf Z_iY_i\Big) \tag{5.21}$$

一阶条件 $\frac1n\sum\mathbf Z_i\hat u_i=\mathbf 0$ (5.22)；矩阵形式 $\hat{\boldsymbol\beta}_n=(\mathbb Z'\mathbb X)^{-1}\mathbb Z'\mathbb Y$。

5.3.2 过度识别：TSLS 估计. $\boldsymbol\beta=(\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf X'])^{-1}\boldsymbol\Pi'\mathbb E[\mathbf ZY]$ (5.23) $=(\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi)^{-1}\boldsymbol\Pi'\mathbb E[\mathbf ZY]$ (5.24)，样本

$$\hat{\boldsymbol\beta}_n=\Big(\hat{\boldsymbol\Pi}_n'\frac1n\sum\mathbf Z_i\mathbf Z_i'\hat{\boldsymbol\Pi}_n\Big)^{-1}\hat{\boldsymbol\Pi}_n'\frac1n\sum\mathbf Z_iY_i \tag{5.30}$$

$\hat{\boldsymbol\Pi}_n=(\frac1n\sum\mathbf Z_i\mathbf Z_i')^{-1}(\frac1n\sum\mathbf Z_i\mathbf X_i')$。矩阵形式 $\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb P_Z\mathbb X)^{-1}\mathbb X'\mathbb P_Z\mathbb Y$，$\mathbb P_Z=\mathbb Z(\mathbb Z'\mathbb Z)^{-1}\mathbb Z'$（$\mathbf Z$ 的投影阵）、$\tilde{\mathbb X}=\mathbb P_Z\mathbb X$（$\mathbf X$ 对 $\mathbf Z$ 的拟合值）。

Tip

Remark 5.1（IV 与 TSLS 的关系） IV 解读：用更短的工具 $\hat{\boldsymbol\Pi}_n'\mathbf Z_i$（维度 $k+1$）做恰好识别 IV。TSLS 解读：第一阶段把 $\mathbf X$ 对 $\mathbf Z$ 回归得拟合 $\tilde{\mathbb X}=\mathbb P_Z\mathbb X$，第二阶段把 $Y$ 对 $\tilde{\mathbb X}$ 回归。两种解读给出同一估计量；若 $\hat{\boldsymbol\Pi}_n$ 可逆（恰好识别 $l+1=k+1$），则 IV 与 TSLS 完全相同。

Let $(Y,\mathbf X,\mathbf Z,u)$ satisfy the above assumptions with an i.i.d. sample. Two cases: exact identification uses IV, over-identification uses two-stage least squares (TSLS).

5.3.1 Exact identification: IV estimator. The sample analog of $\boldsymbol\beta=\mathbb E[\mathbf Z\mathbf X']^{-1}\mathbb E[\mathbf ZY]$:

$$\hat{\boldsymbol\beta}_n=\Big(\frac1n\sum\mathbf Z_i\mathbf X_i'\Big)^{-1}\Big(\frac1n\sum\mathbf Z_iY_i\Big) \tag{5.21}$$

with first-order condition $\frac1n\sum\mathbf Z_i\hat u_i=\mathbf 0$ (5.22); matrix form $\hat{\boldsymbol\beta}_n=(\mathbb Z'\mathbb X)^{-1}\mathbb Z'\mathbb Y$.

5.3.2 Over-identification: TSLS estimator. $\boldsymbol\beta=(\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf X'])^{-1}\boldsymbol\Pi'\mathbb E[\mathbf ZY]$ (5.23) $=(\boldsymbol\Pi'\mathbb E[\mathbf Z\mathbf Z']\boldsymbol\Pi)^{-1}\boldsymbol\Pi'\mathbb E[\mathbf ZY]$ (5.24), with sample

$$\hat{\boldsymbol\beta}_n=\Big(\hat{\boldsymbol\Pi}_n'\frac1n\sum\mathbf Z_i\mathbf Z_i'\hat{\boldsymbol\Pi}_n\Big)^{-1}\hat{\boldsymbol\Pi}_n'\frac1n\sum\mathbf Z_iY_i \tag{5.30}$$

with $\hat{\boldsymbol\Pi}_n=(\frac1n\sum\mathbf Z_i\mathbf Z_i')^{-1}(\frac1n\sum\mathbf Z_i\mathbf X_i')$. Matrix form $\hat{\boldsymbol\beta}_n=(\mathbb X'\mathbb P_Z\mathbb X)^{-1}\mathbb X'\mathbb P_Z\mathbb Y$, where $\mathbb P_Z=\mathbb Z(\mathbb Z'\mathbb Z)^{-1}\mathbb Z'$ (the projection matrix of $\mathbf Z$) and $\tilde{\mathbb X}=\mathbb P_Z\mathbb X$ (the fitted value of $\mathbf X$ on $\mathbf Z$).

Tip

Remark 5.1 (IV vs TSLS) IV reading: an exactly-identified IV using the shorter instrument $\hat{\boldsymbol\Pi}_n'\mathbf Z_i$ (dimension $k+1$). TSLS reading: the first stage regresses $\mathbf X$ on $\mathbf Z$ to get the fitted $\tilde{\mathbb X}=\mathbb P_Z\mathbb X$, the second stage regresses $Y$ on $\tilde{\mathbb X}$. The two readings give the same estimator; if $\hat{\boldsymbol\Pi}_n$ is invertible (exact identification $l+1=k+1$), IV and TSLS coincide.

5.4 Properties of the TSLS Estimator

聚焦过度识别情形（更一般）。

5.4.1 一致性：$\hat{\boldsymbol\beta}_n\xrightarrow{p}\boldsymbol\beta$（WLLN + CMT，把样本矩换成总体矩）。
5.4.2 极限分布：额外假设 $\mathrm{Var}(\mathbf Zu)$ 存在，则 $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega),\quad\Omega=\mathbb E[\boldsymbol\Pi'\mathbf Z\mathbf Z'\boldsymbol\Pi]^{-1}\boldsymbol\Pi'\mathrm{Var}(\mathbf Zu)\boldsymbol\Pi\,\mathbb E[\boldsymbol\Pi'\mathbf Z\mathbf Z'\boldsymbol\Pi]^{-1} \tag{5.34}$$ （展开 (5.33) + CLT $\sqrt n\frac1n\sum\mathbf Z_iu_i\xrightarrow{d}N(0,\mathrm{Var}(\mathbf Zu))$ + Slutsky）。
5.4.3 效率（最优工具）：用任意 $\boldsymbol\Gamma$（$\boldsymbol\Gamma'\mathbb E[\mathbf Z\mathbf X']$ 可逆）构造 $\tilde{\boldsymbol\beta}_n=(\frac1n\sum\boldsymbol\Gamma'\mathbf Z_i\mathbf X_i')^{-1}(\frac1n\sum\boldsymbol\Gamma'\mathbf Z_iY_i)$，其渐近方差 $\tilde\Omega$。

Important

命题 5.1（最优工具）若 $\mathbb E[u\mid\mathbf Z]=0$ 且 $\mathrm{Var}(u\mid\mathbf Z)=\sigma^2$，则 $\boldsymbol\Gamma=\boldsymbol\Pi$ 是「最优」——对任意满足 $\mathrm{rank}(\boldsymbol\Gamma'\mathbb E[\mathbf Z\mathbf X'])=k+1$ 的 $\boldsymbol\Gamma$，有 $\Omega\le\tilde\Omega$。即 TSLS（以 $\boldsymbol\Pi$ 为权）在所有线性工具组合中渐近方差最小。

5.4.4 估计 $\Omega$：(5.34) 的样本类比 $\hat\Omega_n$（中间项 $\frac1n\sum\mathbf Z_i\mathbf Z_i'\hat u_i^2\xrightarrow{p}\mathrm{Var}(\mathbf Zu)$，同 OLS 稳健方差）。

Focus on the over-identification case (more general).

5.4.1 Consistency: $\hat{\boldsymbol\beta}_n\xrightarrow{p}\boldsymbol\beta$ (WLLN + CMT, replacing sample moments by population moments).
5.4.2 Limiting distribution: additionally assume $\mathrm{Var}(\mathbf Zu)$ exists; then $$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)\xrightarrow{d}N(0,\Omega),\quad\Omega=\mathbb E[\boldsymbol\Pi'\mathbf Z\mathbf Z'\boldsymbol\Pi]^{-1}\boldsymbol\Pi'\mathrm{Var}(\mathbf Zu)\boldsymbol\Pi\,\mathbb E[\boldsymbol\Pi'\mathbf Z\mathbf Z'\boldsymbol\Pi]^{-1} \tag{5.34}$$ (expand (5.33) + CLT $\sqrt n\frac1n\sum\mathbf Z_iu_i\xrightarrow{d}N(0,\mathrm{Var}(\mathbf Zu))$ + Slutsky).
5.4.3 Efficiency (optimal instrument): with an arbitrary $\boldsymbol\Gamma$ ($\boldsymbol\Gamma'\mathbb E[\mathbf Z\mathbf X']$ invertible), construct $\tilde{\boldsymbol\beta}_n=(\frac1n\sum\boldsymbol\Gamma'\mathbf Z_i\mathbf X_i')^{-1}(\frac1n\sum\boldsymbol\Gamma'\mathbf Z_iY_i)$ with asymptotic variance $\tilde\Omega$.

Important

Proposition 5.1 (Optimal instrument) If $\mathbb E[u\mid\mathbf Z]=0$ and $\mathrm{Var}(u\mid\mathbf Z)=\sigma^2$, then $\boldsymbol\Gamma=\boldsymbol\Pi$ is "best" — for any $\boldsymbol\Gamma$ with $\mathrm{rank}(\boldsymbol\Gamma'\mathbb E[\mathbf Z\mathbf X'])=k+1$, $\Omega\le\tilde\Omega$. That is, TSLS (with weight $\boldsymbol\Pi$) has the smallest asymptotic variance among all linear combinations of instruments.

5.4.4 Estimating $\Omega$: the sample analog $\hat\Omega_n$ of (5.34) (the middle term $\frac1n\sum\mathbf Z_i\mathbf Z_i'\hat u_i^2\xrightarrow{p}\mathrm{Var}(\mathbf Zu)$, as in the OLS robust variance).

5.5 Weak Instruments

当秩条件 $\mathrm{rank}(\mathbb E[\mathbf Z\mathbf X'])=k+1$「接近失败」（几乎不足 $k+1$）时，$\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)$ 的有限样本分布很差。

例 5.1（弱工具）. $Y_i=\beta X_i+u_i$、$X_i=\pi Z_i+v_i$，$Z_i$ 非随机（确定），$(u_i,v_i)$ 联合正态。恰好识别 IV $\hat\beta_n=\frac{\frac1n\sum Z_iY_i}{\frac1n\sum Z_iX_i}$，

$$\sqrt n(\hat\beta_n-\beta)=\frac{\overbrace{\sqrt n\frac1n\sum Z_iu_i}^{\equiv A}}{\underbrace{\frac1n\sum Z_i^2\pi+\frac1n\sum Z_iv_i}_{\equiv B}}$$

由 CLT 与 WLLN，分母 $B$ 须远大于其标准差才能视为常数（好的近似），即 $\overline{Z_n^2}\pi\gg\sqrt{\frac1n\overline{Z_n^2}\sigma_2^2}$，要求 $\pi\gg\frac1{\sqrt n}$。若工具弱（$\pi$ 极小），近似很差。

例 5.2（Anderson-Rubin 检验）. 检验 $H_0:\boldsymbol\beta=\mathbf c$ vs $H_1:\boldsymbol\beta\ne\mathbf c$，不先估 $\hat\beta$。$H_0$ 下误差 $u_i(\mathbf c)=Y_i-\mathbf X_i'\mathbf c$，检验工具外生性 $\mathbb E[\mathbf Z_i(Y_i-\mathbf X_i'\mathbf c)]=\mathbf 0$。统计量

$$T_n=n\bar w_n(\mathbf c)'\hat\Sigma_n^{-1}\bar w_n(\mathbf c),\quad\bar w_n(\mathbf c)=\frac1n\sum\mathbf Z_i(Y_i-\mathbf X_i'\mathbf c),\ \hat\Sigma_n=\frac1n\sum\mathbf Z_i\mathbf Z_i'u_i^2(\mathbf c)$$

$\phi_n=\mathbf 1\{T_n>c_n\}$，$c_n=\chi^2_{l+1,1-\alpha}$。

Tip

Remark 5.2 AR 检验避免先估 $\hat\beta$，直接把 $\mathbb E[\mathbf Zu]$ 与零比较，不涉及秩条件（不连接 $\mathbf Z$ 与 $\mathbf X$），故不受弱工具问题影响。

When the rank condition $\mathrm{rank}(\mathbb E[\mathbf Z\mathbf X'])=k+1$ is "close to failing" (almost less than $k+1$), the finite-sample distribution of $\sqrt n(\hat{\boldsymbol\beta}_n-\boldsymbol\beta)$ is poor.

Example 5.1 (weak instruments). $Y_i=\beta X_i+u_i$, $X_i=\pi Z_i+v_i$, $Z_i$ deterministic, $(u_i,v_i)$ jointly normal. The exactly-identified IV $\hat\beta_n=\frac{\frac1n\sum Z_iY_i}{\frac1n\sum Z_iX_i}$,

$$\sqrt n(\hat\beta_n-\beta)=\frac{\overbrace{\sqrt n\frac1n\sum Z_iu_i}^{\equiv A}}{\underbrace{\frac1n\sum Z_i^2\pi+\frac1n\sum Z_iv_i}_{\equiv B}}$$

By CLT and WLLN, the denominator $B$ must be much larger than its standard deviation to be treated as constant (a good approximation), i.e. $\overline{Z_n^2}\pi\gg\sqrt{\frac1n\overline{Z_n^2}\sigma_2^2}$, requiring $\pi\gg\frac1{\sqrt n}$. If the instrument is weak ($\pi$ tiny), the approximation is very poor.

Example 5.2 (Anderson-Rubin test). Test $H_0:\boldsymbol\beta=\mathbf c$ vs $H_1:\boldsymbol\beta\ne\mathbf c$ without first estimating $\hat\beta$. Under $H_0$ the error $u_i(\mathbf c)=Y_i-\mathbf X_i'\mathbf c$, and we test instrument exogeneity $\mathbb E[\mathbf Z_i(Y_i-\mathbf X_i'\mathbf c)]=\mathbf 0$. The statistic

with $\phi_n=\mathbf 1\{T_n>c_n\}$, $c_n=\chi^2_{l+1,1-\alpha}$.

Tip

Remark 5.2 The AR test avoids estimating $\hat\beta$ first, directly comparing $\mathbb E[\mathbf Zu]$ with zero, not involving the rank condition (it does not connect $\mathbf Z$ with $\mathbf X$), so it does not suffer from the weak-instrument problem.

5.6 Interpretation of TSLS under Heterogeneous Effects

同质效应 $Y_i=\mathbf X_i'\boldsymbol\beta$（$\boldsymbol\beta$ 对所有人相同）把结果差异全归于 $u_i$。放松为异质效应——$\boldsymbol\beta$ 随机（因人而异）：$Y_i=\mathbf X_i'\boldsymbol\beta_i$ (5.35)，称随机系数模型。考虑二元处理 $D_i\in\{0,1\}$：$Y_i=\beta_{0,i}+\beta_{1,i}D_i$ (5.36)。潜在结果：$Y_i(0)=\beta_{0,i}$（未处理）、$Y_i(1)=\beta_{0,i}+\beta_{1,i}$（处理）。处理效应 $\beta_{1,i}=Y_i(1)-Y_i(0)$；平均处理效应（ATE） $\mathbb E[\beta_{1,i}]=\mathbb E[Y_i(1)-Y_i(0)]$。实际结果由开关方程

$$Y_i=Y_i(1)D_i+Y_i(0)(1-D_i) \tag{5.37}$$

5.6.1 $(Y_i(0),Y_i(1))\perp D_i$（随机分配）. 独立 ⇒ 均值独立：$\mathbb E[Y_i(1)]=\mathbb E[Y_i(1)\mid D_i=1]=\mathbb E[Y_i(1)\mid D_i=0]$（$Y_i(0)$ 同理）——潜在结果在处理组与对照组相同。此时 OLS $\hat\beta_1=\frac{\frac1n\sum(D_i-\bar D)(Y_i-\bar Y)}{\frac1n\sum(D_i-\bar D)^2}\xrightarrow{p}\frac{\mathrm{Cov}(Y_i,D_i)}{\mathrm{Var}(D_i)}=\mathbb E[Y_i(1)-Y_i(0)]$ = ATE (5.38)。

The homogeneous-effect model $Y_i=\mathbf X_i'\boldsymbol\beta$ ($\boldsymbol\beta$ the same for everyone) attributes all outcome differences to $u_i$. Relax it to heterogeneous effects — $\boldsymbol\beta$ random (varying by individual): $Y_i=\mathbf X_i'\boldsymbol\beta_i$ (5.35), the random-coefficient model. Consider a binary treatment $D_i\in\{0,1\}$: $Y_i=\beta_{0,i}+\beta_{1,i}D_i$ (5.36). Potential outcomes: $Y_i(0)=\beta_{0,i}$ (untreated), $Y_i(1)=\beta_{0,i}+\beta_{1,i}$ (treated). The treatment effect $\beta_{1,i}=Y_i(1)-Y_i(0)$; the average treatment effect (ATE) $\mathbb E[\beta_{1,i}]=\mathbb E[Y_i(1)-Y_i(0)]$. The actual outcome follows the switching equation

$$Y_i=Y_i(1)D_i+Y_i(0)(1-D_i) \tag{5.37}$$

5.6.1 $(Y_i(0),Y_i(1))\perp D_i$ (random assignment). Independence ⇒ mean independence: $\mathbb E[Y_i(1)]=\mathbb E[Y_i(1)\mid D_i=1]=\mathbb E[Y_i(1)\mid D_i=0]$ (similarly for $Y_i(0)$) — potential outcomes are the same in the treatment and control groups. Then OLS $\hat\beta_1=\frac{\frac1n\sum(D_i-\bar D)(Y_i-\bar Y)}{\frac1n\sum(D_i-\bar D)^2}\xrightarrow{p}\frac{\mathrm{Cov}(Y_i,D_i)}{\mathrm{Var}(D_i)}=\mathbb E[Y_i(1)-Y_i(0)]$ = ATE (5.38).

5.6.2 $(Y_i(0),Y_i(1))$ 不独立于 $D_i$（自选择）. 此时 OLS

$$\beta_1=\mathbb E[Y_i(1)\mid D_i=1]-\mathbb E[Y_i(0)\mid D_i=0] \tag{5.39}$$

是「实际受处理者的均值」减「实际未受处理者的均值」——含选择偏误，非 ATE（即使处理真实效应为零，$\beta_1$ 仍可能非零）。需工具 $Z_i\in\{0,1\}$。第一阶段 $D_i=\pi_{0,i}+\pi_{1,i}Z_i$，潜在处理 $D_i(0)=\pi_{0,i}$、$D_i(1)=\pi_{0,i}+\pi_{1,i}$，开关方程 $D_i=D_i(1)Z_i+D_i(0)(1-Z_i)$ (5.40)。据此把人分四类：

never takers：无论工具如何都不接受处理（$D_i(1)=D_i(0)=0$）；
always takers：无论工具如何都接受处理（$D_i(1)=D_i(0)=1$）；
compliers（服从者）：被工具鼓励则接受（$1=D_i(1)>D_i(0)=0$）；
defiers（违抗者）：与鼓励相反（\(D_i(1)

三个条件：工具外生性 $(Y_i(1),Y_i(0),D_i(1),D_i(0))\perp Z_i$（随机选取被试做工具处理）；工具相关性 $\mathbb P(D_i(1)\ne D_i(0))>0$；单调性 $\mathbb P(D_i(1)\ge D_i(0))=1$（排除 defiers）。

5.6.2 $(Y_i(0),Y_i(1))$ not independent of $D_i$ (self-selection). Then OLS

$$\beta_1=\mathbb E[Y_i(1)\mid D_i=1]-\mathbb E[Y_i(0)\mid D_i=0] \tag{5.39}$$

is "the mean of those actually treated" minus "the mean of those actually untreated" — it contains selection bias and is not the ATE (even if the true treatment effect is zero, $\beta_1$ can be nonzero). We need an instrument $Z_i\in\{0,1\}$. The first stage $D_i=\pi_{0,i}+\pi_{1,i}Z_i$, with potential treatments $D_i(0)=\pi_{0,i}$, $D_i(1)=\pi_{0,i}+\pi_{1,i}$, and switching equation $D_i=D_i(1)Z_i+D_i(0)(1-Z_i)$ (5.40). People split into four types:

never takers: never take treatment regardless of the instrument ($D_i(1)=D_i(0)=0$);
always takers: always take treatment regardless ($D_i(1)=D_i(0)=1$);
compliers: take treatment when encouraged ($1=D_i(1)>D_i(0)=0$);
defiers: do the opposite of the encouragement (\(D_i(1)

Three conditions: instrument exogeneity $(Y_i(1),Y_i(0),D_i(1),D_i(0))\perp Z_i$ (subjects randomly selected for instrument treatment); instrument relevance $\mathbb P(D_i(1)\ne D_i(0))>0$; monotonicity $\mathbb P(D_i(1)\ge D_i(0))=1$ (rules out defiers).

模型 $Y_i=\beta_0+\beta_1D_i+u_i$、$D_i=\delta_0+\delta_1Z_i+v_i$。IV 估计 $\hat\beta^{IV}\xrightarrow{p}\beta_1=\frac{\mathrm{Cov}(Z_i,Y_i)}{\mathrm{Cov}(Z_i,D_i)}=\frac{\mathbb E[Y_i\mid Z_i=1]-\mathbb E[Y_i\mid Z_i=0]}{\mathbb E[D_i\mid Z_i=1]-\mathbb E[D_i\mid Z_i=0]}$ (5.41)。分子 $=\mathbb E[(Y_i(1)-Y_i(0))\mid D_i(1)>D_i(0)]\,\mathbb P(D_i(1)>D_i(0))$、分母 $=\mathbb P(D_i(1)>D_i(0))$，故

$$\beta_1=\mathbb E[(Y_i(1)-Y_i(0))\mid D_i(1)>D_i(0)] \tag{5.42}$$

即 服从者的局部平均处理效应（LATE, local average treatment effect）——$D_i(1)>D_i(0)$ 恰为服从者（$D_i(1)=1,D_i(0)=0$）。

Important

Remark 5.4：LATE 的含义 LATE 只识别服从者的平均处理效应，且与所用工具相关——不同工具对应不同的服从者群体，故 LATE 可能不同。单调性至关重要：若有 defiers，他们会与 compliers 混在一起、相互抵消，使 LATE 无法识别任何子群体的效应。

Model $Y_i=\beta_0+\beta_1D_i+u_i$, $D_i=\delta_0+\delta_1Z_i+v_i$. The IV estimator $\hat\beta^{IV}\xrightarrow{p}\beta_1=\frac{\mathrm{Cov}(Z_i,Y_i)}{\mathrm{Cov}(Z_i,D_i)}=\frac{\mathbb E[Y_i\mid Z_i=1]-\mathbb E[Y_i\mid Z_i=0]}{\mathbb E[D_i\mid Z_i=1]-\mathbb E[D_i\mid Z_i=0]}$ (5.41). The numerator $=\mathbb E[(Y_i(1)-Y_i(0))\mid D_i(1)>D_i(0)]\,\mathbb P(D_i(1)>D_i(0))$, the denominator $=\mathbb P(D_i(1)>D_i(0))$, so

$$\beta_1=\mathbb E[(Y_i(1)-Y_i(0))\mid D_i(1)>D_i(0)] \tag{5.42}$$

i.e. the local average treatment effect (LATE) among compliers — $D_i(1)>D_i(0)$ are exactly the compliers ($D_i(1)=1,D_i(0)=0$).

Important

Remark 5.4: the meaning of LATE LATE identifies only the average treatment effect among compliers, and it is tied to the instrument used — different instruments correspond to different complier groups, so the LATE may differ. Monotonicity is crucial: with defiers present, they would be mixed with compliers and cancel out, so the LATE could not identify the effect for any subgroup.

Important

本章脉络 内生性破坏 OLS → 工具变量修复 → 异质效应下 IV 只给 LATE。 §5.1 三例（遗漏变量、测量误差、联立）说明 $\mathbb E[\mathbf Xu]\ne0$ 使 OLS 有偏不一致。§5.2–5.3 引入满足外生 + 相关的工具 $\mathbf Z$：恰好识别 → IV，过度识别 → TSLS（= 先用 $\mathbf Z$ 拟合 $\mathbf X$ 再回归）。§5.4 给出一致性、渐近正态与最优工具 $\boldsymbol\Pi$。§5.5 警示弱工具（$\pi\ll1/\sqrt n$）有限样本表现差，AR 检验可绕开。§5.6 在异质效应/潜在结果框架下，IV 不再识别 ATE，而是服从者的 LATE（单调性排除 defiers 是关键）。本章与下一章（极大似然）共同构成估计理论的两大支柱。

Important

Chapter arc Endogeneity breaks OLS → instrumental variables repair it → under heterogeneous effects IV gives only the LATE. §5.1's three examples (omitted variables, measurement error, simultaneity) show $\mathbb E[\mathbf Xu]\ne0$ makes OLS biased and inconsistent. §5.2–5.3 introduce instruments $\mathbf Z$ satisfying exogeneity + relevance: exact identification → IV, over-identification → TSLS (= fit $\mathbf X$ with $\mathbf Z$ first, then regress). §5.4 gives consistency, asymptotic normality, and the optimal instrument $\boldsymbol\Pi$. §5.5 warns that weak instruments ($\pi\ll1/\sqrt n$) have poor finite-sample behavior, which the AR test can sidestep. §5.6, in the heterogeneous-effects / potential-outcomes framework, shows IV no longer identifies the ATE but the LATE among compliers (with monotonicity ruling out defiers being key). This chapter and the next (maximum likelihood) form the two pillars of estimation theory.