6. Maximum Likelihood Estimator

Jun He May 31, 2026

计量经济学Econometrics 极大似然Maximum Likelihood 似然函数Likelihood 一致性Consistency Fisher信息Fisher Information KL散度Kullback-Leibler Wald-Score-LRWald-Score-LR 学习笔记Study Note

Note

本章主题：极大似然（ML）估计。 §6.1 无条件 ML：似然 $\ell_n(\theta)=\prod p_\theta(x_i)$（联合密度）、对数似然 $L_n(\theta)=\frac1n\sum\log p_\theta(x_i)$、估计量 $\hat\theta_n\in\arg\max_\theta L_n(\theta)$；例子展示 ML 可能不唯一/不存在。§6.2 条件 ML：$\ell_n(\theta)=\prod p_\theta(y_i\mid x_i)$（无条件是其特例），如 Probit / Logit。§6.3 性质：一致性——$L_n(\theta)\xrightarrow{p}L(\theta)=\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]$，由 Jensen 不等式证 $L(\theta)$ 在真值 $\theta_0$ 唯一最大（引理 6.1），配合「近似最大化 + 一致收敛 + 良好分离」三条件得 $\hat\theta_n\xrightarrow{p}\theta_0$（引理 6.2）；误设定 → KL 散度：ML 收敛到最小化 Kullback-Leibler 散度的参数；极限分布 $\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega)$，$\Omega=B^{-1}AB^{-1}$，由信息矩阵等式 $B=-A$ 简化为 $\Omega=A^{-1}=(-B)^{-1}$（Fisher 信息阵的逆）。§6.4 推断：检验 $f(\theta_0)=0$ 的三种渐近等价检验——Wald（比较 $f(\hat\theta_n)$ 与 0）、Score / LM（比较受约束估计的得分与 0）、似然比 LR（$2[L_n(\hat\theta_n)-L_n(\tilde\theta_n)]\xrightarrow{d}\chi^2_p$；简单假设下由 Neyman-Pearson 引理为一致最优）。

Note

Chapter theme: maximum likelihood (ML) estimation. §6.1 Unconditional ML: likelihood $\ell_n(\theta)=\prod p_\theta(x_i)$ (the joint density), log-likelihood $L_n(\theta)=\frac1n\sum\log p_\theta(x_i)$, estimator $\hat\theta_n\in\arg\max_\theta L_n(\theta)$; examples show ML may be non-unique / non-existent. §6.2 Conditional ML: $\ell_n(\theta)=\prod p_\theta(y_i\mid x_i)$ (unconditional is a special case), e.g. Probit / Logit. §6.3 Properties: consistency — $L_n(\theta)\xrightarrow{p}L(\theta)=\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]$, with Jensen's inequality proving $L(\theta)$ uniquely maximized at the truth $\theta_0$ (Lemma 6.1), combined with "near-maximizer + uniform convergence + well-separatedness" to get $\hat\theta_n\xrightarrow{p}\theta_0$ (Lemma 6.2); misspecification → KL divergence: ML converges to the parameter minimizing the Kullback-Leibler divergence; limiting distribution $\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega)$, $\Omega=B^{-1}AB^{-1}$, simplified by the information matrix equality $B=-A$ to $\Omega=A^{-1}=(-B)^{-1}$ (the inverse Fisher information). §6.4 Inference: three asymptotically equivalent tests of $f(\theta_0)=0$ — Wald (compares $f(\hat\theta_n)$ with 0), Score / LM (compares the restricted estimator's score with 0), and the likelihood ratio LR ($2[L_n(\hat\theta_n)-L_n(\tilde\theta_n)]\xrightarrow{d}\chi^2_p$; under a simple hypothesis it is uniformly most powerful by the Neyman-Pearson Lemma).

6.1 Unconditional Maximum Likelihood Estimator

设 $X_1,\dots,X_n\overset{iid}\sim P=P_{\theta_0}$，$P_\theta$ 为带参数 $\theta\in\Theta$ 的分布（$\theta$ 未必是真值 $\theta_0$），$p_\theta$ 为对应密度（关于公共测度 $\mu$）。

Important

定义 6.1 / 6.2 / 6.3 无条件似然 $\ell_n(\theta)\equiv\prod_{i=1}^np_\theta(x_i)$（在实现值处求值的联合密度，i.i.d. 故为同型边际密度之积）。ML 估计量 $\hat\theta_n\in\arg\max_\theta\ell_n(\theta)$。无条件对数似然 $L_n(\theta)\equiv\frac1n\log\ell_n(\theta)=\frac1n\sum_{i=1}^n\log p_\theta(x_i)$。

Tip

Remark 6.1 / 6.2 似然与密度形式相同，区别仅在视角：密度把 $X_i$ 当变量、$\theta$ 当参数；似然把 $\theta$ 当变量、$X_i$ 当用于求值的已知数。又 $\log$ 单调递增，故 $\arg\max\ell_n=\arg\max L_n$。

例子. 例 6.1 $\mathrm{Unif}[0,\theta_0]$：$p_\theta(x)=\frac1\theta\mathbf 1\{0\le x\le\theta\}$，$\ell_n(\theta)=\frac1{\theta^n}\mathbf 1\{0\le X_i\le\theta,\forall i\}$，最大化要求 $\theta\ge\max X_i$ 且 $\theta$ 越小 $\frac1{\theta^n}$ 越大，故 $\hat\theta_n=\max_{1\le i\le n}X_i$。例 6.2 不存在：支撑改为开区间 $(0,\theta_0)$，需 $\theta>\max X_i$（严格），无法取到 $\max X_i$，ML 不存在（只能取近似最大）。例 6.3 不唯一：$p_\theta(x)=\mathbf 1\{\theta\le x\le\theta+1\}$，任意 $\theta\in[\max X_i-1,\min X_i]$ 皆可。例 6.4 Bernoulli：$p_\theta(x)=\theta^x(1-\theta)^{1-x}$，$L_n(\theta)=\log(\theta)\bar X_n+\log(1-\theta)(1-\bar X_n)$，f.o.c. $\Rightarrow\hat\theta_n=\bar X_n$（二阶条件 $<0$ 确认最大）。例 6.5 截断正态：$X=\max\{Z-\theta,0\}$、$Z\sim N(0,1)$，$p_\theta(x)=\Phi(\theta)\mathbf 1\{x=0\}+\phi(x+\theta)\mathbf 1\{x>0\}$，$\hat\theta_n$ 难显式刻画。

Let $X_1,\dots,X_n\overset{iid}\sim P=P_{\theta_0}$, where $P_\theta$ is a distribution with parameter $\theta\in\Theta$ ($\theta$ need not be the truth $\theta_0$), and $p_\theta$ the corresponding density (with respect to a common measure $\mu$).

Important

Definitions 6.1 / 6.2 / 6.3 Unconditional likelihood $\ell_n(\theta)\equiv\prod_{i=1}^np_\theta(x_i)$ (the joint density evaluated at the realized values; a product of same-form marginal densities by i.i.d.). ML estimator $\hat\theta_n\in\arg\max_\theta\ell_n(\theta)$. Unconditional log-likelihood $L_n(\theta)\equiv\frac1n\log\ell_n(\theta)=\frac1n\sum_{i=1}^n\log p_\theta(x_i)$.

Tip

Remark 6.1 / 6.2 Likelihood and density have the same form, differing only in viewpoint: the density treats $X_i$ as the variable and $\theta$ as the parameter; the likelihood treats $\theta$ as the variable and $X_i$ as known values used for evaluation. Since $\log$ is monotone increasing, $\arg\max\ell_n=\arg\max L_n$.

Examples. Example 6.1 $\mathrm{Unif}[0,\theta_0]$: $p_\theta(x)=\frac1\theta\mathbf 1\{0\le x\le\theta\}$, $\ell_n(\theta)=\frac1{\theta^n}\mathbf 1\{0\le X_i\le\theta,\forall i\}$; maximizing requires $\theta\ge\max X_i$ and the smaller $\theta$ the larger $\frac1{\theta^n}$, so $\hat\theta_n=\max_{1\le i\le n}X_i$. Example 6.2 non-existence: with support changed to the open interval $(0,\theta_0)$, we need $\theta>\max X_i$ (strict), which cannot attain $\max X_i$, so ML does not exist (only a near-maximizer). Example 6.3 non-uniqueness: $p_\theta(x)=\mathbf 1\{\theta\le x\le\theta+1\}$, any $\theta\in[\max X_i-1,\min X_i]$ works. Example 6.4 Bernoulli: $p_\theta(x)=\theta^x(1-\theta)^{1-x}$, $L_n(\theta)=\log(\theta)\bar X_n+\log(1-\theta)(1-\bar X_n)$, f.o.c. $\Rightarrow\hat\theta_n=\bar X_n$ (second-order condition $<0$ confirms a maximum). Example 6.5 censored normal: $X=\max\{Z-\theta,0\}$, $Z\sim N(0,1)$, $p_\theta(x)=\Phi(\theta)\mathbf 1\{x=0\}+\phi(x+\theta)\mathbf 1\{x>0\}$, with $\hat\theta_n$ hard to characterize explicitly.

6.2 Conditional Maximum Likelihood Estimator

设 $(Y_i,X_i)$ i.i.d.，$Y_i$ 给定 $X_i$ 的条件分布 $Y_i\mid X_i\sim P_{\theta_0}$、$X_i$ 边际 $\sim P_X$，条件密度 $p_\theta(y\mid x)$。

Important

定义 6.4 / 6.5 / 6.6 条件似然 $\ell_n(\theta)\equiv\prod_{i=1}^np_\theta(y_i\mid x_i)$。条件 ML 估计量 $\hat\theta_n\in\arg\max_\theta\ell_n(\theta)$。条件对数似然 $L_n(\theta)\equiv\frac1n\log\ell_n(\theta)=\frac1n\sum\log p_\theta(y_i\mid x_i)$。

无条件似然是条件似然的特例（$X_i$ 退化为常数时 $\mathbb E[Y\mid X]=\mathbb E[Y]$），故下文均以更一般的条件情形讨论。

例 6.6（Probit / Logit）. $(Y_i,X_i)$ i.i.d.，$Y_i\in\{0,1\}$、$X_i\in\mathbb R^{k+1}$（首元为 1）、$\theta_0\in\mathbb R^{k+1}$。Probit：

$$p_\theta(y\mid x)=\Phi(x'\theta)^y[1-\Phi(x'\theta)]^{1-y}$$

$$L_n(\theta)=\frac1n\sum_{i=1}^n[Y_i\log\Phi(X_i'\theta)+(1-Y_i)\log(1-\Phi(X_i'\theta))]$$

若把 $\Phi$ 换成 logistic c.d.f. $G(z)=\frac{\exp(z)}{1+\exp(z)}$，即 Logit 模型。

Let $(Y_i,X_i)$ be i.i.d., with the conditional distribution $Y_i\mid X_i\sim P_{\theta_0}$, the marginal $X_i\sim P_X$, and conditional density $p_\theta(y\mid x)$.

Important

Definitions 6.4 / 6.5 / 6.6 Conditional likelihood $\ell_n(\theta)\equiv\prod_{i=1}^np_\theta(y_i\mid x_i)$. Conditional ML estimator $\hat\theta_n\in\arg\max_\theta\ell_n(\theta)$. Conditional log-likelihood $L_n(\theta)\equiv\frac1n\log\ell_n(\theta)=\frac1n\sum\log p_\theta(y_i\mid x_i)$.

The unconditional likelihood is a special case of the conditional one (when $X_i$ degenerates to a constant, $\mathbb E[Y\mid X]=\mathbb E[Y]$), so the discussion below uses the more general conditional case.

Example 6.6 (Probit / Logit). $(Y_i,X_i)$ i.i.d., $Y_i\in\{0,1\}$, $X_i\in\mathbb R^{k+1}$ (first element 1), $\theta_0\in\mathbb R^{k+1}$. Probit:

$$p_\theta(y\mid x)=\Phi(x'\theta)^y[1-\Phi(x'\theta)]^{1-y}$$

$$L_n(\theta)=\frac1n\sum_{i=1}^n[Y_i\log\Phi(X_i'\theta)+(1-Y_i)\log(1-\Phi(X_i'\theta))]$$

Replacing $\Phi$ with the logistic c.d.f. $G(z)=\frac{\exp(z)}{1+\exp(z)}$ gives the Logit model.

6.3 Properties of the Maximum Likelihood Estimator

6.3.1 一致性. WLLN 提示 $L_n(\theta)=\frac1n\sum\log p_\theta(Y_i\mid X_i)\xrightarrow{p}\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]\equiv L(\theta)$。

Important

引理 6.1（极限下 ML 的唯一性）若对每个 $\theta\ne\theta_0$，$\mathbb P(p_\theta(Y\mid X)\ne p_{\theta_0}(Y\mid X))>0$，则 $L(\theta)=\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]$ 在 $\theta=\theta_0$ 唯一最大。

Note

证明（引理 6.1，Jensen）令 $M(\theta)\equiv L(\theta)-L(\theta_0)=\mathbb E_{\theta_0}[\log\frac{p_\theta(Y\mid X)}{p_{\theta_0}(Y\mid X)}]$。由 $\log$ 严格凹 + Jensen： $$M(\theta)\le\log\mathbb E_{\theta_0}\Big[\frac{p_\theta(Y\mid X)}{p_{\theta_0}(Y\mid X)}\Big]=\log\Big(\int\int\frac{p_\theta(y\mid x)}{p_{\theta_0}(y\mid x)}p_{\theta_0}(y\mid x)\,d\mu(y)\,dP_X(x)\Big)=\log 1=0 \tag{6.1}$$ （因 $p_\theta(y\mid x)$ 为条件密度积分为 1）。严格凹使不等式严格除非 $\mathbb P(\frac{p_\theta}{p_{\theta_0}}=c)=1$，而积分为 1 迫使 $c=1$；由假设 $\mathbb P(p_\theta\ne p_{\theta_0})>0$ 排除 $c=1$，故 $M(\theta)<0$ 对 $\theta\ne\theta_0$，即 $\theta_0$ 唯一最大 (6.2)。$\blacksquare$

Important

引理 6.2（ML 一致性的充分条件）若 $\hat\theta_n$ 满足：(1)近似最大化 $L_n(\hat\theta_n)\ge L_n(\theta_0)-o_P(1)$；(2)一致收敛 $\sup_{\theta\in\Theta}|L_n(\theta)-L(\theta)|\xrightarrow{p}0$；(3)良好分离 $\sup_{\theta\notin B_\delta(\theta_0)}L(\theta)0$，则 $\hat\theta_n\xrightarrow{p}\theta_0$。

（良好分离意味着：在去掉 $\theta_0$ 邻域 $B_\delta(\theta_0)$ 的定义域上，$L(\theta)$ 的上确界严格低于 $L(\theta_0)$——即 $L$ 在 $\theta_0$ 处有「孤立」尖峰，保证唯一最大值。引理 6.3/6.4/6.5 给出更原始的充分条件：近似最大化由上确界定义显然；良好分离需 $L$ 连续 + $\Theta$ 紧 + 唯一最大；一致收敛需 $\Theta$ 紧 + 存在支配函数。）

6.3.1 Consistency. WLLN suggests $L_n(\theta)=\frac1n\sum\log p_\theta(Y_i\mid X_i)\xrightarrow{p}\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]\equiv L(\theta)$.

Important

Lemma 6.1 (Uniqueness of ML in the limit) If for every $\theta\ne\theta_0$, $\mathbb P(p_\theta(Y\mid X)\ne p_{\theta_0}(Y\mid X))>0$, then $L(\theta)=\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]$ is uniquely maximized at $\theta=\theta_0$.

Note

Proof (Lemma 6.1, Jensen) Let $M(\theta)\equiv L(\theta)-L(\theta_0)=\mathbb E_{\theta_0}[\log\frac{p_\theta(Y\mid X)}{p_{\theta_0}(Y\mid X)}]$. By strict concavity of $\log$ + Jensen: $$M(\theta)\le\log\mathbb E_{\theta_0}\Big[\frac{p_\theta(Y\mid X)}{p_{\theta_0}(Y\mid X)}\Big]=\log\Big(\int\int\frac{p_\theta(y\mid x)}{p_{\theta_0}(y\mid x)}p_{\theta_0}(y\mid x)\,d\mu(y)\,dP_X(x)\Big)=\log 1=0 \tag{6.1}$$ (since $p_\theta(y\mid x)$ is a conditional density integrating to 1). Strict concavity makes it strict unless $\mathbb P(\frac{p_\theta}{p_{\theta_0}}=c)=1$, and integrating to 1 forces $c=1$; the assumption $\mathbb P(p_\theta\ne p_{\theta_0})>0$ rules out $c=1$, so $M(\theta)<0$ for $\theta\ne\theta_0$, i.e. $\theta_0$ is the unique maximizer (6.2). $\blacksquare$

Important

Lemma 6.2 (Sufficient conditions for ML consistency) If $\hat\theta_n$ satisfies: (1) near-maximizer $L_n(\hat\theta_n)\ge L_n(\theta_0)-o_P(1)$; (2) uniform convergence $\sup_{\theta\in\Theta}|L_n(\theta)-L(\theta)|\xrightarrow{p}0$; (3) well-separatedness $\sup_{\theta\notin B_\delta(\theta_0)}L(\theta)0$, then $\hat\theta_n\xrightarrow{p}\theta_0$.

(Well-separatedness means: on the domain excluding the neighborhood $B_\delta(\theta_0)$ of $\theta_0$, the supremum of $L(\theta)$ is strictly below $L(\theta_0)$ — i.e. $L$ has an "isolated" peak at $\theta_0$, ensuring a unique maximum. Lemmas 6.3/6.4/6.5 give more primitive sufficient conditions: near-maximizer is obvious from the definition of supremum; well-separatedness needs $L$ continuous + $\Theta$ compact + unique maximum; uniform convergence needs $\Theta$ compact + a dominating function.)

6.3.2 误设定与 Kullback-Leibler 散度. 当 $Y\mid X$ 对任何 $\theta$ 都不能表示为 $P_\theta$（模型误设定），同样的条件仍使 $\hat\theta_n\xrightarrow{p}L(\theta)=\mathbb E_{f(y\mid x)}[\log p_\theta(Y\mid X)]$ 的最大值点（$f$ 为真实条件密度）。把它改写：

$$\arg\max_\theta\mathbb E_{f}[\log p_\theta(Y\mid X)]=\arg\min_\theta\mathbb E_{f}\Big[\log\frac{f(Y\mid X)}{p_\theta(Y\mid X)}\Big]$$

后者中 $\mathbb E_f[\log\frac{f}{p_\theta}]$ 称 $f$ 与 $p_\theta$ 的 Kullback-Leibler 散度。由 Jensen，KL 散度 $\ge0$，且 $=0$ 当且仅当 $f=p_\theta$。故 ML 收敛到使「与真实分布的 KL 距离」最小的参数（KL 类比「距离」但不对称、非真正距离）。

6.3.3 极限分布.

Important

命题 6.1 / 命题 6.2（极限分布 + 信息矩阵等式）在适当正则条件下（$\log p_\theta(y\mid x)$ 关于 $\theta$ 二次连续可微、存在支配函数、$B$ 可逆、$\Theta$ 紧）， $$\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega),\quad\Omega=B^{-1}AB^{-1}$$ 其中 $A=\mathbb E_{\theta_0}[D_\theta\log p_{\theta_0}\,D_\theta\log p_{\theta_0}']$（得分的外积）、$B=\mathbb E_{\theta_0}[D^2_{\theta\theta'}\log p_{\theta_0}]$（Hessian 的期望）。命题 6.2（信息矩阵等式）：$B=-A$，故 $\Omega=-B^{-1}=A^{-1}$。

Note

证明（命题 6.1，MVT + CLT）一阶条件 $0=D_\theta L_n(\hat\theta_n)$。对 $D_\theta L_n(\hat\theta_n)$ 在 $\theta_0$ 处用中值定理（逐分量）：$0=D_\theta L_n(\theta_0)+H_n(\hat\theta_n-\theta_0)$，$H_n=D^2_{\theta\theta'}L_n$ 在中间点求值。由 WLLN + CMT $H_n\xrightarrow{p}B$；由 CLT $\sqrt n D_\theta L_n(\theta_0)=\sqrt n\frac1n\sum D_\theta\log p_{\theta_0}(Y_i\mid X_i)\xrightarrow{d}N(0,A)$（用 $\mathbb E_{\theta_0}[D_\theta\log p_{\theta_0}]=0$ (6.7)）。整理 $-H_n\sqrt n(\hat\theta_n-\theta_0)=\sqrt n D_\theta L_n(\theta_0)$，由 Slutsky：$\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}-B^{-1}N(0,A)=N(0,B^{-1}AB^{-1})$。$\blacksquare$

Important

定义 6.7（信息矩阵）称 $-B=-\mathbb E_{\theta_0}[D^2_{\theta\theta'}\log p_\theta(y\mid x)]$ 为 Fisher 信息矩阵。由信息矩阵等式 $\Omega=(-B)^{-1}$ 即信息阵之逆。

例 6.9（非正态极限）. $\mathrm{Unif}[0,\theta_0]$、$\hat\theta_n=\max X_i$：$-n(\hat\theta_n-\theta_0)\xrightarrow{d}\mathrm{Exp}(\theta_0)$（非正态——支撑「重度」依赖参数，$\sqrt n$ 标准化失效，收敛速率为 $n$）。

6.3.2 Misspecification and Kullback-Leibler divergence. When $Y\mid X$ cannot be represented by $P_\theta$ for any $\theta$ (a misspecified model), the same conditions still give $\hat\theta_n\xrightarrow{p}$ the maximizer of $L(\theta)=\mathbb E_{f(y\mid x)}[\log p_\theta(Y\mid X)]$ ($f$ the true conditional density). Rewriting:

$$\arg\max_\theta\mathbb E_{f}[\log p_\theta(Y\mid X)]=\arg\min_\theta\mathbb E_{f}\Big[\log\frac{f(Y\mid X)}{p_\theta(Y\mid X)}\Big]$$

where $\mathbb E_f[\log\frac{f}{p_\theta}]$ is the Kullback-Leibler divergence between $f$ and $p_\theta$. By Jensen, KL divergence $\ge0$, with $=0$ iff $f=p_\theta$. So ML converges to the parameter minimizing the "KL distance to the true distribution" (KL is analogous to a "distance" but is asymmetric, not a true distance).

6.3.3 Limiting distribution.

Important

Proposition 6.1 / Proposition 6.2 (Limiting distribution + information matrix equality) Under suitable regularity conditions ($\log p_\theta(y\mid x)$ twice continuously differentiable in $\theta$, a dominating function exists, $B$ invertible, $\Theta$ compact), $$\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega),\quad\Omega=B^{-1}AB^{-1}$$ where $A=\mathbb E_{\theta_0}[D_\theta\log p_{\theta_0}\,D_\theta\log p_{\theta_0}']$ (outer product of the score) and $B=\mathbb E_{\theta_0}[D^2_{\theta\theta'}\log p_{\theta_0}]$ (expectation of the Hessian). Proposition 6.2 (information matrix equality): $B=-A$, so $\Omega=-B^{-1}=A^{-1}$.

Note

Proof (Proposition 6.1, MVT + CLT) First-order condition $0=D_\theta L_n(\hat\theta_n)$. Apply the mean value theorem (component-wise) to $D_\theta L_n(\hat\theta_n)$ around $\theta_0$: $0=D_\theta L_n(\theta_0)+H_n(\hat\theta_n-\theta_0)$, with $H_n=D^2_{\theta\theta'}L_n$ evaluated at an intermediate point. By WLLN + CMT $H_n\xrightarrow{p}B$; by CLT $\sqrt n D_\theta L_n(\theta_0)=\sqrt n\frac1n\sum D_\theta\log p_{\theta_0}(Y_i\mid X_i)\xrightarrow{d}N(0,A)$ (using $\mathbb E_{\theta_0}[D_\theta\log p_{\theta_0}]=0$ (6.7)). Rearranging $-H_n\sqrt n(\hat\theta_n-\theta_0)=\sqrt n D_\theta L_n(\theta_0)$, by Slutsky $\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}-B^{-1}N(0,A)=N(0,B^{-1}AB^{-1})$. $\blacksquare$

Important

Definition 6.7 (Information matrix) $-B=-\mathbb E_{\theta_0}[D^2_{\theta\theta'}\log p_\theta(y\mid x)]$ is the Fisher information matrix. By the information matrix equality $\Omega=(-B)^{-1}$, the inverse information.

Example 6.9 (non-normal limit). $\mathrm{Unif}[0,\theta_0]$, $\hat\theta_n=\max X_i$: $-n(\hat\theta_n-\theta_0)\xrightarrow{d}\mathrm{Exp}(\theta_0)$ (non-normal — the support depends "heavily" on the parameter, $\sqrt n$ standardization fails, and the convergence rate is $n$).

6.4 Inference

设 $f:\mathbb R^k\to\mathbb R^p$ 连续可微、$D_\theta f(\theta_0)$ 行线性独立。检验 $H_0:f(\theta_0)=\mathbf 0$ vs $H_1:f(\theta_0)\ne\mathbf 0$（$p$ 个约束）。$\tilde\theta_n$ 为受约束 ML（在 $f(\theta)=0$ 下最大 $L_n$），$\hat\theta_n$ 为无约束 ML。得分 $S(\theta)=D_\theta\log p_\theta(Y\mid X)$。三种检验：

Wald：比较无约束估计的 $f(\hat\theta_n)$ 与 0；
Score / LM：比较受约束估计的得分 $D_\theta L_n(\tilde\theta_n)$ 与 0；
似然比 LR：直接比较 $L_n(\hat\theta_n)$ 与 $L_n(\tilde\theta_n)$。

6.4.2 Wald 检验. 由 Delta 方法 $\sqrt n(f(\hat\theta_n)-f(\theta_0))\xrightarrow{d}N(0,D_\theta f(\theta_0)\Omega D_\theta f(\theta_0)')$ (6.13)。以 $\hat\Omega_n=(-\hat B)^{-1}=\hat A^{-1}$ 估计 $\Omega$。检验统计量

$$T_n^{\text{Wald}}=n\,f(\hat\theta_n)'\big(D_\theta f(\hat\theta_n)\hat\Omega_n D_\theta f(\hat\theta_n)'\big)^{-1}f(\hat\theta_n)\xrightarrow{d}\chi^2_p$$

临界值 $\chi^2_{p,1-\alpha}$。

6.4.3 Score（LM）检验. $\hat\theta_n$ 使 $D_\theta L_n(\hat\theta_n)=0$；受约束 $\tilde\theta_n$ 的得分 $D_\theta L_n(\tilde\theta_n)\approx0$ 应在 $H_0$ 下接近零。用中值定理推 $\sqrt n D_\theta L_n(\tilde\theta_n)$ 的极限（(6.14)–(6.18)）：$\sqrt n D_\theta L_n(\tilde\theta_n)\xrightarrow{d}N(0,B^{-1}A)$ 型。检验统计量（一般式）

$$T_n^{\text{Score}}=n\,(D_\theta L_n(\tilde\theta_n))'\big(\hat\Omega_n\big)(D_\theta L_n(\tilde\theta_n))\xrightarrow{d}\chi^2_p$$

（$k=p$ 即约束数等于参数数时可简化）。临界值 $\chi^2_{p,1-\alpha}$。

6.4.4 Wald 与 Score 的关系. 两者极限同为 $\chi^2_p$，且统计量可经代数变换互化——若 $\hat H_n=\tilde H_n$、$\hat F_n=\tilde F_n$ 等假设 (6.19)(6.20)(6.21) 成立则严格相等。但有限样本下这些假设通常不精确成立，故两检验的数值常有差异（虽渐近分布相同）。

6.4.5 似然比（LR）检验. $H_0$ 为真时受约束 $\tilde\theta_n$ 与无约束 $\hat\theta_n$ 应给出相近的似然，故 $L_n(\hat\theta_n)-L_n(\tilde\theta_n)$ 小。可证

$$2[L_n(\hat\theta_n)-L_n(\tilde\theta_n)]\xrightarrow{d}\chi^2_p$$

对简单假设（$H_0:\theta=\theta_1$ vs $H_1:\theta=\theta_2$），LR 检验由 Neyman-Pearson 引理为一致最优（UMP）检验。

Let $f:\mathbb R^k\to\mathbb R^p$ be continuously differentiable with $D_\theta f(\theta_0)$ of linearly independent rows. Test $H_0:f(\theta_0)=\mathbf 0$ vs $H_1:f(\theta_0)\ne\mathbf 0$ ($p$ restrictions). $\tilde\theta_n$ is the restricted ML (maximizing $L_n$ subject to $f(\theta)=0$), $\hat\theta_n$ the unrestricted ML. The score $S(\theta)=D_\theta\log p_\theta(Y\mid X)$. Three tests:

Wald: compares $f(\hat\theta_n)$ from the unrestricted estimator with 0;
Score / LM: compares the restricted estimator's score $D_\theta L_n(\tilde\theta_n)$ with 0;
Likelihood ratio LR: directly compares $L_n(\hat\theta_n)$ with $L_n(\tilde\theta_n)$.

6.4.2 Wald test. By the delta method $\sqrt n(f(\hat\theta_n)-f(\theta_0))\xrightarrow{d}N(0,D_\theta f(\theta_0)\Omega D_\theta f(\theta_0)')$ (6.13). Estimate $\Omega$ by $\hat\Omega_n=(-\hat B)^{-1}=\hat A^{-1}$. The statistic

$$T_n^{\text{Wald}}=n\,f(\hat\theta_n)'\big(D_\theta f(\hat\theta_n)\hat\Omega_n D_\theta f(\hat\theta_n)'\big)^{-1}f(\hat\theta_n)\xrightarrow{d}\chi^2_p$$

with critical value $\chi^2_{p,1-\alpha}$.

6.4.3 Score (LM) test. $\hat\theta_n$ makes $D_\theta L_n(\hat\theta_n)=0$; the restricted $\tilde\theta_n$'s score $D_\theta L_n(\tilde\theta_n)\approx0$ should be close to zero under $H_0$. The mean value theorem gives the limit of $\sqrt n D_\theta L_n(\tilde\theta_n)$ ((6.14)–(6.18)): a $\sqrt n D_\theta L_n(\tilde\theta_n)\xrightarrow{d}N(0,B^{-1}A)$-type result. The statistic (general form)

$$T_n^{\text{Score}}=n\,(D_\theta L_n(\tilde\theta_n))'\big(\hat\Omega_n\big)(D_\theta L_n(\tilde\theta_n))\xrightarrow{d}\chi^2_p$$

(it simplifies when $k=p$, restrictions equal parameters), with critical value $\chi^2_{p,1-\alpha}$.

6.4.4 Relationship between Wald and Score. Both have the same $\chi^2_p$ limit and the statistics can be transformed into each other algebraically — they are exactly equal if assumptions (6.19)(6.20)(6.21) ($\hat H_n=\tilde H_n$, $\hat F_n=\tilde F_n$, etc.) hold. But in finite samples these assumptions usually do not hold exactly, so the two tests often differ numerically (though sharing the same asymptotic distribution).

6.4.5 Likelihood ratio (LR) test. When $H_0$ is true, the restricted $\tilde\theta_n$ and unrestricted $\hat\theta_n$ should give similar likelihoods, so $L_n(\hat\theta_n)-L_n(\tilde\theta_n)$ is small. One can show

$$2[L_n(\hat\theta_n)-L_n(\tilde\theta_n)]\xrightarrow{d}\chi^2_p$$

For a simple hypothesis ($H_0:\theta=\theta_1$ vs $H_1:\theta=\theta_2$), the LR test is the uniformly most powerful (UMP) test by the Neyman-Pearson Lemma.

Important

本章脉络 似然 → 一致性 → 渐近正态 → 三大检验。 §6.1–6.2 定义（条件）似然与 ML 估计量，强调 ML 可能不唯一、不存在或非正态。§6.3 核心三步：用 Jensen 证总体对数似然 $L(\theta)$ 在真值唯一最大（引理 6.1）；配合一致收敛 + 良好分离得一致性（引理 6.2）；误设定时 ML 找的是 KL 散度最小者；渐近 $\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega)$，由信息矩阵等式 $B=-A$ 化简为 Fisher 信息阵之逆。§6.4 的 Wald / Score / LR 三检验渐近等价（皆 $\chi^2_p$），分别从「无约束估计偏离约束」「受约束估计的得分」「两者似然差」三个角度构造。至此 Part I（实证分析 I）完结；下一章进入 Part II（实证分析 II），从贝叶斯推断开始。

Important

Chapter arc Likelihood → consistency → asymptotic normality → three tests. §6.1–6.2 define the (conditional) likelihood and ML estimator, stressing that ML may be non-unique, non-existent, or non-normal. §6.3's core three steps: Jensen proves the population log-likelihood $L(\theta)$ is uniquely maximized at the truth (Lemma 6.1); combined with uniform convergence + well-separatedness gives consistency (Lemma 6.2); under misspecification ML finds the KL-divergence minimizer; asymptotically $\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega)$, simplified by the information matrix equality $B=-A$ to the inverse Fisher information. §6.4's Wald / Score / LR tests are asymptotically equivalent (all $\chi^2_p$), built from three angles: "the unrestricted estimate's deviation from the restriction," "the restricted estimate's score," and "the likelihood gap between the two." This completes Part I (Empirical Analysis I); the next chapter begins Part II (Empirical Analysis II) with Bayesian inference.