6. Maximum Likelihood Estimator

Note

本章主题:极大似然(ML)估计。 §6.1 无条件 ML:似然 \(\ell_n(\theta)=\prod p_\theta(x_i)\)(联合密度)、对数似然 \(L_n(\theta)=\frac1n\sum\log p_\theta(x_i)\)、估计量 \(\hat\theta_n\in\arg\max_\theta L_n(\theta)\);例子展示 ML 可能不唯一/不存在。§6.2 条件 ML:\(\ell_n(\theta)=\prod p_\theta(y_i\mid x_i)\)(无条件是其特例),如 Probit / Logit。§6.3 性质一致性——\(L_n(\theta)\xrightarrow{p}L(\theta)=\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]\),由 Jensen 不等式证 \(L(\theta)\) 在真值 \(\theta_0\) 唯一最大(引理 6.1),配合「近似最大化 + 一致收敛 + 良好分离」三条件得 \(\hat\theta_n\xrightarrow{p}\theta_0\)(引理 6.2);误设定 → KL 散度:ML 收敛到最小化 Kullback-Leibler 散度的参数;极限分布 \(\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega)\),\(\Omega=B^{-1}AB^{-1}\),由信息矩阵等式 \(B=-A\) 简化为 \(\Omega=A^{-1}=(-B)^{-1}\)(Fisher 信息阵的逆)。§6.4 推断:检验 \(f(\theta_0)=0\) 的三种渐近等价检验——Wald(比较 \(f(\hat\theta_n)\) 与 0)、Score / LM(比较受约束估计的得分与 0)、似然比 LR(\(2[L_n(\hat\theta_n)-L_n(\tilde\theta_n)]\xrightarrow{d}\chi^2_p\);简单假设下由 Neyman-Pearson 引理为一致最优)。

Note

Chapter theme: maximum likelihood (ML) estimation. §6.1 Unconditional ML: likelihood \(\ell_n(\theta)=\prod p_\theta(x_i)\) (the joint density), log-likelihood \(L_n(\theta)=\frac1n\sum\log p_\theta(x_i)\), estimator \(\hat\theta_n\in\arg\max_\theta L_n(\theta)\); examples show ML may be non-unique / non-existent. §6.2 Conditional ML: \(\ell_n(\theta)=\prod p_\theta(y_i\mid x_i)\) (unconditional is a special case), e.g. Probit / Logit. §6.3 Properties: consistency — \(L_n(\theta)\xrightarrow{p}L(\theta)=\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]\), with Jensen's inequality proving \(L(\theta)\) uniquely maximized at the truth \(\theta_0\) (Lemma 6.1), combined with "near-maximizer + uniform convergence + well-separatedness" to get \(\hat\theta_n\xrightarrow{p}\theta_0\) (Lemma 6.2); misspecification → KL divergence: ML converges to the parameter minimizing the Kullback-Leibler divergence; limiting distribution \(\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega)\), \(\Omega=B^{-1}AB^{-1}\), simplified by the information matrix equality \(B=-A\) to \(\Omega=A^{-1}=(-B)^{-1}\) (the inverse Fisher information). §6.4 Inference: three asymptotically equivalent tests of \(f(\theta_0)=0\) — Wald (compares \(f(\hat\theta_n)\) with 0), Score / LM (compares the restricted estimator's score with 0), and the likelihood ratio LR (\(2[L_n(\hat\theta_n)-L_n(\tilde\theta_n)]\xrightarrow{d}\chi^2_p\); under a simple hypothesis it is uniformly most powerful by the Neyman-Pearson Lemma).

6.1 Unconditional Maximum Likelihood Estimator

设 \(X_1,\dots,X_n\overset{iid}\sim P=P_{\theta_0}\),\(P_\theta\) 为带参数 \(\theta\in\Theta\) 的分布(\(\theta\) 未必是真值 \(\theta_0\)),\(p_\theta\) 为对应密度(关于公共测度 \(\mu\))。

Important

定义 6.1 / 6.2 / 6.3 无条件似然 \(\ell_n(\theta)\equiv\prod_{i=1}^np_\theta(x_i)\)(在实现值处求值的联合密度,i.i.d. 故为同型边际密度之积)。ML 估计量 \(\hat\theta_n\in\arg\max_\theta\ell_n(\theta)\)。无条件对数似然 \(L_n(\theta)\equiv\frac1n\log\ell_n(\theta)=\frac1n\sum_{i=1}^n\log p_\theta(x_i)\)。

Tip

Remark 6.1 / 6.2 似然与密度形式相同,区别仅在视角:密度把 \(X_i\) 当变量、\(\theta\) 当参数;似然把 \(\theta\) 当变量、\(X_i\) 当用于求值的已知数。又 \(\log\) 单调递增,故 \(\arg\max\ell_n=\arg\max L_n\)。

例子. 例 6.1 \(\mathrm{Unif}[0,\theta_0]\):\(p_\theta(x)=\frac1\theta\mathbf 1\{0\le x\le\theta\}\),\(\ell_n(\theta)=\frac1{\theta^n}\mathbf 1\{0\le X_i\le\theta,\forall i\}\),最大化要求 \(\theta\ge\max X_i\) 且 \(\theta\) 越小 \(\frac1{\theta^n}\) 越大,故 \(\hat\theta_n=\max_{1\le i\le n}X_i\)。例 6.2 不存在:支撑改为开区间 \((0,\theta_0)\),需 \(\theta>\max X_i\)(严格),无法取到 \(\max X_i\),ML 不存在(只能取近似最大)。例 6.3 不唯一:\(p_\theta(x)=\mathbf 1\{\theta\le x\le\theta+1\}\),任意 \(\theta\in[\max X_i-1,\min X_i]\) 皆可。例 6.4 Bernoulli:\(p_\theta(x)=\theta^x(1-\theta)^{1-x}\),\(L_n(\theta)=\log(\theta)\bar X_n+\log(1-\theta)(1-\bar X_n)\),f.o.c. \(\Rightarrow\hat\theta_n=\bar X_n\)(二阶条件 $<0$ 确认最大)。例 6.5 截断正态:\(X=\max\{Z-\theta,0\}\)、\(Z\sim N(0,1)\),\(p_\theta(x)=\Phi(\theta)\mathbf 1\{x=0\}+\phi(x+\theta)\mathbf 1\{x>0\}\),\(\hat\theta_n\) 难显式刻画。

Let \(X_1,\dots,X_n\overset{iid}\sim P=P_{\theta_0}\), where \(P_\theta\) is a distribution with parameter \(\theta\in\Theta\) (\(\theta\) need not be the truth \(\theta_0\)), and \(p_\theta\) the corresponding density (with respect to a common measure \(\mu\)).

Important

Definitions 6.1 / 6.2 / 6.3 Unconditional likelihood \(\ell_n(\theta)\equiv\prod_{i=1}^np_\theta(x_i)\) (the joint density evaluated at the realized values; a product of same-form marginal densities by i.i.d.). ML estimator \(\hat\theta_n\in\arg\max_\theta\ell_n(\theta)\). Unconditional log-likelihood \(L_n(\theta)\equiv\frac1n\log\ell_n(\theta)=\frac1n\sum_{i=1}^n\log p_\theta(x_i)\).

Tip

Remark 6.1 / 6.2 Likelihood and density have the same form, differing only in viewpoint: the density treats \(X_i\) as the variable and \(\theta\) as the parameter; the likelihood treats \(\theta\) as the variable and \(X_i\) as known values used for evaluation. Since \(\log\) is monotone increasing, \(\arg\max\ell_n=\arg\max L_n\).

Examples. Example 6.1 \(\mathrm{Unif}[0,\theta_0]\): \(p_\theta(x)=\frac1\theta\mathbf 1\{0\le x\le\theta\}\), \(\ell_n(\theta)=\frac1{\theta^n}\mathbf 1\{0\le X_i\le\theta,\forall i\}\); maximizing requires \(\theta\ge\max X_i\) and the smaller \(\theta\) the larger \(\frac1{\theta^n}\), so \(\hat\theta_n=\max_{1\le i\le n}X_i\). Example 6.2 non-existence: with support changed to the open interval \((0,\theta_0)\), we need \(\theta>\max X_i\) (strict), which cannot attain \(\max X_i\), so ML does not exist (only a near-maximizer). Example 6.3 non-uniqueness: \(p_\theta(x)=\mathbf 1\{\theta\le x\le\theta+1\}\), any \(\theta\in[\max X_i-1,\min X_i]\) works. Example 6.4 Bernoulli: \(p_\theta(x)=\theta^x(1-\theta)^{1-x}\), \(L_n(\theta)=\log(\theta)\bar X_n+\log(1-\theta)(1-\bar X_n)\), f.o.c. \(\Rightarrow\hat\theta_n=\bar X_n\) (second-order condition $<0$ confirms a maximum). Example 6.5 censored normal: \(X=\max\{Z-\theta,0\}\), \(Z\sim N(0,1)\), \(p_\theta(x)=\Phi(\theta)\mathbf 1\{x=0\}+\phi(x+\theta)\mathbf 1\{x>0\}\), with \(\hat\theta_n\) hard to characterize explicitly.

6.2 Conditional Maximum Likelihood Estimator

设 \((Y_i,X_i)\) i.i.d.,\(Y_i\) 给定 \(X_i\) 的条件分布 \(Y_i\mid X_i\sim P_{\theta_0}\)、\(X_i\) 边际 \(\sim P_X\),条件密度 \(p_\theta(y\mid x)\)。

Important

定义 6.4 / 6.5 / 6.6 条件似然 \(\ell_n(\theta)\equiv\prod_{i=1}^np_\theta(y_i\mid x_i)\)。条件 ML 估计量 \(\hat\theta_n\in\arg\max_\theta\ell_n(\theta)\)。条件对数似然 \(L_n(\theta)\equiv\frac1n\log\ell_n(\theta)=\frac1n\sum\log p_\theta(y_i\mid x_i)\)。

无条件似然是条件似然的特例(\(X_i\) 退化为常数时 \(\mathbb E[Y\mid X]=\mathbb E[Y]\)),故下文均以更一般的条件情形讨论。

例 6.6(Probit / Logit). \((Y_i,X_i)\) i.i.d.,\(Y_i\in\{0,1\}\)、\(X_i\in\mathbb R^{k+1}\)(首元为 1)、\(\theta_0\in\mathbb R^{k+1}\)。Probit

$$p_\theta(y\mid x)=\Phi(x'\theta)^y[1-\Phi(x'\theta)]^{1-y}$$

$$L_n(\theta)=\frac1n\sum_{i=1}^n[Y_i\log\Phi(X_i'\theta)+(1-Y_i)\log(1-\Phi(X_i'\theta))]$$

若把 \(\Phi\) 换成 logistic c.d.f. \(G(z)=\frac{\exp(z)}{1+\exp(z)}\),即 Logit 模型。

Let \((Y_i,X_i)\) be i.i.d., with the conditional distribution \(Y_i\mid X_i\sim P_{\theta_0}\), the marginal \(X_i\sim P_X\), and conditional density \(p_\theta(y\mid x)\).

Important

Definitions 6.4 / 6.5 / 6.6 Conditional likelihood \(\ell_n(\theta)\equiv\prod_{i=1}^np_\theta(y_i\mid x_i)\). Conditional ML estimator \(\hat\theta_n\in\arg\max_\theta\ell_n(\theta)\). Conditional log-likelihood \(L_n(\theta)\equiv\frac1n\log\ell_n(\theta)=\frac1n\sum\log p_\theta(y_i\mid x_i)\).

The unconditional likelihood is a special case of the conditional one (when \(X_i\) degenerates to a constant, \(\mathbb E[Y\mid X]=\mathbb E[Y]\)), so the discussion below uses the more general conditional case.

Example 6.6 (Probit / Logit). \((Y_i,X_i)\) i.i.d., \(Y_i\in\{0,1\}\), \(X_i\in\mathbb R^{k+1}\) (first element 1), \(\theta_0\in\mathbb R^{k+1}\). Probit:

$$p_\theta(y\mid x)=\Phi(x'\theta)^y[1-\Phi(x'\theta)]^{1-y}$$

$$L_n(\theta)=\frac1n\sum_{i=1}^n[Y_i\log\Phi(X_i'\theta)+(1-Y_i)\log(1-\Phi(X_i'\theta))]$$

Replacing \(\Phi\) with the logistic c.d.f. \(G(z)=\frac{\exp(z)}{1+\exp(z)}\) gives the Logit model.

6.3 Properties of the Maximum Likelihood Estimator

6.3.1 一致性. WLLN 提示 \(L_n(\theta)=\frac1n\sum\log p_\theta(Y_i\mid X_i)\xrightarrow{p}\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]\equiv L(\theta)\)。

Important

引理 6.1(极限下 ML 的唯一性) 若对每个 \(\theta\ne\theta_0\),\(\mathbb P(p_\theta(Y\mid X)\ne p_{\theta_0}(Y\mid X))>0\),则 \(L(\theta)=\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]\) 在 \(\theta=\theta_0\) 唯一最大。

Note

证明(引理 6.1,Jensen) 令 \(M(\theta)\equiv L(\theta)-L(\theta_0)=\mathbb E_{\theta_0}[\log\frac{p_\theta(Y\mid X)}{p_{\theta_0}(Y\mid X)}]\)。由 \(\log\) 严格凹 + Jensen: $$M(\theta)\le\log\mathbb E_{\theta_0}\Big[\frac{p_\theta(Y\mid X)}{p_{\theta_0}(Y\mid X)}\Big]=\log\Big(\int\int\frac{p_\theta(y\mid x)}{p_{\theta_0}(y\mid x)}p_{\theta_0}(y\mid x)\,d\mu(y)\,dP_X(x)\Big)=\log 1=0 \tag{6.1}$$ (因 \(p_\theta(y\mid x)\) 为条件密度积分为 1)。严格凹使不等式严格除非 \(\mathbb P(\frac{p_\theta}{p_{\theta_0}}=c)=1\),而积分为 1 迫使 \(c=1\);由假设 \(\mathbb P(p_\theta\ne p_{\theta_0})>0\) 排除 \(c=1\),故 \(M(\theta)<0\) 对 \(\theta\ne\theta_0\),即 \(\theta_0\) 唯一最大 (6.2)。\(\blacksquare\)

Important

引理 6.2(ML 一致性的充分条件) 若 \(\hat\theta_n\) 满足:(1)近似最大化 \(L_n(\hat\theta_n)\ge L_n(\theta_0)-o_P(1)\);(2)一致收敛 \(\sup_{\theta\in\Theta}|L_n(\theta)-L(\theta)|\xrightarrow{p}0\);(3)良好分离 \(\sup_{\theta\notin B_\delta(\theta_0)}L(\theta)0\),则 \(\hat\theta_n\xrightarrow{p}\theta_0\)。

良好分离意味着:在去掉 \(\theta_0\) 邻域 \(B_\delta(\theta_0)\) 的定义域上,\(L(\theta)\) 的上确界严格低于 \(L(\theta_0)\)——即 \(L\) 在 \(\theta_0\) 处有「孤立」尖峰,保证唯一最大值。引理 6.3/6.4/6.5 给出更原始的充分条件:近似最大化由上确界定义显然;良好分离需 \(L\) 连续 + \(\Theta\) 紧 + 唯一最大;一致收敛需 \(\Theta\) 紧 + 存在支配函数。)

6.3.1 Consistency. WLLN suggests \(L_n(\theta)=\frac1n\sum\log p_\theta(Y_i\mid X_i)\xrightarrow{p}\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]\equiv L(\theta)\).

Important

Lemma 6.1 (Uniqueness of ML in the limit) If for every \(\theta\ne\theta_0\), \(\mathbb P(p_\theta(Y\mid X)\ne p_{\theta_0}(Y\mid X))>0\), then \(L(\theta)=\mathbb E_{\theta_0}[\log p_\theta(Y\mid X)]\) is uniquely maximized at \(\theta=\theta_0\).

Note

Proof (Lemma 6.1, Jensen) Let \(M(\theta)\equiv L(\theta)-L(\theta_0)=\mathbb E_{\theta_0}[\log\frac{p_\theta(Y\mid X)}{p_{\theta_0}(Y\mid X)}]\). By strict concavity of \(\log\) + Jensen: $$M(\theta)\le\log\mathbb E_{\theta_0}\Big[\frac{p_\theta(Y\mid X)}{p_{\theta_0}(Y\mid X)}\Big]=\log\Big(\int\int\frac{p_\theta(y\mid x)}{p_{\theta_0}(y\mid x)}p_{\theta_0}(y\mid x)\,d\mu(y)\,dP_X(x)\Big)=\log 1=0 \tag{6.1}$$ (since \(p_\theta(y\mid x)\) is a conditional density integrating to 1). Strict concavity makes it strict unless \(\mathbb P(\frac{p_\theta}{p_{\theta_0}}=c)=1\), and integrating to 1 forces \(c=1\); the assumption \(\mathbb P(p_\theta\ne p_{\theta_0})>0\) rules out \(c=1\), so \(M(\theta)<0\) for \(\theta\ne\theta_0\), i.e. \(\theta_0\) is the unique maximizer (6.2). \(\blacksquare\)

Important

Lemma 6.2 (Sufficient conditions for ML consistency) If \(\hat\theta_n\) satisfies: (1) near-maximizer \(L_n(\hat\theta_n)\ge L_n(\theta_0)-o_P(1)\); (2) uniform convergence \(\sup_{\theta\in\Theta}|L_n(\theta)-L(\theta)|\xrightarrow{p}0\); (3) well-separatedness \(\sup_{\theta\notin B_\delta(\theta_0)}L(\theta)0\), then \(\hat\theta_n\xrightarrow{p}\theta_0\).

(Well-separatedness means: on the domain excluding the neighborhood \(B_\delta(\theta_0)\) of \(\theta_0\), the supremum of \(L(\theta)\) is strictly below \(L(\theta_0)\) — i.e. \(L\) has an "isolated" peak at \(\theta_0\), ensuring a unique maximum. Lemmas 6.3/6.4/6.5 give more primitive sufficient conditions: near-maximizer is obvious from the definition of supremum; well-separatedness needs \(L\) continuous + \(\Theta\) compact + unique maximum; uniform convergence needs \(\Theta\) compact + a dominating function.)

6.3.2 误设定与 Kullback-Leibler 散度. 当 \(Y\mid X\) 对任何 \(\theta\) 都不能表示为 \(P_\theta\)(模型误设定),同样的条件仍使 \(\hat\theta_n\xrightarrow{p}L(\theta)=\mathbb E_{f(y\mid x)}[\log p_\theta(Y\mid X)]\) 的最大值点(\(f\) 为真实条件密度)。把它改写:

$$\arg\max_\theta\mathbb E_{f}[\log p_\theta(Y\mid X)]=\arg\min_\theta\mathbb E_{f}\Big[\log\frac{f(Y\mid X)}{p_\theta(Y\mid X)}\Big]$$

后者中 \(\mathbb E_f[\log\frac{f}{p_\theta}]\) 称 \(f\) 与 \(p_\theta\) 的 Kullback-Leibler 散度。由 Jensen,KL 散度 \(\ge0\),且 $=0$ 当且仅当 \(f=p_\theta\)。故 ML 收敛到使「与真实分布的 KL 距离」最小的参数(KL 类比「距离」但不对称、非真正距离)。

6.3.3 极限分布.

Important

命题 6.1 / 命题 6.2(极限分布 + 信息矩阵等式) 在适当正则条件下(\(\log p_\theta(y\mid x)\) 关于 \(\theta\) 二次连续可微、存在支配函数、\(B\) 可逆、\(\Theta\) 紧), $$\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega),\quad\Omega=B^{-1}AB^{-1}$$ 其中 \(A=\mathbb E_{\theta_0}[D_\theta\log p_{\theta_0}\,D_\theta\log p_{\theta_0}']\)(得分的外积)、\(B=\mathbb E_{\theta_0}[D^2_{\theta\theta'}\log p_{\theta_0}]\)(Hessian 的期望)。命题 6.2(信息矩阵等式):\(B=-A\),故 \(\Omega=-B^{-1}=A^{-1}\)。

Note

证明(命题 6.1,MVT + CLT) 一阶条件 \(0=D_\theta L_n(\hat\theta_n)\)。对 \(D_\theta L_n(\hat\theta_n)\) 在 \(\theta_0\) 处用中值定理(逐分量):\(0=D_\theta L_n(\theta_0)+H_n(\hat\theta_n-\theta_0)\),\(H_n=D^2_{\theta\theta'}L_n\) 在中间点求值。由 WLLN + CMT \(H_n\xrightarrow{p}B\);由 CLT \(\sqrt n D_\theta L_n(\theta_0)=\sqrt n\frac1n\sum D_\theta\log p_{\theta_0}(Y_i\mid X_i)\xrightarrow{d}N(0,A)\)(用 \(\mathbb E_{\theta_0}[D_\theta\log p_{\theta_0}]=0\) (6.7))。整理 \(-H_n\sqrt n(\hat\theta_n-\theta_0)=\sqrt n D_\theta L_n(\theta_0)\),由 Slutsky:\(\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}-B^{-1}N(0,A)=N(0,B^{-1}AB^{-1})\)。\(\blacksquare\)

Important

定义 6.7(信息矩阵) 称 \(-B=-\mathbb E_{\theta_0}[D^2_{\theta\theta'}\log p_\theta(y\mid x)]\) 为 Fisher 信息矩阵。由信息矩阵等式 \(\Omega=(-B)^{-1}\) 即信息阵之逆。

例 6.9(非正态极限). \(\mathrm{Unif}[0,\theta_0]\)、\(\hat\theta_n=\max X_i\):\(-n(\hat\theta_n-\theta_0)\xrightarrow{d}\mathrm{Exp}(\theta_0)\)(非正态——支撑「重度」依赖参数,\(\sqrt n\) 标准化失效,收敛速率为 \(n\))。

6.3.2 Misspecification and Kullback-Leibler divergence. When \(Y\mid X\) cannot be represented by \(P_\theta\) for any \(\theta\) (a misspecified model), the same conditions still give \(\hat\theta_n\xrightarrow{p}\) the maximizer of \(L(\theta)=\mathbb E_{f(y\mid x)}[\log p_\theta(Y\mid X)]\) (\(f\) the true conditional density). Rewriting:

$$\arg\max_\theta\mathbb E_{f}[\log p_\theta(Y\mid X)]=\arg\min_\theta\mathbb E_{f}\Big[\log\frac{f(Y\mid X)}{p_\theta(Y\mid X)}\Big]$$

where \(\mathbb E_f[\log\frac{f}{p_\theta}]\) is the Kullback-Leibler divergence between \(f\) and \(p_\theta\). By Jensen, KL divergence \(\ge0\), with $=0$ iff \(f=p_\theta\). So ML converges to the parameter minimizing the "KL distance to the true distribution" (KL is analogous to a "distance" but is asymmetric, not a true distance).

6.3.3 Limiting distribution.

Important

Proposition 6.1 / Proposition 6.2 (Limiting distribution + information matrix equality) Under suitable regularity conditions (\(\log p_\theta(y\mid x)\) twice continuously differentiable in \(\theta\), a dominating function exists, \(B\) invertible, \(\Theta\) compact), $$\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega),\quad\Omega=B^{-1}AB^{-1}$$ where \(A=\mathbb E_{\theta_0}[D_\theta\log p_{\theta_0}\,D_\theta\log p_{\theta_0}']\) (outer product of the score) and \(B=\mathbb E_{\theta_0}[D^2_{\theta\theta'}\log p_{\theta_0}]\) (expectation of the Hessian). Proposition 6.2 (information matrix equality): \(B=-A\), so \(\Omega=-B^{-1}=A^{-1}\).

Note

Proof (Proposition 6.1, MVT + CLT) First-order condition \(0=D_\theta L_n(\hat\theta_n)\). Apply the mean value theorem (component-wise) to \(D_\theta L_n(\hat\theta_n)\) around \(\theta_0\): \(0=D_\theta L_n(\theta_0)+H_n(\hat\theta_n-\theta_0)\), with \(H_n=D^2_{\theta\theta'}L_n\) evaluated at an intermediate point. By WLLN + CMT \(H_n\xrightarrow{p}B\); by CLT \(\sqrt n D_\theta L_n(\theta_0)=\sqrt n\frac1n\sum D_\theta\log p_{\theta_0}(Y_i\mid X_i)\xrightarrow{d}N(0,A)\) (using \(\mathbb E_{\theta_0}[D_\theta\log p_{\theta_0}]=0\) (6.7)). Rearranging \(-H_n\sqrt n(\hat\theta_n-\theta_0)=\sqrt n D_\theta L_n(\theta_0)\), by Slutsky \(\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}-B^{-1}N(0,A)=N(0,B^{-1}AB^{-1})\). \(\blacksquare\)

Important

Definition 6.7 (Information matrix) \(-B=-\mathbb E_{\theta_0}[D^2_{\theta\theta'}\log p_\theta(y\mid x)]\) is the Fisher information matrix. By the information matrix equality \(\Omega=(-B)^{-1}\), the inverse information.

Example 6.9 (non-normal limit). \(\mathrm{Unif}[0,\theta_0]\), \(\hat\theta_n=\max X_i\): \(-n(\hat\theta_n-\theta_0)\xrightarrow{d}\mathrm{Exp}(\theta_0)\) (non-normal — the support depends "heavily" on the parameter, \(\sqrt n\) standardization fails, and the convergence rate is \(n\)).

6.4 Inference

设 \(f:\mathbb R^k\to\mathbb R^p\) 连续可微、\(D_\theta f(\theta_0)\) 行线性独立。检验 \(H_0:f(\theta_0)=\mathbf 0\) vs \(H_1:f(\theta_0)\ne\mathbf 0\)(\(p\) 个约束)。\(\tilde\theta_n\) 为受约束 ML(在 \(f(\theta)=0\) 下最大 \(L_n\)),\(\hat\theta_n\) 为无约束 ML。得分 \(S(\theta)=D_\theta\log p_\theta(Y\mid X)\)。三种检验:

  • Wald:比较无约束估计的 \(f(\hat\theta_n)\) 与 0;
  • Score / LM:比较受约束估计的得分 \(D_\theta L_n(\tilde\theta_n)\) 与 0;
  • 似然比 LR:直接比较 \(L_n(\hat\theta_n)\) 与 \(L_n(\tilde\theta_n)\)。

6.4.2 Wald 检验. 由 Delta 方法 \(\sqrt n(f(\hat\theta_n)-f(\theta_0))\xrightarrow{d}N(0,D_\theta f(\theta_0)\Omega D_\theta f(\theta_0)')\) (6.13)。以 \(\hat\Omega_n=(-\hat B)^{-1}=\hat A^{-1}\) 估计 \(\Omega\)。检验统计量

$$T_n^{\text{Wald}}=n\,f(\hat\theta_n)'\big(D_\theta f(\hat\theta_n)\hat\Omega_n D_\theta f(\hat\theta_n)'\big)^{-1}f(\hat\theta_n)\xrightarrow{d}\chi^2_p$$

临界值 \(\chi^2_{p,1-\alpha}\)。

6.4.3 Score(LM)检验. \(\hat\theta_n\) 使 \(D_\theta L_n(\hat\theta_n)=0\);受约束 \(\tilde\theta_n\) 的得分 \(D_\theta L_n(\tilde\theta_n)\approx0\) 应在 \(H_0\) 下接近零。用中值定理推 \(\sqrt n D_\theta L_n(\tilde\theta_n)\) 的极限((6.14)–(6.18)):\(\sqrt n D_\theta L_n(\tilde\theta_n)\xrightarrow{d}N(0,B^{-1}A)\) 型。检验统计量(一般式)

$$T_n^{\text{Score}}=n\,(D_\theta L_n(\tilde\theta_n))'\big(\hat\Omega_n\big)(D_\theta L_n(\tilde\theta_n))\xrightarrow{d}\chi^2_p$$

(\(k=p\) 即约束数等于参数数时可简化)。临界值 \(\chi^2_{p,1-\alpha}\)。

6.4.4 Wald 与 Score 的关系. 两者极限同为 \(\chi^2_p\),且统计量可经代数变换互化——若 \(\hat H_n=\tilde H_n\)、\(\hat F_n=\tilde F_n\) 等假设 (6.19)(6.20)(6.21) 成立则严格相等。但有限样本下这些假设通常不精确成立,故两检验的数值常有差异(虽渐近分布相同)。

6.4.5 似然比(LR)检验. \(H_0\) 为真时受约束 \(\tilde\theta_n\) 与无约束 \(\hat\theta_n\) 应给出相近的似然,故 \(L_n(\hat\theta_n)-L_n(\tilde\theta_n)\) 小。可证

$$2[L_n(\hat\theta_n)-L_n(\tilde\theta_n)]\xrightarrow{d}\chi^2_p$$

简单假设(\(H_0:\theta=\theta_1\) vs \(H_1:\theta=\theta_2\)),LR 检验由 Neyman-Pearson 引理一致最优(UMP)检验。

Let \(f:\mathbb R^k\to\mathbb R^p\) be continuously differentiable with \(D_\theta f(\theta_0)\) of linearly independent rows. Test \(H_0:f(\theta_0)=\mathbf 0\) vs \(H_1:f(\theta_0)\ne\mathbf 0\) (\(p\) restrictions). \(\tilde\theta_n\) is the restricted ML (maximizing \(L_n\) subject to \(f(\theta)=0\)), \(\hat\theta_n\) the unrestricted ML. The score \(S(\theta)=D_\theta\log p_\theta(Y\mid X)\). Three tests:

  • Wald: compares \(f(\hat\theta_n)\) from the unrestricted estimator with 0;
  • Score / LM: compares the restricted estimator's score \(D_\theta L_n(\tilde\theta_n)\) with 0;
  • Likelihood ratio LR: directly compares \(L_n(\hat\theta_n)\) with \(L_n(\tilde\theta_n)\).

6.4.2 Wald test. By the delta method \(\sqrt n(f(\hat\theta_n)-f(\theta_0))\xrightarrow{d}N(0,D_\theta f(\theta_0)\Omega D_\theta f(\theta_0)')\) (6.13). Estimate \(\Omega\) by \(\hat\Omega_n=(-\hat B)^{-1}=\hat A^{-1}\). The statistic

$$T_n^{\text{Wald}}=n\,f(\hat\theta_n)'\big(D_\theta f(\hat\theta_n)\hat\Omega_n D_\theta f(\hat\theta_n)'\big)^{-1}f(\hat\theta_n)\xrightarrow{d}\chi^2_p$$

with critical value \(\chi^2_{p,1-\alpha}\).

6.4.3 Score (LM) test. \(\hat\theta_n\) makes \(D_\theta L_n(\hat\theta_n)=0\); the restricted \(\tilde\theta_n\)'s score \(D_\theta L_n(\tilde\theta_n)\approx0\) should be close to zero under \(H_0\). The mean value theorem gives the limit of \(\sqrt n D_\theta L_n(\tilde\theta_n)\) ((6.14)–(6.18)): a \(\sqrt n D_\theta L_n(\tilde\theta_n)\xrightarrow{d}N(0,B^{-1}A)\)-type result. The statistic (general form)

$$T_n^{\text{Score}}=n\,(D_\theta L_n(\tilde\theta_n))'\big(\hat\Omega_n\big)(D_\theta L_n(\tilde\theta_n))\xrightarrow{d}\chi^2_p$$

(it simplifies when \(k=p\), restrictions equal parameters), with critical value \(\chi^2_{p,1-\alpha}\).

6.4.4 Relationship between Wald and Score. Both have the same \(\chi^2_p\) limit and the statistics can be transformed into each other algebraically — they are exactly equal if assumptions (6.19)(6.20)(6.21) (\(\hat H_n=\tilde H_n\), \(\hat F_n=\tilde F_n\), etc.) hold. But in finite samples these assumptions usually do not hold exactly, so the two tests often differ numerically (though sharing the same asymptotic distribution).

6.4.5 Likelihood ratio (LR) test. When \(H_0\) is true, the restricted \(\tilde\theta_n\) and unrestricted \(\hat\theta_n\) should give similar likelihoods, so \(L_n(\hat\theta_n)-L_n(\tilde\theta_n)\) is small. One can show

$$2[L_n(\hat\theta_n)-L_n(\tilde\theta_n)]\xrightarrow{d}\chi^2_p$$

For a simple hypothesis (\(H_0:\theta=\theta_1\) vs \(H_1:\theta=\theta_2\)), the LR test is the uniformly most powerful (UMP) test by the Neyman-Pearson Lemma.

Important

本章脉络 似然 → 一致性 → 渐近正态 → 三大检验。 §6.1–6.2 定义(条件)似然与 ML 估计量,强调 ML 可能不唯一、不存在或非正态。§6.3 核心三步:用 Jensen 证总体对数似然 \(L(\theta)\) 在真值唯一最大(引理 6.1);配合一致收敛 + 良好分离得一致性(引理 6.2);误设定时 ML 找的是 KL 散度最小者;渐近 \(\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega)\),由信息矩阵等式 \(B=-A\) 化简为 Fisher 信息阵之逆。§6.4 的 Wald / Score / LR 三检验渐近等价(皆 \(\chi^2_p\)),分别从「无约束估计偏离约束」「受约束估计的得分」「两者似然差」三个角度构造。至此 Part I(实证分析 I)完结;下一章进入 Part II(实证分析 II),从贝叶斯推断开始。

Important

Chapter arc Likelihood → consistency → asymptotic normality → three tests. §6.1–6.2 define the (conditional) likelihood and ML estimator, stressing that ML may be non-unique, non-existent, or non-normal. §6.3's core three steps: Jensen proves the population log-likelihood \(L(\theta)\) is uniquely maximized at the truth (Lemma 6.1); combined with uniform convergence + well-separatedness gives consistency (Lemma 6.2); under misspecification ML finds the KL-divergence minimizer; asymptotically \(\sqrt n(\hat\theta_n-\theta_0)\xrightarrow{d}N(0,\Omega)\), simplified by the information matrix equality \(B=-A\) to the inverse Fisher information. §6.4's Wald / Score / LR tests are asymptotically equivalent (all \(\chi^2_p\)), built from three angles: "the unrestricted estimate's deviation from the restriction," "the restricted estimate's score," and "the likelihood gap between the two." This completes Part I (Empirical Analysis I); the next chapter begins Part II (Empirical Analysis II) with Bayesian inference.