12. Stochastic Processes

Jun He May 31, 2026

计量经济学Econometrics 随机过程Stochastic Process 遍历性Ergodicity 鞅分解Martingale Decomposition 大数定律Law of Large Numbers 中心极限定理Central Limit Theorem 似然过程Likelihood Process 学习笔记Study Note

Note

本章主题：随机过程。 §12.1 构造随机过程：从测度论起步——概率空间 $(\Omega,\mathcal F,\mathbb P)$、随机向量（可测函数）、可测变换 $\mathbb S$ 生成随机过程 $X_t=X(\mathbb S^t(\omega))$；保测（measure preserving）$\Rightarrow$ 平稳。§12.2 条件期望与大数定律：不变事件、按不变事件划分定义条件期望 $\mathbb E[X\mid\mathcal I]$、遍历（所有不变事件概率为 0 或 1）；Birkhoff 大数定律（定理 12.1）允许时间依赖；遍历 $\Rightarrow\mathbb E[X\mid\mathcal I]=\mathbb E[X]$；极限经验测度。§12.3 平稳遍历过程：可交换性；遍历分解 $\mathbb P(\Lambda)=\sum Q_r^j(\Lambda)\mathbb P(\Lambda_j)$（命题 12.2/12.3）；例 1 VAR（平稳遍历 vs 平稳不遍历）、例 2 有限态 Markov 链。§12.4 平稳增量过程：鞅；鞅分解 $Y_t=Y_0+t\nu+\sum M_j+(\tilde X_0-\tilde X_t)$——一般情形、Markov 差分情形（算子 $\mathbb T$、强收缩、$(\mathbb I-\mathbb T)^{-1}$）、VAR 情形；永久冲击 $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_{t+1}$；永久收入模型。§12.5 中心极限定理：Billingsley CLT $\frac1{\sqrt t}\sum M_j\xrightarrow{d}N(0,\mathbb E[M_j^2])$，$\mathbb E[M_t^2]=$ 零频谱密度 $s_x(0)$；协整。§12.6 显状态：似然过程：似然构造、似然比（正鞅）、对数似然比（上鞅）、MLE、得分过程（鞅，Fisher 信息）、冗余参数。§12.7 隐状态：递归学习：状态不可观测时的 regime switching 模型贝叶斯递归更新 $Q_{t+1}$。

Note

Chapter theme: stochastic processes. §12.1 Constructing a stochastic process: starting from measure theory — the probability space $(\Omega,\mathcal F,\mathbb P)$, random vectors (measurable functions), a measurable transformation $\mathbb S$ generating a process $X_t=X(\mathbb S^t(\omega))$; measure preserving $\Rightarrow$ stationary. §12.2 Conditional expectation and LLN: invariant events, conditional expectation $\mathbb E[X\mid\mathcal I]$ defined via the partition into invariant events, ergodicity (all invariant events have probability 0 or 1); the Birkhoff LLN (Theorem 12.1) allowing temporal dependence; ergodic $\Rightarrow\mathbb E[X\mid\mathcal I]=\mathbb E[X]$; limiting empirical measures. §12.3 Stationary and ergodic process: exchangeability; the ergodic decomposition $\mathbb P(\Lambda)=\sum Q_r^j(\Lambda)\mathbb P(\Lambda_j)$ (Propositions 12.2/12.3); Example 1 VAR (stationary-ergodic vs stationary-not-ergodic), Example 2 finite-state Markov chains. §12.4 Stationary increment process: martingales; the martingale decomposition $Y_t=Y_0+t\nu+\sum M_j+(\tilde X_0-\tilde X_t)$ — general case, Markov-difference case (operator $\mathbb T$, strong contraction, $(\mathbb I-\mathbb T)^{-1}$), VAR case; the permanent shock $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_{t+1}$; the permanent income model. §12.5 CLT: the Billingsley CLT $\frac1{\sqrt t}\sum M_j\xrightarrow{d}N(0,\mathbb E[M_j^2])$, with $\mathbb E[M_t^2]$ the zero-frequency spectral density $s_x(0)$; cointegration. §12.6 Revealed states: the likelihood process: likelihood construction, likelihood ratio (positive martingale), log-likelihood ratio (supermartingale), MLE, the score process (a martingale, Fisher information), nuisance parameters. §12.7 Hidden states: recursive learning: Bayesian recursive updating of $Q_{t+1}$ in the regime-switching model when the state is unobservable.

12.1 Constructing a Stochastic Process

12.1.1 Sample Space and Probability Measure

$\omega\in\Omega$ 为样本空间 $\Omega$ 中的样本点。
$\mathcal F$ 是 $\Omega$ 上的 $\sigma$-代数。
$(\Omega,\mathcal F)$ 为可测空间，$\mathcal F$ 中元素称可测集。
函数 $\mathbb P:\mathcal F\to[0,1]$ 为概率测度，满足：
非负性：$\forall E\in\mathcal F$，$\mu(E)\ge0$。
空集：$\mu(\varnothing)=0$。
可数可加性（$\sigma$-可加性）：对 $\mathcal F$ 中任意两两不交的可数集族 $\{E_i\}_{i=1}^\infty$，$\mu\big(\bigcup_{k=1}^\infty E_k\big)=\sum_{k=1}^\infty\mu(E_k)$。
单位归一化：$\mu(\Omega)=1$。
$(\Omega,\mathcal F,\mathbb P)$ 为带概率测度的测度空间，也称概率空间。（详见附录 23。）

$\omega\in\Omega$ is a sample point in the sample space $\Omega$.
$\mathcal F$ is a $\sigma$-algebra over $\Omega$.
$(\Omega,\mathcal F)$ is a measurable space, with elements of $\mathcal F$ called measurable sets.
A function $\mathbb P:\mathcal F\to[0,1]$ is a probability measure with the properties:
Non-negativity: $\forall E\in\mathcal F$, $\mu(E)\ge0$.
Null set: $\mu(\varnothing)=0$.
Countable additivity ($\sigma$-additivity): for any countable collection $\{E_i\}_{i=1}^\infty$ of pairwise disjoint sets in $\mathcal F$, $\mu\big(\bigcup_{k=1}^\infty E_k\big)=\sum_{k=1}^\infty\mu(E_k)$.
Unit normalization: $\mu(\Omega)=1$.
$(\Omega,\mathcal F,\mathbb P)$ is a measure space with a probability measure, also called a probability space. (See Appendix 23.)

12.1.2 Random Vector

Important

定义 12.1（随机向量与可测函数） $n$ 维随机向量是函数 $X:\Omega\to\mathbb R^n$，满足 $$\{X\in\mathfrak b\}\equiv X^{-1}(\mathfrak b)\equiv\{\omega\in\Omega:\ X(\omega)\in\mathfrak b\}\in\mathcal F\quad\forall\mathfrak b\in\mathcal B(\mathbb R^n)$$ 其中 $\mathcal B(\mathbb R^n)$ 是 $\mathbb R^n$ 上的 Borel $\sigma$-代数（含所有 Borel 集），$\mathfrak b$ 为任意 Borel 集。由于 $\{X\in\mathfrak b\}\in\mathcal F$，$X$ 也称可测函数或测量函数。赋给 $\{X\in\mathfrak b\}$ 的概率测度即赋给 $\{\omega:X(\omega)\in\mathfrak b\}$ 的概率测度。

Tip

注记 12.1 每个 $\mathbb R^n$ 中的随机向量对应一个抽象样本空间中的抽象样本点。随机性与概率属于该对应的抽象样本点，它抽象、可能不可观测；测量函数把随机性与概率转给随机向量，后者在现实中可观测——这正是"测量函数"之名的由来。于是"随机向量 $X$ 落在 $\mathbb R^n$ 某子空间 $\mathfrak b$"的事件，等价定义为"样本点 $\omega$ 落在样本空间相应子空间 $X^{-1}(\mathfrak b)$"的事件。

Important

Definition 12.1 (Random vector and measurable function) An $n$-dimensional random vector is a function $X:\Omega\to\mathbb R^n$ such that $$\{X\in\mathfrak b\}\equiv X^{-1}(\mathfrak b)\equiv\{\omega\in\Omega:\ X(\omega)\in\mathfrak b\}\in\mathcal F\quad\forall\mathfrak b\in\mathcal B(\mathbb R^n)$$ where $\mathcal B(\mathbb R^n)$ is the Borel $\sigma$-algebra over $\mathbb R^n$ (containing all Borel sets) and $\mathfrak b$ is any Borel set. Since $\{X\in\mathfrak b\}\in\mathcal F$, $X$ is also called a measurable function or measurement function. The probability measure assigned to $\{X\in\mathfrak b\}$ is the probability measure assigned to $\{\omega:X(\omega)\in\mathfrak b\}$.

Tip

Remark 12.1 Each random vector in $\mathbb R^n$ corresponds to an abstract sample point in the abstract sample space. The randomness and probability belong to that corresponding sample point, which is abstract and may not be observable; the measurement function gives the randomness and probability to the random vector, which is observable in the real world — hence the name. So the event that a random vector $X$ lies in a subspace $\mathfrak b$ of $\mathbb R^n$ is equivalently defined as the event that the sample point $\omega$ lies in the corresponding subspace $X^{-1}(\mathfrak b)$ of the sample space.

12.1.3 Stochastic Process

Important

定义 12.2（可测变换） $\mathbb S:\Omega\to\Omega$ 称为可测变换，若对 $\forall\Lambda\in\mathcal F$， $$\mathbb S^{-1}(\Lambda)\equiv\{\omega:\ \mathbb S(\omega)=\Lambda\}\in\mathcal F$$

Tip

注记 12.2 我们总是看原像（$\Omega$ 的子集）来判定可测性。若原像是 $\mathcal F$ 中元素，就说变换 $\mathbb S$（或随机向量 $X$）可测。另注 $\mathbb S$ 是确定性映射，本身没有任何随机性。

Important

定义 12.3（随机过程）设 $\mathbb S:\Omega\to\Omega$ 为可测变换。一个 $n$ 维随机过程是 $n$ 维随机向量的无穷序列 $\{X_t\}_{t=0}^\infty$，其中 $$X_t=X\big(\mathbb S^t(\omega)\big)$$ 且 $X_0=X(\mathbb S^0(\omega))=X(\omega)$。

Important

定义 12.4（联合分布）考虑随机向量的向量 $$\mathbf X_{t,\tau}(\omega)=\begin{bmatrix}X_t(\omega)\\X_{t+1}(\omega)\\\vdots\\X_{t+\tau}(\omega)\end{bmatrix},\quad\forall t$$ 即对随机向量 $X_t$ 堆叠 $\tau$ 个观测。如前，$\mathbf X_\tau:\Omega\to\mathbb R^{\tau\times n}$ 为可测函数（$\forall\mathfrak b\in\mathcal B(\mathbb R^{\tau\times n})$，$\mathbf X_{t,\tau}^{-1}(\mathfrak b)\in\mathcal F$）。则 $\mathbf X_{t,\tau}(\omega)$ 的联合分布由 $$\mathbb P(\{\mathbf X_{t,\tau}(\omega)\in\mathfrak b\})\equiv\mathbb P\big(\mathbf X_{t,\tau}^{-1}(\mathfrak b)\big)\equiv\mathbb P(\{\omega\in\Omega:\ \mathbf X_{t,\tau}(\omega)\in\mathfrak b\})$$ 给出。

Important

Definition 12.2 (Measurable transformation) $\mathbb S:\Omega\to\Omega$ is a measurable transformation if for all $\Lambda\in\mathcal F$, $$\mathbb S^{-1}(\Lambda)\equiv\{\omega:\ \mathbb S(\omega)=\Lambda\}\in\mathcal F$$

Tip

Remark 12.2 We always look at the pre-image (a subset of $\Omega$) to determine measurability. If the pre-image is an element in $\mathcal F$, we say the transformation $\mathbb S$ (or random vector $X$) is measurable. Also note $\mathbb S$ is a deterministic mapping with nothing random at all.

Important

Definition 12.3 (Stochastic process) Let $\mathbb S:\Omega\to\Omega$ be a measurable transformation. An $n$-dimensional stochastic process is an infinite sequence of $n$-dimensional random vectors $\{X_t\}_{t=0}^\infty$ where $$X_t=X\big(\mathbb S^t(\omega)\big)$$ with $X_0=X(\mathbb S^0(\omega))=X(\omega)$.

Important

Definition 12.4 (Joint distribution) Consider a vector of random vectors $$\mathbf X_{t,\tau}(\omega)=\begin{bmatrix}X_t(\omega)\\X_{t+1}(\omega)\\\vdots\\X_{t+\tau}(\omega)\end{bmatrix},\quad\forall t$$ stacking $\tau$ observations about the random vector $X_t$. As before, $\mathbf X_\tau:\Omega\to\mathbb R^{\tau\times n}$ is a measurable function (for all $\mathfrak b\in\mathcal B(\mathbb R^{\tau\times n})$, $\mathbf X_{t,\tau}^{-1}(\mathfrak b)\in\mathcal F$). Then the joint distribution of $\mathbf X_{t,\tau}(\omega)$ is given by $$\mathbb P(\{\mathbf X_{t,\tau}(\omega)\in\mathfrak b\})\equiv\mathbb P\big(\mathbf X_{t,\tau}^{-1}(\mathfrak b)\big)\equiv\mathbb P(\{\omega\in\Omega:\ \mathbf X_{t,\tau}(\omega)\in\mathfrak b\})$$

12.1.4 Stationary Stochastic Processes

Important

定义 12.5（保测）称 $(\mathbb S,\mathbb P)$ 保测（measure preserving），若 $$\mathbb P\big(\{\mathbb S^{-1}(\Lambda)\}\big)=\mathbb P(\{\Lambda\})$$

Tip

注记 12.3 保测意即：经变换 $\mathbb S$ 后，落在子集 $\Lambda$ 的概率与落在变换后子集 $\mathbb S(\Lambda)$ 的概率相同。等价地，不存在以严格正概率被 $\mathbb S$ 变换到 $\Lambda$ 之外的样本点（否则保测意味着 $\mathbb S$ 几乎是 $\mathcal F$ 中元素间的一对一映射——"几乎"是因为可能存在零概率测度的样本点）。

Important

定义 12.6（平稳随机过程）当 $(\mathbb S,\mathbb P)$ 保测时，$X_t$ 的分布对 $\forall t$ 相同，于是称随机过程 $\{X_t\}_{t=0}^\infty$ 平稳。

Note

证明（保测 $\Rightarrow$ 平稳）考虑 $X_t$ 与 $X_{t+1}$。对 $\forall\mathfrak b\in\mathcal B(\mathbb R^n)$， $$\begin{aligned}X_t&=X(\mathbb S^t(\omega))\\\Rightarrow\mathbb P(\{X_t\in\mathfrak b\})&=\mathbb P(\{\omega:X(\mathbb S^t(\omega))\in\mathfrak b\})=\mathbb P(\{\omega:\mathbb S^t(\omega)\in X^{-1}(\mathfrak b)\})=\mathbb P(\mathbb S^{-t}(X^{-1}(\mathfrak b)))\end{aligned}$$ 类似地 $$\mathbb P(\{X_{t+1}\in\mathfrak b\})=\mathbb P(\mathbb S^{-t}(\mathbb S^{-1}(X^{-1}(\mathfrak b))))\overset{\text{m.p.}}{=}\mathbb P(\mathbb S^{-t}(X^{-1}(\mathfrak b)))=\mathbb P(\{X_t\in\mathfrak b\})$$ 由归纳，$\mathbb P(\{X_t\in\mathfrak b\})$ 对 $\forall t$、$\forall\mathfrak b$ 相同，故 $\{X_t\}_{t=0}^\infty$ 平稳。$\blacksquare$

Tip

注记 12.4 平稳随机过程 $\{X_t\}_{t=0}^\infty$ 也意味着（堆叠的）随机过程 $\{\mathbf X_{t,\tau}\}_{t=0}^\infty$ 平稳。

Important

Definition 12.5 (Measure preserving) We say $(\mathbb S,\mathbb P)$ is measure preserving if $$\mathbb P\big(\{\mathbb S^{-1}(\Lambda)\}\big)=\mathbb P(\{\Lambda\})$$

Tip

Remark 12.3 Measure preserving simply means that the probability of being in a subset $\Lambda$ is the same as the probability of being in the transformed subset $\mathbb S(\Lambda)$ after the transformation $\mathbb S$. Equivalently, we won't have sample points outside of $\Lambda$ with strictly positive probability being transformed by $\mathbb S$ into $\Lambda$ (otherwise measure preserving implies that $\mathbb S$ is almost a one-to-one mapping between elements in $\mathcal F$ — "almost" because of possible sample points with zero probability measure).

Important

Definition 12.6 (Stationary stochastic process) When $(\mathbb S,\mathbb P)$ is measure preserving, the distribution of $X_t$ is identical for all $t$, so we say the stochastic process $\{X_t\}_{t=0}^\infty$ is stationary.

Note

Proof (measure preserving $\Rightarrow$ stationary) Consider $X_t$ and $X_{t+1}$. For all $\mathfrak b\in\mathcal B(\mathbb R^n)$, $$\begin{aligned}X_t&=X(\mathbb S^t(\omega))\\\Rightarrow\mathbb P(\{X_t\in\mathfrak b\})&=\mathbb P(\{\omega:X(\mathbb S^t(\omega))\in\mathfrak b\})=\mathbb P(\{\omega:\mathbb S^t(\omega)\in X^{-1}(\mathfrak b)\})=\mathbb P(\mathbb S^{-t}(X^{-1}(\mathfrak b)))\end{aligned}$$ Similarly $$\mathbb P(\{X_{t+1}\in\mathfrak b\})=\mathbb P(\mathbb S^{-t}(\mathbb S^{-1}(X^{-1}(\mathfrak b))))\overset{\text{m.p.}}{=}\mathbb P(\mathbb S^{-t}(X^{-1}(\mathfrak b)))=\mathbb P(\{X_t\in\mathfrak b\})$$ By induction, $\mathbb P(\{X_t\in\mathfrak b\})$ is identical for all $t$ and all $\mathfrak b$, so $\{X_t\}_{t=0}^\infty$ is stationary. $\blacksquare$

Tip

Remark 12.4 A stationary stochastic process $\{X_t\}_{t=0}^\infty$ also implies the (stacked) stochastic process $\{\mathbf X_{t,\tau}\}_{t=0}^\infty$ is stationary.

12.1.5 A Specific Case of Constructing a Stochastic Process

设 $\omega=(\mathbf r_0,\mathbf r_1,\dots)_{k\times1}$ 为样本空间 $\Omega\subseteq\mathbb R^{k\times n}$ 中的样本点（$\mathbf r_j\in\mathbb R^n$，$\forall j=1,2,\dots$）。
$\mathbb S:\Omega\to\Omega$ 为可测变换，对 $\omega=(\mathbf r_0,\mathbf r_1,\dots)_{k\times1}$ 满足 $$\mathbb S(\omega)\equiv(\mathbf r_1,\mathbf r_2,\dots)_{k\times1},\quad\mathbb S^2(\omega)\equiv(\mathbf r_2,\mathbf r_3,\dots)_{k\times1},\quad\dots,\quad\mathbb S^l(\omega)\equiv(\mathbf r_l,\mathbf r_{l+1},\dots)_{k\times1}$$ 即 $\mathbb S$ 是"向前移一步"映射。
$X:\Omega\to\mathbb R^n$ 为测量函数，其像取样本点向量的第一个元素： $$X_0(\omega)=X(\omega)=\mathbf r_0,\quad X_1(\omega)=X(\mathbb S(\omega))=\mathbf r_1,\quad\dots,\quad X_l(\omega)=X(\mathbb S^l(\omega))=\mathbf r_l$$ 于是可定义 $\{X_t\}_{t=0}^\infty$ 为随机过程。

Let $\omega=(\mathbf r_0,\mathbf r_1,\dots)_{k\times1}$ be a sample point in the sample space $\Omega\subseteq\mathbb R^{k\times n}$ ($\mathbf r_j\in\mathbb R^n$ for all $j=1,2,\dots$).
$\mathbb S:\Omega\to\Omega$ is a measurable transformation s.t. for $\omega=(\mathbf r_0,\mathbf r_1,\dots)_{k\times1}$, $$\mathbb S(\omega)\equiv(\mathbf r_1,\mathbf r_2,\dots)_{k\times1},\quad\mathbb S^2(\omega)\equiv(\mathbf r_2,\mathbf r_3,\dots)_{k\times1},\quad\dots,\quad\mathbb S^l(\omega)\equiv(\mathbf r_l,\mathbf r_{l+1},\dots)_{k\times1}$$ i.e. $\mathbb S$ is a "shifting-one-step-ahead" mapping.
$X:\Omega\to\mathbb R^n$ is a measurement function whose image is the first element in the sample point vector: $$X_0(\omega)=X(\omega)=\mathbf r_0,\quad X_1(\omega)=X(\mathbb S(\omega))=\mathbf r_1,\quad\dots,\quad X_l(\omega)=X(\mathbb S^l(\omega))=\mathbf r_l$$ Then we can define $\{X_t\}_{t=0}^\infty$ as a stochastic process.

12.2 Conditional Expectation and Law of Large Numbers

12.2.1 Invariant Events

Important

定义 12.7（不变事件）称 $\Lambda\in\mathcal F$ 为不变事件，若 $\mathbb S^{-1}(\Lambda)=\Lambda$。

Tip

注记 12.5 $\mathbb S^{-1}(\Lambda)=\Lambda$ 意味着没有从不变事件外部进入的概率。另一方面，$\mathbb S^{-1}(\Lambda)=\Lambda\Rightarrow\mathbb S(\mathbb S^{-1}(\Lambda))=\mathbb S(\Lambda)\Rightarrow\Lambda\approx\mathbb S(\Lambda)$，即也没有离开的概率——从不变事件内部任一样本点出发、经 $\mathbb S$ 变换，总落回同一不变事件内。注意 $\mathbb S^{-1}(\Lambda)=\Lambda$（无离开、无进入）比 $\mathbb S(\Lambda)=\Lambda$（仅无离开）更强。$\Omega$ 与 $\varnothing$ 都是不变事件。不变事件是相对 $\mathbb S$ 定义的概念，与概率测度 $\mathbb P$ 无关。

Important

Definition 12.7 (Invariant event) We say $\Lambda\in\mathcal F$ is an invariant event if $\mathbb S^{-1}(\Lambda)=\Lambda$.

Tip

Remark 12.5 $\mathbb S^{-1}(\Lambda)=\Lambda$ means there is no entering probability from points outside an invariant event. On the other hand, since $\mathbb S^{-1}(\Lambda)=\Lambda\Rightarrow\mathbb S(\mathbb S^{-1}(\Lambda))=\mathbb S(\Lambda)\Rightarrow\Lambda\approx\mathbb S(\Lambda)$, there is also no exiting probability — starting from any sample point inside an invariant event and transforming under $\mathbb S$ will always end up at some sample point inside the same invariant event. Note $\mathbb S^{-1}(\Lambda)=\Lambda$ (no exiting and no entering) is stronger than $\mathbb S(\Lambda)=\Lambda$ (only no exiting). Both $\Omega$ and $\varnothing$ are invariant events. An invariant event is a notion defined w.r.t. $\mathbb S$ and has nothing to do with the probability measure $\mathbb P$.

12.2.2 Conditional Expectation

设 $\{\Lambda_j\}_{j=1}^\infty$ 为 $\Omega$ 的可数划分，其中每个 $\Lambda_j$ 是不变事件。令 $\mathcal I=\{\Lambda_1,\Lambda_2,\dots\}$ 为划分 $\{\Lambda_j\}_{j=1}^\infty$ 中所有元素的集合。则随机向量 $X$ 关于 $\omega\in\Lambda_j$ 的条件期望为 $$\mathbb E[X\mid\Lambda_j]=\int_{\Lambda_j}\frac{X(\omega)\,d\mathbb P}{\mathbb P(\{\Lambda_j\})}$$ 而 $X$ 关于划分 $\mathcal I$ 的条件期望是 $\omega$ 的函数： $$\mathbb E[X\mid\mathcal I](\omega)=\int_{\Lambda_j}\frac{X(\omega)\,d\mathbb P}{\mathbb P(\{\Lambda_j\})}\quad\text{if }\omega\in\Lambda_j$$

（划分满足 $\Lambda_i\cap\Lambda_j=\varnothing$（$i\ne j$）、$\bigcup_{j=1}^\infty\Lambda_j=\Omega$、每个 $\Lambda_j$ 非空。由于 $\Omega$ 与 $\varnothing$ 总是不变事件，这样的划分总能构造。）

Let $\{\Lambda_j\}_{j=1}^\infty$ denote a countable partition of $\Omega$ where each $\Lambda_j$ is an invariant event. Let $\mathcal I=\{\Lambda_1,\Lambda_2,\dots\}$ be a collection of all the elements in the partition $\{\Lambda_j\}_{j=1}^\infty$. Then the expectation of a random vector $X$ conditional on $\omega\in\Lambda_j$ is $$\mathbb E[X\mid\Lambda_j]=\int_{\Lambda_j}\frac{X(\omega)\,d\mathbb P}{\mathbb P(\{\Lambda_j\})}$$ and the expectation of $X$ conditional on the partition $\mathcal I$ is a function of $\omega$: $$\mathbb E[X\mid\mathcal I](\omega)=\int_{\Lambda_j}\frac{X(\omega)\,d\mathbb P}{\mathbb P(\{\Lambda_j\})}\quad\text{if }\omega\in\Lambda_j$$

(The partition satisfies $\Lambda_i\cap\Lambda_j=\varnothing$ ($i\ne j$), $\bigcup_{j=1}^\infty\Lambda_j=\Omega$, and each $\Lambda_j$ is non-empty. Since $\Omega$ and $\varnothing$ are always invariant events, such a partition can always be obtained.)

12.2.3 Ergodic Distribution

Important

定义 12.8（遍历分布）保测的 $(\mathbb S,\mathbb P)$ 称遍历（ergodic），若所有不变事件的概率为 0 或 1。

Tip

注记 12.6 总能把 $\Omega$ 划分为两两不交的不变事件。若每个不变事件概率为 0 或 1，则我们将"出发于、停留于、最终落于"那个概率为 1 的唯一不变事件。

练习 12.1（有限态 Markov 链） 如何用 $(\mathbb S,\mathbb P)$ 定义有限态（一阶）Markov 链？

先想样本点 $\omega$ 与样本空间 $\Omega$。设 Markov 链有 $k$ 个状态 $S\equiv\{s_1,s_2,\dots,s_k\}$。令样本点 $\omega_i=(s_{i_1},s_{i_2},s_{i_3},\dots)$ 为状态的无穷序列；对 $j=1,2,\dots$，$i_j\in\{1,2,\dots,k\}$，即 $s_{i_j}\in S=\{s_1,\dots,s_k\}$。样本空间 $\Omega=\{\omega_i=(s_{i_1},s_{i_2},\dots):s_{i_j}\in S\}$；$\Omega$ 元素的多样性来自每个位置 $s_{i_j}$ 从 $S$ 任取的自由。
$\mathbb S$ 的定义：$\mathbb S:\Omega\to\Omega$ 可测变换，对 $\omega_i=(s_{i_1},s_{i_2},s_{i_3},\dots)$ 满足 $\mathbb S^l(\omega_i)\equiv(s_{i_{l+1}},s_{i_{l+2}},\dots)$，即"向前移一步"映射；因 $\omega_i$ 无穷长，这是从 $\Omega$ 到自身的（确定性）映射。
$X:\Omega\to S$ 为测量函数，像取样本点向量的第一个元素（一个状态）：$X(\omega_i)=s_{i_1}$、$X(\mathbb S(\omega_i))=s_{i_2}$、…、$X(\mathbb S^l(\omega_i))=s_{i_{l+1}}$。则定义序列 $\{X_t(\omega_i)\}_{t=0}^\infty$。
$\mathbf X_{l+1}(\omega_i)$ 为 $\{X_t\}_{t=0}^\infty$ 前 $l+1$ 个元素构成的 $(l+1)\times1$ 向量。因 $\mathbb S$ 确定，给定 $\omega_i$ 时序列 $\{X_t(\omega_i)\}$ 与 $\mathbf X_{l+1}(\omega_i)$ 都确定、毫无随机性；$\mathbf X_{l+1}(\omega_i)$ 可解读为该 Markov 链前 $l+1$ 步实现状态的向量。
$\mathbb P$ 的定义：赋给 $\mathbf X_{l+1}(\omega_i)=(s_{i_1},s_{i_2},\dots,s_{i_{l+1}})'$ 的概率可两种算法：
用 $(\mathbb S,\mathbb P)$：直接赋给 $\mathbf X_{l+1}(\omega_i)$ 的概率 $\mathbb P(\{\omega_i:X(\mathbb S^j(\omega_i))=s_{i_{j+1}},\forall j=0,1,\dots,l\})$，即把所有前 $l+1$ 个元素与 $\mathbf X_{l+1}(\omega_i)$ 一致的样本点概率累加。
用转移矩阵 $\mathbf T$（$p_{s_a s_b}$ 为从 $s_a$ 转到 $s_b$ 的概率）：给定初始分布 $q_0=(q^0_{s_1},q^0_{s_2},\dots,q^0_{s_k})$，$\mathbb P$ 定义为使 $$\mathbb P(\{\omega_i:X(\mathbb S^j(\omega_i))=s_{i_{j+1}},\forall j=0,1,\dots,l\})=q^0_{s_{i_1}}\prod_{k=1}^l p_{s_{i_k}s_{i_{k+1}}}$$ 对任意序列 $\{i_1,i_2,\dots,i_{l+1}\}_{l=0}^\infty$（$i_j\in\{1,2,\dots,k\}$）成立的测度。

练习 12.2 用 $(\mathbb S,\mathbb P)$ 定义的遍历性与用转移矩阵 $\mathbf T$ 定义的遍历性有何关系？

$(\mathbb S,\mathbb P)$ 的遍历性：给定初始分布 $q_0$ 与转移矩阵 $\mathbf T$ 即可恢复测度 $\mathbb P$，并按练习 12.1 定义 $\mathbb S$；若 $\mathbb P$ 对 $\mathbb S$ 下所有不变事件赋概率 0 或 1，则 $(\mathbb S,\mathbb P)$ 遍历。
转移矩阵 $\mathbf T$ 的遍历性：从任意状态（任意初始分布）出发，$\forall s_j\in S$ 在有限步内被到达的概率严格为正。
两者关系：$(\mathbb S,\mathbb P)$ 的遍历依赖初始分布，$\mathbf T$ 的遍历不依赖（因构造等价测度 $\mathbb P$ 依赖初始分布 $q_0$，是否遍历确实取决于 $q_0$）。$(\mathbb S,\mathbb P)$ 的遍历允许状态空间 $S$ 有多个不变事件（作为 $S$ 的子集），而 $\mathbf T$ 的遍历要求唯一不变事件就是 $S$。所以给定 $\mathbf T$，若 $(\mathbb S,\mathbb P)$ 对任意 $q_0$ 都遍历，则它是 $\mathbf T$ 遍历的必要非充分条件——还需"无暂态"。否则可能唯一不变事件（相对 $\mathbb S$）是整个状态空间，但存在暂态：此时 $(\mathbb S,\mathbb P)$ 对任意 $q_0$ 仍遍历，但 $\mathbf T$ 不遍历。

Important

Definition 12.8 (Ergodic distribution) A measure-preserving $(\mathbb S,\mathbb P)$ is ergodic if all invariant events have probability either 0 or 1.

Tip

Remark 12.6 We can always partition $\Omega$ into pairwise disjoint invariant events. If every invariant event has probability either 0 or 1, we would start within, stay within, and end up within the only invariant event assigned probability 1.

Exercise 12.1 (Finite-state Markov chain). How is $(\mathbb S,\mathbb P)$ defined for a finite-state (first-order) Markov chain?

First consider the sample point $\omega$ and sample space $\Omega$. Suppose there are $k$ states $S\equiv\{s_1,s_2,\dots,s_k\}$. Let $\omega_i=(s_{i_1},s_{i_2},s_{i_3},\dots)$ be a sample point (a state sequence with infinite terms); for $j=1,2,\dots$, $i_j\in\{1,2,\dots,k\}$, i.e. $s_{i_j}\in S=\{s_1,\dots,s_k\}$. The sample space is $\Omega=\{\omega_i=(s_{i_1},s_{i_2},\dots):s_{i_j}\in S\}$; the diversity of elements in $\Omega$ comes from the arbitrary choice of $s_{i_j}$ from $S$ for each position.
Definition of $\mathbb S$: $\mathbb S:\Omega\to\Omega$ is a measurable transformation s.t. for $\omega_i=(s_{i_1},s_{i_2},s_{i_3},\dots)$, $\mathbb S^l(\omega_i)\equiv(s_{i_{l+1}},s_{i_{l+2}},\dots)$, a "shifting-one-step-ahead" mapping; since $\omega_i$ has infinite length, this is a (deterministic) mapping from $\Omega$ to itself.
$X:\Omega\to S$ is a measurement function whose image is the first element in the sample point vector (a state): $X(\omega_i)=s_{i_1}$, $X(\mathbb S(\omega_i))=s_{i_2}$, …, $X(\mathbb S^l(\omega_i))=s_{i_{l+1}}$. Then we define a sequence $\{X_t(\omega_i)\}_{t=0}^\infty$.
$\mathbf X_{l+1}(\omega_i)$ is the $(l+1)\times1$ vector of the first $l+1$ elements in $\{X_t\}_{t=0}^\infty$. Since $\mathbb S$ is deterministic, given $\omega_i$ both $\{X_t(\omega_i)\}$ and $\mathbf X_{l+1}(\omega_i)$ are deterministic and not random at all; $\mathbf X_{l+1}(\omega_i)$ can be interpreted as a vector of realized states in the first $l+1$ steps of that Markov chain.
Definition of $\mathbb P$: the probability assigned to $\mathbf X_{l+1}(\omega_i)=(s_{i_1},s_{i_2},\dots,s_{i_{l+1}})'$ can be computed two ways:
using $(\mathbb S,\mathbb P)$: the probability assigned to $\mathbf X_{l+1}(\omega_i)$ by $(\mathbb S,\mathbb P)$, $\mathbb P(\{\omega_i:X(\mathbb S^j(\omega_i))=s_{i_{j+1}},\forall j=0,1,\dots,l\})$, adding up all probabilities of sample points whose first $l+1$ elements coincide with $\mathbf X_{l+1}(\omega_i)$.
using the transition matrix $\mathbf T$ ($p_{s_a s_b}$ the probability of transferring from $s_a$ to $s_b$): given an initial distribution $q_0=(q^0_{s_1},q^0_{s_2},\dots,q^0_{s_k})$, $\mathbb P$ is defined as a measure such that $$\mathbb P(\{\omega_i:X(\mathbb S^j(\omega_i))=s_{i_{j+1}},\forall j=0,1,\dots,l\})=q^0_{s_{i_1}}\prod_{k=1}^l p_{s_{i_k}s_{i_{k+1}}}$$ holds for any sequence $\{i_1,i_2,\dots,i_{l+1}\}_{l=0}^\infty$ ($i_j\in\{1,2,\dots,k\}$).

Exercise 12.2. How should we understand the relationship between ergodicity defined for $(\mathbb S,\mathbb P)$ and that for the transition matrix $\mathbf T$?

Ergodicity for $(\mathbb S,\mathbb P)$: given an initial distribution $q_0$ and transition matrix $\mathbf T$ we can recover the measure $\mathbb P$, and define $\mathbb S$ as in Exercise 12.1; if $\mathbb P$ assigns probability either 0 or 1 to all invariant events under $\mathbb S$, then $(\mathbb S,\mathbb P)$ is ergodic.
Ergodicity for $\mathbf T$: starting from any state (any initial distribution), $\forall s_j\in S$ has a strictly positive probability of being reached within finite steps.
The relationship: ergodicity for $(\mathbb S,\mathbb P)$ does rely on the initial distribution, while ergodicity for $\mathbf T$ doesn't (since constructing the equivalent measure $\mathbb P$ depends on $q_0$, whether it is ergodic does depend on $q_0$). Ergodicity for $(\mathbb S,\mathbb P)$ allows the state space $S$ to have multiple invariant events (as subsets of $S$), while ergodicity for $\mathbf T$ requires that the only invariant event is $S$. So given $\mathbf T$, if $(\mathbb S,\mathbb P)$ is ergodic for any arbitrary $q_0$, then it is a necessary but not sufficient condition for $\mathbf T$ to be ergodic — we also need the condition that there is no transient state. Otherwise it is possible that the only invariant event w.r.t. $\mathbb S$ is the whole state space but there are transient states: then $(\mathbb S,\mathbb P)$ is still ergodic for any arbitrary $q_0$, but $\mathbf T$ is not ergodic.

12.2.4 Law of Large Numbers

可把通常要求 i.i.d. 随机向量的大数定律推广到允许跨期依赖的情形。

Important

定理 12.1（Birkhoff 大数定律）设 $(\mathbb S,\mathbb P)$ 保测，则 1.（依概率收敛）对任意满足 $\mathbb E[|X|]<\infty$ 的 $X$， $$\frac1N\sum_{t=1}^N X_t(\omega)\xrightarrow{p}\mathbb E[X\mid\mathcal I](\omega)$$ 2.（均方收敛）对任意满足 $\mathbb E[|X|^2]<\infty$ 的 $X$， $$\mathbb E\Big[\Big|\frac1N\sum_{t=1}^N X_t(\omega)-\mathbb E[X\mid\mathcal I](\omega)\Big|^2\Big]\to0\quad\text{as }N\to\infty$$

Important

命题 12.1 设 $(\mathbb S,\mathbb P)$ 遍历，则 $\mathbb E[X\mid\mathcal I]=\mathbb E[X]$。

Note

证明不失一般性，设 $\mathbb P(\{\Lambda_j\})=1$、$\forall i\ne j$ 有 $\mathbb P(\{\Lambda_i\})=0$。则 $$\mathbb E[X]=\sum_{k=1}^\infty\Big(\int_{\Lambda_k}X(\omega)\,d\mathbb P\Big)\overset{\text{erg}}{=}\int_{\Lambda_j}X(\omega)\,d\mathbb P=\int_{\Lambda_j}\frac{X(\omega)\,d\mathbb P}1=\int_{\Lambda_j}\frac{X(\omega)\,d\mathbb P}{\mathbb P(\{\Lambda_j\})}=\mathbb E[X\mid\mathcal I]$$ $\blacksquare$

Tip

注记 12.7（Birkhoff 大数定律的直观）它直观告诉我们：当你出发于某不变事件，你无法离开它（这正是条件大数定律依赖于你从哪个 $\omega$ 出发的原因）。即便数据 $X_t(\omega)$ 相关，由于平稳，相关性会随期数间隔变大而衰减、可忽略。序列足够长时，相邻数据彼此相关，而 Birkhoff 大数定律对那些相关子序列起作用；几乎有无穷多这样的子序列，它们极限里有相同的均值，加总仍生成同一极限均值，即被揭示的真实均值。若 $(\mathbb S,\mathbb P)$ 遍历，则只有唯一可被到达的不变事件，条件期望即无条件期望（条件作用的空间变为整个空间）。

We can extend our usual version of the Law of Large Numbers (requiring i.i.d. random vectors) to one that allows inter-temporal dependency.

Important

Theorem 12.1 (Birkhoff Law of Large Numbers) Suppose $(\mathbb S,\mathbb P)$ is measure preserving. Then 1. (Convergence in probability) For any $X$ such that $\mathbb E[|X|]<\infty$, $$\frac1N\sum_{t=1}^N X_t(\omega)\xrightarrow{p}\mathbb E[X\mid\mathcal I](\omega)$$ 2. (Mean-square convergence) For any $X$ such that $\mathbb E[|X|^2]<\infty$, $$\mathbb E\Big[\Big|\frac1N\sum_{t=1}^N X_t(\omega)-\mathbb E[X\mid\mathcal I](\omega)\Big|^2\Big]\to0\quad\text{as }N\to\infty$$

Important

Proposition 12.1 Suppose $(\mathbb S,\mathbb P)$ is ergodic. Then $\mathbb E[X\mid\mathcal I]=\mathbb E[X]$.

Note

Proof WLOG, assume $\mathbb P(\{\Lambda_j\})=1$ and for all $i\ne j$, $\mathbb P(\{\Lambda_i\})=0$. Then $$\mathbb E[X]=\sum_{k=1}^\infty\Big(\int_{\Lambda_k}X(\omega)\,d\mathbb P\Big)\overset{\text{erg}}{=}\int_{\Lambda_j}X(\omega)\,d\mathbb P=\int_{\Lambda_j}\frac{X(\omega)\,d\mathbb P}1=\int_{\Lambda_j}\frac{X(\omega)\,d\mathbb P}{\mathbb P(\{\Lambda_j\})}=\mathbb E[X\mid\mathcal I]$$ $\blacksquare$

Tip

Remark 12.7 (Intuition of the Birkhoff LLN) It intuitively tells us the following: when you start within an invariant event, you cannot leave that invariant set (that's why the conditional LLN depends on which $\omega$ you start from). Even though the data $X_t(\omega)$ are correlated, since the data are stationary, the correlation effect will decay to become negligible if two data points have sufficiently large number of periods between them. If the sequence is long enough, we know there exist subsequences in which all data points stay close to each other, and the LLN finds these effects for those subsequences. We have almost infinitely many such subsequences, which all should have the same mean in limit, so adding them up will still generate the same mean in limit, which is the revealed true mean. If $(\mathbb S,\mathbb P)$ is ergodic, then it simply means there is only one possible invariant event to be visited ever, so the conditional expectation becomes unconditional as the conditioned space becomes the whole space.

12.2.5 Limiting Empirical Measures

在概率空间 $(\Omega,\mathcal F,\mathbb P)$ 中，$(\mathbb S,\mathbb P)$ 保测、$\{\Lambda_j\}_{j=1}^\infty$ 为 $\Omega$ 的可数划分（每个 $\Lambda_j$ 不变）。对任意 $\hat\Lambda\subset\Lambda_j$（$\hat\Lambda\ne\varnothing$），$\hat\Lambda$ 不是不变事件。定义指示函数 $$\mathbf 1_\Lambda(\omega)=\begin{cases}1&\text{if }\omega\in\Lambda\\0&\text{otherwise}\end{cases}$$

Important

定义 12.9（极限经验测度）对 $\forall\Lambda\in\mathcal F$、$\forall\omega\in\Lambda_j$，极限经验测度 $Q_r^j:\mathcal F\to[0,1]$ 定义为 $$Q_r^j(\Lambda)\equiv\lim_{N\to\infty}\frac1N\sum_{t=1}^N\mathbf 1_\Lambda\big(\mathbb S^t(\omega)\big)$$

Note

推导（由大数定律）由定理 12.1， $$\begin{aligned}Q_r^j(\Lambda)&\equiv\lim_{N\to\infty}\frac1N\sum_{t=1}^N\mathbf 1_\Lambda(\mathbb S^t(\omega))\overset{\text{LLN}}{=}\mathbb E[\mathbf 1_\Lambda(\mathbb S^t(\omega))\mid\mathcal I](\omega)\\&=\int_{\Lambda_j}\frac{\mathbf 1_\Lambda(\mathbb S^t(\omega))\,d\mathbb P}{\mathbb P(\{\omega\in\Lambda_j\})}=\frac{\int_\Omega\mathbf 1_\Lambda(\mathbb S^t(\omega))\mathbf 1_{\Lambda_j}(\mathbb S^t(\omega))\,d\mathbb P}{\mathbb P(\{\omega\in\Lambda_j\})}\\&=\frac{\mathbb E[\mathbf 1_\Lambda(\mathbb S^t(\omega))\mathbf 1_{\Lambda_j}(\mathbb S^t(\omega))]}{\mathbb P(\{\omega\in\Lambda_j\})}=\frac{\mathbb P(\{\mathbb S^t(\omega)\in\Lambda\cap\Lambda_j\})}{\mathbb P(\{\omega\in\Lambda_j\})}\overset{\text{stat}}{=}\frac{\mathbb P(\{\omega\in\Lambda\cap\Lambda_j\})}{\mathbb P(\{\omega\in\Lambda_j\})}\\\Rightarrow Q_r^j(\Lambda)&=\frac{\mathbb P(\Lambda\cap\Lambda_j)}{\mathbb P(\Lambda_j)}\end{aligned}$$ 又注意 $Q_r^j(\Lambda_j)=\frac{\mathbb P(\Lambda_j\cap\Lambda_j)}{\mathbb P(\Lambda_j)}=1$。所以极限经验测度对 $\Lambda_j$ 赋概率 1、对 $\Lambda_i$（$\forall i\ne j$）赋 0。故 $(\mathbb S,Q_r^j(\Lambda))$ 仅有 $\Lambda_j$ 与 $\varnothing$ 两个不变事件，是遍历的——$\{Q_r^j\}_{j=1}^\infty$ 称为遍历构件。$\blacksquare$

由此得 $$\mathbb P(\Lambda)=\sum_{j=1}^\infty\mathbb P(\Lambda_j)Q_r^j(\Lambda)\tag{12.1}$$ 其中 $Q_r^j(\Lambda_j)$ 解读为遍历构件。

Note

证明（12.1） $$\sum_{j=1}^\infty\mathbb P(\Lambda_j)Q_r^j(\Lambda)=\sum_{j=1}^\infty\mathbb P(\Lambda_j)\frac{\mathbb P(\Lambda\cap\Lambda_j)}{\mathbb P(\Lambda_j)}=\sum_{j=1}^\infty\mathbb P(\Lambda\cap\Lambda_j)=\mathbb P(\Lambda)$$ （末步因 $\{\Lambda_j\}$ 是 $\Omega$ 的划分。）$\blacksquare$

In the probability space $(\Omega,\mathcal F,\mathbb P)$ with $(\mathbb S,\mathbb P)$ measure preserving and $\{\Lambda_j\}_{j=1}^\infty$ a countable partition of $\Omega$ (each $\Lambda_j$ invariant), for any $\hat\Lambda\subset\Lambda_j$ with $\hat\Lambda\ne\varnothing$, $\hat\Lambda$ is not an invariant event. Define the indicator function $$\mathbf 1_\Lambda(\omega)=\begin{cases}1&\text{if }\omega\in\Lambda\\0&\text{otherwise}\end{cases}$$

Important

Definition 12.9 (Limiting empirical measure) For all $\Lambda\in\mathcal F$ and all $\omega\in\Lambda_j$, the limiting empirical measure $Q_r^j:\mathcal F\to[0,1]$ is defined by $$Q_r^j(\Lambda)\equiv\lim_{N\to\infty}\frac1N\sum_{t=1}^N\mathbf 1_\Lambda\big(\mathbb S^t(\omega)\big)$$

Note

Derivation (from the LLN) By Theorem 12.1, $$\begin{aligned}Q_r^j(\Lambda)&\equiv\lim_{N\to\infty}\frac1N\sum_{t=1}^N\mathbf 1_\Lambda(\mathbb S^t(\omega))\overset{\text{LLN}}{=}\mathbb E[\mathbf 1_\Lambda(\mathbb S^t(\omega))\mid\mathcal I](\omega)\\&=\int_{\Lambda_j}\frac{\mathbf 1_\Lambda(\mathbb S^t(\omega))\,d\mathbb P}{\mathbb P(\{\omega\in\Lambda_j\})}=\frac{\int_\Omega\mathbf 1_\Lambda(\mathbb S^t(\omega))\mathbf 1_{\Lambda_j}(\mathbb S^t(\omega))\,d\mathbb P}{\mathbb P(\{\omega\in\Lambda_j\})}\\&=\frac{\mathbb E[\mathbf 1_\Lambda(\mathbb S^t(\omega))\mathbf 1_{\Lambda_j}(\mathbb S^t(\omega))]}{\mathbb P(\{\omega\in\Lambda_j\})}=\frac{\mathbb P(\{\mathbb S^t(\omega)\in\Lambda\cap\Lambda_j\})}{\mathbb P(\{\omega\in\Lambda_j\})}\overset{\text{stat}}{=}\frac{\mathbb P(\{\omega\in\Lambda\cap\Lambda_j\})}{\mathbb P(\{\omega\in\Lambda_j\})}\\\Rightarrow Q_r^j(\Lambda)&=\frac{\mathbb P(\Lambda\cap\Lambda_j)}{\mathbb P(\Lambda_j)}\end{aligned}$$ Note also $Q_r^j(\Lambda_j)=\frac{\mathbb P(\Lambda_j\cap\Lambda_j)}{\mathbb P(\Lambda_j)}=1$. So the limiting empirical measure assigns probability 1 to $\Lambda_j$ and 0 to $\Lambda_i$ for all $i\ne j$. Thus $(\mathbb S,Q_r^j(\Lambda))$ has only two invariant events $\Lambda_j$ and $\varnothing$ and is ergodic — $\{Q_r^j\}_{j=1}^\infty$ are called ergodic building blocks. $\blacksquare$

This gives $$\mathbb P(\Lambda)=\sum_{j=1}^\infty\mathbb P(\Lambda_j)Q_r^j(\Lambda)\tag{12.1}$$ where $Q_r^j(\Lambda_j)$ is interpreted as an ergodic building block.

Note

Proof (12.1) $$\sum_{j=1}^\infty\mathbb P(\Lambda_j)Q_r^j(\Lambda)=\sum_{j=1}^\infty\mathbb P(\Lambda_j)\frac{\mathbb P(\Lambda\cap\Lambda_j)}{\mathbb P(\Lambda_j)}=\sum_{j=1}^\infty\mathbb P(\Lambda\cap\Lambda_j)=\mathbb P(\Lambda)$$ (the last step because $\{\Lambda_j\}$ is a partition of $\Omega$.) $\blacksquare$

12.3 Stationary and Ergodic Process

12.3.1 Exchangeability

若序列 $\{\mathbf x_t:t=0,1,\dots\}$ 中各成员的联合分布不依赖它们出现的时间顺序，则称该随机过程可交换。形式化如下。

Important

定义 12.10（置换）时间索引的置换是非负整数集到自身的一对一映射。

Important

定义 12.11（可交换性）随机向量序列 $\{\mathbf x_t:t=0,1,2,\dots\}$ 称可交换，若由该序列诱导的联合概率分布与对任意置换 $\xi(0),\xi(1),\dots$ 诱导的 $\{X_{\xi(0)},X_{\xi(1)},\dots\}$ 的联合分布相等——后者保持除有限个索引外全部不变。

Tip

注记 12.8 可交换是极强、且常不合适的假设——例如在时间序列情境下。下文我们常把平稳（对时间序列可成立）与可交换对比。

A stochastic process is said to be exchangeable if the joint distributions of members of the sequence $\{\mathbf x_t:t=0,1,\dots\}$ do not depend on their order of appearance. This is formalized below.

Important

Definition 12.10 (Permutation) A permutation of the time index is a one-to-one mapping of the set of non-negative integers into itself.

Important

Definition 12.11 (Exchangeability) A sequence of random vectors $\{\mathbf x_t:t=0,1,2,\dots\}$ is exchangeable if the joint probability distributions induced by this sequence equal those for $\{X_{\xi(0)},X_{\xi(1)},\dots\}$ for any permutation $\xi(0),\xi(1),\dots$ that keeps all but a finite number of indices fixed.

Tip

Remark 12.8 Exchangeability is an extremely strong, and often inappropriate, assumption — for example in time-series contexts. In the discussion that follows, we will often contrast stationarity, a statement that we can make about time series, with exchangeability.

12.3.2 Ergodic Decomposition

前文从测度空间 $(\Omega,\mathcal F,\mathbb P)$ 出发，研究可测保测变换 $\mathbb S$ 的性质，把 $\Omega$ 划分为可数不变事件族 $\mathcal I=\{\Lambda_1,\Lambda_2,\dots\}$（$\Lambda_i\cap\Lambda_k=\varnothing$，$\bigcup_j\Lambda_j=\Omega$）。则由 (12.1)，对 $\forall\Lambda\in\mathcal F$， $$\mathbb P(\Lambda)=\sum_{j=1}^\infty\mathbb P(\Lambda_j\cap\Lambda)=\sum_{j=1}^\infty\Big(\frac{\mathbb P(\Lambda_j\cap\Lambda)}{\mathbb P(\Lambda_j)}\Big)\mathbb P(\Lambda_j)=\sum_{j=1}^\infty Q_r^j(\Lambda)\mathbb P(\Lambda_j)$$ 其中 $Q_r^j(\Lambda_j)=1$、$Q_r^j(\Lambda_k)=0$（$k\ne j$）。故每个遍历构件 $Q_r^j$ 对唯一对应的不变事件 $\Lambda_j$ 赋概率 1，它们的选择对各不变事件互不重叠。这就是 $\mathcal I$ 的遍历分解。

Important

命题 12.2 $(\mathbb S,Q_r^j)$ 保测且对 $\forall j$ 遍历。

Note

证明 保测： 任取 $\Lambda$。由 $(\mathbb S,\mathbb P)$ 保测， $$Q_r^j(\Lambda)=\frac{\mathbb P(\Lambda_j\cap\Lambda)}{\mathbb P(\Lambda_j)}=\frac{\mathbb P(\mathbb S^{-1}(\Lambda_j\cap\Lambda))}{\mathbb P(\Lambda_j)}=\frac{\mathbb P(\Lambda_j\cap\mathbb S^{-1}(\Lambda))}{\mathbb P(\Lambda_j)}=Q_r^j(\mathbb S^{-1}(\Lambda))$$ 第二个等号用 $(\mathbb S,\mathbb P)$ 保测；第三个用 $\Lambda_j$ 是不变事件（$\mathbb S^{-1}(\Lambda_j\cap\Lambda)\subseteq\mathbb S^{-1}(\Lambda_j)=\Lambda_j$，且 $\Lambda_j\cap\Lambda\subseteq\mathbb S^{-1}(\Lambda)$，反证可得 $\mathbb S^{-1}(\Lambda_j\cap\Lambda)=\Lambda_j\cap\mathbb S^{-1}(\Lambda)$）。故 $Q_r^j$ 保测。 遍历： 须证对所有不变事件 $\Lambda_k$，$Q_r^j(\Lambda_k)$ 为 0 或 1。我们已证 $(\mathbb S,Q_r^j)$ 保测。取某些不变事件 $\tilde{\mathcal I}=\{\tilde\Lambda_1,\tilde\Lambda_2,\dots\}$，则 $$Q_r^j(\tilde\Lambda_k)=\frac{\mathbb P(\Lambda_j\cap\tilde\Lambda_k)}{\mathbb P(\Lambda_j)}$$ 当我们把 $\Omega$ 划分为 $\{\Lambda_j\}$ 时，对 $\hat\Lambda\subset\Lambda_j$（$\hat\Lambda\ne\varnothing$），$\hat\Lambda$ 不是不变事件，故对 $\forall\tilde\Lambda_k$，要么 $\Lambda_j\cap\tilde\Lambda_k=\varnothing$，要么 $\Lambda_j\subseteq\tilde\Lambda_k$——前者 $Q_r^j(\tilde\Lambda_k)=0$，后者 $Q_r^j(\tilde\Lambda_k)=\frac{\mathbb P(\Lambda_j)}{\mathbb P(\Lambda_j)}=1$。故 $(\mathbb S,Q_r^j)$ 遍历。$\blacksquare$

现在反向构造。我们想用一个 $\mathbb S$ 来研究一组使 $(\mathbb S,\tilde{\mathbb P})$ 保测的概率测度 $\tilde{\mathbb P}$。$Q_r^j$ 是用"遍历构件"构造的，它们让 $(\mathbb S,\tilde{\mathbb P})$ 保测。为引出此构造，先给更一般命题，再用可数情形类比。

Important

命题 12.3 可用 $\mathcal Q$ 上的概率测度 $\pi$ 构造任意概率测度 $\tilde{\mathbb P}\in\mathcal P$： $$\tilde{\mathbb P}(\Lambda)=\int_{\mathcal Q}Q_r(\Lambda)\,\pi(dQ_r)$$ 对 $\forall\Lambda\in\mathcal F$，其中 $\mathcal P\equiv\Delta(\Omega)$ 是 $\Omega$ 上所有概率测度的空间、$\mathcal Q$ 含所有由不变划分 $\mathcal I$ 经上文定义的 $\mathbb S$-不变测度 $Q_r$。则 $(\mathbb S,\tilde{\mathbb P})$ 保测。

把上述思想搬到可数设定：$Q_r^j$（$j=1,2,\dots$）如命题 12.2 定义，取任意概率 $\pi_j\ge0$（$\sum_{j=1}^\infty\pi_j=1$），构造 $$\tilde{\mathbb P}(\Lambda)=\sum_{j=0}^\infty Q_r^j(\Lambda)\pi_j$$ 则 $\tilde{\mathbb P}$ 保测，但通常不遍历。

Note

证明（保测但通常不遍历）由命题 12.2，$Q_r^j$ 对 $\forall j$ 保测，故 $$\tilde{\mathbb P}(\mathbb S^{-1}(\Lambda))=\sum_{j=0}^\infty Q_r^j(\mathbb S^{-1}(\Lambda))\pi_j=\sum_{j=0}^\infty Q_r^j(\Lambda)\pi_j=\tilde{\mathbb P}(\Lambda)$$ 即 $\tilde{\mathbb P}$ 保测。但 $\tilde{\mathbb P}$ 通常不遍历，因 $\pi_j$ 任意，可能对某不变事件 $\Lambda$ 赋严格在 0 与 1 之间的概率。例：设 $Q_r^j(\Lambda)=1$、$Q_r^k(\Lambda)=0$，取 $\pi_j=0.5$、$\pi_k=0.5$、$\pi_i=0$（$\forall i\ne j$ 且 $i\ne k$），则 $\tilde{\mathbb P}(\Lambda)=0.5\in(0,1)$。$\blacksquare$

Tip

注记 12.9 遍历分解之所以重要，是因为它展示了如何构造让 $\mathbb S$ 保测的备选概率测度 $\tilde{\mathbb P}$。每个由不变事件 $\Lambda_j$ 唯一决定的遍历测度 $Q_r^j$ 是 $\tilde{\mathbb P}$ 所用的遍历构件，但 $\tilde{\mathbb P}$ 还可对底层不变事件 $\Lambda_j$ 自由赋权重 $\pi_j$——这是我们手上有的自由度。于是可在样本空间 $\Omega$（$\mathbb S$ 保测于其上）创造一族概率测度 $\tilde{\mathbb P}$。若想建立关于平稳概率测度 $\tilde{\mathbb P}$ 的统计模型，可分两步：第一步确定 $\mathbb P$ 即遍历分布（由遍历分布性质，每个 $Q_r^j$ 可由长数据序列在条件大数定律下揭示）；第二步确定赋给各 $\Lambda_j$ 的概率 $\pi_j$（大数定律与长数据序列对揭示不变事件无能为力——不变事件互不沟通）。故各不变事件的相对重要性与权重，是 $\tilde{\mathbb P}$ 模型中主观性的来源。

In the analysis above, we started with the measure space $(\Omega,\mathcal F,\mathbb P)$ and studied properties of measure-preserving transformations $\mathbb S$, partitioning $\Omega$ into a countable collection of invariant events $\mathcal I=\{\Lambda_1,\Lambda_2,\dots\}$ ($\Lambda_i\cap\Lambda_k=\varnothing$, $\bigcup_j\Lambda_j=\Omega$). Then by (12.1), for any $\Lambda\in\mathcal F$, $$\mathbb P(\Lambda)=\sum_{j=1}^\infty\mathbb P(\Lambda_j\cap\Lambda)=\sum_{j=1}^\infty\Big(\frac{\mathbb P(\Lambda_j\cap\Lambda)}{\mathbb P(\Lambda_j)}\Big)\mathbb P(\Lambda_j)=\sum_{j=1}^\infty Q_r^j(\Lambda)\mathbb P(\Lambda_j)$$ where $Q_r^j(\Lambda_j)=1$ and $Q_r^j(\Lambda_k)=0$ ($k\ne j$). So each ergodic building block $Q_r^j$ assigns probability 1 to only one corresponding invariant event $\Lambda_j$, and their choices of corresponding invariant event $\Lambda_j$ do not overlap with each other. This is the ergodic decomposition of $\mathcal I$.

Important

Proposition 12.2 $(\mathbb S,Q_r^j)$ is measure preserving and ergodic for all $j$.

Note

Proof Measure preserving: consider an arbitrary set $\Lambda$. Since $(\mathbb S,\mathbb P)$ is measure preserving, $$Q_r^j(\Lambda)=\frac{\mathbb P(\Lambda_j\cap\Lambda)}{\mathbb P(\Lambda_j)}=\frac{\mathbb P(\mathbb S^{-1}(\Lambda_j\cap\Lambda))}{\mathbb P(\Lambda_j)}=\frac{\mathbb P(\Lambda_j\cap\mathbb S^{-1}(\Lambda))}{\mathbb P(\Lambda_j)}=Q_r^j(\mathbb S^{-1}(\Lambda))$$ The second equality uses that $(\mathbb S,\mathbb P)$ is measure preserving; the third uses that $\Lambda_j$ is an invariant event. So $Q_r^j$ is measure preserving. Ergodic: we must show that $Q_r^j(\Lambda_k)$ equals zero or one for all invariant events $\Lambda_k$. We have already shown $(\mathbb S,Q_r^j)$ is measure preserving. Consider some arbitrary invariant events $\tilde{\mathcal I}=\{\tilde\Lambda_1,\tilde\Lambda_2,\dots\}$; then $$Q_r^j(\tilde\Lambda_k)=\frac{\mathbb P(\Lambda_j\cap\tilde\Lambda_k)}{\mathbb P(\Lambda_j)}$$ Recall that when we partitioned $\Omega$ into $\{\Lambda_j\}$, for any $\hat\Lambda\subset\Lambda_j$ with $\hat\Lambda\ne\varnothing$, $\hat\Lambda$ is not an invariant event. So for any $\tilde\Lambda_k$, either $\Lambda_j\cap\tilde\Lambda_k=\varnothing$ or $\Lambda_j\subseteq\tilde\Lambda_k$ — the former gives $Q_r^j(\tilde\Lambda_k)=0$, the latter $Q_r^j(\tilde\Lambda_k)=\frac{\mathbb P(\Lambda_j)}{\mathbb P(\Lambda_j)}=1$. Thus $(\mathbb S,Q_r^j)$ is ergodic. $\blacksquare$

Now we want to go in the opposite direction. We want to start with a transformation $\mathbb S$ and then study a set of probability measures $\tilde{\mathbb P}$ for which $(\mathbb S,\tilde{\mathbb P})$ is measure preserving. We're going to interpret the $Q_r^j$'s constructed above as the "ergodic building blocks" of such wanted set of probability measures $\tilde{\mathbb P}$ for which $(\mathbb S,\tilde{\mathbb P})$ is measure preserving. To motivate this construction, start with a more general proposition, then use the countable analog of this more general result.

Important

Proposition 12.3 We can construct an arbitrary probability measure $\tilde{\mathbb P}\in\mathcal P$ by using a probability measure $\pi$ over $\mathcal Q$ and $\mathcal Q$ in the following way: $$\tilde{\mathbb P}(\Lambda)=\int_{\mathcal Q}Q_r(\Lambda)\,\pi(dQ_r)$$ for all $\Lambda\in\mathcal F$, where $\mathcal P\equiv\Delta(\Omega)$ is the space of all probability measures defined over $\Omega$, and $\mathcal Q$ contains all the probability measures $Q_r$ defined above by an invariant (w.r.t. $\mathbb S$) partition $\mathcal I$ of $\Omega$. Then $(\mathbb S,\tilde{\mathbb P})$ is measure preserving.

Replicating this idea in the countable set-up: with $Q_r^j$'s ($j=1,2,\dots$) defined as in Proposition 12.2, take arbitrary probabilities $\pi_j\ge0$ ($\sum_{j=1}^\infty\pi_j=1$) and construct $$\tilde{\mathbb P}(\Lambda)=\sum_{j=0}^\infty Q_r^j(\Lambda)\pi_j$$ Then $\tilde{\mathbb P}$ is measure preserving but typically not ergodic.

Note

Proof (measure preserving but typically not ergodic) By Proposition 12.2, $Q_r^j$ is measure preserving for all $j$, so $$\tilde{\mathbb P}(\mathbb S^{-1}(\Lambda))=\sum_{j=0}^\infty Q_r^j(\mathbb S^{-1}(\Lambda))\pi_j=\sum_{j=0}^\infty Q_r^j(\Lambda)\pi_j=\tilde{\mathbb P}(\Lambda)$$ which implies $\tilde{\mathbb P}$ is measure preserving. But $\tilde{\mathbb P}$ is typically not ergodic since the $\pi_j$'s are arbitrary and it's possible that $\tilde{\mathbb P}$ assigns a strictly positive but less than one probability to an invariant event $\Lambda$. For example, suppose $Q_r^j(\Lambda)=1$ and $Q_r^k(\Lambda)=0$; take $\pi_j=0.5$, $\pi_k=0.5$ and $\pi_i=0$ for all $i\ne j$ and $i\ne k$. Then $\tilde{\mathbb P}(\Lambda)=0.5\in(0,1)$. $\blacksquare$

Tip

Remark 12.9 The ergodic decomposition is of interest because it shows how to construct alternative probability measures $\tilde{\mathbb P}$ for which $\mathbb S$ is measure preserving. Each ergodic probability measure $Q_r^j$ constructed above is uniquely determined by each invariant event $\Lambda_j$, given some probability measure $\mathbb P$ used to define $Q_r^j$, but $\tilde{\mathbb P}$ is on the latitude to play with. And in this way we can create a family of probability measures $\tilde{\mathbb P}$'s over the whole space $\Omega$ for which $\mathbb S$ is measure preserving. If we want to build a statistical model about a stationary probability measure $\tilde{\mathbb P}$ over $\Omega$, we can choose the model in two steps. For the first step, we want to determine $\mathbb P$ that correctly set up $Q_r^j$'s. By the properties of ergodic distributions, since each $Q_r^j$ can be revealed by long data series under conditional Law of Large Numbers. Then, we need to determine probabilities $\pi_j$ assigned to $\Lambda_j$'s, which the Law of Large Numbers and long data-series cannot help us because invariant events don't communicate with each other. So the relative importance and weight of each invariant event is the source of subjectivity in the model of $\tilde{\mathbb P}$.

12.3.3 Example 1: Vector Autoregressions

平稳遍历情形。 考虑模型 $\mathbf X:\Omega\to\mathbb R^n$、$(\mathbb S,\mathbb P)$ 给出： $$\mathbf X_{t+1}=\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1},\quad \mathbf W_{t+1}\sim N(0,\mathbf I)\ \text{i.i.d.}$$ $\mathbf A$ 稳定（特征值绝对值 $<1$）。目标是求该过程的平稳分布并验证遍历。注意 $\mathbf X_0$ 设为确定（或正态）、$\mathbf W_{t+1}$ i.i.d. 正态，故 $\{\mathbf X_t\}_{t=0}^\infty$ 的平稳分布也正态。考察矩： 1. 均值。 $\mu_{t+1}=\mathbb E[\mathbf X_{t+1}]=\mathbb E[\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1}]=\mathbf A\mu_t+\mathbf 0$。若平稳，$\mu=\mathbf A\mu\Rightarrow\mathbf 0=(\mathbf A-\mathbf I)\mu\Rightarrow\mu=\mathbf 0$。 2. 方差。 $\Sigma_{t+1}=\mathbf A\Sigma_t\mathbf A'+\mathbf B\mathbf B'$；平稳时 $\Sigma=\mathbf A\Sigma\mathbf A'+\mathbf B\mathbf B'$。猜 $\Sigma=\sum_{j=0}^\infty\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j$。

Note

验证 $$\mathbf A\Sigma\mathbf A'+\mathbf B\mathbf B'=\sum_{j=0}^\infty\mathbf A^{j+1}\mathbf B\mathbf B'(\mathbf A')^{j+1}+\mathbf B\mathbf B'=\mathbf A^0\mathbf B\mathbf B'(\mathbf A')^0+\sum_{j=1}^\infty\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j=\sum_{j=0}^\infty\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j=\Sigma$$ 由反复代入：$\Sigma_{t+1}=\mathbf A(\mathbf A\Sigma_{t-1}\mathbf A'+\mathbf B\mathbf B')\mathbf A'+\mathbf B\mathbf B'=\dots=\sum_{j=0}^t\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j+\mathbf A^{t+1}\Sigma_0(\mathbf A')^{t+1}$，令 $t\to\infty$ 得 $\Sigma_\infty=\sum_{j=0}^\infty\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j$。$\blacksquare$

所以 $(\mathbb S,\mathbb P)$ 平稳时 $\mathbf X_t\sim N(\mathbf 0,\Sigma)$。由联合正态性质，$\mathbb R^n$ 中任一点都有严格正概率被到达，故唯一不变事件就是 $\mathbb R^n$。既然 $(\mathbb S,\mathbb P)$ 对 $\mathbb R^n$ 赋概率 1，故遍历。

Stationary and ergodic case. Consider the model $\mathbf X:\Omega\to\mathbb R^n$, with $(\mathbb S,\mathbb P)$ defined by $$\mathbf X_{t+1}=\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1},\quad \mathbf W_{t+1}\sim N(0,\mathbf I)\ \text{i.i.d.}$$ where $\mathbf A$ is stable (eigenvalues with absolute values $<1$). Our goal is to find the stationary distribution and verify it is ergodic. $\mathbf X_0$ is assumed to be deterministic (or normal) and $\mathbf W_{t+1}$'s are i.i.d. normal for all $t$, so the stationary distribution of $\{\mathbf X_t\}_{t=0}^\infty$ will also be normal. Behavior of moments: 1. Mean. $\mu_{t+1}=\mathbb E[\mathbf X_{t+1}]=\mathbb E[\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1}]=\mathbf A\mu_t+\mathbf 0$. If stationary, $\mu=\mathbf A\mu\Rightarrow\mathbf 0=(\mathbf A-\mathbf I)\mu\Rightarrow\mu=\mathbf 0$. 2. Variance. $\Sigma_{t+1}=\mathbf A\Sigma_t\mathbf A'+\mathbf B\mathbf B'$; stationary $\Sigma=\mathbf A\Sigma\mathbf A'+\mathbf B\mathbf B'$. We guess $\Sigma=\sum_{j=0}^\infty\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j$.

Note

Verification $$\mathbf A\Sigma\mathbf A'+\mathbf B\mathbf B'=\sum_{j=0}^\infty\mathbf A^{j+1}\mathbf B\mathbf B'(\mathbf A')^{j+1}+\mathbf B\mathbf B'=\mathbf A^0\mathbf B\mathbf B'(\mathbf A')^0+\sum_{j=1}^\infty\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j=\sum_{j=0}^\infty\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j=\Sigma$$ By repeated substitution: $\Sigma_{t+1}=\mathbf A(\mathbf A\Sigma_{t-1}\mathbf A'+\mathbf B\mathbf B')\mathbf A'+\mathbf B\mathbf B'=\dots=\sum_{j=0}^t\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j+\mathbf A^{t+1}\Sigma_0(\mathbf A')^{t+1}$, and as $t\to\infty$, $\Sigma_\infty=\sum_{j=0}^\infty\mathbf A^j\mathbf B\mathbf B'(\mathbf A')^j$. $\blacksquare$

So for $(\mathbb S,\mathbb P)$ to be stationary, we have $\mathbf X_t\sim N(\mathbf 0,\Sigma)$. By the nature of joint normal distribution, any point in $\mathbb R^n$ has strictly positive probability of being reached, so the unique invariant event is $\mathbb R^n$. Since $(\mathbb S,\mathbb P)$ assigns probability 1 to $\mathbb R^n$, it is ergodic.

平稳但不遍历情形。 考虑均值非零的 VAR。把 Markov 状态分块： $$\mathbf X_t=\begin{bmatrix}\mathbf x_t^1\\x_t^2\end{bmatrix},\quad\mathbf A=\begin{bmatrix}\mathbf A_{11}&a_{12}\\\mathbf 0&1\end{bmatrix},\quad\mathbf B=\begin{bmatrix}\mathbf B_1\\0\end{bmatrix}$$ 即 $$\begin{bmatrix}\mathbf x_{t+1}^1\\x_{t+1}^2\end{bmatrix}=\begin{bmatrix}\mathbf A_{11}&a_{12}\\\mathbf 0&1\end{bmatrix}\begin{bmatrix}\mathbf x_t^1\\x_t^2\end{bmatrix}+\begin{bmatrix}\mathbf B_1\\0\end{bmatrix}\mathbf W_{t+1}$$ 其中 $x_t^2\ne0$、$a_{12}$ 为标量、$\mathbf x_t^1$ 为向量。因 $\mathbf A$ 第二行是行向量 $[\mathbf 0,1]$，故 $x_t^2$ 对 $\forall t$ 恒为常数。设 $\mathbf x_t^1$ 平稳，则 $$\mathbb E[\mathbf x_{t+1}^1]=\mathbf A_{11}\mathbb E[\mathbf x_t^1]+a_{12}\mathbb E[x_t^2]\Rightarrow(\mathbf I-\mathbf A_{11})\mathbb E[\mathbf x_t^1]=a_{12}\mathbb E[x_t^2]\Rightarrow\mu_1\equiv\mathbb E[\mathbf x_t^1]=(\mathbf I-\mathbf A_{11})^{-1}a_{12}x_0^2$$ 方差：令 $\Sigma_t=\begin{bmatrix}\Sigma_t^{11}&0\\0&0\end{bmatrix}$，则 $$\Sigma_{t+1}^{11}=\operatorname{Var}(\mathbf x_{t+1}^1)=\operatorname{Var}(\mathbf A_{11}\mathbf x_t^1+a_{12}x_t^2+\mathbf B_1\mathbf W_{t+1})=\mathbf A_{11}\Sigma_t^{11}\mathbf A_{11}'+\mathbf B_1\mathbf B_1'$$ 平稳时 $\Sigma^{11}=\sum_{j=0}^\infty\mathbf A_{11}^j\mathbf B_1\mathbf B_1'(\mathbf A_{11}')^j$。故 $(\mathbb S,\mathbb P)$ 平稳时 $\mathbf X_t\sim N(\mu,\Sigma)$， $$\mu=\begin{bmatrix}(\mathbf I-\mathbf A_{11})^{-1}a_{12}x_0^2\\x_0^2\end{bmatrix},\quad\Sigma=\begin{bmatrix}\Sigma^{11}&0\\0&0\end{bmatrix}$$ 但显然不是 $\mathbb R^n$ 中任一点都有严格正概率被到达，故 $\mathbb R^n$ 不再是唯一不变事件。对每个固定的 $x_0^2$，$\mathbb R^{n-1}$ 上对应一个不变事件；除非确知 $\mathbb P$ 对某特定 $x_0^2$ 赋概率 1、对其余赋 0，否则不能断言 $(\mathbb S,\mathbb P)$ 遍历。

Stationary but not ergodic case. Consider a different VAR with non-zero mean of the stationary distribution. Partition the Markov state as $$\mathbf X_t=\begin{bmatrix}\mathbf x_t^1\\x_t^2\end{bmatrix},\quad\mathbf A=\begin{bmatrix}\mathbf A_{11}&a_{12}\\\mathbf 0&1\end{bmatrix},\quad\mathbf B=\begin{bmatrix}\mathbf B_1\\0\end{bmatrix}$$ i.e. $$\begin{bmatrix}\mathbf x_{t+1}^1\\x_{t+1}^2\end{bmatrix}=\begin{bmatrix}\mathbf A_{11}&a_{12}\\\mathbf 0&1\end{bmatrix}\begin{bmatrix}\mathbf x_t^1\\x_t^2\end{bmatrix}+\begin{bmatrix}\mathbf B_1\\0\end{bmatrix}\mathbf W_{t+1}$$ where $x_t^2\ne0$, $a_{12}$ is a scalar, and $\mathbf x_t^1$ is a vector. Because the second row of $\mathbf A$ is the row vector $[\mathbf 0,1]$, this implies $x_t^2$ is constant for all $t$. Suppose $\mathbf x_t^1$ is stationary; then $$\mathbb E[\mathbf x_{t+1}^1]=\mathbf A_{11}\mathbb E[\mathbf x_t^1]+a_{12}\mathbb E[x_t^2]\Rightarrow(\mathbf I-\mathbf A_{11})\mathbb E[\mathbf x_t^1]=a_{12}\mathbb E[x_t^2]\Rightarrow\mu_1\equiv\mathbb E[\mathbf x_t^1]=(\mathbf I-\mathbf A_{11})^{-1}a_{12}x_0^2$$ For the variance, let $\Sigma_t=\begin{bmatrix}\Sigma_t^{11}&0\\0&0\end{bmatrix}$; then $$\Sigma_{t+1}^{11}=\operatorname{Var}(\mathbf x_{t+1}^1)=\operatorname{Var}(\mathbf A_{11}\mathbf x_t^1+a_{12}x_t^2+\mathbf B_1\mathbf W_{t+1})=\mathbf A_{11}\Sigma_t^{11}\mathbf A_{11}'+\mathbf B_1\mathbf B_1'$$ Stationary: $\Sigma^{11}=\sum_{j=0}^\infty\mathbf A_{11}^j\mathbf B_1\mathbf B_1'(\mathbf A_{11}')^j$. So for $(\mathbb S,\mathbb P)$ to be stationary, $\mathbf X_t\sim N(\mu,\Sigma)$ where $$\mu=\begin{bmatrix}(\mathbf I-\mathbf A_{11})^{-1}a_{12}x_0^2\\x_0^2\end{bmatrix},\quad\Sigma=\begin{bmatrix}\Sigma^{11}&0\\0&0\end{bmatrix}$$ But clearly not any point in $\mathbb R^n$ has strictly positive probability of being reached, so $\mathbb R^n$ is no longer a unique invariant event. For each fixed value of $x_0^2$, we will have a corresponding invariant event on $\mathbb R^{n-1}$. Unless we know for sure that $\mathbb P$ assigns probability 1 to one particular $x_0^2$ and probability 0 to all others, we cannot conclude $(\mathbb S,\mathbb P)$ is ergodic.

12.3.4 Example 2: Finite-State Markov Chains

考虑转移矩阵 $\mathbb P_{k\times k}$，元素 $p_{ij}$ 表示从状态 $i$ 转到状态 $j$ 的概率（$\sum_{j=1}^k p_{ij}=1$）。把状态 $i$ 记为坐标向量 $u_i=[0,\dots,1,\dots,0]'$（第 $i$ 位为 1、其余为 0）。可用 $\mathbf X_t\in\mathbb R^k$ 表示过程 $\{\mathbf X_t\}$ 为这些坐标向量在每期 $t$ 的实现（线性组合）。进一步，可把 $\mathbf X_t$ 坐标的任意线性函数 $f:\mathbb R^k\to\mathbb R$ 表示为向量 $f_{k\times1}$（$\forall u\in\mathbb R^k$）：$f(u)=u'f_{k\times1}$。更一般地考虑条件期望——因为 $$\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t=u]=u'\mathbb P f_{k\times1}$$

Note

推导 $$u'\mathbb P=\begin{bmatrix}u_1&\cdots&u_k\end{bmatrix}\begin{bmatrix}p_{11}&\cdots&p_{1k}\\\vdots&&\vdots\\p_{k1}&\cdots&p_{kk}\end{bmatrix}=\begin{bmatrix}\mathbb E[u_1^\star\mid\mathbf X_t=u]&\cdots&\mathbb E[u_k^\star\mid\mathbf X_t=u]\end{bmatrix}\equiv\mathbb E[\mathbf X_{t+1}\mid\mathbf X_t=u]'$$ $\Rightarrow u'\mathbb P f_{k\times1}=\mathbb E[\mathbf X_{t+1}\mid\mathbf X_t=u]'f_{k\times1}=f(\mathbb E[\mathbf X_{t+1}\mid\mathbf X_t=u])=\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t=u]$（末步用 $\mathbb E[\cdot]$ 的线性）。$\blacksquare$

一般地，把 $\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t=u]$ 作为 $\mathbf X_t$ 的函数写为 $$\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]=\mathbb P f_{k\times1}$$

Important

命题 12.4 设 $\mathbb P f_{k\times1}=f_{k\times1}$，即 $\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]=f(\mathbf X_t)$。则 $\{\mathbf X_t\}$ 平稳时 $f(\mathbf X_{t+1})=f(\mathbf X_t)$。

Note

证明只需证给定 $f(\mathbf X_{t+1})=\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]+e_{t+1}$ 时 $\mathbb E[e_{t+1}^2]=0$（这将推出 $e_{t+1}=0$，从而 $f(\mathbf X_{t+1})=f(\mathbf X_t)$）。 $$\begin{aligned}\mathbb E[e_{t+1}^2]&=\mathbb E[(f(\mathbf X_{t+1})-\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t])^2]\\&=\mathbb E[(f(\mathbf X_{t+1})-f(\mathbf X_t))^2]\\&=\mathbb E[f(\mathbf X_{t+1})^2]-2\mathbb E[f(\mathbf X_{t+1})f(\mathbf X_t)]+\mathbb E[f(\mathbf X_t)^2]\\&\overset{\text{LIE}}{=}\mathbb E[f(\mathbf X_{t+1})^2]-2\mathbb E[\mathbb E[f(\mathbf X_{t+1})f(\mathbf X_t)\mid\mathbf X_t]]+\mathbb E[f(\mathbf X_t)^2]\\&=\mathbb E[f(\mathbf X_{t+1})^2]-2\mathbb E[f(\mathbf X_t)\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]]+\mathbb E[f(\mathbf X_t)^2]\\&=\mathbb E[f(\mathbf X_{t+1})^2]-2\mathbb E[f(\mathbf X_t)f(\mathbf X_t)]+\mathbb E[f(\mathbf X_t)^2]\\&=\mathbb E[f(\mathbf X_{t+1})^2]-2\mathbb E[f(\mathbf X_t)^2]+\mathbb E[f(\mathbf X_t)^2]\\&=\mathbb E[f(\mathbf X_{t+1})^2]-\mathbb E[f(\mathbf X_t)^2]\overset{\text{stat}}{=}0\end{aligned}$$ $\blacksquare$

Tip

注记 12.10 直观：只要 $\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]=f(\mathbf X_t)$（如当期消费等于对下期同一函数的期望——当期等于未来期望），且 $\{\mathbf X_t\}$ 平稳，则当期函数 $f(\mathbf X_t)$ 实际等于其条件期望未来 $\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]$，而当期 $f(\mathbf X_t)$ 直到函数 $f$（未必 $\mathbf X_{t+1}=\mathbf X_t$）层面成立。

Consider a transition matrix $\mathbb P_{k\times k}$, with elements $p_{ij}$ representing the probability of moving from state $i$ to state $j$ ($\sum_{j=1}^k p_{ij}=1$). Denote state $i$ as a coordinate vector $u_i=[0,\dots,1,\dots,0]'$ (1 in position $i$, 0 everywhere else). We can represent the process $\{\mathbf X_t\}$ with $\mathbf X_t\in\mathbb R^k$ as realizations of the collection (linear combination) of coordinate vectors in each period $t$. Further, we can represent any linear function $f:\mathbb R^k\to\mathbb R$ of $\mathbf X_t$'s coordinates as a vector $f_{k\times1}$, for all $u\in\mathbb R^k$: $f(u)=u'f_{k\times1}$. More generally we can think of conditional expectations, since $$\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t=u]=u'\mathbb P f_{k\times1}$$

Note

Derivation $$u'\mathbb P=\begin{bmatrix}u_1&\cdots&u_k\end{bmatrix}\begin{bmatrix}p_{11}&\cdots&p_{1k}\\\vdots&&\vdots\\p_{k1}&\cdots&p_{kk}\end{bmatrix}=\begin{bmatrix}\mathbb E[u_1^\star\mid\mathbf X_t=u]&\cdots&\mathbb E[u_k^\star\mid\mathbf X_t=u]\end{bmatrix}\equiv\mathbb E[\mathbf X_{t+1}\mid\mathbf X_t=u]'$$ $\Rightarrow u'\mathbb P f_{k\times1}=\mathbb E[\mathbf X_{t+1}\mid\mathbf X_t=u]'f_{k\times1}=f(\mathbb E[\mathbf X_{t+1}\mid\mathbf X_t=u])=\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t=u]$ (the last step by linearity of $\mathbb E[\cdot]$). $\blacksquare$

In general, to define $\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t=u]$ as a function of $\mathbf X_t$, we can write that function as $$\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]=\mathbb P f_{k\times1}$$

Important

Proposition 12.4 Suppose $\mathbb P f_{k\times1}=f_{k\times1}$, i.e. $\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]=f(\mathbf X_t)$. Then if $\{\mathbf X_t\}$ is stationary, $f(\mathbf X_{t+1})=f(\mathbf X_t)$.

Note

Proof It suffices to show that given $f(\mathbf X_{t+1})=\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]+e_{t+1}$, we have $\mathbb E[e_{t+1}^2]=0$ (which implies $e_{t+1}=0$, hence $f(\mathbf X_{t+1})=f(\mathbf X_t)$). $$\begin{aligned}\mathbb E[e_{t+1}^2]&=\mathbb E[(f(\mathbf X_{t+1})-\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t])^2]\\&=\mathbb E[(f(\mathbf X_{t+1})-f(\mathbf X_t))^2]\\&=\mathbb E[f(\mathbf X_{t+1})^2]-2\mathbb E[f(\mathbf X_{t+1})f(\mathbf X_t)]+\mathbb E[f(\mathbf X_t)^2]\\&\overset{\text{LIE}}{=}\mathbb E[f(\mathbf X_{t+1})^2]-2\mathbb E[\mathbb E[f(\mathbf X_{t+1})f(\mathbf X_t)\mid\mathbf X_t]]+\mathbb E[f(\mathbf X_t)^2]\\&=\mathbb E[f(\mathbf X_{t+1})^2]-2\mathbb E[f(\mathbf X_t)\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]]+\mathbb E[f(\mathbf X_t)^2]\\&=\mathbb E[f(\mathbf X_{t+1})^2]-2\mathbb E[f(\mathbf X_t)f(\mathbf X_t)]+\mathbb E[f(\mathbf X_t)^2]\\&=\mathbb E[f(\mathbf X_{t+1})^2]-\mathbb E[f(\mathbf X_t)^2]\overset{\text{stat}}{=}0\end{aligned}$$ $\blacksquare$

Tip

Remark 12.10 The idea is that as long as $\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]=f(\mathbf X_t)$ (e.g. some function of state-contingent current consumption is equal to expected next-period same function of consumption, i.e. current equal to future expectation) and the series $\{\mathbf X_t\}$ is stationary, then the actual function $f(\mathbf X_{t+1})$ is equal to conditional expected future $\mathbb E[f(\mathbf X_{t+1})\mid\mathbf X_t]$, and the current $f(\mathbf X_t)$ up to function $f$ (not necessarily $\mathbf X_{t+1}=\mathbf X_t$).

12.4 Stationary Increment Process

许多情形（如宏观时间序列）下，考虑带随机增长成分的过程比平稳过程更合理。本节研究平稳增量过程。遵循一定步骤，总能把平稳增量过程分解为若干成分，其中包含一个鞅成分。之所以做这个鞅分解，是因为每个鞅增量都是对所关心的平稳增量过程的永久冲击。

There are many situations, such as in macroeconomic time series, where it is more reasonable to think about a process with a stochastic growth component as opposed to a stationary process. So we will be looking at the stationary increment process in this section. And following certain steps, we can always decompose a stationary increment process into several components including a martingale component. We are interested in this martingale decomposition because every martingale increment is a permanent shock to the stationary increment process of interest.

12.4.1 Martingale

记法： - 设 $\{X_t\}_{t=0}^\infty$ 为平稳遍历过程。 - 定义 $\{Y_t\}_{t=0}^\infty$ 由 $Y_{t+1}-Y_t=X_{t+1}$；等价地 $Y_t=Y_0+\sum_{j=1}^t X_j$。 - 记到 $t$ 期为止的信息集（即 $Y_0,X_1,X_2,\dots,X_t$）为 $\mathcal F_t$。

Tip

注记 12.11 两个有用事实：第一（LIE），$\mathbb E[\mathbb E[Z\mid\mathcal F_{t+1}]\mid\mathcal F_t]=\mathbb E[Z\mid\mathcal F_t]$；第二，$\mathbb E[\mathbb E[Z\mid\mathcal F_t]\mid\mathcal F_{t+1}]=\mathbb E[Z\mid\mathcal F_t]$——因为 $\mathbb E[Z\mid\mathcal F_t]$ 已把 $t$ 期后所有信息退化为一个期望数，$t+1$ 期没有新变量进入该表达式，故外层期望不起作用。

Important

定义 12.12（鞅）随机过程 $\{Y_t\}_{t=0}^\infty$ 是鞅，若 $$\mathbb E[Y_{t+1}\mid\mathcal F_t]=Y_t\quad\text{or}\quad\mathbb E\big[\underbrace{Y_{t+1}-Y_t}_{=X_{t+1}}\mid\mathcal F_t\big]=0$$

Notations: - Let $\{X_t\}_{t=0}^\infty$ be a stationary and ergodic process. - Define $\{Y_t\}_{t=0}^\infty$ by $Y_{t+1}-Y_t=X_{t+1}$; alternatively $Y_t=Y_0+\sum_{j=1}^t X_j$. - Denote the set of information up to period $t$ (i.e. $Y_0,X_1,X_2,\dots,X_t$) by $\mathcal F_t$.

Tip

Remark 12.11 Two helpful facts. First (LIE), $\mathbb E[\mathbb E[Z\mid\mathcal F_{t+1}]\mid\mathcal F_t]=\mathbb E[Z\mid\mathcal F_t]$. Second, $\mathbb E[\mathbb E[Z\mid\mathcal F_t]\mid\mathcal F_{t+1}]=\mathbb E[Z\mid\mathcal F_t]$ — because $\mathbb E[Z\mid\mathcal F_t]$ already degenerates all information after period $t$ into one expected number, so there is no variable of period $t+1$ in the expression any more; therefore the outer expectation operator won't make any difference.

Important

Definition 12.12 (Martingale) A stochastic process $\{Y_t\}_{t=0}^\infty$ is a martingale if $$\mathbb E[Y_{t+1}\mid\mathcal F_t]=Y_t\quad\text{or}\quad\mathbb E\big[\underbrace{Y_{t+1}-Y_t}_{=X_{t+1}}\mid\mathcal F_t\big]=0$$

12.4.2 Martingale Decomposition: General Case

对差分过程 $\{Y_{t+1}-Y_t\}$ 由一般平稳过程 $\{X_t\}$ 生成的 $\{Y_t\}$ 做鞅分解，算法如下。

第 1 步： 写下所关心的过程 $Y_{t+1}-Y_t=X_{t+1}$ (12.2)，其中 $\{X_t\}$ 平稳（$X_t$ 是平稳过程的一般表示，含 §12.4.3 可加泛函 $\kappa(X_t,W_{t+1})$ 特例、§12.4.4 VAR 特例 $\mathbf D\mathbf X_{t-1}+\mathbf F\mathbf W_t+\nu$）。
第 2 步： 计算 $\nu=\mathbb E[X_t]$（$\forall t$）。
第 3 步： 定义 $$\overline X_t=\sum_{j=0}^\infty\mathbb E[X_{t+j}-\nu\mid\mathcal F_t]\tag{12.3}$$ $$\tilde X_t=\sum_{j=1}^\infty\mathbb E[X_{t+j}-\nu\mid\mathcal F_t]\tag{12.4}$$ 注意 $X_{t+j}-\nu$ 无条件期望为零，故 $\mathbb E[\overline X_t]=0$、$\mathbb E[\tilde X_t]=0$；又 $\overline X_t-\tilde X_t=\mathbb E[X_t-\nu\mid\mathcal F_t]=X_t-\nu$，且 $\mathbb E[\overline X_{t+1}\mid\mathcal F_t]=\tilde X_t$（LIE）。
第 4 步： 定义 $M_{t+1}\equiv\overline X_{t+1}-\tilde X_t$ (12.5)。
第 5 步： 分解 $X_{t+1}$： $$X_{t+1}=(\overline X_{t+1}-\tilde X_{t+1})+\nu=\underbrace{(\overline X_{t+1}-\tilde X_t)}_{\equiv M_{t+1}}+(\tilde X_t-\tilde X_{t+1}+\nu)$$ 其中 $\mathbb E[M_{t+1}\mid\mathcal F_t]=\mathbb E[\overline X_{t+1}\mid\mathcal F_t]-\tilde X_t=0$。
第 6 步： 把 $Y_t$ 写为鞅分解： $$Y_t=Y_0+\sum_{j=1}^t X_j=Y_0+\sum_{j=1}^t\big[M_j+(\tilde X_{j-1}-\tilde X_j+\nu)\big]=Y_0+t\nu+\sum_{j=1}^t M_j+\big(\tilde X_0-\tilde X_t\big)$$

其中：$Y_0$ 与 $\tilde X_0$ 是不变数；$t\nu$ 是线性于 $t$ 的时间趋势；$\{\tilde X_t\}$ 平稳；$\big\{\sum_{j=1}^t M_j\big\}$ 是鞅，因 $$\mathbb E\Big[\sum_{j=1}^{t+1}M_j\mid\mathcal F_t\Big]=\underbrace{\mathbb E[M_{t+1}\mid\mathcal F_t]}_{=0}+\sum_{j=1}^t M_j=\sum_{j=1}^t M_j\tag{12.6}$$ $M_j$ 称鞅增量，$\{M_t\}$ 平稳。

Important

定义 12.13（鞅分解）称 $$Y_t=Y_0+t\nu+\sum_{j=1}^t M_j+\big(\tilde X_0-\tilde X_t\big)\tag{12.7}$$ 为 $Y_t$ 的鞅分解。

由 (12.6)，$M_j$ 是 $j$ 期对平稳增量过程的永久冲击：一旦实现就永不衰减。考察今天 $Y_t$ 与无穷远未来 $Y_{t+\infty}$ 去趋势距离基于 $\mathcal F_t$ vs $\mathcal F_{t+1}$ 的差，正是 $M_{t+1}$——这个差只因 $t+1$ 期实现而出现，且看向无穷远仍存在，故 $M_{t+1}$ 是 $t+1$ 期永久冲击。

Important

命题 12.5 $\sum_{j=1}^t M_j$ 的方差随 $t$ 线性增长。

Note

证明对 \(j

We can do a martingale decomposition to a process $\{Y_t\}$ whose difference process $\{Y_{t+1}-Y_t\}$ is generated by a general stationary process $\{X_t\}$. The algorithm follows.

Step 1: Write down the process of interest $Y_{t+1}-Y_t=X_{t+1}$ (12.2), where $\{X_t\}$ is stationary ($X_t$ is a general representation of a stationary process, including the additive-functional $\kappa(X_t,W_{t+1})$ special case of §12.4.3 and the VAR special case $\mathbf D\mathbf X_{t-1}+\mathbf F\mathbf W_t+\nu$ of §12.4.4).
Step 2: Calculate $\nu=\mathbb E[X_t]$ for all $t$.
Step 3: Define $$\overline X_t=\sum_{j=0}^\infty\mathbb E[X_{t+j}-\nu\mid\mathcal F_t]\tag{12.3}$$ $$\tilde X_t=\sum_{j=1}^\infty\mathbb E[X_{t+j}-\nu\mid\mathcal F_t]\tag{12.4}$$ Note $X_{t+j}-\nu$ has zero unconditional expectation, so $\mathbb E[\overline X_t]=0$, $\mathbb E[\tilde X_t]=0$; also $\overline X_t-\tilde X_t=\mathbb E[X_t-\nu\mid\mathcal F_t]=X_t-\nu$, and $\mathbb E[\overline X_{t+1}\mid\mathcal F_t]=\tilde X_t$ (LIE).
Step 4: Define $M_{t+1}\equiv\overline X_{t+1}-\tilde X_t$ (12.5).
Step 5: Decompose $X_{t+1}$: $$X_{t+1}=(\overline X_{t+1}-\tilde X_{t+1})+\nu=\underbrace{(\overline X_{t+1}-\tilde X_t)}_{\equiv M_{t+1}}+(\tilde X_t-\tilde X_{t+1}+\nu)$$ where $\mathbb E[M_{t+1}\mid\mathcal F_t]=\mathbb E[\overline X_{t+1}\mid\mathcal F_t]-\tilde X_t=0$.
Step 6: Rewrite $Y_t$ as a martingale decomposition: $$Y_t=Y_0+\sum_{j=1}^t X_j=Y_0+\sum_{j=1}^t\big[M_j+(\tilde X_{j-1}-\tilde X_j+\nu)\big]=Y_0+t\nu+\sum_{j=1}^t M_j+\big(\tilde X_0-\tilde X_t\big)$$

where: $Y_0$ and $\tilde X_0$ are invariant numbers; $t\nu$ is the time trend (linear in $t$); $\{\tilde X_t\}$ is stationary; $\big\{\sum_{j=1}^t M_j\big\}$ is a martingale, since $$\mathbb E\Big[\sum_{j=1}^{t+1}M_j\mid\mathcal F_t\Big]=\underbrace{\mathbb E[M_{t+1}\mid\mathcal F_t]}_{=0}+\sum_{j=1}^t M_j=\sum_{j=1}^t M_j\tag{12.6}$$ The $M_j$'s are called martingale increments, and $\{M_t\}$ is stationary.

Important

Definition 12.13 (Martingale decomposition) We call $$Y_t=Y_0+t\nu+\sum_{j=1}^t M_j+\big(\tilde X_0-\tilde X_t\big)\tag{12.7}$$ a martingale decomposition of $Y_t$.

By (12.6), $M_j$ is the permanent shock in period $j$ to the stationary increment process: once realized it never decays. Considering the difference in the expected detrended distance between today $Y_t$ and the infinite future $Y_{t+\infty}$ based on $\mathcal F_t$ vs $\mathcal F_{t+1}$, the difference is exactly $M_{t+1}$ — it comes into place only because of period $t+1$'s realization and is still there even at the infinite horizon, so $M_{t+1}$ is the permanent shock in period $t+1$.

Important

Proposition 12.5 The variance of $\sum_{j=1}^t M_j$ grows linearly in $t$.

Note

Proof For \(j

12.4.3 Martingale Decomposition: Markov Difference Process (a special case)

设 $\{X_t\}_{t=1}^\infty$ 为（多元）Markov 过程，定义 $\{Y_t\}_{t=1}^\infty$ 由 $$Y_{t+1}-Y_t=\phi(X_{t+1})\tag{12.8}$$ 定义算子 $\mathbb T$： $$(\mathbb T\phi)(x)=\mathbb E[\phi(X_{t+1})\mid X_t=x]$$ 令 $\mathbb L^2$ 为有限二阶矩函数空间、$\mathbb Z$ 为零均值子空间： $$\mathbb L^2=\{\phi:\mathbb E[\phi(X_t)^2]<\infty\},\qquad\mathbb Z=\{\phi\in\mathbb L^2:\mathbb E[\phi(X_t)]=0\}$$ 对 $\phi\in\mathbb Z$，$\phi(X_{t+1})=\mathbb E[\phi(X_{t+1})\mid X_t]+e_{t+1}$，$e_{t+1}$ 为与 $\mathbb E[\phi(X_{t+1})\mid X_t]$ 不相关的估计误差。于是 $$\operatorname{Var}(\phi(X_{t+1}))=\operatorname{Var}(\mathbb E[\phi(X_{t+1})\mid X_t])+\operatorname{Var}(e_{t+1})\ge\operatorname{Var}(\mathbb E[\phi(X_{t+1})\mid X_t])=\operatorname{Var}((\mathbb T\phi)(X_t))$$ 又 $\phi\in\mathbb Z\Rightarrow\mathbb E[\phi(X_t)]=0$，且 $\mathbb E[(\mathbb T\phi)(X_t)]=\mathbb E[\mathbb E[\phi(X_{t+1})\mid X_t]]=\mathbb E[\phi(X_{t+1})]=0$，故 $\mathbb T\phi\in\mathbb Z$。由 $\operatorname{Var}(\phi(X_{t+1}))\ge\operatorname{Var}((\mathbb T\phi)(X_t))$， $$\mathbb E[(\phi(X_t))^2]\ge\mathbb E[((\mathbb T\phi)(X_t))^2]\tag{12.9}$$ 故 $\mathbb T\phi$ 零均值有限二阶矩，$\mathbb T\phi\in\mathbb Z$。对 $\phi\in\mathbb Z$，记 $\|\phi\|=(\mathbb E[(\phi(X_t))^2])^{1/2}$，则 (12.9) 给出 $\|\phi\|\ge\|\mathbb T\phi\|$——即 $\mathbb T$ 是弱收缩。设可进一步施加 $\mathbb T$ 为 $\mathbb Z$ 上的强收缩：$\|\mathbb T\phi\|\le\lambda\|\phi\|$，$\lambda\in(0,1)$。则由归纳，对 $j\ge1$，$\mathbb E[\phi(X_{t+j})\mid X_t]=(\mathbb T^j\phi)(X_t)$。

Important

命题 12.6 在强收缩假设下， $$\sum_{j=0}^\infty\mathbb E[\phi(X_{t+j})\mid X_t]=\sum_{j=0}^\infty(\mathbb T^j\phi)(X_t)$$ 是良定义的。

Note

证明定义 $\big(\sum_{j=0}^\infty\mathbb T^j\big)\phi(X_t)\equiv\sum_{j=0}^\infty(\mathbb T^j\phi)(X_t)$，记恒等算子 $\mathbb I$，定义 $$(\mathbb I-\mathbb T)^{-1}=\sum_{j=0}^\infty\mathbb T^j$$ 这是合适定义，因 $(\mathbb I-\mathbb T)\sum_{j=0}^\infty\mathbb T^j=\sum_{j=0}^\infty\mathbb T^j-\sum_{j=1}^\infty\mathbb T^j=\mathbb I$。Lemma 12.1：$\|X+Y\|\le\|X\|+\|Y\|$（证：$\sigma_{X+Y}^2=\sigma_X^2+\sigma_Y^2+2\rho\sigma_X\sigma_Y\le(\sigma_X+\sigma_Y)^2$）。于是 $$\Big\|\sum_{j=0}^\infty\mathbb E[\phi(X_{t+j})\mid X_t]\Big\|=\Big\|\sum_{j=0}^\infty(\mathbb T^j\phi)(X_t)\Big\|\le\sum_{j=0}^\infty\|(\mathbb T^j\phi)(X_t)\|\le\sum_{j=0}^\infty\lambda^j\|\phi(X_t)\|=\frac{\|\phi(X_t)\|}{1-\lambda}$$ 有界，故 $\sum_{j=0}^\infty\mathbb E[\phi(X_{t+j})\mid X_t]$ 的二阶矩存在、一阶矩亦存在，良定义。$\blacksquare$

Markov 差分过程的鞅提取算法。 设 $\{Y_t\}$ 的差分过程由 Markov 过程 $\{X_t\}$ 与 i.i.d. $\{W_{t+1}\}$ 生成：$Y_{t+1}-Y_t=\kappa(X_t,W_{t+1})$（对应一般情形的 $X_{t+1}$）、$X_{t+1}=\psi(X_t,W_{t+1})$；记当期变量 $x\equiv X_t$、$w\equiv W_t$，下期变量 $x^\star\equiv X_{t+1}$、$w^\star\equiv W_{t+1}$。算法分两块提取鞅增量： 1. 从可加泛函对其条件均值的偏离中提取鞅。 定义条件均值 $\bar f(x)=\mathbb E[\kappa(x,w^\star)\mid x]$，偏离 $\kappa_2(x,w^\star)=\kappa(x,w^\star)-\bar f(x)$。则 $\mathbb E[\kappa_2(x,w^\star)\mid x]=\bar f(x)-\bar f(x)=0$，$\kappa_2$ 是鞅增量的一部分。 2. 从条件均值对无条件均值的偏离中提取鞅。 定义无条件均值 $\nu=\mathbb E[\kappa(x,w^\star)]=\mathbb E[\mathbb E[\kappa(x,w^\star)\mid x]]$，偏离 $f(x)=\bar f(x)-\nu$；定义 $g(x)=f(x)+\mathbb T g(x)$，则 $g(x)=(\mathbb I-\mathbb T)^{-1}f(x)=\sum_{j=0}^\infty(\mathbb T^j f)(x)=\sum_{j=0}^\infty\mathbb E[f(X_{t+j})\mid X_t=x]$；定义 $$\kappa_1(x,w^\star)\equiv\sum_{j=0}^\infty\mathbb E[f(X_{t+1+j})\mid X_{t+1}=\psi(x,w^\star)]-\sum_{j=0}^\infty\mathbb E[f(X_{t+1+j})\mid X_t=x]=g(\psi(x,w^\star))-g(x)+f(x)$$ 可证 $\mathbb E[\kappa_1(x,w^\star)\mid x]=0$，故 $\kappa_1$ 也是鞅增量的一部分。 3. 合并两部分并重写 $\{Y_t\}$： $$\kappa(x,w^\star)=\nu+f(x)+\kappa_2(x,w^\star)=\nu+(\kappa_1(x,w^\star)-g(\psi(x,w^\star))+g(x))+\kappa_2(x,w^\star)=\nu+\underbrace{(\kappa_1+\kappa_2)}_{\text{mart.}}-g(\psi(x,w^\star))+g(x)$$ $$\Rightarrow Y_t=Y_0+t\nu+\sum_{j=1}^t\kappa_a(X_{j-1},W_j)-g(X_t)+g(X_0)$$ 其中 $\sum_{j=1}^t\kappa_a(X_{j-1},W_j)$ 是鞅成分、$[-g(X_t)]$ 是平稳部分、$t\nu$ 是趋势、$g(X_0)+Y_0$ 是不变部分。此算法与一般情形、以及下面更特殊的 VAR Markov 情形一致。

Let $\{X_t\}_{t=1}^\infty$ be a (multivariate) Markov process, and define $\{Y_t\}_{t=1}^\infty$ by $$Y_{t+1}-Y_t=\phi(X_{t+1})\tag{12.8}$$ Define the operator $\mathbb T$ by $$(\mathbb T\phi)(x)=\mathbb E[\phi(X_{t+1})\mid X_t=x]$$ Let $\mathbb L^2$ be the space of finite-second-moment functions, and $\mathbb Z$ its zero-mean subspace: $$\mathbb L^2=\{\phi:\mathbb E[\phi(X_t)^2]<\infty\},\qquad\mathbb Z=\{\phi\in\mathbb L^2:\mathbb E[\phi(X_t)]=0\}$$ For $\phi\in\mathbb Z$, $\phi(X_{t+1})=\mathbb E[\phi(X_{t+1})\mid X_t]+e_{t+1}$, with $e_{t+1}$ the estimation error uncorrelated with $\mathbb E[\phi(X_{t+1})\mid X_t]$. So $$\operatorname{Var}(\phi(X_{t+1}))=\operatorname{Var}(\mathbb E[\phi(X_{t+1})\mid X_t])+\operatorname{Var}(e_{t+1})\ge\operatorname{Var}((\mathbb T\phi)(X_t))$$ Also $\phi\in\mathbb Z\Rightarrow\mathbb E[\phi(X_t)]=0$, and $\mathbb E[(\mathbb T\phi)(X_t)]=\mathbb E[\mathbb E[\phi(X_{t+1})\mid X_t]]=\mathbb E[\phi(X_{t+1})]=0$, so $\mathbb T\phi\in\mathbb Z$. From $\operatorname{Var}(\phi(X_{t+1}))\ge\operatorname{Var}((\mathbb T\phi)(X_t))$, $$\mathbb E[(\phi(X_t))^2]\ge\mathbb E[((\mathbb T\phi)(X_t))^2]\tag{12.9}$$ so $\mathbb T\phi$ has zero mean and finite second moment, $\mathbb T\phi\in\mathbb Z$. For $\phi\in\mathbb Z$, denote $\|\phi\|=(\mathbb E[(\phi(X_t))^2])^{1/2}$; then (12.9) implies $\|\phi\|\ge\|\mathbb T\phi\|$ — i.e. $\mathbb T$ is a weak contraction. Suppose we can further impose that $\mathbb T$ is a strong contraction on $\mathbb Z$: $\|\mathbb T\phi\|\le\lambda\|\phi\|$, $\lambda\in(0,1)$. Then by induction, for $j\ge1$, $\mathbb E[\phi(X_{t+j})\mid X_t]=(\mathbb T^j\phi)(X_t)$.

Important

Proposition 12.6 Under the strong-contraction assumption, $$\sum_{j=0}^\infty\mathbb E[\phi(X_{t+j})\mid X_t]=\sum_{j=0}^\infty(\mathbb T^j\phi)(X_t)$$ is well-defined.

Note

Proof Define $\big(\sum_{j=0}^\infty\mathbb T^j\big)\phi(X_t)\equiv\sum_{j=0}^\infty(\mathbb T^j\phi)(X_t)$, denote the identity operator $\mathbb I$, and define $$(\mathbb I-\mathbb T)^{-1}=\sum_{j=0}^\infty\mathbb T^j$$ which is appropriate because $(\mathbb I-\mathbb T)\sum_{j=0}^\infty\mathbb T^j=\sum_{j=0}^\infty\mathbb T^j-\sum_{j=1}^\infty\mathbb T^j=\mathbb I$. Lemma 12.1: $\|X+Y\|\le\|X\|+\|Y\|$ (proof: $\sigma_{X+Y}^2=\sigma_X^2+\sigma_Y^2+2\rho\sigma_X\sigma_Y\le(\sigma_X+\sigma_Y)^2$). So $$\Big\|\sum_{j=0}^\infty\mathbb E[\phi(X_{t+j})\mid X_t]\Big\|=\Big\|\sum_{j=0}^\infty(\mathbb T^j\phi)(X_t)\Big\|\le\sum_{j=0}^\infty\|(\mathbb T^j\phi)(X_t)\|\le\sum_{j=0}^\infty\lambda^j\|\phi(X_t)\|=\frac{\|\phi(X_t)\|}{1-\lambda}$$ is bounded, so the second moment of $\sum_{j=0}^\infty\mathbb E[\phi(X_{t+j})\mid X_t]$ exists, hence so does its first moment, and it is well-defined. $\blacksquare$

Martingale Extraction Algorithm for the Markov difference process. Suppose $\{Y_t\}$'s difference process is generated by a Markov process $\{X_t\}$ and i.i.d. $\{W_{t+1}\}$: $Y_{t+1}-Y_t=\kappa(X_t,W_{t+1})$ (corresponding to $X_{t+1}$ in the general case), $X_{t+1}=\psi(X_t,W_{t+1})$; denote current-period variables $x\equiv X_t$, $w\equiv W_t$, and next-period variables $x^\star\equiv X_{t+1}$, $w^\star\equiv W_{t+1}$. The algorithm extracts the martingale increment in two parts: 1. Extract martingale from the deviation of the additive functional from its conditional mean. Define the conditional mean $\bar f(x)=\mathbb E[\kappa(x,w^\star)\mid x]$ and the deviation $\kappa_2(x,w^\star)=\kappa(x,w^\star)-\bar f(x)$. Then $\mathbb E[\kappa_2(x,w^\star)\mid x]=\bar f(x)-\bar f(x)=0$, so $\kappa_2$ is part of the martingale increment. 2. Extract martingale from the deviation of the conditional mean from the unconditional mean. Define $\nu=\mathbb E[\kappa(x,w^\star)]=\mathbb E[\mathbb E[\kappa(x,w^\star)\mid x]]$ and the deviation $f(x)=\bar f(x)-\nu$; define $g(x)=f(x)+\mathbb T g(x)$, so $g(x)=(\mathbb I-\mathbb T)^{-1}f(x)=\sum_{j=0}^\infty(\mathbb T^j f)(x)=\sum_{j=0}^\infty\mathbb E[f(X_{t+j})\mid X_t=x]$; define $$\kappa_1(x,w^\star)\equiv\sum_{j=0}^\infty\mathbb E[f(X_{t+1+j})\mid X_{t+1}=\psi(x,w^\star)]-\sum_{j=0}^\infty\mathbb E[f(X_{t+1+j})\mid X_t=x]=g(\psi(x,w^\star))-g(x)+f(x)$$ one can show $\mathbb E[\kappa_1(x,w^\star)\mid x]=0$, so $\kappa_1$ is also part of the martingale increment. 3. Put the two parts together and rewrite $\{Y_t\}$: $$\kappa(x,w^\star)=\nu+f(x)+\kappa_2(x,w^\star)=\nu+(\kappa_1(x,w^\star)-g(\psi(x,w^\star))+g(x))+\kappa_2(x,w^\star)=\nu+\underbrace{(\kappa_1+\kappa_2)}_{\text{mart. increment}}-g(\psi(x,w^\star))+g(x)$$ $$\Rightarrow Y_t=Y_0+t\nu+\sum_{j=1}^t\kappa_a(X_{j-1},W_j)-g(X_t)+g(X_0)$$ where $\sum_{j=1}^t\kappa_a(X_{j-1},W_j)$ is the martingale component, $[-g(X_t)]$ is the stationary part, $t\nu$ is the trend, and $g(X_0)+Y_0$ is the invariant part. This algorithm is consistent with the general case and with the even more special VAR Markov case below.

12.4.4 Martingale Decomposition: VAR (a special case of the Markov difference process)

特化到生成 $\{Y_t\}$ 差分过程的 VAR Markov 过程 $\{\mathbf X_t\}$： $$\mathbf X_{t+1}=\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1},\quad\mathbf W_{t+1}\sim N(0,\mathbf I)\ \text{i.i.d.},\ \mathbf A\text{ stable}$$ 定义 $$\mathbf Y_{t+1}-\mathbf Y_t=\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}+\nu$$ 令 $\mathbf H_{t+1}=\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}$，则 $\mathbf Y_{t+1}-\mathbf Y_t=\mathbf H_{t+1}+\nu$。

Tip

注记 12.12 $\nu$ 解读为趋势，$\mathbf H_{t+1}$ 为去趋势增量。

定义 $\overline{\mathbf H}_t=\mathbb E[\sum_{j=0}^\infty\mathbf H_{t+j}\mid\mathcal F_t]=\mathbf D\mathbf X_{t-1}+\mathbf F\mathbf W_t+\mathbf D\mathbf X_t+\mathbf D\mathbf A\mathbf X_t+\dots=\mathbf D\mathbf X_{t-1}+\mathbf F\mathbf W_t+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf X_t$。则鞅增量 $$\overline{\mathbf H}_{t+1}-\mathbb E[\overline{\mathbf H}_{t+1}\mid\mathcal F_t]=(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_{t+1}$$

Note

推导 $$\begin{aligned}\overline{\mathbf H}_{t+1}-\mathbb E[\overline{\mathbf H}_{t+1}\mid\mathcal F_t]&=\big(\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf X_{t+1}\big)-\big(\mathbf D\mathbf X_t+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf A\mathbf X_t\big)\\&=\mathbf F\mathbf W_{t+1}+\mathbf D(\mathbf I-\mathbf A)^{-1}(\mathbf X_{t+1}-\mathbf A\mathbf X_t)\\&=\mathbf F\mathbf W_{t+1}+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B\mathbf W_{t+1}=(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_{t+1}\end{aligned}$$ $\blacksquare$

由 §12.4.3 的鞅提取，把所有结果合并得 VAR 情形的鞅分解 $$\mathbf Y_t=\mathbf Y_0+t\nu+\Big[\sum_{j=1}^t(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_j\Big]+\mathbf D(\mathbf I-\mathbf A)^{-1}(\mathbf X_0-\mathbf X_t)\tag{12.10}$$ 对照一般分解 (12.7)：$(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_j$ 映射到 $M_j$、$\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf X_t$ 映射到 $\tilde X_t$。故 $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_j$ 是鞅增量，即永久冲击。

Tip

注记 12.13 $\overline{\mathbf H}_{t+1}-\mathbb E[\overline{\mathbf H}_{t+1}\mid\mathcal F_t]$ 是 $\mathbf W_j$ 的线性组合，是 $\mathbf Y_{t+\infty}$ 与 $\mathbf Y_t$（给定 $\mathbf W_j$ 实现到 $t+1$ 期）的期望距离。若冲击是暂时的，单个冲击 $\mathbf W_{t+1}$ 的实现应不改变 $\mathbf Y_{t+\infty}$、从而不改变期望距离，即 $\overline{\mathbf H}_{t+1}-\mathbb E[\overline{\mathbf H}_{t+1}\mid\mathcal F_t]=0$；但实际并非总是如此——当 $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)$ 非零（设非奇异），$(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_{t+1}$ 必不为零，故 $\mathbf W_{t+1}$ 是永久冲击。

Specializing to the VAR Markov process $\{\mathbf X_t\}$ generating $\{Y_t\}$'s difference process: $$\mathbf X_{t+1}=\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1},\quad\mathbf W_{t+1}\sim N(0,\mathbf I)\ \text{i.i.d.},\ \mathbf A\text{ stable}$$ Define $$\mathbf Y_{t+1}-\mathbf Y_t=\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}+\nu$$ Let $\mathbf H_{t+1}=\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}$; then $\mathbf Y_{t+1}-\mathbf Y_t=\mathbf H_{t+1}+\nu$.

Tip

Remark 12.12 $\nu$ is interpreted as the trend and $\mathbf H_{t+1}$ as the detrended increment.

Define $\overline{\mathbf H}_t=\mathbb E[\sum_{j=0}^\infty\mathbf H_{t+j}\mid\mathcal F_t]=\mathbf D\mathbf X_{t-1}+\mathbf F\mathbf W_t+\mathbf D\mathbf X_t+\mathbf D\mathbf A\mathbf X_t+\dots=\mathbf D\mathbf X_{t-1}+\mathbf F\mathbf W_t+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf X_t$. Then the martingale increment is $$\overline{\mathbf H}_{t+1}-\mathbb E[\overline{\mathbf H}_{t+1}\mid\mathcal F_t]=(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_{t+1}$$

Note

Derivation $$\begin{aligned}\overline{\mathbf H}_{t+1}-\mathbb E[\overline{\mathbf H}_{t+1}\mid\mathcal F_t]&=\big(\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf X_{t+1}\big)-\big(\mathbf D\mathbf X_t+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf A\mathbf X_t\big)\\&=\mathbf F\mathbf W_{t+1}+\mathbf D(\mathbf I-\mathbf A)^{-1}(\mathbf X_{t+1}-\mathbf A\mathbf X_t)\\&=\mathbf F\mathbf W_{t+1}+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B\mathbf W_{t+1}=(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_{t+1}\end{aligned}$$ $\blacksquare$

By the martingale extraction of §12.4.3, putting everything together gives the VAR martingale decomposition $$\mathbf Y_t=\mathbf Y_0+t\nu+\Big[\sum_{j=1}^t(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_j\Big]+\mathbf D(\mathbf I-\mathbf A)^{-1}(\mathbf X_0-\mathbf X_t)\tag{12.10}$$ Comparing with the general decomposition (12.7): $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_j$ maps into $M_j$ and $\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf X_t$ maps into $\tilde X_t$. So $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_j$ is the martingale increment, i.e. the permanent shock.

Tip

Remark 12.13 $\overline{\mathbf H}_{t+1}-\mathbb E[\overline{\mathbf H}_{t+1}\mid\mathcal F_t]$ is a linear combination of $\mathbf W_j$'s, the expected distance between $\mathbf Y_{t+\infty}$ and $\mathbf Y_t$ given realizations of $\mathbf W_j$'s up to period $t+1$. If the shock is transitory, the realization of one single shock $\mathbf W_{t+1}$ should make no difference to $\mathbf Y_{t+\infty}$, hence no difference to the expected distance, i.e. $\overline{\mathbf H}_{t+1}-\mathbb E[\overline{\mathbf H}_{t+1}\mid\mathcal F_t]=0$; but this is not always the case — when $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)$ is nonzero (assume non-singular), $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_{t+1}$ is for sure not zero, so $\mathbf W_{t+1}$ is a permanent shock.

12.4.5–12.4.6 Permanent Shocks under VAR: Definition and Identification

永久冲击。 考虑 $\mathbf Y_t$ 对 $\mathbf W_1$ 的脉冲响应：设 $\mathbf X_0=\mathbf 0$、$\mathbf W_t=0$（$\forall t\ge2$），由 (12.10)， $$\mathbf Y_t=\mathbf Y_0+t\nu+(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_1-\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf X_t\tag{12.13}$$ 即便 $t\to\infty$，$\mathbf Y_t$ 对 $\mathbf W_1$ 的响应仍恒为 $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_1$（永不衰减），故 $\mathbf W_1$ 是永久冲击、永久响应为 $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_1$。以上是对向量 $\mathbf Y_t$ 的永久冲击定义；也可逐元素考虑：设 $Y_t^j,\nu^j,F^j,D^j$ 为 $\mathbf Y_t,\nu,\mathbf F,\mathbf D$ 的第 $j$ 行（$Y_t^j,\nu^j$ 标量，$F^j,D^j$ 行向量），由 (12.10)， $$Y_t^j=Y_0^j+t\nu^j+\Big[\sum_{j=1}^t(F^j+D^j(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_j\Big]+D^j(\mathbf I-\mathbf A)^{-1}(\mathbf X_0-\mathbf X_t)$$ $\mathbf Y_t$ 第 $j$ 元的永久冲击 $\mathbf W^j_t$ 平行于 $(F^j+D^j(\mathbf I-\mathbf A)^{-1}\mathbf B)'$；永久响应是向量 $(F^j+D^j(\mathbf I-\mathbf A)^{-1}\mathbf B)$ 的模长。

识别。 $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)$ 决定永久响应；矩阵 $\mathbf A,\mathbf D$ 可由理论或回归估得，但无法在无额外假设下识别 $\mathbf F,\mathbf B$——存在无法消去的任意性。可在 $\mathbf B$ 中堆叠行以减少待估元素：设 $\mathbf X_t$ 为 $n\times1$、$\mathbf A$ 为 $n\times n$、$\mathbf W_t$ 为 $m\times1$、$\mathbf B$ 为 $n\times m$（$n>m$），$\operatorname{rank}(\mathbf B)=m$，则可用 $n\times n$ 矩阵 $\mathbf R$ 左乘把 $\mathbf B$ 的行堆叠为 $\mathbf{RB}=\begin{bmatrix}\mathbf B_1\\\mathbf 0_{(n-m)\times m}\end{bmatrix}$（$\mathbf B_1$ 为 $m\times m$）。则 $$\mathbf{RX}_{t+1}=\mathbf{RAX}_t+\mathbf{RBW}_{t+1}\Rightarrow\mathbf Z_{t+1}=\mathbf C\mathbf Z_t+\begin{bmatrix}\mathbf B_1\\\mathbf 0\end{bmatrix}\mathbf W_{t+1}$$ （$\mathbf Z_{t+1}=\mathbf{RX}_{t+1}$、$\mathbf C=\mathbf{RAR}^{-1}$。）误差项 $\mathbf Z_{t+1}-\mathbf C\mathbf Z_t$ 可观测，其方差—协方差矩阵 $\begin{bmatrix}\mathbf B_1\\\mathbf 0\end{bmatrix}\mathbf I\begin{bmatrix}\mathbf B_1'&\mathbf 0\end{bmatrix}=\begin{bmatrix}\mathbf B_1\mathbf B_1'&\mathbf 0\\\mathbf 0&\mathbf 0\end{bmatrix}$ 可估。但只能估出 $\mathbf B_1\mathbf B_1'$、非 $\mathbf B_1$——因任意 $m\times m$ 正交阵 $\mathbf Q$（$\mathbf{QQ}'=\mathbf I$）使 $\hat{\mathbf B}_1=\mathbf B_1\mathbf Q$ 满足 $\hat{\mathbf B}_1\hat{\mathbf B}_1'=\mathbf B_1\mathbf B_1'$。$\mathbf Q$ 任意，故不能定下 $\hat{\mathbf B}_1$。但 $\mathbf B$（或 $\mathbf B_1$）正是 $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)$ 的关键成分，决定永久响应。故须用 §10.3 的技巧（Cholesky、长期识别、符号约束）限制 $\mathbf B$ 的候选范围。

Permanent shock. Consider the impulse response of $\mathbf Y_t$ to $\mathbf W_1$: suppose $\mathbf X_0=\mathbf 0$ and $\mathbf W_t=0$ for all $t\ge2$; by (12.10), $$\mathbf Y_t=\mathbf Y_0+t\nu+(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_1-\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf X_t\tag{12.13}$$ Even as $t\to\infty$, the response of $\mathbf Y_t$ to $\mathbf W_1$ is always $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_1$ (never decays), so $\mathbf W_1$ is a permanent shock with permanent response $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_1$. The above is the definition of permanent shock to the vector $\mathbf Y_t$; we can also consider it element-wise: let $Y_t^j,\nu^j,F^j,D^j$ be the $j$-th rows of $\mathbf Y_t,\nu,\mathbf F,\mathbf D$ ($Y_t^j,\nu^j$ scalars, $F^j,D^j$ row vectors); by (12.10), $$Y_t^j=Y_0^j+t\nu^j+\Big[\sum_{j=1}^t(F^j+D^j(\mathbf I-\mathbf A)^{-1}\mathbf B)\mathbf W_j\Big]+D^j(\mathbf I-\mathbf A)^{-1}(\mathbf X_0-\mathbf X_t)$$ The permanent shock $\mathbf W^j_t$ to the $j$-th element of $\mathbf Y_t$ is parallel to $(F^j+D^j(\mathbf I-\mathbf A)^{-1}\mathbf B)'$; the permanent response is the norm of the vector $(F^j+D^j(\mathbf I-\mathbf A)^{-1}\mathbf B)$.

Identification. $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)$ determines the permanent response; matrices $\mathbf A,\mathbf D$ can be given by theory or estimated, but $\mathbf F,\mathbf B$ cannot be identified without further assumptions — there is arbitrariness that we cannot get rid of. We can stack rows in $\mathbf B$ to reduce the number of elements to estimate: suppose $\mathbf X_t$ is $n\times1$, $\mathbf A$ is $n\times n$, $\mathbf W_t$ is $m\times1$, $\mathbf B$ is $n\times m$ ($n>m$), $\operatorname{rank}(\mathbf B)=m$; then a $n\times n$ matrix $\mathbf R$ stacks rows in $\mathbf B$ into $\mathbf{RB}=\begin{bmatrix}\mathbf B_1\\\mathbf 0_{(n-m)\times m}\end{bmatrix}$ ($\mathbf B_1$ being $m\times m$). Then $$\mathbf{RX}_{t+1}=\mathbf{RAX}_t+\mathbf{RBW}_{t+1}\Rightarrow\mathbf Z_{t+1}=\mathbf C\mathbf Z_t+\begin{bmatrix}\mathbf B_1\\\mathbf 0\end{bmatrix}\mathbf W_{t+1}$$ ($\mathbf Z_{t+1}=\mathbf{RX}_{t+1}$, $\mathbf C=\mathbf{RAR}^{-1}$.) The error term $\mathbf Z_{t+1}-\mathbf C\mathbf Z_t$ is observable, with estimable variance-covariance matrix $\begin{bmatrix}\mathbf B_1\\\mathbf 0\end{bmatrix}\mathbf I\begin{bmatrix}\mathbf B_1'&\mathbf 0\end{bmatrix}=\begin{bmatrix}\mathbf B_1\mathbf B_1'&\mathbf 0\\\mathbf 0&\mathbf 0\end{bmatrix}$. But we can only estimate $\mathbf B_1\mathbf B_1'$, not $\mathbf B_1$ itself — since any $m\times m$ orthogonal $\mathbf Q$ ($\mathbf{QQ}'=\mathbf I$) gives $\hat{\mathbf B}_1=\mathbf B_1\mathbf Q$ with $\hat{\mathbf B}_1\hat{\mathbf B}_1'=\mathbf B_1\mathbf B_1'$. As $\mathbf Q$ is arbitrary, we cannot pin down $\hat{\mathbf B}_1$. But $\mathbf B$ (or $\mathbf B_1$) is a crucial component of $(\mathbf F+\mathbf D(\mathbf I-\mathbf A)^{-1}\mathbf B)$, determining the permanent response. So we must use the techniques of §10.3 (Cholesky, long-run identification, sign restrictions) to restrict the scope of candidates for $\mathbf B$.

12.4.7 Permanent Income Model

目标是识别永久冲击。考虑简单永久收入模型 $$K_{t+1}-K_t+C_t=[\exp(\rho)-1]K_t+Y_t$$ 其中 $Y_t$ 为 $t$ 期收入（设 $\{\log Y_t\}_{t=0}^\infty$ 平稳增量过程）、$K_t$ 为资本、$\rho$ 为生产率参数（$[\exp(\rho)-1]K_t\approx\rho K_t$ 为资本 $K_t$ 的产出）、$C_t$ 为消费（由宏观模型可证 $\{\log C_t\}_{t=0}^\infty$ 是鞅）。目标函数 $\sum_{t=0}^\infty\beta^t\log C_t$。定义 $\hat K_t=K_t/Y_t$、$\hat C_t=\log C_t-\log Y_t$，除以 $Y_t$ 重写： $$\hat K_{t+1}\frac{Y_{t+1}}{Y_t}-\hat K_t+\exp(\hat C_t)=[\exp(\rho)-1]\hat K_t+1$$ 特化状态过程 $\{X_t\}$ 为平稳 Markov（标量是向量特例），用 VAR 结果 $$X_{t+1}=AX_t+\mathbf{BW}_{t+1},\qquad\log(Y_{t+1})-\log(Y_t)=\nu+DX_t+\mathbf{FW}_{t+1}$$ $A$ 稳定（$<1$）、$\mathbf W_{t+1}\sim N(0,\mathbf I)$ i.i.d.、$\mathbf B,\mathbf F$ 为 $k\times1$、$\mathbf W_{t+1}$ 为 $k\times1$ 个体冲击向量。设 $X_0=0$、$\mathbf W_1\ne\mathbf 0$、$\mathbf W_t=\mathbf 0$（$t\ne1$），考察对 $\mathbf W_1$ 的脉冲响应： $$\log(Y_t)=\log(Y_0)+t\nu+(\mathbf F+D(\mathbf I-A)^{-1}\mathbf B)\mathbf W_1-D(\mathbf I-A)^{-1}A^{t-1}X_t\xrightarrow{t\to\infty}\log(Y_0)+t\nu+(\mathbf F+D(\mathbf I-A)^{-1}\mathbf B)\mathbf W_1$$ （宏观情形还需线性化系统、解出消费/储蓄的政策函数，此处略。）关键在于对暂时与永久冲击的不同响应：若冲击 $\mathbf W_1$ 与 $(\mathbf F+D(\mathbf I-A)^{-1}\mathbf B)$ 正交，则对长期收入无影响（暂时冲击）；若平行，则对长期收入持续（永久冲击）。直观地，暂时响应（如 $-D(\mathbf I-A)^{-1}A^{t-1}\mathbf W_1$）随消费/储蓄决策抵消、不永久改变收入；而永久响应一旦 $\mathbf W_1$ 实现就永久改变收入水平。

图示（散点/路径图，已转述）： 永久收入冲击下对数收入逐渐上行至新水平、对数消费立即跳到新水平并保持（响应永久）；暂时收入冲击下对数收入冲高后衰减回原水平、对数消费几乎不动（响应暂时）。

Tip

注记这些结果靠时间可分偏好得到。若放松为习惯效用，则消费过程 $\{\log C_t\}$ 不再是鞅；习惯效用下，消费的边际效用才是鞅。

Our goal is to identify permanent shocks. Consider a simple permanent income model $$K_{t+1}-K_t+C_t=[\exp(\rho)-1]K_t+Y_t$$ where $Y_t$ is time-$t$ income ($\{\log Y_t\}_{t=0}^\infty$ a stationary increment process), $K_t$ is time-$t$ capital, $\rho$ is a productivity parameter ($[\exp(\rho)-1]K_t\approx\rho K_t$ is production from capital $K_t$), and $C_t$ is time-$t$ consumption (from the macro model one can show $\{\log C_t\}_{t=0}^\infty$ is a martingale). The objective is $\sum_{t=0}^\infty\beta^t\log C_t$. Define $\hat K_t=K_t/Y_t$, $\hat C_t=\log C_t-\log Y_t$, and rewrite by dividing through by $Y_t$: $$\hat K_{t+1}\frac{Y_{t+1}}{Y_t}-\hat K_t+\exp(\hat C_t)=[\exp(\rho)-1]\hat K_t+1$$ Specialize the state process $\{X_t\}$ to a stationary Markov process (scalar as a special case of vector), using the VAR results $$X_{t+1}=AX_t+\mathbf{BW}_{t+1},\qquad\log(Y_{t+1})-\log(Y_t)=\nu+DX_t+\mathbf{FW}_{t+1}$$ with $A$ stable ($<1$), $\mathbf W_{t+1}\sim N(0,\mathbf I)$ i.i.d., $\mathbf B,\mathbf F$ being $k\times1$, and $\mathbf W_{t+1}$ a $k\times1$ individual-shock vector. Assume $X_0=0$, $\mathbf W_1\ne\mathbf 0$, $\mathbf W_t=\mathbf 0$ ($t\ne1$), and consider the impulse response to $\mathbf W_1$: $$\log(Y_t)=\log(Y_0)+t\nu+(\mathbf F+D(\mathbf I-A)^{-1}\mathbf B)\mathbf W_1-D(\mathbf I-A)^{-1}A^{t-1}X_t\xrightarrow{t\to\infty}\log(Y_0)+t\nu+(\mathbf F+D(\mathbf I-A)^{-1}\mathbf B)\mathbf W_1$$ (In a macro setting one would also linearize the system and solve for policy functions for consumption and saving; this step is skipped here.) What's important is the differential responses to transitory vs permanent shocks: if the shock $\mathbf W_1$ is orthogonal to $(\mathbf F+D(\mathbf I-A)^{-1}\mathbf B)$, it has no effect on long-term income (transitory); if parallel, the effect on long-term income is persistent (permanent). Intuitively, transitory responses (e.g. $-D(\mathbf I-A)^{-1}A^{t-1}\mathbf W_1$ of income $Y_t$ to $\mathbf W_1$) are offset with consumption and savings decisions and won't permanently change income; the permanent responses, once $\mathbf W_1$ is realized, permanently change the level of income.

Figures (scatter/path plots, paraphrased): under a permanent income shock log income gradually rises to a new level and log consumption immediately jumps to a new level and stays there (the response is permanent); under a transitory income shock log income spikes and then decays back to the original level while log consumption barely moves (the response is transitory).

Tip

Remark These results rely on time-separable preferences. If we relax this assumption to habit utility, the consumption process $\{\log C_t\}$ will no longer be a martingale; under habit utility, the marginal utility of consumption will be the martingale.

12.5 Central Limit Theorem

上节得到含 (i) 时间趋势、(ii) 鞅成分的分解。现利用此分解与中心极限定理（CLT）的一个变体，刻画平稳增量过程的极限行为。下面的定理把 i.i.d. 情形的 CLT 推广到鞅差。回忆平稳增量过程的设定：$Y_{t+1}-Y_t=X_{t+1}$（$\{X_t\}$ 平稳遍历，$\mathbb E[X_t]=\nu$），$\overline X_t=\sum_{j=0}^\infty\mathbb E[X_{t+j}-\nu\mid\mathcal F_t]$，$M_t=\overline X_t-\tilde X_{t-1}=\overline X_t-\mathbb E[\overline X_t\mid\mathcal F_{t-1}]$，$Y_t-Y_0=\sum_{j=1}^t M_j-\tilde X_t+\tilde X_0+t\nu$，$\mathbb E[M_{t+1}\mid\mathcal F_t]=0$。

Important

定理 12.2（Billingsley 中心极限定理）设 $\big\{\sum_{j=0}^t M_j\big\}_{t=0}^\infty$ 为可加鞅过程，其增量 $M_t$ 平稳、遍历、鞅差（$\mathbb E[M_{t+1}\mid\mathcal F_t]=0$）。则 $$\frac1{\sqrt t}\Big(\sum_{j=0}^t M_j\Big)\xrightarrow{d}N\big(0,\mathbb E[(M_j)^2]\big)$$

由此可刻画 $Y_t$ 的极限分布。由鞅分解， $$Y_t-Y_0-t\nu=\sum_{j=1}^t M_j+(\tilde X_0-\tilde X_t)\Rightarrow\frac1{\sqrt t}(Y_t-Y_0-t\nu)=\frac1{\sqrt t}\Big(\sum_{j=1}^t M_j\Big)-\frac1{\sqrt t}\tilde X_t+\frac1{\sqrt t}\tilde X_0$$ 右端第一项由 CLT 收敛到正态，后两项依概率收敛到零，故 $$\frac1{\sqrt t}(Y_t-Y_0-t\nu)\xrightarrow{d}N\big(0,\mathbb E[(M_j)^2]\big)$$

观察： 1. $\operatorname{Var}(M_t)=\mathbb E[(M_t)^2]\ne\operatorname{Var}(X_t)$。 2. $\operatorname{Var}\big(\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu)\big)=\operatorname{Var}\big(\frac1{\sqrt t}(Y_t-Y_0-t\nu)\big)\to\operatorname{Var}(M_t)=\mathbb E[(M_j)^2]$。 - $\operatorname{Var}(M_t)$ 是 $\{X_t-\nu\}_{t=-\infty}^\infty$ 在零频率的谱密度：由谱密度定义 $s_x(\omega)=\frac1{2\pi}\sum_{j=-\infty}^\infty\gamma_j e^{-i\omega j}$，取 $\omega=0$ 得 $s_x(0)=\frac1{2\pi}\sum_{j=-\infty}^\infty\gamma_j$。

Note

推导（$\operatorname{Var}(M_t)\cong s_x(0)$） $$\begin{aligned}\operatorname{Var}\Big(\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu)\Big)&=\operatorname{Cov}\Big(\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu),\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu)\Big)=\frac1t\sum_{j=1}^t\sum_{i=1}^t\operatorname{Cov}((X_j-\nu),(X_i-\nu))\\&=\frac1t\sum_{j=1}^t\sum_{i=1}^t\gamma_{j-i}\xrightarrow{t\to\infty}\sum_{j=-\infty}^\infty\gamma_j\end{aligned}$$ 另一方面 $\operatorname{Var}\big(\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu)\big)\to\operatorname{Var}(M_t)$。故 $\operatorname{Var}(M_t)\cong s_x(0)$。$\blacksquare$ 3. 设 $Y_0=0$、$Y_t=\sum_{j=1}^t X_j$，则 $$\frac1t Y_t\xrightarrow{p}\nu\ \text{(LLN)},\qquad\frac1{\sqrt t}(Y_t-t\nu)\xrightarrow{d}N\big(0,\mathbb E[(M_j)^2]\big)\ \text{(CLT)}$$ 末式也可写成更熟悉的形式 $$\sqrt t\Big(\frac1t Y_t-\nu\Big)\xrightarrow{d}N\big(0,\mathbb E[(M_j)^2]\big)$$

In the prior section we produced a decomposition with (i) a time trend and (ii) a martingale component. Now we exploit that decomposition and a variant of the Central Limit Theorem (CLT) to characterize the limiting behavior of a stationary increment process. The next theorem extends the CLT from the i.i.d. case to martingale differences. Recall the set-up of a stationary increment process: $Y_{t+1}-Y_t=X_{t+1}$ ($\{X_t\}$ stationary, ergodic, $\mathbb E[X_t]=\nu$), $\overline X_t=\sum_{j=0}^\infty\mathbb E[X_{t+j}-\nu\mid\mathcal F_t]$, $M_t=\overline X_t-\tilde X_{t-1}=\overline X_t-\mathbb E[\overline X_t\mid\mathcal F_{t-1}]$, $Y_t-Y_0=\sum_{j=1}^t M_j-\tilde X_t+\tilde X_0+t\nu$, $\mathbb E[M_{t+1}\mid\mathcal F_t]=0$.

Important

Theorem 12.2 (Billingsley Central Limit Theorem) Let $\big\{\sum_{j=0}^t M_j\big\}_{t=0}^\infty$ be an additive martingale process whose increments $M_t$ are stationary, ergodic, martingale differences ($\mathbb E[M_{t+1}\mid\mathcal F_t]=0$). Then $$\frac1{\sqrt t}\Big(\sum_{j=0}^t M_j\Big)\xrightarrow{d}N\big(0,\mathbb E[(M_j)^2]\big)$$

This characterizes the limiting distribution of $Y_t$. By the martingale decomposition, $$Y_t-Y_0-t\nu=\sum_{j=1}^t M_j+(\tilde X_0-\tilde X_t)\Rightarrow\frac1{\sqrt t}(Y_t-Y_0-t\nu)=\frac1{\sqrt t}\Big(\sum_{j=1}^t M_j\Big)-\frac1{\sqrt t}\tilde X_t+\frac1{\sqrt t}\tilde X_0$$ The first term on the RHS converges to a normal by the CLT, and the other two converge to zero in probability, so $$\frac1{\sqrt t}(Y_t-Y_0-t\nu)\xrightarrow{d}N\big(0,\mathbb E[(M_j)^2]\big)$$

Observations: 1. $\operatorname{Var}(M_t)=\mathbb E[(M_t)^2]\ne\operatorname{Var}(X_t)$. 2. $\operatorname{Var}\big(\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu)\big)=\operatorname{Var}\big(\frac1{\sqrt t}(Y_t-Y_0-t\nu)\big)\to\operatorname{Var}(M_t)=\mathbb E[(M_j)^2]$. - $\operatorname{Var}(M_t)$ is the spectral density of $\{X_t-\nu\}_{t=-\infty}^\infty$ at frequency zero: by the definition of spectral density $s_x(\omega)=\frac1{2\pi}\sum_{j=-\infty}^\infty\gamma_j e^{-i\omega j}$, at $\omega=0$, $s_x(0)=\frac1{2\pi}\sum_{j=-\infty}^\infty\gamma_j$.

Note

Derivation ($\operatorname{Var}(M_t)\cong s_x(0)$) $$\begin{aligned}\operatorname{Var}\Big(\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu)\Big)&=\operatorname{Cov}\Big(\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu),\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu)\Big)=\frac1t\sum_{j=1}^t\sum_{i=1}^t\operatorname{Cov}((X_j-\nu),(X_i-\nu))\\&=\frac1t\sum_{j=1}^t\sum_{i=1}^t\gamma_{j-i}\xrightarrow{t\to\infty}\sum_{j=-\infty}^\infty\gamma_j\end{aligned}$$ On the other hand $\operatorname{Var}\big(\frac1{\sqrt t}\sum_{j=1}^t(X_j-\nu)\big)\to\operatorname{Var}(M_t)$. Therefore $\operatorname{Var}(M_t)\cong s_x(0)$. $\blacksquare$ 3. Set $Y_0=0$, $Y_t=\sum_{j=1}^t X_j$; then $$\frac1t Y_t\xrightarrow{p}\nu\ \text{(LLN)},\qquad\frac1{\sqrt t}(Y_t-t\nu)\xrightarrow{d}N\big(0,\mathbb E[(M_j)^2]\big)\ \text{(CLT)}$$ The last can be written in the more familiar form $$\sqrt t\Big(\frac1t Y_t-\nu\Big)\xrightarrow{d}N\big(0,\mathbb E[(M_j)^2]\big)$$

12.5.1 Cointegration

考虑过程 $$Y_t=r_1 Y_t^1+r_2 Y_t^2$$ 其中 $$Y_t^1=Y_0^1+t\nu_1+\sum_{j=1}^t M_j^1+(\tilde X_0^1-\tilde X_t^1),\qquad Y_t^2=Y_0^2+t\nu_2+\sum_{j=1}^t M_j^2+(\tilde X_0^2-\tilde X_t^2)$$ 则可对 $Y_t$ 写出类似分解： $$Y_t=(r_1 Y_0^1+r_2 Y_0^2)+t(r_1\nu_1+r_2\nu_2)+\sum_{j=1}^t(r_1 M_j^1+r_2 M_j^2)+\big((r_1\tilde X_0^1+r_2\tilde X_0^2)-(r_1\tilde X_t^1+r_2\tilde X_t^2)\big)$$ 称 $Y^1$ 与 $Y^2$ 协整，若 $\exists r_1,r_2\ne0$ 使 $$r_1\nu_1+r_2\nu_2=0\quad\text{and}\quad r_1 M_t^1+r_2 M_t^2=0,\ \forall t$$ $(r_1,r_2)$ 称协整向量。代入得 $$Y_t=(r_1 Y_0^1+r_2 Y_0^2)+\big((r_1\tilde X_0^1+r_2\tilde X_0^2)-(r_1\tilde X_t^1+r_2\tilde X_t^2)\big)$$ 全部为平稳成分（趋势项与永久冲击项被协整向量消去）。故当 $Y_t^1,Y_t^2$ 协整时，$Y_{t+1}-Y_t$ 只含平稳成分。

Consider the process $$Y_t=r_1 Y_t^1+r_2 Y_t^2$$ where $$Y_t^1=Y_0^1+t\nu_1+\sum_{j=1}^t M_j^1+(\tilde X_0^1-\tilde X_t^1),\qquad Y_t^2=Y_0^2+t\nu_2+\sum_{j=1}^t M_j^2+(\tilde X_0^2-\tilde X_t^2)$$ We can write a similar decomposition for $Y_t$: $$Y_t=(r_1 Y_0^1+r_2 Y_0^2)+t(r_1\nu_1+r_2\nu_2)+\sum_{j=1}^t(r_1 M_j^1+r_2 M_j^2)+\big((r_1\tilde X_0^1+r_2\tilde X_0^2)-(r_1\tilde X_t^1+r_2\tilde X_t^2)\big)$$ We say $Y^1$ and $Y^2$ are cointegrated if $\exists r_1,r_2\ne0$ s.t. $$r_1\nu_1+r_2\nu_2=0\quad\text{and}\quad r_1 M_t^1+r_2 M_t^2=0,\ \forall t$$ $(r_1,r_2)$ is the cointegrating vector. Plugging in, $$Y_t=(r_1 Y_0^1+r_2 Y_0^2)+\big((r_1\tilde X_0^1+r_2\tilde X_0^2)-(r_1\tilde X_t^1+r_2\tilde X_t^2)\big)$$ all of which are stationary (the trend and permanent-shock terms are eliminated by the cointegrating vector). So when $Y_t^1$ and $Y_t^2$ are cointegrated, $Y_{t+1}-Y_t$ contains only stationary components.

12.6 Revealed States: Likelihood Process

12.6.1 Likelihood Construction

设 $\{\mathbf W_{t+1}\}$ 为冲击过程，$\mathbb E[\mathbf W_{t+1}\mid\mathcal F_t]=0$。定义状态过程 $\{\mathbf X_t\}$ 为平稳 Markov 过程 $$\mathbf X_{t+1}=\phi(\mathbf X_t,\mathbf W_{t+1})$$ $\mathbf W_{t+1}\sim N(0,\mathbf I)$ i.i.d.、$\mathbf X_t$ 为 $t$ 期状态。构造观测过程 $\{\mathbf Y_t\}$ 为状态与冲击的函数： $$\mathbf Y_{t+1}-\mathbf Y_t=\kappa(\mathbf X_t,\mathbf W_{t+1})$$ 假设：(1) $\mathbf X_0$ 可观测；(2) 存在已知函数 $\chi$ 使 $\mathbf W_{t+1}=\chi(\mathbf X_t,\mathbf Y_{t+1}-\mathbf Y_t)$（反演假设：可用 $\mathbf X_t$ 与 $\mathbf Y_{t+1}-\mathbf Y_t$ 反解 $\mathbf W_{t+1}$）；(3) 用 $\phi$ 由 $\mathbf X_t$ 与 $\mathbf W_{t+1}$ 反解 $\mathbf X_{t+1}$；由 $\mathbf X_0$ 可观测可归纳反解过程 $\{\mathbf X_t\}$ 全部项——即显状态。设 $\mathbf Y_{t+1}-\mathbf Y_t$ 有密度 $\psi(\cdot\mid\mathbf x)$（关于测度 $\tau$、条件于 $\mathbf X_t=\mathbf x$）。

Important

例 12.1（一阶 VAR Markov 过程）考虑一阶 VAR Markov 过程 $\{\mathbf X_t\}$：$\mathbf X_{t+1}=\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1}$，即 $\phi(\mathbf X_t,\mathbf W_{t+1})=\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1}$，且 $\mathbf Y_{t+1}-\mathbf Y_t=\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}$（即 $\kappa(\mathbf X_t,\mathbf W_{t+1})=\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}$）。$\mathbf Y_t,\mathbf Y_{t+1}$ 为 $k\times1$、$\mathbf F$ 为 $k\times k$ 非奇异，则 $\mathbf W_{t+1}=\mathbf F^{-1}(\mathbf Y_{t+1}-\mathbf Y_t)-\mathbf F^{-1}\mathbf D\mathbf X_t$，即 $\chi(\mathbf X_t,\mathbf Y_{t+1}-\mathbf Y_t)=\mathbf F^{-1}(\mathbf Y_{t+1}-\mathbf Y_t)-\mathbf F^{-1}\mathbf D\mathbf X_t$。由 $\mathbf W_{t+1}\sim N(0,\mathbf I)$ i.i.d.，$\mathbf Y_{t+1}-\mathbf Y_t\sim N(\underbrace{\mathbf D\mathbf X_t}_{\text{revealed}},\mathbf F\mathbf F')$ i.i.d.。

由此构造（条件）似然过程 $\{L_t\}$： $$L_t=\prod_{j=1}^t\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1})$$ 对数似然 $$\log L_t=\sum_{j=1}^t\log\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1})$$ $\log L_t$ 有平稳增量（因增量 $\log\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1})$ 由平稳分布抽取），故可对 $\log L_t$ 做鞅分解。

Let $\{\mathbf W_{t+1}\}$ be a process of shocks, $\mathbb E[\mathbf W_{t+1}\mid\mathcal F_t]=0$. Define the state process $\{\mathbf X_t\}$ as a stationary Markov process $$\mathbf X_{t+1}=\phi(\mathbf X_t,\mathbf W_{t+1})$$ with $\mathbf W_{t+1}\sim N(0,\mathbf I)$ i.i.d. and $\mathbf X_t$ the state at time $t$. Construct the observation process $\{\mathbf Y_t\}$ as a function of state and shock: $$\mathbf Y_{t+1}-\mathbf Y_t=\kappa(\mathbf X_t,\mathbf W_{t+1})$$ Assumptions: (1) $\mathbf X_0$ is observed; (2) there exists a known function $\chi$ s.t. $\mathbf W_{t+1}=\chi(\mathbf X_t,\mathbf Y_{t+1}-\mathbf Y_t)$ (the inversion assumption: we can back out $\mathbf W_{t+1}$ from $\mathbf X_t$ and $\mathbf Y_{t+1}-\mathbf Y_t$); (3) apply $\phi$ to back out $\mathbf X_{t+1}$ from $\mathbf X_t$ and $\mathbf W_{t+1}$; since $\mathbf X_0$ is observed, we can use induction to back out all terms in $\{\mathbf X_t\}$ — i.e. we have revealed states. Assume $\mathbf Y_{t+1}-\mathbf Y_t$ has a density $\psi(\cdot\mid\mathbf x)$ w.r.t. measure $\tau$ conditional on $\mathbf X_t=\mathbf x$.

Important

Example 12.1 (First-order VAR Markov process) Consider a first-order VAR Markov process $\{\mathbf X_t\}$: $\mathbf X_{t+1}=\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1}$, i.e. $\phi(\mathbf X_t,\mathbf W_{t+1})=\mathbf A\mathbf X_t+\mathbf B\mathbf W_{t+1}$, and $\mathbf Y_{t+1}-\mathbf Y_t=\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}$ (i.e. $\kappa(\mathbf X_t,\mathbf W_{t+1})=\mathbf D\mathbf X_t+\mathbf F\mathbf W_{t+1}$). With $\mathbf Y_t,\mathbf Y_{t+1}$ being $k\times1$ and $\mathbf F$ a $k\times k$ non-singular matrix, $\mathbf W_{t+1}=\mathbf F^{-1}(\mathbf Y_{t+1}-\mathbf Y_t)-\mathbf F^{-1}\mathbf D\mathbf X_t$, i.e. $\chi(\mathbf X_t,\mathbf Y_{t+1}-\mathbf Y_t)=\mathbf F^{-1}(\mathbf Y_{t+1}-\mathbf Y_t)-\mathbf F^{-1}\mathbf D\mathbf X_t$. Since $\mathbf W_{t+1}\sim N(0,\mathbf I)$ i.i.d., $\mathbf Y_{t+1}-\mathbf Y_t\sim N(\underbrace{\mathbf D\mathbf X_t}_{\text{revealed}},\mathbf F\mathbf F')$ i.i.d.

This constructs the (conditional) likelihood process $\{L_t\}$: $$L_t=\prod_{j=1}^t\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1})$$ with log-likelihood $$\log L_t=\sum_{j=1}^t\log\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1})$$ $\log L_t$ has stationary increments (the increment $\log\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1})$ is drawn from a stationary distribution), so we can do a martingale decomposition on $\log L_t$.

12.6.2 Likelihood Ratio

设 $\theta,\theta_0$ 为模型中某参数的两个候选（真值 $\theta_0$）。密度有两版 $\psi(\cdot\mid\mathbf x,\theta)$、$\psi(\cdot\mid\mathbf x,\theta_0)$，对应两个似然过程 $$L_t(\theta)=\prod_{j=1}^t\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1},\theta),\qquad L_t(\theta_0)=\prod_{j=1}^t\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1},\theta_0)$$ 则 $$\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}=\frac{L_t(\theta)}{L_t(\theta_0)}\cdot\frac{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta)}{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0)}\tag{12.15}$$ 取 $\mathbf y^\star\in\mathcal Y$ 为 $\mathbf Y_{t+1}-\mathbf Y_t$ 的值，考察 (12.15) 右端末项的条件期望： $$\mathbb E_{\theta_0}\Big[\frac{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta)}{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0)}\mid\mathcal F_t\Big]=\int_{\mathcal Y}\frac{\psi(\mathbf y^\star\mid\mathbf x,\theta)}{\psi(\mathbf y^\star\mid\mathbf x,\theta_0)}\psi(\mathbf y^\star\mid\mathbf x,\theta_0)\tau(d\mathbf y^\star)=\int_{\mathcal Y}\psi(\mathbf y^\star\mid\mathbf x,\theta)\tau(d\mathbf y^\star)=1$$ 代回 (12.15)，得 $$\mathbb E_{\theta_0}\Big[\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}\mid\mathcal F_t\Big]=\frac{L_t(\theta)}{L_t(\theta_0)}\tag{12.16}$$ 即过程 $\big\{\frac{L_t(\theta)}{L_t(\theta_0)}\big\}_{t=1}^\infty$ 是正鞅。

Let $\theta,\theta_0$ be two candidates for a parameter in the model (true value $\theta_0$). The density has two versions $\psi(\cdot\mid\mathbf x,\theta)$, $\psi(\cdot\mid\mathbf x,\theta_0)$, with two likelihood processes $$L_t(\theta)=\prod_{j=1}^t\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1},\theta),\qquad L_t(\theta_0)=\prod_{j=1}^t\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1},\theta_0)$$ Then $$\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}=\frac{L_t(\theta)}{L_t(\theta_0)}\cdot\frac{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta)}{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0)}\tag{12.15}$$ Take $\mathbf y^\star\in\mathcal Y$ as the value of $\mathbf Y_{t+1}-\mathbf Y_t$, and consider the conditional expectation of the last term on the RHS of (12.15): $$\mathbb E_{\theta_0}\Big[\frac{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta)}{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0)}\mid\mathcal F_t\Big]=\int_{\mathcal Y}\frac{\psi(\mathbf y^\star\mid\mathbf x,\theta)}{\psi(\mathbf y^\star\mid\mathbf x,\theta_0)}\psi(\mathbf y^\star\mid\mathbf x,\theta_0)\tau(d\mathbf y^\star)=\int_{\mathcal Y}\psi(\mathbf y^\star\mid\mathbf x,\theta)\tau(d\mathbf y^\star)=1$$ Plugging back into (12.15), $$\mathbb E_{\theta_0}\Big[\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}\mid\mathcal F_t\Big]=\frac{L_t(\theta)}{L_t(\theta_0)}\tag{12.16}$$ i.e. the process $\big\{\frac{L_t(\theta)}{L_t(\theta_0)}\big\}_{t=1}^\infty$ is a positive martingale.

12.6.3 Log-likelihood Ratio

$\log$ 严格凹，由 Jensen 不等式 $\mathbb E[\log(a)]\le\log(\mathbb E[a])$，故 $$\mathbb E_{\theta_0}\Big[\log\Big(\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}\Big)\mid\mathcal F_t\Big]\le\log\Big(\mathbb E_{\theta_0}\Big[\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}\mid\mathcal F_t\Big]\Big)$$ 对 (12.16) 两边取 $\log$，得 $$\mathbb E_{\theta_0}\Big[\log\Big(\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}\Big)\mid\mathcal F_t\Big]\le\log\Big(\frac{L_t(\theta)}{L_t(\theta_0)}\Big)\tag{12.17}$$ 由 $\log$ 严格凹，(12.17) 仅当 $\theta_0=\theta$ 时取等。故过程 $\big\{\log\big(\frac{L_t(\theta)}{L_t(\theta_0)}\big)\big\}_{t=1}^\infty$ 是上鞅（super-martingale）。

Since $\log$ is strictly concave, by Jensen's inequality $\mathbb E[\log(a)]\le\log(\mathbb E[a])$, so $$\mathbb E_{\theta_0}\Big[\log\Big(\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}\Big)\mid\mathcal F_t\Big]\le\log\Big(\mathbb E_{\theta_0}\Big[\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}\mid\mathcal F_t\Big]\Big)$$ Taking $\log$ of both sides of (12.16), $$\mathbb E_{\theta_0}\Big[\log\Big(\frac{L_{t+1}(\theta)}{L_{t+1}(\theta_0)}\Big)\mid\mathcal F_t\Big]\le\log\Big(\frac{L_t(\theta)}{L_t(\theta_0)}\Big)\tag{12.17}$$ By strict concavity of $\log$, (12.17) holds with equality only when $\theta_0=\theta$. So the process $\big\{\log\big(\frac{L_t(\theta)}{L_t(\theta_0)}\big)\big\}_{t=1}^\infty$ is a super-martingale.

12.6.4 Maximum Likelihood Estimator

定义 $$v(\theta)=\mathbb E_{\theta_0}\big[\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))-\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0))\big]$$ 即无条件期望（不条件于 $\mathcal F_t$）。

Important

命题 12.7 $v(\theta)\le0$。

Note

证明先由 (12.17)（应用 $\frac{L_{t+1}}{L_{t+1}}=\frac{L_t}{L_t}\cdot\frac{\psi(\cdots\theta)}{\psi(\cdots\theta_0)}$ 取 $\log$ 并取条件期望）得 $$\mathbb E_{\theta_0}\Big[\log\Big(\frac{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta)}{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0)}\Big)\mid\mathcal F_t\Big]\le0$$ 然后 $$\begin{aligned}v(\theta)&=\mathbb E_{\theta_0}\big[\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))-\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0))\big]\\&\overset{\text{LIE}}{=}\mathbb E_{\mathcal F_t}\Big[\underbrace{\mathbb E_{\theta_0}\Big[\log\Big(\frac{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta)}{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0)}\Big)\mid\mathcal F_t\Big]}_{\le0}\Big]\le0\end{aligned}$$ $\blacksquare$

注意 $v(\theta)\le0$ 严格成立，除非 $\theta_0=\theta$ 或 $\psi(\cdot\mid\mathbf X_t,\theta)=\psi(\cdot\mid\mathbf X_t,\theta_0)$（一般不成立）。由平稳性，对 $\log(\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1},\theta))$ 与 $\log(\psi(\cdots\theta_0))$ 应用 Birkhoff 大数定律： $$\frac1t(\log(L_t(\theta))-\log(L_t(\theta_0)))=\frac1t\sum_{j=1}^t[\log(\psi(\cdots\theta))-\log(\psi(\cdots\theta_0))]\to\mathbb E_{\theta_0}[\log(\psi(\cdots\theta))-\log(\psi(\cdots\theta_0))]=v(\theta)$$ 即 $\frac1t(\log(L_t(\theta))-\log(L_t(\theta_0)))\to v(\theta)$。只要 $\theta_0\ne\theta$，$v(\theta)<0$，于是 $$\frac1t(\log(L_t(\theta))-\log(L_t(\theta_0)))<0\Rightarrow\log(L_t(\theta))-\log(L_t(\theta_0))\to-\infty\Rightarrow\frac{L_t(\theta)}{L_t(\theta_0)}\to0$$

Tip

注记 12.14 观察：$\max_{\theta\in\Theta}v(\theta)$ 在 $\theta=\theta_0$ 取得；定义 $\tilde v(\theta)=\lim_{t\to\infty}\frac1t\log(L_t(\theta))$，则 $\max_{\theta\in\Theta}\tilde v(\theta)$ 也在 $\theta=\theta_0$ 取得。这为最大似然估计提供启发式依据。

Define $$v(\theta)=\mathbb E_{\theta_0}\big[\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))-\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0))\big]$$ which is an unconditional expectation (not conditional on $\mathcal F_t$).

Important

Proposition 12.7 $v(\theta)\le0$.

Note

Proof First, from (12.17) (applying $\frac{L_{t+1}}{L_{t+1}}=\frac{L_t}{L_t}\cdot\frac{\psi(\cdots\theta)}{\psi(\cdots\theta_0)}$, taking $\log$ and conditional expectation), $$\mathbb E_{\theta_0}\Big[\log\Big(\frac{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta)}{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0)}\Big)\mid\mathcal F_t\Big]\le0$$ Then $$\begin{aligned}v(\theta)&=\mathbb E_{\theta_0}\big[\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))-\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0))\big]\\&\overset{\text{LIE}}{=}\mathbb E_{\mathcal F_t}\Big[\underbrace{\mathbb E_{\theta_0}\Big[\log\Big(\frac{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta)}{\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta_0)}\Big)\mid\mathcal F_t\Big]}_{\le0}\Big]\le0\end{aligned}$$ $\blacksquare$

Note $v(\theta)\le0$ is strict unless $\theta_0=\theta$ or $\psi(\cdot\mid\mathbf X_t,\theta)=\psi(\cdot\mid\mathbf X_t,\theta_0)$ (not true in general). By stationarity, applying the Birkhoff LLN to $\log(\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1},\theta))$ and $\log(\psi(\cdots\theta_0))$: $$\frac1t(\log(L_t(\theta))-\log(L_t(\theta_0)))=\frac1t\sum_{j=1}^t[\log(\psi(\cdots\theta))-\log(\psi(\cdots\theta_0))]\to\mathbb E_{\theta_0}[\log(\psi(\cdots\theta))-\log(\psi(\cdots\theta_0))]=v(\theta)$$ i.e. $\frac1t(\log(L_t(\theta))-\log(L_t(\theta_0)))\to v(\theta)$. As long as $\theta_0\ne\theta$, $v(\theta)<0$, so $$\frac1t(\log(L_t(\theta))-\log(L_t(\theta_0)))<0\Rightarrow\log(L_t(\theta))-\log(L_t(\theta_0))\to-\infty\Rightarrow\frac{L_t(\theta)}{L_t(\theta_0)}\to0$$

Tip

Remark 12.14 Observations: $\max_{\theta\in\Theta}v(\theta)$ obtains its solution at $\theta=\theta_0$; defining $\tilde v(\theta)=\lim_{t\to\infty}\frac1t\log(L_t(\theta))$, then $\max_{\theta\in\Theta}\tilde v(\theta)$ also obtains its solution at $\theta=\theta_0$. This gives some heuristic justification for the maximum likelihood estimator.

12.6.5 Score Process

由 §12.6.4，$v(\theta)=\mathbb E_{\theta_0}[\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))]$（去掉与 $\theta$ 无关项）在 $\theta=\theta_0$ 取最大，故一阶条件 $$\int_{\mathcal Y}\frac{\partial\log(\psi(\mathbf y^\star\mid\mathbf x,\theta))}{\partial\theta}\Big|_{\theta=\theta_0}\psi(\mathbf y^\star\mid\mathbf x,\theta_0)\tau(d\mathbf y^\star)=0\tag{12.19}$$

Important

定义 12.14（得分过程）对 §12.6.1 的模型，得分过程 $\{S_t\}_{t=0}^\infty$ 定义为 $$S_t=\sum_{j=1}^t\frac{\partial\log(\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1},\theta))}{\partial\theta}\Big|_{\theta=\theta_0}=\frac{\partial\log L_t(\theta)}{\partial\theta}\Big|_{\theta=\theta_0}$$

Important

命题 12.8 得分过程 $\{S_t\}_{t=0}^\infty$ 是鞅。

Note

证明由 (12.19)， $$\int_{\mathcal Y}\frac{\partial\log(\psi(\mathbf y^\star\mid\mathbf x,\theta))}{\partial\theta}\Big|_{\theta=\theta_0}\psi(\mathbf y^\star\mid\mathbf x,\theta_0)\tau(d\mathbf y^\star)=0\Rightarrow\mathbb E_{\theta_0}\Big[\frac{\partial\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))}{\partial\theta}\Big|_{\theta=\theta_0}\mid\mathbf X_t\Big]=0$$ $$\Rightarrow\mathbb E_{\theta_0}[S_{t+1}-S_t\mid\mathcal F_t]=0\Rightarrow\mathbb E_{\theta_0}[S_{t+1}\mid\mathcal F_t]=S_t$$ $\blacksquare$

由于 $\{S_t\}$ 增量由平稳分布抽取、$\mathbf W_{t+1}$ i.i.d.，可对平稳增量鞅过程 $\{S_t\}$ 应用 Birkhoff 大数定律与 Billingsley 中心极限定理： $$\frac1t S_t\to\mathbb E_{\theta_0}[S_{t+1}-S_t\mid\mathcal F_t]=0\tag{12.20}$$ $$\frac1{\sqrt t}(S_t)\xrightarrow{d}N\big(0,\mathbb E[(S_{t+1}-S_t)(S_{t+1}-S_t)']\big)\tag{12.21}$$ 其中 $\mathbb E[(S_{t+1}-S_t)(S_{t+1}-S_t)']$ 恰为 $\frac{\partial\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))}{\partial\theta}\big|_{\theta=\theta_0}$ 的方差—协方差矩阵，也称 Fisher 信息。还有 $$\sqrt t\big(\theta_t^{\text{MLE}}-\theta_0\big)\xrightarrow{d}N\Big(0,\mathbb E[(S_{t+1}-S_t)(S_{t+1}-S_t)']^{-1}\Big)\tag{12.22}$$ $\theta_t^{\text{MLE}}$ 为用到 $t$ 期数据的最大似然估计。

Tip

注记 12.15 即便此处没有 $(\mathbf X_t,\mathbf Y_{t+1}-\mathbf Y_t)$ 的 i.i.d. 性质，由 Birkhoff 大数定律与 Billingsley 中心极限定理仍能得到 (12.20)、(12.21)，进而 (12.22)——证明与 §6.3.3 i.i.d. 情形类似（彼处用 WLLN 与 CLT）。

By §12.6.4, $v(\theta)=\mathbb E_{\theta_0}[\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))]$ (dropping terms unrelated to $\theta$) is maximized at $\theta=\theta_0$, so the first-order condition is $$\int_{\mathcal Y}\frac{\partial\log(\psi(\mathbf y^\star\mid\mathbf x,\theta))}{\partial\theta}\Big|_{\theta=\theta_0}\psi(\mathbf y^\star\mid\mathbf x,\theta_0)\tau(d\mathbf y^\star)=0\tag{12.19}$$

Important

Definition 12.14 (Score process) For the model of §12.6.1, the score process $\{S_t\}_{t=0}^\infty$ is defined by $$S_t=\sum_{j=1}^t\frac{\partial\log(\psi(\mathbf Y_j-\mathbf Y_{j-1}\mid\mathbf X_{j-1},\theta))}{\partial\theta}\Big|_{\theta=\theta_0}=\frac{\partial\log L_t(\theta)}{\partial\theta}\Big|_{\theta=\theta_0}$$

Important

Proposition 12.8 The score process $\{S_t\}_{t=0}^\infty$ is a martingale.

Note

Proof From (12.19), $$\int_{\mathcal Y}\frac{\partial\log(\psi(\mathbf y^\star\mid\mathbf x,\theta))}{\partial\theta}\Big|_{\theta=\theta_0}\psi(\mathbf y^\star\mid\mathbf x,\theta_0)\tau(d\mathbf y^\star)=0\Rightarrow\mathbb E_{\theta_0}\Big[\frac{\partial\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))}{\partial\theta}\Big|_{\theta=\theta_0}\mid\mathbf X_t\Big]=0$$ $$\Rightarrow\mathbb E_{\theta_0}[S_{t+1}-S_t\mid\mathcal F_t]=0\Rightarrow\mathbb E_{\theta_0}[S_{t+1}\mid\mathcal F_t]=S_t$$ $\blacksquare$

Since the increments of $\{S_t\}$ are drawn from a stationary distribution (by stationarity of $\mathbf X_t$ and i.i.d. of $\mathbf W_{t+1}$), we can apply the Birkhoff LLN and Billingsley CLT to the stationary-increment martingale process $\{S_t\}$: $$\frac1t S_t\to\mathbb E_{\theta_0}[S_{t+1}-S_t\mid\mathcal F_t]=0\tag{12.20}$$ $$\frac1{\sqrt t}(S_t)\xrightarrow{d}N\big(0,\mathbb E[(S_{t+1}-S_t)(S_{t+1}-S_t)']\big)\tag{12.21}$$ where $\mathbb E[(S_{t+1}-S_t)(S_{t+1}-S_t)']$ is exactly the variance-covariance matrix of $\frac{\partial\log(\psi(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t,\theta))}{\partial\theta}\big|_{\theta=\theta_0}$, also called Fisher information. We also have $$\sqrt t\big(\theta_t^{\text{MLE}}-\theta_0\big)\xrightarrow{d}N\Big(0,\mathbb E[(S_{t+1}-S_t)(S_{t+1}-S_t)']^{-1}\Big)\tag{12.22}$$ where $\theta_t^{\text{MLE}}$ is the maximum likelihood estimator based on data up to period $t$.

Tip

Remark 12.15 Even though we don't have i.i.d. property of $(\mathbf X_t,\mathbf Y_{t+1}-\mathbf Y_t)$ in this case, by the Birkhoff LLN and Billingsley CLT we still reach (12.20) and (12.21), then (12.22) — the proof follows similarly to what we did in §6.3.3 in the i.i.d. case (with WLLN and CLT there).

12.6.6 Nuisance Parameters

设模型有多个未知参数，只关心其一。令 $\theta_0$（标量）为感兴趣参数，把其余未知参数堆叠为向量 $\tilde\theta_0$。因不关心 $\tilde\theta_0$ 里的参数，称之为冗余参数（nuisance parameters）。我们将说明：冗余参数越多，关心参数的信息越少（估计越不准、越不确定）。

定义 $\theta_0$ 的得分过程 $\{S_t\}_{t=0}^\infty$（随机标量过程）、$\tilde\theta_0$ 的得分过程 $\{\tilde S_t\}_{t=0}^\infty$（随机向量过程），把 $S_t$ 与 $\tilde S_t$ 堆叠为单一向量， $$V=\mathbb E_{\theta_0,\tilde\theta_0}\Big[\begin{pmatrix}S_{t+1}-S_t\\\tilde S_{t+1}-\tilde S_t\end{pmatrix}\begin{pmatrix}S_{t+1}-S_t\\\tilde S_{t+1}-\tilde S_t\end{pmatrix}'\Big]\tag{12.23}$$ 则 $$\sqrt t\Big(\begin{pmatrix}\theta_0^{\text{MLE}}\\\tilde\theta_0^{\text{MLE}}\end{pmatrix}-\begin{pmatrix}\theta_0\\\tilde\theta_0\end{pmatrix}\Big)\xrightarrow{d}N(0,V^{-1})$$ 其中 $V^{-1}$ 的 $(1,1)$ 元是 $\theta_0^{\text{MLE}}$ 的方差。计算 $V^{-1}$：跑总体最小二乘回归 $$S_{t+1}-S_t=\beta'\big(\tilde S_{t+1}-\tilde S_t\big)+u_{t+1}\tag{12.24}$$ $u_{t+1}$ 为标量残差，与 $\tilde S_{t+1}-\tilde S_t$ 不相关。则可重写 (12.24)、(12.23)，并用分块矩阵求逆 $$\begin{bmatrix}1&\beta'\\\mathbf 0&\mathbf I\end{bmatrix}^{-1}=\begin{bmatrix}1&-\beta'\\\mathbf 0&\mathbf I\end{bmatrix},\qquad\begin{bmatrix}1&\mathbf 0\\\beta&\mathbf I\end{bmatrix}^{-1}=\begin{bmatrix}1&\mathbf 0\\-\beta&\mathbf I\end{bmatrix}$$ 得 $$V^{-1}=\begin{bmatrix}1&\mathbf 0\\-\beta&\mathbf I\end{bmatrix}\begin{bmatrix}\frac1{\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]}&\mathbf 0\\\mathbf 0&\mathbb E_{\theta_0,\tilde\theta_0}[(\tilde S_{t+1}-\tilde S_t)(\tilde S_{t+1}-\tilde S_t)']^{-1}\end{bmatrix}\begin{bmatrix}1&-\beta'\\\mathbf 0&\mathbf I\end{bmatrix}$$ 故 $V^{-1}$ 的 $(1,1)$ 元为 $\frac1{\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]}$——它正是在 $\tilde\theta_0$ 未知时估计 $\theta_0$ 的 Fisher 信息。冗余参数越多，$(\tilde S_{t+1}-\tilde S_t)$ 解释掉的越多、残差方差 $\mathbb E[u_{t+1}^2]$ 越小，故 Fisher 信息 $\frac1{\mathbb E[u_{t+1}^2]}$ 越高、$\theta_0^{\text{MLE}}$ 方差越大。

若反而已知所有冗余参数 $\tilde\theta_0$，则 $\theta_0$（标量）的 Fisher 信息为 $\mathbb E_{\theta_0}[(S_{t+1}-S_t)^2]$——即得分过程增量的方差。关键观察： $$\mathbb E_{\theta_0}[(S_{t+1}-S_t)^2]=\beta'\mathbb E_{\theta_0}\big[(\tilde S_{t+1}-\tilde S_t)(\tilde S_{t+1}-\tilde S_t)'\big]\beta+\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]\ge\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]$$ $$\Rightarrow\frac1{\mathbb E_{\theta_0}[(S_{t+1}-S_t)^2]}\le\frac1{\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]}$$ 即冗余参数已知时的 Fisher 信息大于未知时，故 $\theta_0^{\text{MLE}}$ 方差更小。

Tip

注记 12.16 此结果非常直观：对同一组观测序列（数据）与似然过程（及其导数得分过程），未知参数越多，情况越糟——更多参数会"稀释"信息，使我们对关心参数的估计更不准、更不确定。

Suppose there are multiple unknown parameters in the model, and we are only interested in one of them. Let $\theta_0$ (a scalar) denote the parameter of interest, and stack all other unknown parameters into a vector $\tilde\theta_0$. Since we are not interested in the parameters in $\tilde\theta_0$, they are called nuisance parameters. We will illustrate that the more nuisance parameters we have, the less information about the parameter of interest.

Define the score process $\{S_t\}_{t=0}^\infty$ for $\theta_0$ (a process of random scalars) and the score process $\{\tilde S_t\}_{t=0}^\infty$ for $\tilde\theta_0$ (a process of random vectors), and stack $S_t$ and $\tilde S_t$ into a single vector: $$V=\mathbb E_{\theta_0,\tilde\theta_0}\Big[\begin{pmatrix}S_{t+1}-S_t\\\tilde S_{t+1}-\tilde S_t\end{pmatrix}\begin{pmatrix}S_{t+1}-S_t\\\tilde S_{t+1}-\tilde S_t\end{pmatrix}'\Big]\tag{12.23}$$ Then $$\sqrt t\Big(\begin{pmatrix}\theta_0^{\text{MLE}}\\\tilde\theta_0^{\text{MLE}}\end{pmatrix}-\begin{pmatrix}\theta_0\\\tilde\theta_0\end{pmatrix}\Big)\xrightarrow{d}N(0,V^{-1})$$ where the $(1,1)$ entry of $V^{-1}$ is the variance of $\theta_0^{\text{MLE}}$. To compute $V^{-1}$, run a population least-square regression $$S_{t+1}-S_t=\beta'\big(\tilde S_{t+1}-\tilde S_t\big)+u_{t+1}\tag{12.24}$$ where $u_{t+1}$ is a scalar residual uncorrelated with $\tilde S_{t+1}-\tilde S_t$. Rewriting (12.24) and (12.23), and using the block-matrix inverses $$\begin{bmatrix}1&\beta'\\\mathbf 0&\mathbf I\end{bmatrix}^{-1}=\begin{bmatrix}1&-\beta'\\\mathbf 0&\mathbf I\end{bmatrix},\qquad\begin{bmatrix}1&\mathbf 0\\\beta&\mathbf I\end{bmatrix}^{-1}=\begin{bmatrix}1&\mathbf 0\\-\beta&\mathbf I\end{bmatrix}$$ we get $$V^{-1}=\begin{bmatrix}1&\mathbf 0\\-\beta&\mathbf I\end{bmatrix}\begin{bmatrix}\frac1{\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]}&\mathbf 0\\\mathbf 0&\mathbb E_{\theta_0,\tilde\theta_0}[(\tilde S_{t+1}-\tilde S_t)(\tilde S_{t+1}-\tilde S_t)']^{-1}\end{bmatrix}\begin{bmatrix}1&-\beta'\\\mathbf 0&\mathbf I\end{bmatrix}$$ so the $(1,1)$ entry of $V^{-1}$ is $\frac1{\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]}$ — exactly the Fisher information for estimating $\theta_0$ with $\tilde\theta_0$ unknown. The more nuisance parameters, the more $(\tilde S_{t+1}-\tilde S_t)$ explains and the smaller the residual variance $\mathbb E[u_{t+1}^2]$, so the higher the Fisher information $\frac1{\mathbb E[u_{t+1}^2]}$ and the higher the variance of $\theta_0^{\text{MLE}}$.

If instead we know all the nuisance parameters $\tilde\theta_0$, the Fisher information for $\theta_0$ (a scalar) is $\mathbb E_{\theta_0}[(S_{t+1}-S_t)^2]$ — the variance of the score-process increment. The important observation: $$\mathbb E_{\theta_0}[(S_{t+1}-S_t)^2]=\beta'\mathbb E_{\theta_0}\big[(\tilde S_{t+1}-\tilde S_t)(\tilde S_{t+1}-\tilde S_t)'\big]\beta+\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]\ge\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]$$ $$\Rightarrow\frac1{\mathbb E_{\theta_0}[(S_{t+1}-S_t)^2]}\le\frac1{\mathbb E_{\theta_0,\tilde\theta_0}[u_{t+1}^2]}$$ i.e. the Fisher information when nuisance parameters are known is greater than when they are unknown, so the variance of $\theta_0^{\text{MLE}}$ is smaller.

Tip

Remark 12.16 This result is very intuitive: for the same sequence of observations (data) and likelihood process (and its derivative score process), the more parameters unknown, the worse it is, because more parameters would "dilute" the information, and thus our estimate of the parameter of interest would be less accurate and less certain.

12.7 Hidden States: Recursive Learning

§12.6 假设初始状态 $\mathbf X_0$ 已知，并作反演假设（存在函数 $\chi$ 使 $\mathbf W_{t+1}=\chi(\mathbf X_t,\mathbf Y_{t+1}-\mathbf Y_t)$、$\mathbf X_{t+1}=\phi(\mathbf X_t,\mathbf W_{t+1})$），从而可迭代反解整列冲击 $\{\mathbf W_{t+1}\}_{t=0}^\infty$： $$\mathbf W_1=\chi(\mathbf X_0,\mathbf Y_1-\mathbf Y_0),\quad\mathbf X_1=\phi(\mathbf X_0,\mathbf W_1),\quad\mathbf W_2=\chi(\mathbf X_1,\mathbf Y_2-\mathbf Y_1),\quad\dots$$ 现考虑更一般的情形：不观测初始状态 $\mathbf X_0$。

12.7.1 Regime Switching Model

$\{\mathbf X_t\}$ 为 $n$ 态 Markov 过程，$\mathbf X_t$ 表 $t$ 期状态；实现状态是 $n\times1$ 坐标向量（若 $t$ 期系统处于状态 $j$，则 $\mathbf X_t$ 除第 $j$ 位为 1 外全为 0）。
$\{\mathbf X_t\}$ 按 $n\times n$ 转移矩阵 $\mathbf P$ 演化， $$\mathbf P=\begin{bmatrix}\pi_{11}&\pi_{12}&\cdots&\pi_{1n}\\\pi_{21}&\pi_{22}&\cdots&\pi_{2n}\\\vdots&\vdots&\ddots&\vdots\\\pi_{n1}&\pi_{n2}&\cdots&\pi_{nn}\end{bmatrix},\quad\mathbf P\mathbf 1_n=\mathbf 1_n$$ $\mathbf P$ 的 $(i,j)$ 元 $\pi_{ij}$ 是从状态 $i$ 转到状态 $j$ 的概率。
密度 $\{\psi_1,\dots,\psi_n\}$，若状态 $j$ 实现则 $\mathbf Y_{t+1}-\mathbf Y_t$ 有密度 $\psi_j$（这些状态—条件密度时间不变）。
$\mathbf Y^t=(\mathbf Y_1-\mathbf Y_0,\mathbf Y_2-\mathbf Y_1,\dots,\mathbf Y_t-\mathbf Y_{t-1})'$ 是观测向量。
$\mathbf Q_0$ 是 $\mathbf X_0$ 的概率向量（初始状态的"最佳猜测"）。
$\mathbf Q_t$ 为条件概率， $$\mathbf Q_t=\begin{bmatrix}q_t^1\\q_t^2\\\vdots\\q_t^n\end{bmatrix}=\mathbb P(\mathbf X_t\mid\mathbf Y^t,\mathbf Q_0)$$ $q_t^i$ 给出当期状态为第 $i$ 态的概率。用 Bayes 规则写 $\mathbf Q_{t+1}$： $$\begin{aligned}\mathbf Q_{t+1}&=\mathbb P(\mathbf X_{t+1}\mid\mathbf Y^{t+1},\mathbf Q_0)=\mathbb P(\mathbf X_{t+1}\mid\mathbf Y_{t+1}-\mathbf Y_t,\mathbf Y^t,\mathbf Q_0)\\&=\frac{\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Y^t,\mathbf Q_0)}{\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Y^t,\mathbf Q_0)}=\frac{\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)}{\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)}\end{aligned}\tag{12.25}$$ 设 $\mathbf X_{t+1}$ 与 $\mathbf Y_{t+1}-\mathbf Y_t$ 条件于 $\mathbf X_t$（$\forall t$）独立。分四步显式刻画 $\mathbf Q_{t+1}$。

第 1 步：算 $(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t)$ 条件于 $\mathbf X_t$ 的联合分布。 由条件独立， $$\begin{aligned}&\underbrace{\mathbb P(\mathbf X_{t+1}\mid\mathbf X_t=\mathbf x_i)}_{\mathbf X_{t+1}\text{ density}}\underbrace{\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t=\mathbf x_i)}_{\mathbf Y_{t+1}-\mathbf Y_t\text{ density}}=\underbrace{\mathbf P'\mathbf x_i}_{=\mathbf x_i}\times\underbrace{\mathbf x_i'\operatorname{vec}\{\psi_i(\mathbf y^\star)\}}_{=\mathbf x_i'}\\&=\begin{bmatrix}\pi_{i1}\\\pi_{i2}\\\vdots\\\pi_{in}\end{bmatrix}\times\psi_i(\mathbf y^\star)=\begin{bmatrix}\pi_{i1}\psi_i(\mathbf y^\star)\\\pi_{i2}\psi_i(\mathbf y^\star)\\\vdots\\\pi_{in}\psi_i(\mathbf y^\star)\end{bmatrix}\end{aligned}$$ 其中 $\mathbf y^\star\equiv\mathbf Y_{t+1}-\mathbf Y_t$（$\forall t$）。

第 2 步：算 $(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t)$ 条件于 $\mathbf Q_t$ 的联合分布： $$\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)=\begin{bmatrix}q_t^1\pi_{11}\psi_1(\mathbf y^\star)+\dots+q_t^n\pi_{n1}\psi_n(\mathbf y^\star)\\q_t^1\pi_{12}\psi_1(\mathbf y^\star)+\dots+q_t^n\pi_{n2}\psi_n(\mathbf y^\star)\\\vdots\\q_t^1\pi_{1n}\psi_1(\mathbf y^\star)+\dots+q_t^n\pi_{nn}\psi_n(\mathbf y^\star)\end{bmatrix}\tag{12.26}$$ 亦可重写为：由 $\mathbf Q_t$ 是当期状态各态条件概率向量，$\mathbb E[\mathbf X_t\mathbf X_t'\mid\mathbf Q_t]=q_t^1\operatorname{diag}[\mathbf e_1]+\dots+q_t^n\operatorname{diag}[\mathbf e_n]=\operatorname{diag}[\mathbf Q_t]$（$\operatorname{diag}[\mathbf e_i]$ 仅 $(i,i)$ 位为 1），故 $$\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)=\mathbb E[\mathbf P'\mathbf X_t\mathbf X_t'\operatorname{vec}\{\psi_i(\mathbf y^\star)\}\mid\mathbf Q_t]=\mathbf P'\underbrace{\operatorname{diag}\{\mathbf Q_t\}}_{\mathbb E[\mathbf X_t\mathbf X_t'\mid\mathbf Q_t]}\operatorname{vec}\{\psi_i(\mathbf y^\star)\}\tag{12.27}$$ 与 (12.26) 一致。

第 3 步：算 $\mathbf Y_{t+1}-\mathbf Y_t$ 条件于 $\mathbf Q_t$ 的分布。 由全概率公式， $$\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)=\sum_{i=1}^n\mathbb P(\mathbf X_{t+1}=\mathbf x_i,\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)=\mathbf 1_n'\mathbf P'\operatorname{diag}\{\mathbf Q_t\}\operatorname{vec}\{\psi_i(\mathbf y^\star)\}=\mathbf Q_t'\operatorname{vec}\{\psi_i(\mathbf y^\star)\}$$ （用 $\mathbf 1_n'\mathbf P'=\mathbf 1_n'$，再 $\mathbf 1_n'\operatorname{diag}\{\mathbf Q_t\}=\mathbf Q_t'$。）

第 4 步：构造 $\mathbf Q_{t+1}$。 用第 1–3 步与 (12.25) 的 Bayes 规则， $$\mathbf Q_{t+1}=\frac1{\underbrace{\mathbf Q_t'\operatorname{vec}\{\psi_i(\mathbf Y_{t+1}-\mathbf Y_t)\}}_{\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)}}\underbrace{\mathbf P'\operatorname{diag}\{\mathbf Q_t\}\operatorname{vec}\{\psi_i(\mathbf Y_{t+1}-\mathbf Y_t)\}}_{\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)}$$ 于是可迭代地用此法在各期得到状态的条件概率序列 $\{\mathbf Q_{t+1}\}$——这正是隐状态下的递归学习（贝叶斯滤波）。

§12.6 assumed the initial state $\mathbf X_0$ was known and made the inversion assumption (a function $\chi$ s.t. $\mathbf W_{t+1}=\chi(\mathbf X_t,\mathbf Y_{t+1}-\mathbf Y_t)$, $\mathbf X_{t+1}=\phi(\mathbf X_t,\mathbf W_{t+1})$), so we could iteratively back out the entire sequence of shocks $\{\mathbf W_{t+1}\}_{t=0}^\infty$: $$\mathbf W_1=\chi(\mathbf X_0,\mathbf Y_1-\mathbf Y_0),\quad\mathbf X_1=\phi(\mathbf X_0,\mathbf W_1),\quad\mathbf W_2=\chi(\mathbf X_1,\mathbf Y_2-\mathbf Y_1),\quad\dots$$ Now consider a more general setting: we do not observe the initial state $\mathbf X_0$.

12.7.1 Regime Switching Model

$\{\mathbf X_t\}$ is an $n$-state Markov process, $\mathbf X_t$ modeling the state at time $t$; the realized state is an $n\times1$ coordinate vector (if at time $t$ the system is in state $j$, then $\mathbf X_t$ has zeros everywhere except a one at position $j$).
$\{\mathbf X_t\}$ evolves according to an $n\times n$ transition matrix $\mathbf P$, $$\mathbf P=\begin{bmatrix}\pi_{11}&\pi_{12}&\cdots&\pi_{1n}\\\pi_{21}&\pi_{22}&\cdots&\pi_{2n}\\\vdots&\vdots&\ddots&\vdots\\\pi_{n1}&\pi_{n2}&\cdots&\pi_{nn}\end{bmatrix},\quad\mathbf P\mathbf 1_n=\mathbf 1_n$$ the $(i,j)$ element $\pi_{ij}$ being the probability of transferring to state $j$ from state $i$.
Densities $\{\psi_1,\dots,\psi_n\}$, where $\mathbf Y_{t+1}-\mathbf Y_t$ has density $\psi_j$ if state $j$ is realized (these state-contingent densities are time-invariant).
$\mathbf Y^t=(\mathbf Y_1-\mathbf Y_0,\mathbf Y_2-\mathbf Y_1,\dots,\mathbf Y_t-\mathbf Y_{t-1})'$ is the vector of observations.
$\mathbf Q_0$ is a probability vector for $\mathbf X_0$ (a "best guess" of the initial state).
$\mathbf Q_t$ is a conditional probability, $$\mathbf Q_t=\begin{bmatrix}q_t^1\\q_t^2\\\vdots\\q_t^n\end{bmatrix}=\mathbb P(\mathbf X_t\mid\mathbf Y^t,\mathbf Q_0)$$ where $q_t^i$ gives the probability that the current state is the $i$-th state. Using Bayes' rule, write $\mathbf Q_{t+1}$ as $$\begin{aligned}\mathbf Q_{t+1}&=\mathbb P(\mathbf X_{t+1}\mid\mathbf Y^{t+1},\mathbf Q_0)=\mathbb P(\mathbf X_{t+1}\mid\mathbf Y_{t+1}-\mathbf Y_t,\mathbf Y^t,\mathbf Q_0)\\&=\frac{\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Y^t,\mathbf Q_0)}{\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Y^t,\mathbf Q_0)}=\frac{\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)}{\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)}\end{aligned}\tag{12.25}$$ Finally, assume $\mathbf X_{t+1}$ and $\mathbf Y_{t+1}-\mathbf Y_t$ are independent conditional on $\mathbf X_t$ for all $t$. We characterize $\mathbf Q_{t+1}$ explicitly in four steps.

Step 1: joint distribution of $(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t)$ conditional on $\mathbf X_t$. By conditional independence, $$\begin{aligned}&\underbrace{\mathbb P(\mathbf X_{t+1}\mid\mathbf X_t=\mathbf x_i)}_{\mathbf X_{t+1}\text{ density}}\underbrace{\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf X_t=\mathbf x_i)}_{\mathbf Y_{t+1}-\mathbf Y_t\text{ density}}=\underbrace{\mathbf P'\mathbf x_i}_{=\mathbf x_i}\times\underbrace{\mathbf x_i'\operatorname{vec}\{\psi_i(\mathbf y^\star)\}}_{=\mathbf x_i'}\\&=\begin{bmatrix}\pi_{i1}\\\pi_{i2}\\\vdots\\\pi_{in}\end{bmatrix}\times\psi_i(\mathbf y^\star)=\begin{bmatrix}\pi_{i1}\psi_i(\mathbf y^\star)\\\pi_{i2}\psi_i(\mathbf y^\star)\\\vdots\\\pi_{in}\psi_i(\mathbf y^\star)\end{bmatrix}\end{aligned}$$ where $\mathbf y^\star\equiv\mathbf Y_{t+1}-\mathbf Y_t$ for all $t$.

Step 2: joint distribution of $(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t)$ conditional on $\mathbf Q_t$: $$\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)=\begin{bmatrix}q_t^1\pi_{11}\psi_1(\mathbf y^\star)+\dots+q_t^n\pi_{n1}\psi_n(\mathbf y^\star)\\q_t^1\pi_{12}\psi_1(\mathbf y^\star)+\dots+q_t^n\pi_{n2}\psi_n(\mathbf y^\star)\\\vdots\\q_t^1\pi_{1n}\psi_1(\mathbf y^\star)+\dots+q_t^n\pi_{nn}\psi_n(\mathbf y^\star)\end{bmatrix}\tag{12.26}$$ This can also be rewritten: since $\mathbf Q_t$ is the vector of conditional probabilities for the current state, $\mathbb E[\mathbf X_t\mathbf X_t'\mid\mathbf Q_t]=q_t^1\operatorname{diag}[\mathbf e_1]+\dots+q_t^n\operatorname{diag}[\mathbf e_n]=\operatorname{diag}[\mathbf Q_t]$ ($\operatorname{diag}[\mathbf e_i]$ has 1 only in position $(i,i)$), so $$\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)=\mathbb E[\mathbf P'\mathbf X_t\mathbf X_t'\operatorname{vec}\{\psi_i(\mathbf y^\star)\}\mid\mathbf Q_t]=\mathbf P'\underbrace{\operatorname{diag}\{\mathbf Q_t\}}_{\mathbb E[\mathbf X_t\mathbf X_t'\mid\mathbf Q_t]}\operatorname{vec}\{\psi_i(\mathbf y^\star)\}\tag{12.27}$$ consistent with (12.26).

Step 3: distribution of $\mathbf Y_{t+1}-\mathbf Y_t$ conditional on $\mathbf Q_t$. By the law of total probability, $$\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)=\sum_{i=1}^n\mathbb P(\mathbf X_{t+1}=\mathbf x_i,\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)=\mathbf 1_n'\mathbf P'\operatorname{diag}\{\mathbf Q_t\}\operatorname{vec}\{\psi_i(\mathbf y^\star)\}=\mathbf Q_t'\operatorname{vec}\{\psi_i(\mathbf y^\star)\}$$ (using $\mathbf 1_n'\mathbf P'=\mathbf 1_n'$, then $\mathbf 1_n'\operatorname{diag}\{\mathbf Q_t\}=\mathbf Q_t'$.)

Step 4: construct $\mathbf Q_{t+1}$. Using the terms from Steps 1–3 and the Bayes' rule of (12.25), $$\mathbf Q_{t+1}=\frac1{\underbrace{\mathbf Q_t'\operatorname{vec}\{\psi_i(\mathbf Y_{t+1}-\mathbf Y_t)\}}_{\mathbb P(\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)}}\underbrace{\mathbf P'\operatorname{diag}\{\mathbf Q_t\}\operatorname{vec}\{\psi_i(\mathbf Y_{t+1}-\mathbf Y_t)\}}_{\mathbb P(\mathbf X_{t+1},\mathbf Y_{t+1}-\mathbf Y_t\mid\mathbf Q_t)}$$ We can then iteratively use this method to obtain a sequence of conditional probabilities $\{\mathbf Q_{t+1}\}$ of the states in each period — this is exactly recursive learning (Bayesian filtering) under hidden states.