15. Selection on Observables and Matching

Note

本章主题:基于可观测量的选择与匹配。 用 Roy 模型设定讨论识别中的具体困难(选择偏差)及其克服方法(假设基于可观测量的选择 + 匹配)。§15.1 非参数方法:以尽量少假设估计未知量;核密度估计 \(\hat f(x)=\frac1{Nh}\sum K(\frac{X_i-x}h)\)(偏差 \(\approx f''(x)h^2\kappa_2\)、方差 \(\approx\frac1{Nh}f(x)R(K)\),偏差—方差权衡);核回归 \(\hat g(x)=\sum\omega_h(X_i,x)Y_i\);局部常数 vs 局部线性回归。§15.2 截面观测数据:选择进入处理(定义 15.1);OLS 在 CIA 下 \(\beta^{\text{OLS}}=\text{ATE}\),否则有选择偏差 (15.10);条件独立假设(CIA) \(\{Y_d\}\perp D\mid X\)、共同支撑(overlapping) \(\mathbb P(D=d\mid X)<1\);按匹配单元内均值对比的加权平均识别 ATE/ATT/ATU。§15.3 倾向得分:定义 15.2 \(p(x)=\mathbb P(D=1\mid X=x)\);定理 15.1(倾向得分定理) CIA \(\Rightarrow(Y_0,Y_1)\perp D\mid p(X)\);可在标量 \(p(X)\) 上而非高维 \(X\) 上条件/匹配。§15.4 维度灾难与替代方法:直接在 \(X\) 上匹配的非参数估计面临维度灾难;替代法(回归、在倾向得分上匹配、非精确匹配);饱和 OLS \(\beta^{\text{OLS}}=\sum\text{ATE}(x_k)\omega_k\)(命题 15.1,处理方差越大权重越高),RCT 下 \(\beta^{\text{OLS}}=\text{ATE}\)。

Note

Chapter theme: selection on observables and matching. Using the Roy model set-up to discuss a specific difficulty in identification (selection bias) and how to overcome it (assume selection on observables + matching). §15.1 Non-parametric methods: estimate an unknown quantity with as few assumptions as possible; the kernel density estimator \(\hat f(x)=\frac1{Nh}\sum K(\frac{X_i-x}h)\) (bias \(\approx f''(x)h^2\kappa_2\), variance \(\approx\frac1{Nh}f(x)R(K)\), the bias-variance trade-off); kernel regression \(\hat g(x)=\sum\omega_h(X_i,x)Y_i\); local constant vs local linear regression. §15.2 Observational data in the cross-section: selection into treatment (Definition 15.1); OLS gives \(\beta^{\text{OLS}}=\text{ATE}\) under CIA, otherwise selection bias (15.10); the conditional independence assumption (CIA) \(\{Y_d\}\perp D\mid X\), the common support (overlapping) condition \(\mathbb P(D=d\mid X)<1\); identifying ATE/ATT/ATU by a weighted average of within-matched-cell mean contrasts. §15.3 Propensity scores: Definition 15.2 \(p(x)=\mathbb P(D=1\mid X=x)\); the Propensity Score Theorem (15.1) CIA \(\Rightarrow(Y_0,Y_1)\perp D\mid p(X)\); allowing conditioning/matching on the scalar \(p(X)\) instead of the high-dimensional \(X\). §15.4 Curse of dimensionality and alternative methods: non-parametric estimation when matching directly on \(X\) faces the curse of dimensionality; alternatives (regression, matching on the propensity score, inexact matching); the saturated OLS \(\beta^{\text{OLS}}=\sum\text{ATE}(x_k)\omega_k\) (Proposition 15.1, higher treatment variance gets higher weight), with \(\beta^{\text{OLS}}=\text{ATE}\) under RCT.

15.1 Non-parametric Methods

非参数估计旨在以尽量少的关于数据生成过程(DGP)的假设来估计未知量。考虑两个问题: - 密度估计(density estimation) - 回归(regression):局部常数(local constant)、局部线性(local linear)

两者都可用非参数方法解决,本质都是局部平均法。关键问题是:如何定义"局部"。

Non-parametric estimation aims to estimate an unknown quantity while making as few assumptions about the data generating process (DGP) as possible. We will think about the following two problems: - density estimation - regression: local constant, local linear

Both problems can be solved by non-parametric methods, which are basically local averaging methods. And the key question is: how to define local.

15.1.1 Density Estimation

密度的样本类比。 随机变量 \(X\) 的密度 \(f(x)\) 定义为 $$f(x)=\lim_{h\to0}\frac{F(x+h)-F(x-h)}{2h}=\lim_{h\to0}\frac{\mathbb P(x-h

核(Kernel)。 把直方图估计推广为一族密度估计: $$\hat f(x)=\frac1{Nh}\sum_{i=1}^N K\Big(\frac{X_i-x}h\Big)$$ - 加权函数 \(K(\cdot)\) 称核函数(kernel function); - \(h\) 为平滑参数,称带宽(bandwidth)

\(\tilde f\to f\) 要求 \(Nh\to\infty\) 且 \(h\to0\)。通常假设 \(K(\cdot)\) 满足: 1. 对称:\(K(z)=K(-z)\)。 2. 积分为 1:\(\int K(z)\,dz=1\)。 3. 零"均值"性质:\(\int z\cdot K(z)\,dz=0\)(15.2)。 4. 有限二阶矩性质:\(\int z^2\cdot K(z)\,dz=\kappa_2<\infty\)(15.3)。

Tip

注记 15.1 由通常假设,核 \(K(\cdot)\) 可解读为另一随机变量 \(z_i\) 的密度,\(z_i=\frac{X_i-x}h\) 与 \(X_i\) 相关。为避免繁记,记 \(u_i\equiv X_i\),则 $$\hat f(x)=\frac1{Nh}\sum_{i=1}^N K\Big(\frac{u_i-x}h\Big)=\frac1N\sum_{i=1}^N K(z_i)\frac1h\,d u_i\big/d u_i=\frac1N\sum_{i=1}^N K(z_i)$$ 即估计密度 \(\hat f(x)\) 是密度 \(K(\cdot)\) 在不同 \(z_i\) 处取值的平均。\(z_i\) 可解读为 \(X_i\) 的(按 \(h\) 归一化的)偏离距离,对所有 \(i\) 服从固定分布。

良好性质:\(\int\hat f(x)\,dx=1\);\(\int x\hat f(x)\,dx=\frac1N\sum_{i=1}^N X_i\)。

Note

证明(\(\int\hat f(x)\,dx=1\)) $$\int\hat f(x)\,dx=\frac1N\sum_{i=1}^N\int K\Big(\underbrace{\tfrac{u_i-x}h}_{=z_i}\Big)d\tfrac1h u_i=\frac1N\sum_{i=1}^N\underbrace{\int K(z_i)\,dz_i}_{\text{by assumption}=1}=\frac1N\sum_{i=1}^N1=1$$ \(\blacksquare\)

Sample analog of density. For a random variable \(X\), the density \(f(x)\) is defined as $$f(x)=\lim_{h\to0}\frac{F(x+h)-F(x-h)}{2h}=\lim_{h\to0}\frac{\mathbb P(x-h

Kernel. Generalize the histogram estimator to a family of estimators of density: $$\hat f(x)=\frac1{Nh}\sum_{i=1}^N K\Big(\frac{X_i-x}h\Big)$$ - the weighting function \(K(\cdot)\) is called a kernel function; - \(h\) is a smoothing parameter called the bandwidth.

For \(\tilde f\to f\) we require \(Nh\to\infty\) and \(h\to0\). We usually assume \(K(\cdot)\) satisfies: 1. Symmetry: \(K(z)=K(-z)\). 2. Integrates to 1: \(\int K(z)\,dz=1\). 3. Zero "mean" property: \(\int z\cdot K(z)\,dz=0\) (15.2). 4. Finite second moment property: \(\int z^2\cdot K(z)\,dz=\kappa_2<\infty\) (15.3).

Tip

Remark 15.1 Based on the usual assumptions, the kernel \(K(\cdot)\) can be interpreted as the density for another random variable \(z_i\) related to \(X_i\), with \(z_i=\frac{X_i-x}h\). To avoid messy notation, denote \(u_i\equiv X_i\); then $$\hat f(x)=\frac1{Nh}\sum_{i=1}^N K\Big(\frac{u_i-x}h\Big)=\frac1N\sum_{i=1}^N K(z_i)\frac1h\,d u_i\big/d u_i=\frac1N\sum_{i=1}^N K(z_i)$$ i.e. the estimated density \(\hat f(x)\) is an average of the density \(K(\cdot)\) evaluated at different \(z_i\). So \(z_i\) can be interpreted as the normalized (by \(h\)) deviation distance of \(X_i\), which follows a fixed distribution for all \(i\).

Desired properties: \(\int\hat f(x)\,dx=1\); \(\int x\hat f(x)\,dx=\frac1N\sum_{i=1}^N X_i\).

Note

Proof (\(\int\hat f(x)\,dx=1\)) $$\int\hat f(x)\,dx=\frac1N\sum_{i=1}^N\int K\Big(\underbrace{\tfrac{u_i-x}h}_{=z_i}\Big)d\tfrac1h u_i=\frac1N\sum_{i=1}^N\underbrace{\int K(z_i)\,dz_i}_{\text{by assumption}=1}=\frac1N\sum_{i=1}^N1=1$$ \(\blacksquare\)

估计偏差。 考虑估计 \(\hat f(x)\) 的偏差: $$\mathbb E[\hat f(x)]=\frac1h\mathbb E[K(z_i)]=\int K(z_i)f(u_i)\,d\tfrac1h u_i=\int K(z_i)f(z_ih+x)\,dz_i\tag{15.4}$$ 一般无解析解,故对 (15.4) 末式作二阶 Taylor 近似: $$\begin{aligned}\int K(z_i)f(z_ih+x)\,dz_i&\approx\int K(z_i)\big[f(x)+f'(x)z_ih+f''(x)z_i^2h^2\big]\,dz_i\\&\approx f(x)\underbrace{\int K(z_i)\,dz_i}_{=1}+f'(x)h\underbrace{\int K(z_i)z_i\,dz_i}_{=0}+f''(x)h^2\underbrace{\int K(z_i)z_i^2\,dz_i}_{=\kappa_2}\\&\approx f(x)+f''(x)h^2\kappa_2\end{aligned}$$ 即 $$\mathbb E[\hat f(x)]-f(x)\approx\underbrace{f''(x)h^2\kappa_2}_{\text{bias}}\tag{15.5}$$ 故 \(\hat f(x)\) 有偏,\(h\) 小时偏差约 \(f''(x)h^2\kappa_2\)。

估计方差。 考虑 \(\hat f(x)\) 的方差(设 \(X_i\) i.i.d.): $$\operatorname{Var}(\hat f(x))=\frac1{Nh^2}\mathbb E\Big[K\Big(\tfrac{X_i-x}h\Big)^2\Big]-\frac1N\Big(\frac1h\mathbb E\Big[K\Big(\tfrac{X_i-x}h\Big)\Big]\Big)^2\tag{15.6}$$ Part 2 当 \(N\to\infty\) 趋零(因 \(\frac1h\mathbb E[K(\frac{X_i-x}h)]=\mathbb E[\hat f(x)]\approx f(x)\));Part 1 未必趋零(因只假设 \(Nh\to\infty\) 而非 \(\frac1{Nh^2}\to0\))。整理得 $$\operatorname{Var}(\hat f(x))\approx\frac1{Nh}f(x)R(K)\tag{15.7}$$ \(R(K)\equiv\int K(z_i)^2\,dz_i\) 称核的粗糙度(roughness)

Note

推导((15.7) 的 Part 1) $$\begin{aligned}\frac1h\mathbb E\Big[K\Big(\tfrac{X_i-x}h\Big)^2\Big]&=\int K(z_i)^2 f(z_ih+x)\,dz_i\approx\int K(z_i)^2\big[f(x)+f'(x)z_ih+f''(x)z_i^2h^2\big]\,dz_i\\&\approx f(x)\underbrace{\int K(z_i)^2\,dz_i}_{\equiv R(K)}\end{aligned}$$ 末式在 \(h\) 足够小时近似成立。于是 \(\operatorname{Var}(\hat f(x))\approx\frac1{Nh}f(x)R(K)\)。丢掉 Part 2 而非 Part 1,是因 Part 2 为 \(f(x)^2\) 的平方、比 Part 1 高阶。\(\blacksquare\)

偏差—方差权衡。 由 (15.5)、(15.7),在每个数据量 \(N\) 下,更低的 \(h\) 带来更小偏差但更大方差。但偏差不依赖样本量、而方差随样本量减小,故可在样本增长时减小带宽 \(h\)、但其速率慢于样本增长速率——这样可同时获得更小偏差与更小方差。

Estimation bias. Consider the bias of the estimator \(\hat f(x)\): $$\mathbb E[\hat f(x)]=\frac1h\mathbb E[K(z_i)]=\int K(z_i)f(u_i)\,d\tfrac1h u_i=\int K(z_i)f(z_ih+x)\,dz_i\tag{15.4}$$ which typically cannot be solved analytically, so we do the 2nd-order Taylor approximation of the last line of (15.4): $$\begin{aligned}\int K(z_i)f(z_ih+x)\,dz_i&\approx\int K(z_i)\big[f(x)+f'(x)z_ih+f''(x)z_i^2h^2\big]\,dz_i\\&\approx f(x)\underbrace{\int K(z_i)\,dz_i}_{=1}+f'(x)h\underbrace{\int K(z_i)z_i\,dz_i}_{=0}+f''(x)h^2\underbrace{\int K(z_i)z_i^2\,dz_i}_{=\kappa_2}\\&\approx f(x)+f''(x)h^2\kappa_2\end{aligned}$$ i.e. $$\mathbb E[\hat f(x)]-f(x)\approx\underbrace{f''(x)h^2\kappa_2}_{\text{bias}}\tag{15.5}$$ So \(\hat f(x)\) is a biased estimator, with bias approximately \(f''(x)h^2\kappa_2\) when \(h\) is small.

Estimation variance. Consider the variance of the estimator \(\hat f(x)\) (assume \(X_i\) i.i.d.): $$\operatorname{Var}(\hat f(x))=\frac1{Nh^2}\mathbb E\Big[K\Big(\tfrac{X_i-x}h\Big)^2\Big]-\frac1N\Big(\frac1h\mathbb E\Big[K\Big(\tfrac{X_i-x}h\Big)\Big]\Big)^2\tag{15.6}$$ Part 2 goes to 0 as \(N\to\infty\) (since \(\frac1h\mathbb E[K(\frac{X_i-x}h)]=\mathbb E[\hat f(x)]\approx f(x)\)); Part 1 does not necessarily go to 0 (since we only assumed \(Nh\to\infty\), not \(\frac1{Nh^2}\to0\)). Rearranging, $$\operatorname{Var}(\hat f(x))\approx\frac1{Nh}f(x)R(K)\tag{15.7}$$ where \(R(K)\equiv\int K(z_i)^2\,dz_i\) is called the roughness of the kernel.

Note

Derivation (Part 1 of (15.7)) $$\begin{aligned}\frac1h\mathbb E\Big[K\Big(\tfrac{X_i-x}h\Big)^2\Big]&=\int K(z_i)^2 f(z_ih+x)\,dz_i\approx\int K(z_i)^2\big[f(x)+f'(x)z_ih+f''(x)z_i^2h^2\big]\,dz_i\\&\approx f(x)\underbrace{\int K(z_i)^2\,dz_i}_{\equiv R(K)}\end{aligned}$$ approximately true when \(h\) is small enough. So \(\operatorname{Var}(\hat f(x))\approx\frac1{Nh}f(x)R(K)\). We dropped Part 2 but not Part 1 because Part 2 is the square of \(f(x)\), of higher order than Part 1. \(\blacksquare\)

Bias variance trade-off. From (15.5) and (15.7), at each amount \(N\) of data, lower \(h\) leads to smaller bias but higher variance. However, since bias does not depend on sample size but variance decreases in sample size, we can decrease the bandwidth \(h\) as the sample grows larger but at a slower rate than the growth of sample size. In this way, we can have both smaller bias and smaller variance.

15.1.2 Kernel Regression

我们总想估计结果 \(Y\) 与解释变量 \(X\) 的关系,一般刻画为 $$g(x)\equiv\mathbb E[Y\mid X=x]$$ 用与非参数密度估计相同的技巧。由 $$\mathbb E[Y\mid X=x]=\int y f(y\mid X=x)\,dy=\int y\frac{f(y,x)}{f(x)}\,dy\tag{15.8}$$ 其样本类比为 \(\hat{\mathbb E}[Y\mid X=x]=\int y\frac{\hat f(y,x)}{\hat f(x)}\,dy\)。\(\hat f(x)\) 已知如何估,只需估 \(\hat f(y,x)\)——用双变量核 \(K(u,v)=K_1(u)K_2(v)\): $$\hat f(y,x)=\frac1{Nh^2}\sum_{i=1}^N K\Big(\frac{X_i-x}h,\frac{Y_i-y}h\Big)=\frac1{Nh^2}\sum_{i=1}^N K_1\Big(\frac{X_i-x}h\Big)K_2\Big(\frac{Y_i-y}h\Big)$$ 最终得 \(g(x)\) 的非参数估计 $$\begin{aligned}\hat g(x)\equiv\hat{\mathbb E}[Y\mid X=x]&=\frac{\int y\hat f(y,x)\,dy}{\hat f(x)}=\frac{\frac1{Nh}\sum_{i=1}^N K_1(\frac{X_i-x}h)Y_i}{\frac1{Nh}\sum_{i=1}^N K_1(\frac{X_i-x}h)}\\&=\frac{\sum_{i=1}^N K_1(\frac{X_i-x}h)Y_i}{\sum_{i=1}^N K_1(\frac{X_i-x}h)}=\sum_{i=1}^N\omega_h(X_i,x)Y_i\end{aligned}$$ 权重 \(\omega_h(X_i,x)=\frac{K_1(\frac{X_i-x}h)}{\sum_{i=1}^N K_1(\frac{X_i-x}h)}\)。当 \(K_1(z)=\frac12\mathbf 1\{x-h

对特定核,带宽 \(h\) 可由某种最优性确定,如留一交叉验证 $$h^\star=\arg\max_h\sum_{i=1}^N\big(\hat g_{h,(-i)}(X_i)-Y_i\big)^2$$ \(\hat g_{h,(-i)}(x)\) 为不用 \((X_i,Y_i)\) 估计者。

We would always like to estimate the relationship between an outcome \(Y\) and an explanatory variable \(X\), generally characterized by $$g(x)\equiv\mathbb E[Y\mid X=x]$$ using the same technique as in non-parametric density estimation. From $$\mathbb E[Y\mid X=x]=\int y f(y\mid X=x)\,dy=\int y\frac{f(y,x)}{f(x)}\,dy\tag{15.8}$$ the sample analog is \(\hat{\mathbb E}[Y\mid X=x]=\int y\frac{\hat f(y,x)}{\hat f(x)}\,dy\). We already know how to estimate \(\hat f(x)\), so it only remains to estimate \(\hat f(y,x)\) — use the bivariate kernel \(K(u,v)=K_1(u)K_2(v)\): $$\hat f(y,x)=\frac1{Nh^2}\sum_{i=1}^N K\Big(\frac{X_i-x}h,\frac{Y_i-y}h\Big)=\frac1{Nh^2}\sum_{i=1}^N K_1\Big(\frac{X_i-x}h\Big)K_2\Big(\frac{Y_i-y}h\Big)$$ Finally we obtain the non-parametric estimator of \(g(x)\): $$\begin{aligned}\hat g(x)\equiv\hat{\mathbb E}[Y\mid X=x]&=\frac{\int y\hat f(y,x)\,dy}{\hat f(x)}=\frac{\frac1{Nh}\sum_{i=1}^N K_1(\frac{X_i-x}h)Y_i}{\frac1{Nh}\sum_{i=1}^N K_1(\frac{X_i-x}h)}\\&=\frac{\sum_{i=1}^N K_1(\frac{X_i-x}h)Y_i}{\sum_{i=1}^N K_1(\frac{X_i-x}h)}=\sum_{i=1}^N\omega_h(X_i,x)Y_i\end{aligned}$$ with weight \(\omega_h(X_i,x)=\frac{K_1(\frac{X_i-x}h)}{\sum_{i=1}^N K_1(\frac{X_i-x}h)}\). When \(K_1(z)=\frac12\mathbf 1\{x-h

For a specific kernel, the bandwidth \(h\) could be determined in a sense of optimality, e.g. leave-one-out cross-validation $$h^\star=\arg\max_h\sum_{i=1}^N\big(\hat g_{h,(-i)}(X_i)-Y_i\big)^2$$ where \(\hat g_{h,(-i)}(x)\) is estimated without using \((X_i,Y_i)\).

15.1.3 Local Constant Regression and Local Linear Regression

理论上 \(g(x)=\mathbb E[Y\mid X=x]\) 是点识别的。但估计时须允许 \(X=x\) 周围一个小带(局部邻域)、在该邻域内估计被估量(如 \(\text{ATE}(x)\)),而非只在 \(X=x\)。此时如何相信 \(g(\tilde x)\)(\(\tilde x\in(x-h,x+h)\) 稍偏 \(X=x\))的变化,有两种信念: - 局部常数回归(Local constant regression):假设每个足够小的局部邻域内被估量恒定——仅当邻域足够小时近似成立。 - 局部线性回归(Local linear regression):假设每个足够小的局部邻域内被估量随 \(X\) 线性变化——仍是足够小邻域的近似,但放松了常数假设、应略优于局部常数回归。

局部常数回归。 核回归等价于下面的局部常数回归: $$g(x)=\alpha,\qquad\hat\alpha=\arg\min_a\sum_{i=1}^N K_1\Big(\frac{X_i-x}h\Big)(Y_i-a)^2$$

Note

证明(等价于核回归) 一阶条件: $$2\sum_{i=1}^N K_1\Big(\tfrac{X_i-x}h\Big)(Y_i-a)=0\Rightarrow\sum_{i=1}^N K_1\Big(\tfrac{X_i-x}h\Big)Y_i=\sum_{i=1}^N K_1\Big(\tfrac{X_i-x}h\Big)a\Rightarrow\hat\alpha=\frac{\sum_{i=1}^N K_1(\frac{X_i-x}h)Y_i}{\sum_{i=1}^N K_1(\frac{X_i-x}h)}$$ 这正是核回归的 \(\hat g(x)\)。\(\blacksquare\)

局部线性回归。 在 \(X=x\) 周围每个小带放松 \(g(x)=\alpha\),允许 \(g(x)\) 在小邻域内随 \(X\) 线性变化: $$g(x)\equiv\mathbb E[Y\mid X=x]=\alpha+\beta(X-x)\quad\text{for }X\in(x-h,x+h)$$ 解 \((\hat\alpha,\hat\beta)\): $$(\hat\alpha,\hat\beta)=\arg\min_{a,b}\sum_{i=1}^N K_1\Big(\frac{X_i-x}h\Big)(Y_i-a-b(X_i-x))^2$$ 当随机变量 \(X\) 恰取 \(x\) 时,局部线性回归给出 \(\hat g(x)=\hat\alpha+\hat\beta(x-x)=\hat\alpha\)。

Theoretically \(g(x)=\mathbb E[Y\mid X=x]\) is point identified. However, when doing estimation, we have to allow for a small band (local neighborhood) around \(X=x\) and estimate the estimand (e.g. \(\text{ATE}(x)\)) using the data in that local neighborhood of \(X=x\) instead of only at \(X=x\). Then it becomes important how we believe the \(g(\tilde x)\) changes for \(\tilde x\in(x-h,x+h)\) slightly off \(X=x\). There come two kinds of beliefs: - Local constant regression: assume the estimand within each small enough local neighborhood is constant, which is only an approximation valid for small enough neighborhood. - Local linear regression: assume the estimand within each small enough local neighborhood is varying linearly with \(X\), which is still just an approximation valid for small enough neighborhood, but relaxes the constant assumption a little bit so should perform better in approximation than local constant regression.

Local constant regression. Kernel regression is equivalent to the local constant regression below: $$g(x)=\alpha,\qquad\hat\alpha=\arg\min_a\sum_{i=1}^N K_1\Big(\frac{X_i-x}h\Big)(Y_i-a)^2$$

Note

Proof (equivalence to kernel regression) The f.o.c.: $$2\sum_{i=1}^N K_1\Big(\tfrac{X_i-x}h\Big)(Y_i-a)=0\Rightarrow\sum_{i=1}^N K_1\Big(\tfrac{X_i-x}h\Big)Y_i=\sum_{i=1}^N K_1\Big(\tfrac{X_i-x}h\Big)a\Rightarrow\hat\alpha=\frac{\sum_{i=1}^N K_1(\frac{X_i-x}h)Y_i}{\sum_{i=1}^N K_1(\frac{X_i-x}h)}$$ which is exactly \(\hat g(x)\) in kernel regression. \(\blacksquare\)

Local linear regression. In each small band around \(X=x\), we relax \(g(x)=\alpha\) for local constant regression by allowing the estimand \(g(x)\) to vary linearly with \(X\) in each small neighborhood: $$g(x)\equiv\mathbb E[Y\mid X=x]=\alpha+\beta(X-x)\quad\text{for }X\in(x-h,x+h)$$ The solution \((\hat\alpha,\hat\beta)\): $$(\hat\alpha,\hat\beta)=\arg\min_{a,b}\sum_{i=1}^N K_1\Big(\frac{X_i-x}h\Big)(Y_i-a-b(X_i-x))^2$$ When the random variable \(X\) takes the exact value \(x\), the local linear regression gives \(\hat g(x)=\hat\alpha+\hat\beta(x-x)=\hat\alpha\).

15.2 Observational Data in the Cross-Section

15.2.1 Selection into Treatment

Important

定义 15.1(选择进入处理) 称存在选择进入处理(selection into treatment),若 \(Y_d\mid D=d\) 的分布与 \(Y_d\mid D=d'\)(\(d'\ne d\))不同。可写成等价形式 $$\underbrace{\mathbb E[Y_1-Y_0\mid D=1]}_{\text{observable}}\ne\underbrace{\mathbb E[Y_1-Y_0\mid D=0]}_{\text{unobserved}}$$

选择进入处理实际假设 agents 对自身潜在结果 \((Y_0,Y_1)\) 有知识,并据此决定是否受处理(\(D=0\) 或 \(1\)),即 \((Y_0,Y_1)\not\perp D\)(相依)。相比之下,随机处理分配(RCT)假设 agents 选 \(D\) 时不考虑 \(\{Y_d\}_{d\in\mathcal D}\) 知识,即 \((Y_0,Y_1)\perp D\)(独立)。

总用 OLS 回归估 \(\beta^{\text{OLS}}\)。如前所述,当 \((Y_0,Y_1)\perp D\)(独立)成立时 \(\beta^{\text{OLS}}\) 是全组的 ATE: $$\beta^{\text{OLS}}=\mathbb E[Y\mid D=1]-\mathbb E[Y\mid D=0]=\mathbb E[Y_1\mid D=1]-\mathbb E[Y_0\mid D=0]\overset{\perp}{=}\mathbb E[Y_1]-\mathbb E[Y_0]=\mathbb E[Y_1-Y_0]$$ 但当 \((Y_0,Y_1)\not\perp D\)(相依)时,只能作如下分解: $$\begin{aligned}\mathbb E[Y\mid D=1]-\mathbb E[Y\mid D=0]&=\mathbb E[Y_1\mid D=1]-\mathbb E[Y_0\mid D=0]\\&=\underbrace{\mathbb E[Y_1\mid D=1]-\mathbb E[Y_0\mid D=1]}_{\text{ATT}}+\underbrace{\mathbb E[Y_0\mid D=1]-\mathbb E[Y_0\mid D=0]}_{\text{selection bias}}\\&=\underbrace{\mathbb E[Y_1\mid D=0]-\mathbb E[Y_0\mid D=0]}_{\text{ATU}}+\underbrace{\mathbb E[Y_1\mid D=1]-\mathbb E[Y_1\mid D=0]}_{\text{selection bias}}\end{aligned}\tag{15.10}$$ 选择偏差(selection bias)正是选择进入处理带来的部分。例如 (15.10) 中,可能受处理者之所以被选去受处理,是因他们知道自己的 \(Y_1\) 高于未受处理者的 \(Y_1\),则 $$\underbrace{\mathbb E[Y_1\mid D=1]-\mathbb E[Y_1\mid D=0]}_{\text{selection bias}}>0\Rightarrow\beta^{\text{OLS}}>\text{ATU}$$ 随机分配(RCT)下选择偏差为零,故 RCT 下 \(\text{ATT}=\text{ATU}=\text{ATE}\)。

Important

Definition 15.1 (Selection into treatment) There is selection into the treatment state if \(Y_d\mid D=d\) is distributed differently than \(Y_d\mid D=d'\) for \(d'\ne d\). We can write this equivalently as $$\underbrace{\mathbb E[Y_1-Y_0\mid D=1]}_{\text{observable}}\ne\underbrace{\mathbb E[Y_1-Y_0\mid D=0]}_{\text{unobserved}}$$

Selection into treatment actually assumes that agents have knowledge about their potential outcomes \((Y_0,Y_1)\), and make a decision on whether to take treatment (\(D=0\) or \(1\)) depending on their knowledge of \((Y_0,Y_1)\), i.e. \((Y_0,Y_1)\not\perp D\) (dependence). In contrast, random assignment of treatment state (RCT) assumes that agents choose \(D\) without taking into consideration the knowledge of \(\{Y_d\}_{d\in\mathcal D}\), i.e. \((Y_0,Y_1)\perp D\) (independence).

We always use OLS regression to estimate \(\beta^{\text{OLS}}\). As discussed before, when \((Y_0,Y_1)\perp D\) (independence) is true, \(\beta^{\text{OLS}}\) is the ATE of the whole group: $$\beta^{\text{OLS}}=\mathbb E[Y\mid D=1]-\mathbb E[Y\mid D=0]=\mathbb E[Y_1\mid D=1]-\mathbb E[Y_0\mid D=0]\overset{\perp}{=}\mathbb E[Y_1]-\mathbb E[Y_0]=\mathbb E[Y_1-Y_0]$$ However, when \((Y_0,Y_1)\not\perp D\) (dependence) is true, we can only do the following decomposition: $$\begin{aligned}\mathbb E[Y\mid D=1]-\mathbb E[Y\mid D=0]&=\mathbb E[Y_1\mid D=1]-\mathbb E[Y_0\mid D=0]\\&=\underbrace{\mathbb E[Y_1\mid D=1]-\mathbb E[Y_0\mid D=1]}_{\text{ATT}}+\underbrace{\mathbb E[Y_0\mid D=1]-\mathbb E[Y_0\mid D=0]}_{\text{selection bias}}\\&=\underbrace{\mathbb E[Y_1\mid D=0]-\mathbb E[Y_0\mid D=0]}_{\text{ATU}}+\underbrace{\mathbb E[Y_1\mid D=1]-\mathbb E[Y_1\mid D=0]}_{\text{selection bias}}\end{aligned}\tag{15.10}$$ Selection bias is the part that comes from selection into treatment. For example in (15.10), it is possible that the treated select to be treated because they know they have higher potential outcome \(Y_1\) than the potential \(Y_1\) of those untreated, and then $$\underbrace{\mathbb E[Y_1\mid D=1]-\mathbb E[Y_1\mid D=0]}_{\text{selection bias}}>0\Rightarrow\beta^{\text{OLS}}>\text{ATU}$$ Selection bias is zero under random assignment (RCT), which implies \(\text{ATT}=\text{ATU}=\text{ATE}\) under RCT.

15.2.2 Identification under Selection on Observables

可通过实验中的随机分配(RCT)解决选择进入处理问题,但 RCT 可能外部有效性低。故可改而假设基于可观测量的选择(selection on observables)——它是随机分配的一种放松。

基于可观测量的选择意味着 agents 完全基于研究者可观测的变量向量 \(X\) 选择进入处理,故条件于可观测量后,处理就如随机分配一样好。

Important

条件独立假设(CIA) 设观测 \((Y,D,X)\),\(X\) 为协变量(agents 决策所依据的可观测量),\(D\) 为决策随机变量(\(D\in\mathcal D\))。基于可观测量的选择假设即 $$\{Y_d\}_{d\in\mathcal D}\perp D\mid X$$ 此即条件独立假设(conditional independence assumption, CIA)——本节最难的部分。

Tip

注记 15.2 即便没有 \(\{Y_d\}\perp D\mid X\),下例仍可达到条件独立:设选择既基于可观测量、又基于一个不可观测随机变量向量 \(\theta\),即 \(\{Y_d\}_{d\in\mathcal D}\perp D\mid X,\theta\)。若有很好理由说不可观测变量被某可观测变量 \(X,W\) 的函数 \(\tau\) 钉住,\(\theta=\tau(X,W)\),则即便不知 \(\tau\) 的函数形式,仍有 $$\{Y_d\}_{d\in\mathcal D}\perp D\mid X,W\tag{15.11}$$ 因为条件于 \(X,W\) 时 \(\theta\) 被钉住,\(W\) 与 \(X\) 一起成为不可观测 \(\theta\) 的代理,于是 (15.11) 中再得条件独立。

共同支撑(重叠条件)与平衡。 还需重叠假设(overlapping assumption) \(\mathbb P(D=d\mid X)<1\) 对 \(\forall d\in\mathcal D\)——即处理组与未处理组在每个匹配单元 \(X=x\) 内都有共同支撑(每个匹配单元内既有受处理又有未处理者)。共同支撑至关重要,因我们先在每个匹配单元内估 \(\text{ATE}(x)\)、再加总;若共同支撑被破坏,则该单元的 \(\text{ATE}(x)\) 无法良好定义、毁掉整体效应估计。还需平衡条件(balancing condition):每个单元内处理与未处理者数量相当以降低方差;否则即便满足共同支撑,若某些单元有上千处理者却只有一两个未处理者,仍成问题。

We could solve the problem of selection into treatment by conducting random assignment (RCT) in experiments, but, as we have discussed, experiment could have low external validity. So we can instead choose to assume selection on observables, which is a relaxation of random assignment (RCT).

Selection on observables means that agents select into treatment completely based on variables in the vector \(X\) that is observable to the researcher. So, conditional on observables, treatment is as good as randomly assigned.

Important

Conditional independence assumption (CIA) Suppose we observed \((Y,D,X)\) where \(X\) are covariates (the observables that agents base their decision upon) and \(D\) is the decision random variable with \(D\in\mathcal D\). The selection on observables assumption is $$\{Y_d\}_{d\in\mathcal D}\perp D\mid X$$ This is the conditional independence assumption (CIA) — the most difficult part of this section.

Tip

Remark 15.2 Even when we don't directly have \(\{Y_d\}\perp D\mid X\), we might still be able to reach conditional independence in the following case. Suppose that selection is both on observables and a vector of unobservable random variables \(\theta\), i.e. \(\{Y_d\}_{d\in\mathcal D}\perp D\mid X,\theta\). If we have very good reasoning to say that the unobservable variables are pinned down by some function \(\tau\) of observable variables \(X\) and \(W\), i.e. \(\theta=\tau(X,W)\), then even when we don't know the functional form of \(\tau\), we still have $$\{Y_d\}_{d\in\mathcal D}\perp D\mid X,W\tag{15.11}$$ because conditional on \(X\) and \(W\) we know that \(\theta\) is pinned down. So here \(W\), together with \(X\), is a proxy of the unobservable \(\theta\), and then we have conditional independence in (15.11) again.

Common support (overlapping condition) and balancing. We will also need the overlapping assumption \(\mathbb P(D=d\mid X)<1\) for all \(d\in\mathcal D\), which means that treated and untreated have common support in each matched cell \(X=x\) (both treated and untreated agents exist in each matched cell). Common support is crucial since we are first estimating the average treatment effect \(\text{ATE}(x)\) in each matched cell, and then averaging them up. If common support is broken, then the \(\text{ATE}(x)\) in that cell is not well-defined, and it ruins our overall effect estimation. We also need the balancing condition, which means that the treated and untreated should be comparable in size within each cell to reduce the variance of the estimation. Otherwise, even though the common support is satisfied, it could still be problematic if we have thousands of treated and only one or two untreated in some cells.

用条件独立在期望意义下识别反事实潜在结果。 现有随机分配的条件版本: $$F_d(y\mid x)\equiv\mathbb P(Y_d\le y\mid X=x)=\mathbb P(Y_d\le y\mid D=d,X=x)=\mathbb P(Y\le y\mid D=d,X=x)$$ 第二个等号需重叠条件。再对 \(X\) 平均以点识别 \(Y_d\) 的无条件 CDF: $$\begin{aligned}F_d(y)\equiv\mathbb P(Y_d\le y)&=\mathbb E_X[\mathbb P(Y_d\le y\mid D=d,X)]=\mathbb E_X[\mathbb P(Y\le y\mid D=d,X)]\\&\overset{\text{discrete}}{=}\sum_x\mathbb P(Y\le y\mid D=d,X=x)\mathbb P(X=x)\end{aligned}$$ 从而识别 \(\mathbb E[Y_d]=\int y\,dF_d(y)\)。

Tip

注记 15.3 基本想法:同一单元 \(X=x\) 内,受处理与未受处理者基于 \(D\) 值在潜在结果上无期望差异,故该单元全组的 \(\mathbb E[Y_d]\) 可由其 \(D=d\) 子组的期望无偏代表,即 $$\mathbb E[Y_d\mid X=x]=\mathbb E[Y_d\mid D=d,X=x]$$ 这正是我们能实际观测的 \(Y_d\)。再按 \(x\) 加权平均,即得无偏期望(无选择偏差): $$\mathbb E[Y_d]=\sum_x\mathbb E[Y_d\mid X=x]\mathbb P(X=x)$$ 回到二值决策 \(\mathcal D=\{0,1\}\),重叠假设变为 \(\mathbb P(D=1\mid X)<1\)。

Tip

注记 15.4 理解重叠假设为何重要:设想比较参军者 \(Y_{\text{Army}}\) 与不参军者收入,参军者更可能是男性,但条件于性别后参军如随机分配。我们想作如 \(\mathbb E[Y_{\text{Army}}\mid\text{Army}=1,\text{Sex}=\text{Female}]=\mathbb E[Y_{\text{Army}}\mid\text{Army}=0,\text{Sex}=\text{Female}]\) 的陈述。但若我们观测不到哪怕一名参军女性,则 LHS 不良定义;更一般地,若某可观测单元内不存在受处理者或未受处理者,就无法得到该单元的 ATE,从而在求整体 ATE/ATT/ATU 的加权平均时出问题。

Identify counterfactual potential outcomes in expectation with conditional independence. Now we have the conditional version of random assignment: $$F_d(y\mid x)\equiv\mathbb P(Y_d\le y\mid X=x)=\mathbb P(Y_d\le y\mid D=d,X=x)=\mathbb P(Y\le y\mid D=d,X=x)$$ where the second line requires the overlapping condition. Then we can average over \(X\) to point identify the unconditional CDF of \(Y_d\): $$\begin{aligned}F_d(y)\equiv\mathbb P(Y_d\le y)&=\mathbb E_X[\mathbb P(Y_d\le y\mid D=d,X)]=\mathbb E_X[\mathbb P(Y\le y\mid D=d,X)]\\&\overset{\text{discrete}}{=}\sum_x\mathbb P(Y\le y\mid D=d,X=x)\mathbb P(X=x)\end{aligned}$$ which could thus identify \(\mathbb E[Y_d]=\int y\,dF_d(y)\).

Tip

Remark 15.3 The basic idea is that in the same cell (\(X=x\)), the treated and untreated have no expected difference in their potential outcomes based on the value of \(D\), so \(\mathbb E[Y_d]\) for the whole group in that cell could be unbiasedly represented by the expectation within the subgroup of that cell whose \(D=d\), i.e. $$\mathbb E[Y_d\mid X=x]=\mathbb E[Y_d\mid D=d,X=x]$$ which is the \(Y_d\) that we can actually observe. Then, taking the weighted average according to the weights of \(x\)'s, we can get the unbiased expectation (i.e. no selection bias): $$\mathbb E[Y_d]=\sum_x\mathbb E[Y_d\mid X=x]\mathbb P(X=x)$$ Back to the binary decision problem \(\mathcal D=\{0,1\}\), the overlapping assumption becomes \(\mathbb P(D=1\mid X)<1\).

Tip

Remark 15.4 To understand why the overlapping assumption is important: suppose we want to compare earnings outcomes \(Y_{\text{Army}}\) of people that serve in the army versus those that don't serve in the army. In particular, suppose that men are much more likely to serve in the army, but that, conditional on gender, army service is as good as randomly assigned. We want to make statements like \(\mathbb E[Y_{\text{Army}}\mid\text{Army}=1,\text{Sex}=\text{Female}]=\mathbb E[Y_{\text{Army}}\mid\text{Army}=0,\text{Sex}=\text{Female}]\). However, if we don't observe at least one woman who serves in the army, then the term on the LHS of the equality is not well-defined. More generally, if there exists no treated or no untreated for which a cell of the observables, you can't get the ATE for that cell, and it becomes a problem when you take a weighted average for the overall ATE or ATT or ATU.

15.2.3 Conditional Independence and Choice of Observable Controls

下一问题是条件独立何时成立?有两种典型情形: 1. 我们对处理分配机制有详细信息,知道只有某组可观测变量影响潜在结果与处理选择。条件于这些变量后(在按它们匹配的组内),潜在结果与处理选择间独立成立。 2. 我们有极丰富的数据,故可对任何我们认为与选择决策相关的变量加以控制。

Tip

注记 15.5 为使基于可观测量的选择可信,\(X\) 应是预先决定(predetermined)的。\(X\) 预先决定的一个充分判据是 \(X\) 在 \(D\) 之前(而非之后)测量。一些影响 \(Y\)、可能被误加入 \(X\) 的例子:(1) 项目后的收入;(2) 项目后的就业;(3) 项目后的婚姻状态。注意若这些变量被加入 \(X\) 作控制,则 \((Y_0,Y_1)\perp D\mid X\) 显然不成立。例如若项目后收入 \(Y\) 被加入 \(X\),则同一匹配单元 \(Y=y\) 内,受处理 \((D=1)\) 必是 \(Y_1=y\) 者、未受处理 \((D=0)\) 必是 \(Y_0=y\) 者,显然存在选择进入处理,\((Y_0,Y_1)\not\perp D\mid X\)。

基于不可观测量选择的可能性:Neale 和 Johnson (1996, JPE) 例。 关心问题是种族因素对黑白工资差距的纯效应。他们把收入 \(Y\) 对种族 \(D\)(\(D=1\) 白、\(D=0\) 黑)与某些协变量、且无测试分数 \(T\)(\(T=0\) 高分、\(T=0\) 低分)回归,发现测试分数对收入解释力大。条件于测试分数后黑白工资差异似乎不大,即 $$\mathbb E[Y\mid D=1,X\backslash T]-\mathbb E[Y\mid D=0,X\backslash T]\gg\mathbb E[Y\mid D=1,X\backslash T,T]-\mathbb E[Y\mid D=0,X\backslash T,T]$$ 这是否意味测试分数是区分黑白工资的因素?许多人发现此解释不可信。设把 \(Y\) 对 \(D\)、\(X\backslash T\) 和 \(T\) 回归 $$Y=\alpha+\beta D+\gamma T+\eta X\backslash T+\varepsilon$$ 控制了测试分数,但可能仍有别的不可观测量同时决定工资,如努力 \(E\)。也许黑人作为弱势群体,需付出更多努力才能达到同样的测试分数,故 \(E\) 与 \(D\) 相关;同样 \(E\) 与 \(D\) 相关使 \((Y_0,Y_1)\perp D\mid X\)(通过把 \(T\) 纳入 \(X\) 触发)被不可观测量 \(E\) 的存在破坏。教训是:对控制变量的选择要十分小心,并反复思考条件独立背后的故事与可能的不可观测量。

The next question is when might conditional independence hold? There are two typical cases: 1. We have detailed information about the treatment assignment mechanism, which informs us that only a certain set of observable variables influence both potential outcomes and treatment selection. After taking out those variables (within groups matched based on those variables), the independence holds between potential outcomes and treatment selection. 2. We have extremely rich data. So, we can control for whatever we think is relevant in selection decision making.

Tip

Remark 15.5 For selection on observables to be plausible, \(X\) should be predetermined. Usually a sufficient criteria for \(X\) being predetermined is that \(X\) is measured before (as opposed to after) \(D\). Some examples of what affects \(Y\) and might be accidentally included as a part of \(X\): (1) earnings after the program; (2) employment after the program; (3) marital status after the program. Notice that if these variables are included in \(X\) as controls, we clearly won't have \((Y_0,Y_1)\perp D\mid X\). To see why, for example, if the after-program income \(Y\) is included in \(X\), then in each matched cell \(Y=y\), agents who took treatment (\(D=1\)) must be the guys whose \(Y_1=y\) and those who didn't take treatment (\(D=0\)) must be the guys whose \(Y_0=y\), and clearly there is selection into treatment, and \((Y_0,Y_1)\not\perp D\mid X\).

Possibility of selection on unobservables: example from Neale and Johnson (1996, JPE). The question of interest is what is the effect of pure racial factor in the Black-White wage gap. To look at this, they regress earnings on race (\(D=1\): white; \(D=0\): black) and some covariates, and without test scores \(T\) (\(T=0\): high score; \(T=0\): low score). They find that the test scores play a large component in earnings. Conditional on test scores there doesn't seem to be a large difference in black and white wages, i.e. $$\mathbb E[Y\mid D=1,X\backslash T]-\mathbb E[Y\mid D=0,X\backslash T]\gg\mathbb E[Y\mid D=1,X\backslash T,T]-\mathbb E[Y\mid D=0,X\backslash T,T]$$ Does it mean that test score is the factor differentiating black from white in wages? Many found that this explanation was implausible. Suppose we regress \(Y\) on \(D\), \(X\backslash T\) and \(T\), in the regression $$Y=\alpha+\beta D+\gamma T+\eta X\backslash T+\varepsilon$$ we control for test score. However, there might exist some unobservables that also determine wage, such as effort \(E\). Maybe the black, as a disadvantaged group, need to put much more effort to achieve the same test score, and thus \(E\) and \(D\) are correlated. Also, clearly \(E\) and \((Y_0,Y_1)\perp D\mid X\) is ruined by the existence of unobservable \(E\), which is triggered by including \(T\) in \(X\). The lesson is that we should always be careful about the choices of control variables, and think twice about the story behind the conditional independence and possible unobservables in the story.

偏差与加入额外控制。 加更多变量是否总让偏差更小?

设 \((Y_0,Y_1)\perp D\mid X\)。令 \(X_1\subset X_2\subset X\)。若可用 \(X\),则总能通过加更多控制 \(X_1\) 或 \(X_2\) 来减小偏差、使其成为 \(X\)。然而当我们观测不到 \(X\),不确定 \(X_1\) 与 \(X_2\) 哪个给更小偏差。加更多变量未必更好:如前所述,若把不该加的变量(如结果变量)扔进 \(X\),肯定增大偏差。消除偏差只是一个极限情形。

Bias and the inclusion of additional controls. Is it always true that the bias becomes smaller when you include more variables?

Suppose that \((Y_0,Y_1)\perp D\mid X\). Let \(X_1\subset X_2\subset X\). When it is possible to use \(X\), we can always reduce bias by adding more controls to \(X_1\) or \(X_2\) and make it become \(X\). However, when we cannot observe \(X\), it is not sure which one of \(X_1\) and \(X_2\) will give smaller bias. Including more is not always better: as we discussed before, if we throw a lot of variables in \(X\), it is very likely that we may be including outcome variables, which for sure increases the bias. Eliminating bias is just a limiting case.

15.2.4 Identification of ATE, ATT, and ATU by Weighted Average of Mean Contrasts in Each Matched Cell

Important

注记 15.6 现在选择进入处理下的识别问题可分两步解决: 1. 第 1 步: 找合适的可观测控制变量、并论证条件独立 \(\{Y_d\}_{d\in\mathcal D}\perp D\mid X\)(这是最难的部分!)。 2. 第 2 步: 在匹配 \(X\) 的每个单元内算 ATE,即 \(\mathbb E[Y_d-Y_{d'}\mid X=x]=\mathbb E[Y\mid D=d,X=x]-\mathbb E[Y\mid D=d',X=x]\),再对各单元加权平均得关心的目标参数,权重依目标参数而定。

设 \(D\in\{0,1\}\) 二值。则 $$\begin{aligned}\text{ATE}&\equiv\mathbb E[Y_1-Y_0]=\mathbb E[\mathbb E[Y_1\mid X]-\mathbb E[Y_0\mid X]]=\mathbb E\big[\underbrace{\mathbb E[Y\mid D=1,X]}_{\text{data}}-\underbrace{\mathbb E[Y\mid D=0,X]}_{\text{data}}\big]\\&\overset{\text{discrete}}{=}\sum_x\big(\mathbb E[Y\mid D=1,X=x]-\mathbb E[Y\mid D=0,X=x]\big)\underbrace{\mathbb P(X=x)}_{\text{weights}}\\&\overset{\text{continuous}}{=}\int\big(\mathbb E[Y\mid D=1,X=x]-\mathbb E[Y\mid D=0,X=x]\big)\underbrace{f_X(x)}_{\text{weights}}\,dx\end{aligned}$$ (由条件独立与重叠支撑。)同理 $$\text{ATT}\equiv\mathbb E[Y_1-Y_0\mid D=1]=\int\big(\mathbb E[Y\mid D=1,X=x]-\mathbb E[Y\mid D=0,X=x]\big)\underbrace{f_{X\mid D=1}(x)}_{\text{weights}}\,dx$$ $$\text{ATU}\equiv\mathbb E[Y_1-Y_0\mid D=0]=\int\big(\mathbb E[Y\mid D=1,X=x]-\mathbb E[Y\mid D=0,X=x]\big)\underbrace{f_{X\mid D=0}(x)}_{\text{weights}}\,dx$$ 其中 $$f_{X\mid D=1}(x)=\frac{\mathbb P(D=1,X=x)}{\mathbb P(D=1)}=f(x)\frac{\mathbb P(D=1\mid X=x)}{\mathbb P(D=1)},\quad f_{X\mid D=0}(x)=f(x)\frac{\mathbb P(D=0\mid X=x)}{\mathbb P(D=0)}$$ 每项都可用非参数方法估计(见 §15.1)。

Important

注记 15.7 这些论证只需均值独立:\(\mathbb E[Y_d\mid D=0,X]=\mathbb E[Y_d\mid D=1,X]=\mathbb E[Y_d\mid X]\) 对 \(d\in\{0,1\}\),而非完全独立。

Important

Remark 15.6 Now, the problem of identification under selection into treatment can be solved in two steps: 1. Step 1: Find appropriate observable variables as control variables, and argue for the conditional independence \(\{Y_d\}_{d\in\mathcal D}\perp D\mid X\) (this is the most difficult part!). 2. Step 2: Calculate the ATE within each cell of matched \(X\), i.e. \(\mathbb E[Y_d-Y_{d'}\mid X=x]=\mathbb E[Y\mid D=d,X=x]-\mathbb E[Y\mid D=d',X=x]\), and then take weighted averages of the ATE in each cell to obtain the target parameter we are interested in, and the weights depend on the target parameter.

Suppose that \(D\in\{0,1\}\) is binary. We then have that $$\begin{aligned}\text{ATE}&\equiv\mathbb E[Y_1-Y_0]=\mathbb E[\mathbb E[Y_1\mid X]-\mathbb E[Y_0\mid X]]=\mathbb E\big[\underbrace{\mathbb E[Y\mid D=1,X]}_{\text{data}}-\underbrace{\mathbb E[Y\mid D=0,X]}_{\text{data}}\big]\\&\overset{\text{discrete}}{=}\sum_x\big(\mathbb E[Y\mid D=1,X=x]-\mathbb E[Y\mid D=0,X=x]\big)\underbrace{\mathbb P(X=x)}_{\text{weights}}\\&\overset{\text{continuous}}{=}\int\big(\mathbb E[Y\mid D=1,X=x]-\mathbb E[Y\mid D=0,X=x]\big)\underbrace{f_X(x)}_{\text{weights}}\,dx\end{aligned}$$ (which follows by conditional independence and the overlapping support condition.) Likewise $$\text{ATT}\equiv\mathbb E[Y_1-Y_0\mid D=1]=\int\big(\mathbb E[Y\mid D=1,X=x]-\mathbb E[Y\mid D=0,X=x]\big)\underbrace{f_{X\mid D=1}(x)}_{\text{weights}}\,dx$$ $$\text{ATU}\equiv\mathbb E[Y_1-Y_0\mid D=0]=\int\big(\mathbb E[Y\mid D=1,X=x]-\mathbb E[Y\mid D=0,X=x]\big)\underbrace{f_{X\mid D=0}(x)}_{\text{weights}}\,dx$$ where $$f_{X\mid D=1}(x)=\frac{\mathbb P(D=1,X=x)}{\mathbb P(D=1)}=f(x)\frac{\mathbb P(D=1\mid X=x)}{\mathbb P(D=1)},\quad f_{X\mid D=0}(x)=f(x)\frac{\mathbb P(D=0\mid X=x)}{\mathbb P(D=0)}$$ Every item inside these expressions can be estimated by non-parametric methods (see §15.1).

Important

Remark 15.7 These arguments only require mean independence: \(\mathbb E[Y_d\mid D=0,X]=\mathbb E[Y_d\mid D=1,X]=\mathbb E[Y_d\mid X]\) for \(d\in\{0,1\}\), but not full independence.

15.3 Propensity Scores

15.3.1 Definition and Theorem

Important

定义 15.2(倾向得分) 考虑二值处理 \(D\in\{0,1\}\)。\(p(x)\equiv\mathbb P(D=1\mid X=x)\) 称倾向得分(propensity score)。令 \(P\equiv p(X)\) 为随机变量 \(\mathbb P(D=1\mid X)\)。

Important

定理 15.1(倾向得分定理) 设 \(X\) 的条件独立假设(CIA)\((Y_0,Y_1)\perp D\mid X\) 成立,则 $$(Y_0,Y_1)\perp D\mid p(X)\tag{15.12}$$ 且 $$\mathbb P(D=1\mid Y_0,Y_1,p(X))=\mathbb P(D=1\mid p(X))\tag{15.13}$$

Note

证明 (15.13) 由独立定义蕴含 (15.12),故只需证 (15.13)。先证引理。 引理 15.1. 倾向得分 \(p(X)\) 满足 \(D\perp X\mid p(X)\)。 证:由定义 \(p(X)=\mathbb P(D=1\mid X)\),这正是钉住 \(D\) 分布所需——条件于 \(p(X)\),\(D\) 的分布已被钉住、不再依赖 \(X\)。\(\square\) 然后 $$\begin{aligned}\mathbb P(D=1\mid Y_0,Y_1,p(X))&=\mathbb E[\mathbf 1\{D=1\}\mid Y_0,Y_1,p(X)]\\&=\mathbb E[\mathbb E[\mathbf 1\{D=1\}\mid Y_0,Y_1,p(X),X]\mid Y_0,Y_1,p(X)]\\&\overset{\text{CIA}}{=}\mathbb E[\mathbb E[\mathbf 1\{D=1\}\mid p(X),X]\mid Y_0,Y_1,p(X)]\\&\overset{\text{Lemma}}{=}\mathbb E[\mathbb E[\mathbf 1\{D=1\}\mid p(X)]\mid Y_0,Y_1,p(X)]\\&=\mathbb E[\mathbf 1\{D=1\}\mid p(X)]=\mathbb P(D=1\mid p(X))\end{aligned}$$ 其中倒数第二行用 \(\mathbb E[\mathbb E[X\mid A]\mid A,B]=\mathbb E[X\mid A]\)(条件于更少信息时外层期望不起作用)。\(\blacksquare\)

Important

例 15.1(条件于 \(p(X)\) 的 ATE 估计) 由定理 15.1,基于倾向得分的 \(\mathbb E[Y_1-Y_0]\) 估计为 $$\mathbb E[Y_1-Y_0]=\int_0^1\big(\underbrace{\mathbb E[Y\mid D=1,p(X)=p]}_{\text{data}}-\underbrace{\mathbb E[Y\mid D=0,p(X)=p]}_{\text{data}}\big)f(p)\,dp$$ \(f(p)=\mathbb P(p(X)=p)\) 为随机变量 \(p(X)\) 的密度。

Note

证明(无偏) $$\begin{aligned}\text{ATE}\equiv\mathbb E[Y_1-Y_0]&=\mathbb E[\mathbb E[Y_1\mid p(X)]-\mathbb E[Y_0\mid p(X)]]=\mathbb E\big[\mathbb E[Y\mid D=1,p(X)]-\mathbb E[Y\mid D=0,p(X)]\big]\\&=\int_0^1\big(\mathbb E[Y\mid D=1,p(X)=p]-\mathbb E[Y\mid D=0,p(X)=p]\big)f(p)\,dp\end{aligned}$$ \(\blacksquare\)

Tip

注记 15.8 此定理意味我们可在 \(p(X)\) 上而非 \(X\) 上条件。这很有用,因 \(X\) 可能高维、而 \(p(X)\) 是标量。后面讨论匹配时会方便地在该低维对象上匹配。

Important

Definition 15.2 (Propensity score) Consider a binary treatment with \(D\in\{0,1\}\). \(p(x)\equiv\mathbb P(D=1\mid X=x)\) is called the propensity score. Let \(P\equiv p(X)\) be the random variable \(\mathbb P(D=1\mid X)\).

Important

Theorem 15.1 (The Propensity Score Theorem) Suppose that the conditional independence assumption (CIA) for \(X\), i.e. \((Y_0,Y_1)\perp D\mid X\), holds. Then $$(Y_0,Y_1)\perp D\mid p(X)\tag{15.12}$$ and $$\mathbb P(D=1\mid Y_0,Y_1,p(X))=\mathbb P(D=1\mid p(X))\tag{15.13}$$

Note

Proof (15.13) implies (15.12) by the definition of independence, so we only need to show (15.13). First, a lemma. Lemma 15.1. The propensity score \(p(X)\) satisfies \(D\perp X\mid p(X)\). Proof: by definition \(p(X)=\mathbb P(D=1\mid X)\), which is exactly what is needed to pin down the distribution of \(D\) — conditional on \(p(X)\), \(D\)'s distribution is already pinned down and no longer depends on \(X\). \(\square\) Then $$\begin{aligned}\mathbb P(D=1\mid Y_0,Y_1,p(X))&=\mathbb E[\mathbf 1\{D=1\}\mid Y_0,Y_1,p(X)]\\&=\mathbb E[\mathbb E[\mathbf 1\{D=1\}\mid Y_0,Y_1,p(X),X]\mid Y_0,Y_1,p(X)]\\&\overset{\text{CIA}}{=}\mathbb E[\mathbb E[\mathbf 1\{D=1\}\mid p(X),X]\mid Y_0,Y_1,p(X)]\\&\overset{\text{Lemma}}{=}\mathbb E[\mathbb E[\mathbf 1\{D=1\}\mid p(X)]\mid Y_0,Y_1,p(X)]\\&=\mathbb E[\mathbf 1\{D=1\}\mid p(X)]=\mathbb P(D=1\mid p(X))\end{aligned}$$ where the second-to-last line uses \(\mathbb E[\mathbb E[X\mid A]\mid A,B]=\mathbb E[X\mid A]\) (the outer expectation doesn't retrieve anything back when conditioned on less information). \(\blacksquare\)

Important

Example 15.1 (Estimator for ATE conditional on \(p(X)\)) By Theorem 15.1, the proposed estimator of \(\mathbb E[Y_1-Y_0]\) based on the propensity score is $$\mathbb E[Y_1-Y_0]=\int_0^1\big(\underbrace{\mathbb E[Y\mid D=1,p(X)=p]}_{\text{data}}-\underbrace{\mathbb E[Y\mid D=0,p(X)=p]}_{\text{data}}\big)f(p)\,dp$$ where \(f(p)=\mathbb P(p(X)=p)\) is the p.d.f. for the random variable \(p(X)\).

Note

Proof (unbiased) $$\begin{aligned}\text{ATE}\equiv\mathbb E[Y_1-Y_0]&=\mathbb E[\mathbb E[Y_1\mid p(X)]-\mathbb E[Y_0\mid p(X)]]=\mathbb E\big[\mathbb E[Y\mid D=1,p(X)]-\mathbb E[Y\mid D=0,p(X)]\big]\\&=\int_0^1\big(\mathbb E[Y\mid D=1,p(X)=p]-\mathbb E[Y\mid D=0,p(X)=p]\big)f(p)\,dp\end{aligned}$$ \(\blacksquare\)

Tip

Remark 15.8 This theorem implies that we can condition on \(p(X)\) instead of \(X\). This is a useful result in that \(X\) is potentially a high dimensional object while \(p(X)\) is a scalar. This will be a useful result when we discuss matching later because it will be convenient to match based on the lower dimensional object.

15.3.2 Rewrite ATE, ATT and ATU with Propensity Score in Expression but Matching on X

用 \(p(X)\),可在每个匹配单元 \(X=x\) 内重写 \(\text{ATE}(x)\),再把 ATE/ATT/ATU 写为 \(\text{ATE}(x)\) 的加权平均。先看 \(\text{ATE}(x)\):由 $$\mathbb E[YD\mid X=x]=\mathbb E[Y_1\mid D=1,X=x]\mathbb P(D=1\mid X=x)\overset{\text{CIA}}{=}\mathbb E[Y_1\mid X=x]p(x)$$ $$\Rightarrow\mathbb E[Y_1\mid X=x]=\frac{\mathbb E[YD\mid X=x]}{p(x)}=\mathbb E\Big[\frac{YD}{p(x)}\mid X=x\Big]$$ 同理 \(\mathbb E[Y_0\mid X=x]=\mathbb E\big[\frac{Y(1-D)}{1-p(x)}\mid X=x\big]\)。于是 $$\begin{aligned}\text{ATE}(x)&=\mathbb E[Y_1-Y_0\mid X=x]=\mathbb E\Big[\frac{YD}{p(x)}\mid X=x\Big]-\mathbb E\Big[\frac{Y(1-D)}{1-p(x)}\mid X=x\Big]\\&=\mathbb E\Big[\frac{YD(1-p(x))-Y(1-D)p(x)}{p(x)(1-p(x))}\mid X=x\Big]=\mathbb E\Big[\frac{(D-p(x))Y}{p(x)(1-p(x))}\mid X=x\Big]\end{aligned}$$ 然后在连续情形重写 $$\text{ATE}\equiv\mathbb E[\text{ATE}(x)]=\int\text{ATE}(x)\underbrace{f_X(x)}_{\text{weights}}\,dx=\int\frac{(D-p(x))Y}{p(x)(1-p(x))}\underbrace{f_X(x)}_{\text{weights}}\,dx$$ $$\text{ATT}\equiv\mathbb E[\text{ATE}(x)\mid D=1]=\int\frac{(D-p(x))Y}{p(x)(1-p(x))}\underbrace{\frac{\mathbb P(D=1\mid X=x)}{\mathbb P(D=1)}f(x)}_{\text{weights}}\,dx$$ $$\text{ATU}\equiv\mathbb E[\text{ATE}(x)\mid D=0]=\int\frac{(D-p(x))Y}{p(x)(1-p(x))}\underbrace{\frac{\mathbb P(D=0\mid X=x)}{\mathbb P(D=0)}f(x)}_{\text{weights}}\,dx$$ 这些表达式中每项都可用非参数方法(§15.1)估计。

Tip

注记 15.9 这些结果是在 \(X\) 上匹配推得的。也可基于定理 15.1 在 \(p(X)\) 上匹配推得类似结果。注意在 \(p(X)\) 上匹配时仍需重叠条件 \(p(X)<1\)。

Using \(p(X)\), we can rewrite the ATE in each matched cell \(X=x\), i.e. \(\text{ATE}(x)\), and then rewrite ATE, ATT and ATU as a weighted average of \(\text{ATE}(x)\). First look at \(\text{ATE}(x)\): since $$\mathbb E[YD\mid X=x]=\mathbb E[Y_1\mid D=1,X=x]\mathbb P(D=1\mid X=x)\overset{\text{CIA}}{=}\mathbb E[Y_1\mid X=x]p(x)$$ $$\Rightarrow\mathbb E[Y_1\mid X=x]=\frac{\mathbb E[YD\mid X=x]}{p(x)}=\mathbb E\Big[\frac{YD}{p(x)}\mid X=x\Big]$$ similarly \(\mathbb E[Y_0\mid X=x]=\mathbb E\big[\frac{Y(1-D)}{1-p(x)}\mid X=x\big]\). Then $$\begin{aligned}\text{ATE}(x)&=\mathbb E[Y_1-Y_0\mid X=x]=\mathbb E\Big[\frac{YD}{p(x)}\mid X=x\Big]-\mathbb E\Big[\frac{Y(1-D)}{1-p(x)}\mid X=x\Big]\\&=\mathbb E\Big[\frac{YD(1-p(x))-Y(1-D)p(x)}{p(x)(1-p(x))}\mid X=x\Big]=\mathbb E\Big[\frac{(D-p(x))Y}{p(x)(1-p(x))}\mid X=x\Big]\end{aligned}$$ Then in the continuous case we can rewrite $$\text{ATE}\equiv\mathbb E[\text{ATE}(x)]=\int\text{ATE}(x)\underbrace{f_X(x)}_{\text{weights}}\,dx=\int\frac{(D-p(x))Y}{p(x)(1-p(x))}\underbrace{f_X(x)}_{\text{weights}}\,dx$$ $$\text{ATT}\equiv\mathbb E[\text{ATE}(x)\mid D=1]=\int\frac{(D-p(x))Y}{p(x)(1-p(x))}\underbrace{\frac{\mathbb P(D=1\mid X=x)}{\mathbb P(D=1)}f(x)}_{\text{weights}}\,dx$$ $$\text{ATU}\equiv\mathbb E[\text{ATE}(x)\mid D=0]=\int\frac{(D-p(x))Y}{p(x)(1-p(x))}\underbrace{\frac{\mathbb P(D=0\mid X=x)}{\mathbb P(D=0)}f(x)}_{\text{weights}}\,dx$$ Every item inside these expressions can be estimated by non-parametric methods (§15.1).

Tip

Remark 15.9 These results are derived for matching on \(X\). We can instead derive similar results for matching on \(p(X)\) based on Theorem 15.1. Note that when matching on \(p(X)\), we still need the overlapping condition \(p(X)<1\).

15.4 Curse of Dimensionality and Alternative Methods

15.4.1 Problem of Non-parametric Estimates When Matching Directly on X

至此我们都在 \(X\) 上匹配,而 \(X\) 一般是高维向量。定理 15.1 意味也可在倾向得分 \(p(X)\)(标量)上匹配。不同方法蕴含不同识别、从而不同的估计构造,进而估计的效率、收敛速率、有限样本表现也不同。

由于估计用非参数方法,很重要的一点是:尽量在每个单元放足够多数据点以降低偏差与方差。但数据量有限时,高维控制变量 \(X\) 需要指数级增加的数据点。例:设有 \(K=30\) 个二值协变量作控制 \(X\),则至少需 \(\min N=2^{30+1}=2147483648\) 个数据点——这就是维度灾难(curse of dimensionality)

Up to now, we are matching on \(X\), which is generally a vector with high dimensions. Theorem 15.1 implies that we can also match on the propensity score \(p(X)\) (a scalar). Different methods imply different identification and thus lead to different construction of estimators. Then, those estimators may have different efficiency, rates of convergence, and finite sample performance etc.

Since we will be using non-parametric methods in the estimation, it is very important that we put as many data points as possible within each cell of match to reduce both bias and variance. However, given the limited size of data, high dimensional control variables \(X\) need exponentially increasing amount of data points. For example, suppose that you have \(K=30\) binary covariates as control variables \(X\); then this would imply that \(\min N=2^{30+1}=2147483648\). This is the curse of dimensionality.

15.4.2 Alternative Methods

为克服维度灾难,可做如下: 1. 回归分析。 (a) OLS 回归无法给我们 ATE、ATT 或 ATU,而是给每个匹配单元 ATE 的一个不同加权平均;(b) 见下。 2. 在倾向得分上匹配:通过部分使用参数模型解决维度问题。 (a) 与在 \(X\) 上匹配类似;(b) 然而即便匹配可避维度灾难,估计倾向得分时仍面临此问题;(c) 实务中可用 probit 或 logit 回归模型(而非非参数法)估倾向得分、避免维度灾难;(d) 故最终可对匹配单元的均值用非参数法、但对倾向得分的估计用参数模型。 3. 非参数估计下的非精确匹配。 (a) 先在 \(X\) 上定义一个度量;(b) 再允许匹配有正带宽;(c) 但要思考带宽的最优选择、权衡偏差与方差(见 §15.1)。

结论:若坚持用完全非参数方法、做尽量少的假设,则代价是维度灾难。但若结合参数与非参数方法(如在倾向得分上匹配),则可解决维度问题,这又是不同目标间的权衡与平衡。

To overcome the curse we can do the following: 1. Regression analysis. (a) OLS regression cannot give us ATE, ATT, or ATU. Instead, it gives us a differently weighted average of the ATE in each matched cell; (b) see below for details. 2. Match on propensity score: solve the dimensionality problem by partly using parametric models. (a) This is similar to what we have done for matching on \(X\); (b) however, although we can avoid the curse of dimensionality for matching, we are still faced with this problem when estimating propensity score; (c) in practice, we could use probit or logit regression model instead of non-parametric methods to estimate propensity score and avoid the curse of dimensionality; (d) so, we could end up using non-parametric methods for averages for matched cells but parametric models for estimation of propensity score. 3. Inexact matching when doing non-parametric estimates. (a) First, define a metric on \(X\); (b) then, allow for a positive bandwidth of matching; (c) but we need to think about the optimal choice of bandwidth, taking into consideration of the trade-off between bias and variance. See more discussion on this in §15.1.

In conclusion, if we insist on using completely non-parametric method in estimation to make as less assumptions as possible, then the cost to pay is the dimensionality problem. But if we combine parametric and non-parametric methods (e.g. matching on propensity score), then the dimensionality problem can be solved and it is again a trade-off and balance between different objectives.

OLS 回归。 设已找到正确控制变量 \(X\) 并放入 OLS 回归,则处理 \(D\) 的系数估计 \(\beta^{\text{OLS}}\) 是各单元 \(\text{ATE}(x_i)\) 的加权平均,但权重不同于 ATE/ATT/ATU 所用,是带特殊权重的别的对象。具体地,考虑饱和于 \(X\)的回归模型:\(X_i\) 有 \(K\) 个离散态 \(x_k\),\(\forall i\)。施加重叠条件,处理与未处理者有共同支撑,即 \(X_i\) 把 agents 划为 \(K\) 个子组、每个子组内既有处理又有未处理者: $$Y_i=\sum_{k=1}^K d_{ik}\alpha_k+\beta^{\text{OLS}}D_i+U_i,\qquad d_{ik}=\mathbf 1\{X_i=x_k\}\tag{15.14}$$ 其中:\(\alpha_k\) 为 \(X_k=x_k\) 的系数;\(\beta^{\text{OLS}}\) 为处理的回归估计量;CIA \((Y_{0i},Y_{1i})\perp D_i\mid X_i\) 成立。

Important

命题 15.1 可证 \(\beta^{\text{OLS}}\) 是 \(\text{ATE}(x_i)\) 的加权平均,且处理方差越大的态权重越高,即 $$\beta^{\text{OLS}}=\sum_{k=1}^K\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}\omega_k$$ $$\omega_k=\frac{\mathbb P(D_i=1\mid X_i=x_k)(1-\mathbb P(D_i=1\mid X_i=x_k))\mathbb P(X_i=x_k)}{\sum_{k=1}^K\mathbb P(D_i=1\mid X_i=x_k)(1-\mathbb P(D_i=1\mid X_i=x_k))\mathbb P(X_i=x_k)}$$

Note

证明 由命题 4.1(FWL,把 BLP 换为 \(\mathbb E[\mathbf X_1\mid\mathbf X_2]\) 满足者):对回归 \(Y=\beta_1 X_1+\beta_2 X_2+u\) 与 \(\tilde Y=\beta\tilde X_2+v\)(\(\tilde Y=Y-\mathbb E[Y\mid X_1]\)、\(\tilde X_2=X_2-\mathbb E[X_2\mid X_1]\))有 \(\beta=\beta_2\)。故对 (15.14),令 \(\tilde D_i=D_i-\mathbb E[D_i\mid X_i]\)、\(\tilde Y_i=Y_i-\mathbb E[Y_i\mid X_i]\),回归化为 \(\tilde Y_i=\tilde\beta^{\text{OLS}}\tilde D_i+U_i\),\(\beta^{\text{OLS}}=\tilde\beta^{\text{OLS}}\),且 $$\begin{aligned}\beta^{\text{OLS}}=\tilde\beta^{\text{OLS}}&=\frac{\operatorname{Cov}(\tilde Y_i,\tilde D_i)}{\operatorname{Var}(\tilde D_i)}=\frac{\mathbb E[\tilde Y_i\tilde D_i]-\overbrace{\mathbb E[\tilde Y_i]\mathbb E[\tilde D_i]}^{=0}}{\operatorname{Var}(\tilde D_i)}=\frac{\mathbb E[(Y_i-\mathbb E[Y_i\mid X_i])(D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}\\&=\frac{\mathbb E[Y_i(D_i-\mathbb E[D_i\mid X_i])]-\overbrace{\mathbb E[\mathbb E[Y_i\mid X_i](D_i-\mathbb E[D_i\mid X_i])]}^{=0}}{\operatorname{Var}(\tilde D_i)}=\frac{\mathbb E[Y_i(D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}\\&=\frac{\mathbb E[(Y_i-U_i+U_i)(D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}\overset{\text{exog}}{=}\frac{\mathbb E[(Y_i-U_i)(D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}=\frac{\mathbb E[\mathbb E[Y_i\mid X_i=x,D_i](D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}\end{aligned}$$ 定义各单元 ATE \(\delta_x\equiv\text{ATE}(x)=\mathbb E[Y_i\mid X_i=x,D_i=1]-\mathbb E[Y_i\mid X_i=x,D_i=0]\),则 \(\mathbb E[Y_i\mid X_i=x,D_i]=D_i\delta_x+\mathbb E[Y_i\mid X_i=x,D_i=0]\)。代入(分子里 \(\mathbb E[Y_i\mid X_i=x,D_i=0]\) 项乘 \((D_i-\mathbb E[D_i\mid X_i])\) 期望为零),并用 \(\mathbb E[D_i(D_i-\mathbb E[D_i\mid X_i])\mid X_i]=\operatorname{Var}(D_i\mid X_i)\) 得 $$\beta^{\text{OLS}}=\frac{\mathbb E[\operatorname{Var}(D_i\mid X_i)\delta_{x_i}]}{\mathbb E[\operatorname{Var}(D_i\mid X_i)]}$$ 二值 \(D_i\) 下 \(\operatorname{Var}(D_i\mid X_i=x_k)=\mathbb P(D_i=1\mid X_i=x_k)(1-\mathbb P(D_i=1\mid X_i=x_k))\),对离散 \(x_k\)(概率 \(\mathbb P(X_i=x_k)\))取期望即得命题中的 \(\beta^{\text{OLS}}=\sum_k\delta_{x_k}\omega_k\) 与权重 \(\omega_k\)。\(\blacksquare\)

显然 OLS 估计中,处理 \(D_i\) 方差更大的单元 \(x_k\) 权重 \(\omega_k\) 更高——这合理:处理方差更大意味更精确的估计、更小残差;由 OLS 最小化残差平方和,结果即如此。注意 \(\delta_{x_k}=\text{ATE}(x_k)\) 也可由非参数方法估计。

Important

注记 15.10 随机对照试验(RCT)下 \(D\perp X_i\),则 $$\omega_k=\frac{\mathbb P(D_i=1\mid X_i=x_k)(1-\mathbb P(D_i=1\mid X_i=x_k))\mathbb P(X_i=x_k)}{\mathbb P(D_i=1)(1-\mathbb P(D_i=1))\sum_{k=1}^K\mathbb P(X_i=x_k)}=\mathbb P(X_i=x_k)$$ 于是 $$\beta^{\text{OLS}}=\sum_{k=1}^K\underbrace{\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}}_{=\delta(x_k)}\mathbb P(X_i=x_k)=\mathbb E[\delta(x_k)]=\beta^{\text{ATE}}$$ 但一般无 \(D\perp X_i\),故 OLS 对 ATE 不奏效,即 \(\beta^{\text{OLS}}\ne\beta^{\text{ATE}}\)。此时若要估 ATE/ATT/ATU,须用非参数法(在每个单元 \(X\in(x-h,x+h)\) 内用局部线性核回归估 \(\text{ATE}(x)\)、再用非参数法估各单元权重、最后按目标参数取加权平均)。要估 ATE/ATT/ATU,须分别估下列对象(皆非参数): $$\beta^{\text{ATE}}=\sum_{k=1}^K\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}\mathbb P(X_i=x_k)$$ $$\beta^{\text{ATT}}=\sum_{k=1}^K\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}\mathbb P(X_i=x_k\mid D_i=1)$$ $$\beta^{\text{ATU}}=\sum_{k=1}^K\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}\mathbb P(X_i=x_k\mid D_i=0)$$

还有其他替代法,如在不同对象上匹配。但每种方法都面对相同权衡:对数据生成过程更多假设(参数法)或更严重维度问题(非参数法);在维度问题内部,又有偏差与方差的权衡。

OLS Regression. Suppose we have found correct control variables \(X\) and add it into the OLS regression; then the coefficient estimand \(\beta^{\text{OLS}}\) for treatment \(D\) is a weighted average of \(\text{ATE}(x_i)\), but the weights are different from what we used for ATE, ATT and ATU, so \(\beta^{\text{OLS}}\) is an object different from ATE, ATT and ATU with special weights. To be more specific, consider the following saturated-in-\(X\) regression model: there are \(K\) possible discrete states for \(X_i\), \(\forall i\). By imposing the overlapping condition, we have common support for treated and untreated, i.e. \(X_i\) partitioned the agents into \(K\) subgroups, within which there are both treated and untreated agents. The model is: $$Y_i=\sum_{k=1}^K d_{ik}\alpha_k+\beta^{\text{OLS}}D_i+U_i,\qquad d_{ik}=\mathbf 1\{X_i=x_k\}\tag{15.14}$$ where: \(\alpha_k\) is the coefficient for \(X_k=x_k\); \(\beta^{\text{OLS}}\) is the regression estimand for treatment; CIA \((Y_{0i},Y_{1i})\perp D_i\mid X_i\) holds.

Important

Proposition 15.1 We can then show that \(\beta^{\text{OLS}}\) is a weighted average of \(\text{ATE}(x_i)\) such that states with higher variance of treatment get higher weights, i.e. $$\beta^{\text{OLS}}=\sum_{k=1}^K\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}\omega_k$$ $$\omega_k=\frac{\mathbb P(D_i=1\mid X_i=x_k)(1-\mathbb P(D_i=1\mid X_i=x_k))\mathbb P(X_i=x_k)}{\sum_{k=1}^K\mathbb P(D_i=1\mid X_i=x_k)(1-\mathbb P(D_i=1\mid X_i=x_k))\mathbb P(X_i=x_k)}$$

Note

Proof By Proposition 4.1 (FWL, with BLP replaced by the one satisfying \(\mathbb E[\mathbf X_1\mid\mathbf X_2]\)): for regressions \(Y=\beta_1 X_1+\beta_2 X_2+u\) and \(\tilde Y=\beta\tilde X_2+v\) (\(\tilde Y=Y-\mathbb E[Y\mid X_1]\), \(\tilde X_2=X_2-\mathbb E[X_2\mid X_1]\)) we have \(\beta=\beta_2\). So for (15.14), define \(\tilde D_i=D_i-\mathbb E[D_i\mid X_i]\), \(\tilde Y_i=Y_i-\mathbb E[Y_i\mid X_i]\), the regression becomes \(\tilde Y_i=\tilde\beta^{\text{OLS}}\tilde D_i+U_i\), \(\beta^{\text{OLS}}=\tilde\beta^{\text{OLS}}\), and $$\begin{aligned}\beta^{\text{OLS}}=\tilde\beta^{\text{OLS}}&=\frac{\operatorname{Cov}(\tilde Y_i,\tilde D_i)}{\operatorname{Var}(\tilde D_i)}=\frac{\mathbb E[\tilde Y_i\tilde D_i]-\overbrace{\mathbb E[\tilde Y_i]\mathbb E[\tilde D_i]}^{=0}}{\operatorname{Var}(\tilde D_i)}=\frac{\mathbb E[(Y_i-\mathbb E[Y_i\mid X_i])(D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}\\&=\frac{\mathbb E[Y_i(D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}=\frac{\mathbb E[(Y_i-U_i+U_i)(D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}\\&\overset{\text{exog}}{=}\frac{\mathbb E[(Y_i-U_i)(D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}=\frac{\mathbb E[\mathbb E[Y_i\mid X_i=x,D_i](D_i-\mathbb E[D_i\mid X_i])]}{\operatorname{Var}(\tilde D_i)}\end{aligned}$$ Define the within-cell ATE \(\delta_x\equiv\text{ATE}(x)=\mathbb E[Y_i\mid X_i=x,D_i=1]-\mathbb E[Y_i\mid X_i=x,D_i=0]\), so \(\mathbb E[Y_i\mid X_i=x,D_i]=D_i\delta_x+\mathbb E[Y_i\mid X_i=x,D_i=0]\). Substituting (the \(\mathbb E[Y_i\mid X_i=x,D_i=0]\) term times \((D_i-\mathbb E[D_i\mid X_i])\) has zero expectation), and using \(\mathbb E[D_i(D_i-\mathbb E[D_i\mid X_i])\mid X_i]=\operatorname{Var}(D_i\mid X_i)\), $$\beta^{\text{OLS}}=\frac{\mathbb E[\operatorname{Var}(D_i\mid X_i)\delta_{x_i}]}{\mathbb E[\operatorname{Var}(D_i\mid X_i)]}$$ For binary \(D_i\), \(\operatorname{Var}(D_i\mid X_i=x_k)=\mathbb P(D_i=1\mid X_i=x_k)(1-\mathbb P(D_i=1\mid X_i=x_k))\); taking the expectation over discrete \(x_k\) (with probability \(\mathbb P(X_i=x_k)\)) yields the proposition's \(\beta^{\text{OLS}}=\sum_k\delta_{x_k}\omega_k\) with weights \(\omega_k\). \(\blacksquare\)

Clearly, in the OLS estimator, weight \(\omega_k\) is higher for the cell \(x_k\) in which variance of treatment (\(D_i\)) is higher. This makes sense because the cell with higher variance has more variations in \(D_i\) thus more accurate estimator and less residuals. Since OLS minimizes the sum of squared distance, this result is natural. Also note that \(\delta_{x_k}=\text{ATE}(x_k)\) can also be estimated by non-parametric methods.

Important

Remark 15.10 When we have \(D\perp X_i\), which is true in the case of random controlled trials (RCT), we have $$\omega_k=\frac{\mathbb P(D_i=1\mid X_i=x_k)(1-\mathbb P(D_i=1\mid X_i=x_k))\mathbb P(X_i=x_k)}{\mathbb P(D_i=1)(1-\mathbb P(D_i=1))\sum_{k=1}^K\mathbb P(X_i=x_k)}=\mathbb P(X_i=x_k)$$ Then $$\beta^{\text{OLS}}=\sum_{k=1}^K\underbrace{\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}}_{=\delta(x_k)}\mathbb P(X_i=x_k)=\mathbb E[\delta(x_k)]=\beta^{\text{ATE}}$$ But in general when we don't have \(D\perp X_i\), so OLS never works for ATE, i.e. \(\beta^{\text{OLS}}\ne\beta^{\text{ATE}}\). But, if selection on observables holds, we can use local OLS, which is non-parametric (instead of local linear kernel regression, which is non-parametric) to estimate \(\text{ATE}(x)\) within each cell \(X\in(x-h,x+h)\), and estimate the weights of each cell by non-parametric methods, and finally obtain ATE by taking weighted average of \(\text{ATE}(x)\) based on the estimated weight of each cell. To estimate ATE, ATT and ATU, we then have to use the non-parametric methods to estimate the following objects: $$\beta^{\text{ATE}}=\sum_{k=1}^K\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}\mathbb P(X_i=x_k)$$ $$\beta^{\text{ATT}}=\sum_{k=1}^K\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}\mathbb P(X_i=x_k\mid D_i=1)$$ $$\beta^{\text{ATU}}=\sum_{k=1}^K\big\{\mathbb E[Y_i\mid D_i=1,X_i=x_k]-\mathbb E[Y_i\mid D_i=0,X_i=x_k]\big\}\mathbb P(X_i=x_k\mid D_i=0)$$

There are also some other alternative methods including matching on different objects. But every method faces the same trade-off: more assumptions on the data generating process (parametric methods) or more severe dimensionality problem (non-parametric methods). Inside the dimensionality problem, we have the trade-off between bias and variance.