Difference-in-Differences

Difference-in-differences (DiD) estimates a treatment effect by comparing how outcomes change for a treated group versus a control group around an event. This chapter applies it to corporate bonds around the 2015 Paris Agreement: do the yields of bonds issued by polluting firms move differently afterward? It builds the bond panel, runs a two-way fixed-effects DiD, and checks parallel trends with an event study. Use the R | Python toggle to switch.

library(tidyverse)
library(arrow)
library(fixest)
library(broom)
import pandas as pd
import numpy as np
import linearmodels as lm

Data preparation

We load bond characteristics (FISD) and transactions (TRACE), set the treatment date to the Paris Agreement, and flag bonds from polluting industries (by two-digit SIC code) as the treatment group. Bonds need at least a year to maturity so the post-period is meaningful.

fisd <- read_parquet("data-r/fisd.parquet") |>
  select(complete_cusip, maturity, offering_amt, sic_code) |>
  drop_na()

trace_enhanced <- open_dataset("data-r/trace_enhanced") |>
  select(cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt) |>
  collect() |>
  drop_na()

treatment_date <- ymd("2015-12-12")
polluting_industries <- c(49, 13, 45, 29, 28, 33, 40, 20, 26, 42, 10, 53, 32, 99, 37)

bonds <- fisd |>
  filter(offering_amt > 0) |>
  mutate(
    time_to_maturity = as.numeric(maturity - treatment_date) / 365,
    sic_code = as.integer(substr(sic_code, 1, 2)),
    log_offering_amt = log(offering_amt)
  ) |>
  filter(time_to_maturity >= 1) |>
  select(cusip_id = complete_cusip, time_to_maturity, log_offering_amt, sic_code) |>
  mutate(polluter = sic_code %in% polluting_industries)
fisd = pd.read_parquet("data-python/fisd.parquet")[
    ["complete_cusip", "maturity", "offering_amt", "sic_code"]].dropna()

trace_enhanced = pd.read_parquet("data-python/trace_enhanced.parquet")[
    ["cusip_id", "trd_exctn_dt", "rptd_pr", "entrd_vol_qt", "yld_pt"]].dropna()

treatment_date = pd.Timestamp("2015-12-12")
polluting_industries = [49, 13, 45, 29, 28, 33, 40, 20, 26, 42, 10, 53, 32, 99, 37]

bonds = (fisd
  .query("offering_amt > 0")
  .assign(
    time_to_maturity=lambda x: (x["maturity"] - treatment_date).dt.days / 365,
    sic_code=lambda x: x["sic_code"].astype(str).str[:2].astype(int),
    log_offering_amt=lambda x: np.log(x["offering_amt"]))
  .query("time_to_maturity >= 1")
  .rename(columns={"complete_cusip": "cusip_id"})
  .assign(polluter=lambda x: x["sic_code"].isin(polluting_industries))
  .get(["cusip_id", "time_to_maturity", "log_offering_amt", "polluter"]))

Building the bond-month panel

We aggregate trades to a bond-month panel of volume-weighted average yields (requiring at least five trades a day for reliability), join the bond characteristics, and define post_period and the DiD interaction treated = polluter × post.

trace_aggregated <- trace_enhanced |>
  filter(rptd_pr > 25) |>
  group_by(cusip_id, trd_exctn_dt) |>
  summarize(
    avg_yield = weighted.mean(yld_pt, entrd_vol_qt * rptd_pr),
    trades = n(), .groups = "drop"
  ) |>
  drop_na(avg_yield) |>
  filter(trades >= 5) |>
  mutate(date = floor_date(trd_exctn_dt, "month")) |>
  group_by(cusip_id, date) |>
  slice_max(trd_exctn_dt) |>
  ungroup() |>
  select(cusip_id, date, avg_yield)

bonds_panel <- bonds |>
  inner_join(trace_aggregated, join_by(cusip_id), multiple = "all") |>
  drop_na() |>
  mutate(
    post_period = date >= floor_date(treatment_date, "months"),
    treated = polluter & post_period
  )
trace_aggregated = (trace_enhanced
  .query("rptd_pr > 25")
  .assign(weight=lambda x: x["entrd_vol_qt"] * x["rptd_pr"])
  .groupby(["cusip_id", "trd_exctn_dt"])
  .apply(lambda x: pd.Series({
    "avg_yield": np.average(x["yld_pt"], weights=x["weight"]),
    "trades": len(x)}))
  .reset_index()
  .query("trades >= 5")
  .assign(date=lambda x: x["trd_exctn_dt"].dt.to_period("M").dt.to_timestamp())
  .sort_values("trd_exctn_dt")
  .groupby(["cusip_id", "date"]).tail(1)
  .get(["cusip_id", "date", "avg_yield"]))

bonds_panel = (bonds
  .merge(trace_aggregated, how="inner", on="cusip_id")
  .dropna()
  .assign(
    post_period=lambda x: x["date"] >= treatment_date.replace(day=1),
    treated=lambda x: x["polluter"] & (x["date"] >= treatment_date.replace(day=1))))

The DiD regression

The estimate is the coefficient on treated, run two ways: a pooled regression with controls, and a two-way fixed-effects version with bond and month fixed effects (which absorb time-invariant bond traits and common monthly shocks). A negative coefficient would say polluters' yields rose less — or fell — relative to others after Paris.

model_without_fe <- feols(
  avg_yield ~ treated + post_period + polluter +
    log_offering_amt + time_to_maturity,
  vcov = "iid", data = bonds_panel
)

model_with_fe <- feols(
  avg_yield ~ treated | cusip_id + date,
  vcov = "iid", data = bonds_panel
)

etable(model_without_fe, model_with_fe, coefstat = "tstat", digits = 3)
model_without_fe = lm.PanelOLS.from_formula(
    "avg_yield ~ 1 + treated + post_period + polluter + log_offering_amt + time_to_maturity",
    data=bonds_panel.set_index(["cusip_id", "date"])
).fit()

model_with_fe = lm.PanelOLS.from_formula(
    "avg_yield ~ treated + EntityEffects + TimeEffects",
    data=bonds_panel.set_index(["cusip_id", "date"])
).fit()

print(model_without_fe.summary)
print(model_with_fe.summary)

The credibility of DiD rests on parallel pre-trends. An event-study specification interacts the polluter flag with each month (relative to treatment), so plotting those coefficients with confidence bands shows whether treated and control bonds tracked together before the event and diverged after. A flat, near-zero pre-period is the visual test.

model_without_fe_time <- feols(
  avg_yield ~ polluter + date:polluter + time_to_maturity + log_offering_amt,
  vcov = "iid",
  data = bonds_panel |> mutate(date = factor(date))
)

model_without_fe_coefs <- tidy(model_without_fe_time) |>
  filter(str_detect(term, "date")) |>
  mutate(
    date = ymd(substr(term, nchar(term) - 9, nchar(term))),
    treatment = str_detect(term, "TRUE"),
    ci_up = estimate + qnorm(0.975) * std.error,
    ci_low = estimate + qnorm(0.025) * std.error
  )

model_without_fe_coefs |>
  ggplot(aes(date, color = treatment, shape = treatment)) +
  geom_vline(aes(xintercept = floor_date(treatment_date, "month")), linetype = "dashed") +
  geom_hline(aes(yintercept = 0), linetype = "dashed") +
  geom_errorbar(aes(ymin = ci_low, ymax = ci_up), alpha = 0.5) +
  geom_point(aes(y = estimate)) +
  labs(x = NULL, y = "Yield", color = "Polluter?", shape = "Polluter?",
       title = "Polluters respond stronger to Paris Agreement than green firms")
bonds_panel_event = bonds_panel.copy()
bonds_panel_event["rel_month"] = (
    (bonds_panel_event["date"].dt.year - treatment_date.year) * 12
    + (bonds_panel_event["date"].dt.month - treatment_date.month))

event_dummies = pd.get_dummies(bonds_panel_event["rel_month"], prefix="rel")
event_data = pd.concat(
    [bonds_panel_event, event_dummies.mul(bonds_panel_event["polluter"], axis=0)], axis=1)

terms = [c for c in event_dummies.columns if c != "rel_0"]
model_event = lm.PanelOLS.from_formula(
    "avg_yield ~ " + " + ".join(terms) + " + EntityEffects + TimeEffects",
    data=event_data.set_index(["cusip_id", "date"])
).fit(cov_type="clustered", cluster_entity=True)
print(model_event.summary)

A caution the chapter stresses: when treatment is staggered across units at different times, the simple two-way fixed-effects estimator can be biased, and newer estimators built for staggered adoption should be used instead.


Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.