Fixed Effects and Clustered Standard Errors

This chapter covers the panel-regression mechanics that make firm-level inference honest: fixed effects to absorb unobserved heterogeneity, and clustered standard errors to handle correlated residuals. The running example is a corporate-investment regression (investment on cash flow and Tobin's Q). The code uses fixest in R and linearmodels in Python. Use the R | Python toggle to switch.

library(tidyverse)
library(tidyfinance)
library(fixest)
import pandas as pd
import numpy as np
import sqlite3
import linearmodels as lm

Building the investment panel

We construct the regression variables from Compustat: investment (capex over lagged assets), cash flows (operating cash flow over lagged assets), and Tobin's Q (market value of assets over book assets). The dependent variable is next year's investment, so the regressors are properly predetermined.

crsp_monthly <- tbl(tidy_finance, "crsp_monthly") |> collect()
compustat <- tbl(tidy_finance, "compustat") |> collect()

data_investment <- compustat |>
  mutate(date = floor_date(ymd(datadate), "month")) |>
  left_join(
    compustat |> select(gvkey, year, at_lag = at) |> mutate(year = year + 1),
    join_by(gvkey, year)
  ) |>
  mutate(investment = capx / at_lag, cash_flows = oancf / at_lag) |>
  group_by(gvkey) |>
  mutate(investment_lead = lead(investment)) |>
  ungroup() |>
  left_join(crsp_monthly |> select(gvkey, date, mktcap), join_by(gvkey, date)) |>
  mutate(tobins_q = (mktcap + at - be - txdb) / at) |>
  select(gvkey, year, investment_lead, cash_flows, tobins_q) |>
  drop_na()
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
crsp_monthly = pd.read_sql_query("SELECT * FROM crsp_monthly", tidy_finance, parse_dates={"date"})
compustat = pd.read_sql_query("SELECT * FROM compustat", tidy_finance, parse_dates={"datadate"})

data_investment = (compustat
  .assign(date=lambda x: x["datadate"].dt.to_period("M").dt.to_timestamp())
  .merge(compustat.get(["gvkey", "year", "at"])
    .rename(columns={"at": "at_lag"})
    .assign(year=lambda x: x["year"] + 1),
    how="left", on=["gvkey", "year"])
  .assign(
    investment=lambda x: x["capx"] / x["at_lag"],
    cash_flows=lambda x: x["oancf"] / x["at_lag"])
  .sort_values(["gvkey", "year"])
  .assign(investment_lead=lambda x: x.groupby("gvkey")["investment"].shift(-1))
  .merge(crsp_monthly.get(["gvkey", "date", "mktcap"]), how="left", on=["gvkey", "date"])
  .assign(tobins_q=lambda x: (x["mktcap"] + x["at"] - x["be"] - x["txdb"]) / x["at"])
  .get(["gvkey", "year", "investment_lead", "cash_flows", "tobins_q"])
  .dropna())

Winsorizing

Accounting ratios have extreme outliers that can dominate a regression. Winsorizing caps each variable at its 1st and 99th percentiles — a standard, transparent way to limit the influence of a few extreme firm-years without dropping them.

winsorize <- function(x, cut) {
  x <- replace(x, x > quantile(x, 1 - cut, na.rm = TRUE), quantile(x, 1 - cut, na.rm = TRUE))
  x <- replace(x, x < quantile(x, cut, na.rm = TRUE), quantile(x, cut, na.rm = TRUE))
  x
}

data_investment <- data_investment |>
  mutate(across(c(investment_lead, cash_flows, tobins_q), ~ winsorize(., 0.01)))
def winsorize(x, cut):
    tmp = x.copy()
    upper = np.nanquantile(x, 1 - cut)
    lower = np.nanquantile(x, cut)
    tmp[tmp > upper] = upper
    tmp[tmp < lower] = lower
    return tmp

data_investment = data_investment.assign(
  investment_lead=lambda x: winsorize(x["investment_lead"], 0.01),
  cash_flows=lambda x: winsorize(x["cash_flows"], 0.01),
  tobins_q=lambda x: winsorize(x["tobins_q"], 0.01))

OLS, then fixed effects

We start with pooled OLS, then add firm fixed effects (identifying the effect from within-firm variation), then add year fixed effects too (absorbing common shocks). The | gvkey and | gvkey + year syntax in fixest makes the progression compact; linearmodels does the same via entity/time effects.

model_ols <- feols(investment_lead ~ cash_flows + tobins_q, data = data_investment)

model_fe_firm <- feols(
  investment_lead ~ cash_flows + tobins_q | gvkey, data = data_investment)

model_fe_firmyear <- feols(
  investment_lead ~ cash_flows + tobins_q | gvkey + year, data = data_investment)
data_panel = data_investment.set_index(["gvkey", "year"])

model_ols = lm.PanelOLS.from_formula(
    "investment_lead ~ cash_flows + tobins_q + 1", data=data_panel).fit()

model_fe_firm = lm.PanelOLS.from_formula(
    "investment_lead ~ cash_flows + tobins_q + EntityEffects", data=data_panel).fit()

model_fe_firmyear = lm.PanelOLS.from_formula(
    "investment_lead ~ cash_flows + tobins_q + EntityEffects + TimeEffects",
    data=data_panel).fit()

Clustered standard errors

Residuals are correlated within firms (over time) and within periods (across firms), so naive standard errors are too small. Clustering — by firm, and then by firm and year (two-way) — allows for these dependencies and yields honest confidence intervals. The choice of clustering dimension follows the suspected correlation structure.

model_cluster_firm <- feols(
  investment_lead ~ cash_flows + tobins_q | gvkey + year,
  vcov = ~gvkey, data = data_investment)

model_cluster_firmyear <- feols(
  investment_lead ~ cash_flows + tobins_q | gvkey + year,
  vcov = ~ gvkey + year, data = data_investment)
model_cluster_firm = lm.PanelOLS.from_formula(
    "investment_lead ~ cash_flows + tobins_q + EntityEffects + TimeEffects",
    data=data_panel).fit(cov_type="clustered", cluster_entity=True)

model_cluster_firmyear = lm.PanelOLS.from_formula(
    "investment_lead ~ cash_flows + tobins_q + EntityEffects + TimeEffects",
    data=data_panel).fit(cov_type="clustered", cluster_entity=True, cluster_time=True)

Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.