Size Sorts and p-Hacking
This chapter sorts on firm size and uses the exercise to show how many defensible choices stand between raw data and a published number. By wrapping portfolio construction in a function with switches for breakpoints, exchange filters, and weighting, we can see how the "size premium" moves across specifications — a hands-on lesson in p-hacking and researcher degrees of freedom. Use the R | Python toggle to switch.
library(tidyverse)
library(tidyfinance)
import pandas as pd
import numpy as np
import sqlite3
Data preparation
We load the full monthly CRSP panel and the market factor.
crsp_monthly <- tbl(tidy_finance, "crsp_monthly") |>
collect()
factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |>
select(date, mkt_excess) |>
collect()
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
crsp_monthly = pd.read_sql_query(
sql="SELECT * FROM crsp_monthly", con=tidy_finance, parse_dates={"date"}
)
factors_ff3_monthly = pd.read_sql_query(
sql="SELECT date, mkt_excess FROM factors_ff3_monthly",
con=tidy_finance, parse_dates={"date"}
)
The size distribution
Market cap is extremely skewed — a handful of giant firms dominate. Looking at the top quantiles of market cap each month makes the concentration concrete and motivates why the weighting scheme matters so much.
crsp_monthly |>
group_by(date) |>
summarize(
top01 = quantile(mktcap, 0.99),
top05 = quantile(mktcap, 0.95),
top10 = quantile(mktcap, 0.90),
top25 = quantile(mktcap, 0.75)
)
(crsp_monthly
.groupby("date")
.apply(lambda x: x["mktcap"].quantile([0.99, 0.95, 0.90, 0.75]))
)
Flexible breakpoints
We generalize the sorting helper to accept arbitrary percentile breakpoints, then form size portfolios on lagged market cap. Using all-stock vs. NYSE-only breakpoints is one of the choices that quietly changes the result, because the size distribution is so skewed.
assign_portfolio <- function(data, sorting_variable, percentiles) {
breakpoints <- data |>
pull({{ sorting_variable }}) |>
quantile(probs = percentiles, na.rm = TRUE, names = FALSE)
data |>
mutate(portfolio = findInterval(
pick(everything()) |> pull({{ sorting_variable }}),
breakpoints, all.inside = TRUE
)) |>
pull(portfolio)
}
returns_size <- crsp_monthly |>
group_by(date) |>
mutate(portfolio = assign_portfolio(
data = pick(everything()),
sorting_variable = mktcap_lag,
percentiles = seq(0, 1, 0.2)
)) |>
group_by(portfolio, date) |>
summarize(ret = weighted.mean(ret_excess, mktcap_lag), .groups = "drop")
def assign_portfolio(data, sorting_variable, percentiles):
breakpoints = np.quantile(
data[sorting_variable].dropna(), percentiles, method="linear"
)
return pd.cut(
data[sorting_variable], bins=breakpoints,
labels=range(1, len(breakpoints)), include_lowest=True
)
returns_size = (crsp_monthly
.groupby("date")
.apply(lambda x: x.assign(
portfolio=assign_portfolio(x, "mktcap_lag", np.arange(0, 1.2, 0.2))))
.reset_index(drop=True)
.groupby(["portfolio", "date"])
.apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
.reset_index(name="ret")
)
One function, many specifications
The heart of the chapter: a function that computes the size premium given the number of portfolios, which exchanges to use for breakpoints, and whether to weight equally or by value. Each argument is a defensible modelling choice — and sweeping over them shows how much the headline premium depends on decisions that rarely make it into a paper's main text.
compute_portfolio_returns <- function(n_portfolios = 10,
exchanges = c("NYSE", "NASDAQ", "AMEX"),
value_weighted = TRUE,
data = crsp_monthly) {
data |>
group_by(date) |>
mutate(portfolio = assign_portfolio(
data = pick(everything()),
sorting_variable = mktcap_lag,
percentiles = seq(0, 1, length.out = n_portfolios + 1)
)) |>
ungroup() |>
group_by(portfolio, date) |>
summarize(
ret = if_else(value_weighted,
weighted.mean(ret_excess, mktcap_lag), mean(ret_excess)),
.groups = "drop"
) |>
group_by(portfolio) |>
summarize(ret = mean(ret)) |>
summarize(size_premium = ret[portfolio == min(portfolio)] -
ret[portfolio == max(portfolio)])
}
def compute_portfolio_returns(n_portfolios=10,
exchanges=["NYSE", "NASDAQ", "AMEX"],
value_weighted=True,
data=crsp_monthly):
returns = (data
.groupby("date")
.apply(lambda x: x.assign(
portfolio=assign_portfolio(
x, "mktcap_lag", np.linspace(0, 1, n_portfolios+1))))
.reset_index(drop=True)
.groupby(["portfolio", "date"])
.apply(lambda x:
np.average(x["ret_excess"], weights=x["mktcap_lag"])
if value_weighted else x["ret_excess"].mean())
.reset_index(name="ret")
.groupby("portfolio")["ret"].mean()
)
return returns.iloc[0] - returns.iloc[-1]
Sweeping the garden of forking paths
Building a grid over all the switches and evaluating the premium for each combination produces a distribution of "results" from the same hypothesis. Seeing how widely the premium ranges — sometimes significant, sometimes not — is the chapter's punchline: a finding that survives only one particular combination of choices deserves suspicion. The defense is reporting across specifications, not cherry-picking.
p_hacking_setup <- expand_grid(
n_portfolios = c(2, 5, 10),
exchanges = list(c("NYSE"), c("NYSE", "NASDAQ", "AMEX")),
value_weighted = c(TRUE, FALSE),
data = list(crsp_monthly)
)
p_hacking_results <- p_hacking_setup |>
mutate(size_premium = pmap_dbl(
list(n_portfolios, exchanges, value_weighted, data),
compute_portfolio_returns
))
from itertools import product
p_hacking_setup = list(product(
[2, 5, 10],
[["NYSE"], ["NYSE", "NASDAQ", "AMEX"]],
[True, False],
))
p_hacking_results = pd.DataFrame([
{"n_portfolios": n, "exchanges": e, "value_weighted": vw,
"size_premium": compute_portfolio_returns(n, e, vw)}
for n, e, vw in p_hacking_setup
])
Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.