Size Sorts and p-Hacking

Jun He May 31, 2026

This chapter sorts on firm size and uses the exercise to show how many defensible choices stand between raw data and a published number. By wrapping portfolio construction in a function with switches for breakpoints, exchange filters, and weighting, we can see how the "size premium" moves across specifications — a hands-on lesson in p-hacking and researcher degrees of freedom. Use the R | Python toggle to switch.

library(tidyverse)
library(tidyfinance)

import pandas as pd
import numpy as np
import sqlite3

Data preparation

We load the full monthly CRSP panel and the market factor.

crsp_monthly <- tbl(tidy_finance, "crsp_monthly") |>
  collect()

factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |>
  select(date, mkt_excess) |>
  collect()

tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")

crsp_monthly = pd.read_sql_query(
  sql="SELECT * FROM crsp_monthly", con=tidy_finance, parse_dates={"date"}
)

factors_ff3_monthly = pd.read_sql_query(
  sql="SELECT date, mkt_excess FROM factors_ff3_monthly",
  con=tidy_finance, parse_dates={"date"}
)

The size distribution

Market cap is extremely skewed — a handful of giant firms dominate. Looking at the top quantiles of market cap each month makes the concentration concrete and motivates why the weighting scheme matters so much.

crsp_monthly |>
  group_by(date) |>
  summarize(
    top01 = quantile(mktcap, 0.99),
    top05 = quantile(mktcap, 0.95),
    top10 = quantile(mktcap, 0.90),
    top25 = quantile(mktcap, 0.75)
  )

(crsp_monthly
  .groupby("date")
  .apply(lambda x: x["mktcap"].quantile([0.99, 0.95, 0.90, 0.75]))
)

Flexible breakpoints

We generalize the sorting helper to accept arbitrary percentile breakpoints, then form size portfolios on lagged market cap. Using all-stock vs. NYSE-only breakpoints is one of the choices that quietly changes the result, because the size distribution is so skewed.

assign_portfolio <- function(data, sorting_variable, percentiles) {
  breakpoints <- data |>
    pull({{ sorting_variable }}) |>
    quantile(probs = percentiles, na.rm = TRUE, names = FALSE)
  data |>
    mutate(portfolio = findInterval(
      pick(everything()) |> pull({{ sorting_variable }}),
      breakpoints, all.inside = TRUE
    )) |>
    pull(portfolio)
}

returns_size <- crsp_monthly |>
  group_by(date) |>
  mutate(portfolio = assign_portfolio(
    data = pick(everything()),
    sorting_variable = mktcap_lag,
    percentiles = seq(0, 1, 0.2)
  )) |>
  group_by(portfolio, date) |>
  summarize(ret = weighted.mean(ret_excess, mktcap_lag), .groups = "drop")

def assign_portfolio(data, sorting_variable, percentiles):
    breakpoints = np.quantile(
        data[sorting_variable].dropna(), percentiles, method="linear"
    )
    return pd.cut(
        data[sorting_variable], bins=breakpoints,
        labels=range(1, len(breakpoints)), include_lowest=True
    )

returns_size = (crsp_monthly
  .groupby("date")
  .apply(lambda x: x.assign(
      portfolio=assign_portfolio(x, "mktcap_lag", np.arange(0, 1.2, 0.2))))
  .reset_index(drop=True)
  .groupby(["portfolio", "date"])
  .apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
  .reset_index(name="ret")
)

One function, many specifications

The heart of the chapter: a function that computes the size premium given the number of portfolios, which exchanges to use for breakpoints, and whether to weight equally or by value. Each argument is a defensible modelling choice — and sweeping over them shows how much the headline premium depends on decisions that rarely make it into a paper's main text.

compute_portfolio_returns <- function(n_portfolios = 10,
                                      exchanges = c("NYSE", "NASDAQ", "AMEX"),
                                      value_weighted = TRUE,
                                      data = crsp_monthly) {
  data |>
    group_by(date) |>
    mutate(portfolio = assign_portfolio(
      data = pick(everything()),
      sorting_variable = mktcap_lag,
      percentiles = seq(0, 1, length.out = n_portfolios + 1)
    )) |>
    ungroup() |>
    group_by(portfolio, date) |>
    summarize(
      ret = if_else(value_weighted,
        weighted.mean(ret_excess, mktcap_lag), mean(ret_excess)),
      .groups = "drop"
    ) |>
    group_by(portfolio) |>
    summarize(ret = mean(ret)) |>
    summarize(size_premium = ret[portfolio == min(portfolio)] -
      ret[portfolio == max(portfolio)])
}

def compute_portfolio_returns(n_portfolios=10,
                              exchanges=["NYSE", "NASDAQ", "AMEX"],
                              value_weighted=True,
                              data=crsp_monthly):
    returns = (data
      .groupby("date")
      .apply(lambda x: x.assign(
          portfolio=assign_portfolio(
            x, "mktcap_lag", np.linspace(0, 1, n_portfolios+1))))
      .reset_index(drop=True)
      .groupby(["portfolio", "date"])
      .apply(lambda x:
        np.average(x["ret_excess"], weights=x["mktcap_lag"])
        if value_weighted else x["ret_excess"].mean())
      .reset_index(name="ret")
      .groupby("portfolio")["ret"].mean()
    )
    return returns.iloc[0] - returns.iloc[-1]

Sweeping the garden of forking paths

Building a grid over all the switches and evaluating the premium for each combination produces a distribution of "results" from the same hypothesis. Seeing how widely the premium ranges — sometimes significant, sometimes not — is the chapter's punchline: a finding that survives only one particular combination of choices deserves suspicion. The defense is reporting across specifications, not cherry-picking.

p_hacking_setup <- expand_grid(
  n_portfolios = c(2, 5, 10),
  exchanges = list(c("NYSE"), c("NYSE", "NASDAQ", "AMEX")),
  value_weighted = c(TRUE, FALSE),
  data = list(crsp_monthly)
)

p_hacking_results <- p_hacking_setup |>
  mutate(size_premium = pmap_dbl(
    list(n_portfolios, exchanges, value_weighted, data),
    compute_portfolio_returns
  ))

from itertools import product

p_hacking_setup = list(product(
    [2, 5, 10],
    [["NYSE"], ["NYSE", "NASDAQ", "AMEX"]],
    [True, False],
))

p_hacking_results = pd.DataFrame([
    {"n_portfolios": n, "exchanges": e, "value_weighted": vw,
     "size_premium": compute_portfolio_returns(n, e, vw)}
    for n, e, vw in p_hacking_setup
])

Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.