Replicating Fama–French Factors

Jun He May 31, 2026

This chapter reconstructs the SMB and HML factors from scratch, following the Fama–French recipe, then checks the home-grown factors against the published series. It is a stress test of the whole data pipeline: the construction details (NYSE breakpoints, June rebalancing, accounting lags) are exactly where replications drift. Use the R | Python toggle to switch.

library(tidyverse)
library(tidyfinance)

import pandas as pd
import numpy as np
import sqlite3

Data preparation

We load returns, the accounting variables needed for the factors (book equity, operating profitability, investment), and the published SMB/HML to compare against.

crsp_monthly <- tbl(tidy_finance, "crsp_monthly") |>
  select(permno, gvkey, date, ret_excess, mktcap, mktcap_lag, exchange) |>
  collect()

compustat <- tbl(tidy_finance, "compustat") |>
  select(gvkey, datadate, be, op, inv) |>
  collect()

factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |>
  select(date, smb, hml) |>
  collect()

tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")

crsp_monthly = pd.read_sql_query(
  "SELECT permno, gvkey, date, ret_excess, mktcap, mktcap_lag, exchange FROM crsp_monthly",
  tidy_finance, parse_dates={"date"})

compustat = pd.read_sql_query(
  "SELECT gvkey, datadate, be, op, inv FROM compustat",
  tidy_finance, parse_dates={"datadate"})

factors_ff3_monthly = pd.read_sql_query(
  "SELECT date, smb, hml FROM factors_ff3_monthly",
  tidy_finance, parse_dates={"date"})

Sorting variables and the timing convention

The Fama–French timing is precise: size is market equity in June of year t; book-to-market uses book equity from the fiscal year ending in t−1 divided by market equity from December of t−1. Portfolios are formed in June and held for twelve months. Encoding these sorting_dates correctly is the crux of the replication.

size <- crsp_monthly |>
  filter(month(date) == 6) |>
  mutate(sorting_date = date %m+% months(1)) |>
  select(permno, exchange, sorting_date, size = mktcap)

market_equity <- crsp_monthly |>
  filter(month(date) == 12) |>
  mutate(sorting_date = date %m+% months(7)) |>
  select(permno, gvkey, sorting_date, me = mktcap)

book_to_market <- compustat |>
  mutate(sorting_date = ymd(str_c(year(datadate) + 1, "0701"))) |>
  select(gvkey, sorting_date, be) |>
  inner_join(market_equity, join_by(gvkey, sorting_date)) |>
  mutate(bm = be / me) |>
  select(permno, sorting_date, me, bm)

sorting_variables <- size |>
  inner_join(book_to_market, join_by(permno, sorting_date)) |>
  drop_na() |>
  distinct(permno, sorting_date, .keep_all = TRUE)

size = (crsp_monthly
  .query("date.dt.month == 6")
  .assign(sorting_date=lambda x: x["date"] + pd.DateOffset(months=1))
  .rename(columns={"mktcap": "size"})
  .get(["permno", "exchange", "sorting_date", "size"]))

market_equity = (crsp_monthly
  .query("date.dt.month == 12")
  .assign(sorting_date=lambda x: x["date"] + pd.DateOffset(months=7))
  .rename(columns={"mktcap": "me"})
  .get(["permno", "gvkey", "sorting_date", "me"]))

book_to_market = (compustat
  .assign(sorting_date=lambda x: pd.to_datetime(
    (x["datadate"].dt.year + 1).astype(str) + "0701", format="%Y%m%d"))
  .get(["gvkey", "sorting_date", "be"])
  .merge(market_equity, how="inner", on=["gvkey", "sorting_date"])
  .assign(bm=lambda x: x["be"] / x["me"])
  .get(["permno", "sorting_date", "me", "bm"]))

sorting_variables = (size
  .merge(book_to_market, how="inner", on=["permno", "sorting_date"])
  .dropna()
  .drop_duplicates(subset=["permno", "sorting_date"]))

NYSE breakpoints

The sorting helper computes breakpoints using only NYSE stocks — a defining Fama–French choice that prevents the many small NASDAQ stocks from dominating the cutoffs. Size splits at the median (2 groups); book-to-market splits at the 30th and 70th percentiles (3 groups).

assign_portfolio <- function(data, sorting_variable, percentiles, exchanges) {
  breakpoints <- data |>
    filter(exchange %in% exchanges) |>
    pull({{ sorting_variable }}) |>
    quantile(probs = percentiles, na.rm = TRUE, names = FALSE)
  data |>
    mutate(portfolio = findInterval(
      pick(everything()) |> pull({{ sorting_variable }}),
      breakpoints, all.inside = TRUE)) |>
    pull(portfolio)
}

portfolios <- sorting_variables |>
  group_by(sorting_date) |>
  mutate(
    portfolio_size = assign_portfolio(pick(everything()), size, c(0, 0.5, 1), c("NYSE")),
    portfolio_bm = assign_portfolio(pick(everything()), bm, c(0, 0.3, 0.7, 1), c("NYSE"))
  ) |>
  select(permno, gvkey, sorting_date, portfolio_size, portfolio_bm)

def assign_portfolio(data, sorting_variable, percentiles, exchanges):
    breakpoints = (data
      .query("exchange in @exchanges")
      .get(sorting_variable)
      .quantile(percentiles, interpolation="linear"))
    breakpoints.iloc[0] = -np.inf
    breakpoints.iloc[breakpoints.size - 1] = np.inf
    return pd.cut(
      data[sorting_variable], bins=breakpoints,
      labels=range(1, breakpoints.size), include_lowest=True, right=False)

portfolios = (sorting_variables
  .groupby("sorting_date")
  .apply(lambda x: x.assign(
      portfolio_size=assign_portfolio(x, "size", [0, 0.5, 1], ["NYSE"]),
      portfolio_bm=assign_portfolio(x, "bm", [0, 0.3, 0.7, 1], ["NYSE"])))
  .reset_index(drop=True)
  .get(["permno", "gvkey", "sorting_date", "portfolio_size", "portfolio_bm"]))

Building SMB and HML

The six size×value portfolios give the factors. SMB averages the small portfolios minus the big ones; HML averages the high book-to-market portfolios minus the low ones. Each portfolio return is value-weighted; the factor is constructed so it is roughly neutral to the other dimension.

portfolios <- portfolios |>
  inner_join(crsp_monthly, join_by(permno, gvkey, closest(sorting_date <= date))) |>
  filter(date < sorting_date %m+% months(12))

factors_replicated <- portfolios |>
  group_by(portfolio_size, portfolio_bm, date) |>
  summarize(ret = weighted.mean(ret_excess, mktcap_lag), .groups = "drop") |>
  group_by(date) |>
  summarize(
    smb_replicated = mean(ret[portfolio_size == 1]) - mean(ret[portfolio_size == 2]),
    hml_replicated = mean(ret[portfolio_bm == 3]) - mean(ret[portfolio_bm == 1])
  )

portfolios = (portfolios
  .merge(crsp_monthly, how="inner", on=["permno", "gvkey"])
  .query("(date >= sorting_date) & (date < sorting_date + pd.DateOffset(months=12))"))

factors_replicated = (portfolios
  .groupby(["portfolio_size", "portfolio_bm", "date"])
  .apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
  .reset_index(name="ret")
  .groupby("date")
  .apply(lambda x: pd.Series({
    "smb_replicated": x.loc[x["portfolio_size"] == 1, "ret"].mean() - x.loc[x["portfolio_size"] == 2, "ret"].mean(),
    "hml_replicated": x.loc[x["portfolio_bm"] == 3, "ret"].mean() - x.loc[x["portfolio_bm"] == 1, "ret"].mean()
  }))
  .reset_index())

Evaluating the replication

The test regresses the published factor on the replicated one: a slope near one and a high R² mean the pipeline is faithful. The replication usually gets very close but rarely identical — itself an honest measure of how sensitive these benchmarks are to implementation.

test <- factors_replicated |>
  inner_join(factors_ff3_monthly, join_by(date)) |>
  mutate(across(c(smb_replicated, hml_replicated), ~ round(., 4)))

summary(lm(smb ~ smb_replicated, data = test))
summary(lm(hml ~ hml_replicated, data = test))

import statsmodels.formula.api as smf

test = (factors_replicated
  .merge(factors_ff3_monthly, how="inner", on="date"))

smf.ols("smb ~ smb_replicated", data=test).fit().summary()
smf.ols("hml ~ hml_replicated", data=test).fit().summary()

Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.