Replicating Fama–French Factors
This chapter reconstructs the SMB and HML factors from scratch, following the Fama–French recipe, then checks the home-grown factors against the published series. It is a stress test of the whole data pipeline: the construction details (NYSE breakpoints, June rebalancing, accounting lags) are exactly where replications drift. Use the R | Python toggle to switch.
library(tidyverse)
library(tidyfinance)
import pandas as pd
import numpy as np
import sqlite3
Data preparation
We load returns, the accounting variables needed for the factors (book equity, operating profitability, investment), and the published SMB/HML to compare against.
crsp_monthly <- tbl(tidy_finance, "crsp_monthly") |>
select(permno, gvkey, date, ret_excess, mktcap, mktcap_lag, exchange) |>
collect()
compustat <- tbl(tidy_finance, "compustat") |>
select(gvkey, datadate, be, op, inv) |>
collect()
factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |>
select(date, smb, hml) |>
collect()
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
crsp_monthly = pd.read_sql_query(
"SELECT permno, gvkey, date, ret_excess, mktcap, mktcap_lag, exchange FROM crsp_monthly",
tidy_finance, parse_dates={"date"})
compustat = pd.read_sql_query(
"SELECT gvkey, datadate, be, op, inv FROM compustat",
tidy_finance, parse_dates={"datadate"})
factors_ff3_monthly = pd.read_sql_query(
"SELECT date, smb, hml FROM factors_ff3_monthly",
tidy_finance, parse_dates={"date"})
Sorting variables and the timing convention
The Fama–French timing is precise: size is market equity in June of year t; book-to-market uses book equity from the fiscal year ending in t−1 divided by market equity from December of t−1. Portfolios are formed in June and held for twelve months. Encoding these sorting_dates correctly is the crux of the replication.
size <- crsp_monthly |>
filter(month(date) == 6) |>
mutate(sorting_date = date %m+% months(1)) |>
select(permno, exchange, sorting_date, size = mktcap)
market_equity <- crsp_monthly |>
filter(month(date) == 12) |>
mutate(sorting_date = date %m+% months(7)) |>
select(permno, gvkey, sorting_date, me = mktcap)
book_to_market <- compustat |>
mutate(sorting_date = ymd(str_c(year(datadate) + 1, "0701"))) |>
select(gvkey, sorting_date, be) |>
inner_join(market_equity, join_by(gvkey, sorting_date)) |>
mutate(bm = be / me) |>
select(permno, sorting_date, me, bm)
sorting_variables <- size |>
inner_join(book_to_market, join_by(permno, sorting_date)) |>
drop_na() |>
distinct(permno, sorting_date, .keep_all = TRUE)
size = (crsp_monthly
.query("date.dt.month == 6")
.assign(sorting_date=lambda x: x["date"] + pd.DateOffset(months=1))
.rename(columns={"mktcap": "size"})
.get(["permno", "exchange", "sorting_date", "size"]))
market_equity = (crsp_monthly
.query("date.dt.month == 12")
.assign(sorting_date=lambda x: x["date"] + pd.DateOffset(months=7))
.rename(columns={"mktcap": "me"})
.get(["permno", "gvkey", "sorting_date", "me"]))
book_to_market = (compustat
.assign(sorting_date=lambda x: pd.to_datetime(
(x["datadate"].dt.year + 1).astype(str) + "0701", format="%Y%m%d"))
.get(["gvkey", "sorting_date", "be"])
.merge(market_equity, how="inner", on=["gvkey", "sorting_date"])
.assign(bm=lambda x: x["be"] / x["me"])
.get(["permno", "sorting_date", "me", "bm"]))
sorting_variables = (size
.merge(book_to_market, how="inner", on=["permno", "sorting_date"])
.dropna()
.drop_duplicates(subset=["permno", "sorting_date"]))
NYSE breakpoints
The sorting helper computes breakpoints using only NYSE stocks — a defining Fama–French choice that prevents the many small NASDAQ stocks from dominating the cutoffs. Size splits at the median (2 groups); book-to-market splits at the 30th and 70th percentiles (3 groups).
assign_portfolio <- function(data, sorting_variable, percentiles, exchanges) {
breakpoints <- data |>
filter(exchange %in% exchanges) |>
pull({{ sorting_variable }}) |>
quantile(probs = percentiles, na.rm = TRUE, names = FALSE)
data |>
mutate(portfolio = findInterval(
pick(everything()) |> pull({{ sorting_variable }}),
breakpoints, all.inside = TRUE)) |>
pull(portfolio)
}
portfolios <- sorting_variables |>
group_by(sorting_date) |>
mutate(
portfolio_size = assign_portfolio(pick(everything()), size, c(0, 0.5, 1), c("NYSE")),
portfolio_bm = assign_portfolio(pick(everything()), bm, c(0, 0.3, 0.7, 1), c("NYSE"))
) |>
select(permno, gvkey, sorting_date, portfolio_size, portfolio_bm)
def assign_portfolio(data, sorting_variable, percentiles, exchanges):
breakpoints = (data
.query("exchange in @exchanges")
.get(sorting_variable)
.quantile(percentiles, interpolation="linear"))
breakpoints.iloc[0] = -np.inf
breakpoints.iloc[breakpoints.size - 1] = np.inf
return pd.cut(
data[sorting_variable], bins=breakpoints,
labels=range(1, breakpoints.size), include_lowest=True, right=False)
portfolios = (sorting_variables
.groupby("sorting_date")
.apply(lambda x: x.assign(
portfolio_size=assign_portfolio(x, "size", [0, 0.5, 1], ["NYSE"]),
portfolio_bm=assign_portfolio(x, "bm", [0, 0.3, 0.7, 1], ["NYSE"])))
.reset_index(drop=True)
.get(["permno", "gvkey", "sorting_date", "portfolio_size", "portfolio_bm"]))
Building SMB and HML
The six size×value portfolios give the factors. SMB averages the small portfolios minus the big ones; HML averages the high book-to-market portfolios minus the low ones. Each portfolio return is value-weighted; the factor is constructed so it is roughly neutral to the other dimension.
portfolios <- portfolios |>
inner_join(crsp_monthly, join_by(permno, gvkey, closest(sorting_date <= date))) |>
filter(date < sorting_date %m+% months(12))
factors_replicated <- portfolios |>
group_by(portfolio_size, portfolio_bm, date) |>
summarize(ret = weighted.mean(ret_excess, mktcap_lag), .groups = "drop") |>
group_by(date) |>
summarize(
smb_replicated = mean(ret[portfolio_size == 1]) - mean(ret[portfolio_size == 2]),
hml_replicated = mean(ret[portfolio_bm == 3]) - mean(ret[portfolio_bm == 1])
)
portfolios = (portfolios
.merge(crsp_monthly, how="inner", on=["permno", "gvkey"])
.query("(date >= sorting_date) & (date < sorting_date + pd.DateOffset(months=12))"))
factors_replicated = (portfolios
.groupby(["portfolio_size", "portfolio_bm", "date"])
.apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
.reset_index(name="ret")
.groupby("date")
.apply(lambda x: pd.Series({
"smb_replicated": x.loc[x["portfolio_size"] == 1, "ret"].mean() - x.loc[x["portfolio_size"] == 2, "ret"].mean(),
"hml_replicated": x.loc[x["portfolio_bm"] == 3, "ret"].mean() - x.loc[x["portfolio_bm"] == 1, "ret"].mean()
}))
.reset_index())
Evaluating the replication
The test regresses the published factor on the replicated one: a slope near one and a high R² mean the pipeline is faithful. The replication usually gets very close but rarely identical — itself an honest measure of how sensitive these benchmarks are to implementation.
test <- factors_replicated |>
inner_join(factors_ff3_monthly, join_by(date)) |>
mutate(across(c(smb_replicated, hml_replicated), ~ round(., 4)))
summary(lm(smb ~ smb_replicated, data = test))
summary(lm(hml ~ hml_replicated, data = test))
import statsmodels.formula.api as smf
test = (factors_replicated
.merge(factors_ff3_monthly, how="inner", on="date"))
smf.ols("smb ~ smb_replicated", data=test).fit().summary()
smf.ols("hml ~ hml_replicated", data=test).fit().summary()
Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.