Value and Bivariate Sorts

Jun He May 31, 2026

This chapter sorts on two characteristics at once — size and book-to-market — to study the value premium while controlling for size. It builds the book-to-market ratio with careful accounting lags, then forms portfolios with both independent and dependent (sequential) double sorts. Use the R | Python toggle to switch.

library(tidyverse)
library(tidyfinance)

import pandas as pd
import numpy as np
import sqlite3

Data preparation

We need both market data (CRSP) and accounting fundamentals (Compustat), plus the market factor.

crsp_monthly <- tbl(tidy_finance, "crsp_monthly") |> collect()
compustat <- tbl(tidy_finance, "compustat") |> collect()
factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |> collect()

tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")

crsp_monthly = pd.read_sql_query(
  "SELECT * FROM crsp_monthly", tidy_finance, parse_dates={"date"})
compustat = pd.read_sql_query(
  "SELECT * FROM compustat", tidy_finance, parse_dates={"datadate"})
factors_ff3_monthly = pd.read_sql_query(
  "SELECT * FROM factors_ff3_monthly", tidy_finance, parse_dates={"date"})

The book-to-market ratio

Book equity comes from Compustat; market equity from CRSP. The timing is the delicate part: book equity is matched to market data with a six-month lag (sorting_date), so portfolios formed in mid-year use accounting data that was already public. This avoids look-ahead bias and follows the Fama–French convention.

be <- compustat |>
  select(gvkey, datadate, be) |>
  drop_na() |>
  mutate(date = floor_date(ymd(datadate), "month"))

me <- crsp_monthly |>
  mutate(sorting_date = date %m+% months(1)) |>
  select(permno, gvkey, sorting_date, me = mktcap)

bm <- be |>
  inner_join(crsp_monthly, join_by(gvkey, date)) |>
  mutate(bm = be / mktcap,
         sorting_date = date %m+% months(6),
         comp_date = date) |>
  select(permno, gvkey, sorting_date, comp_date, bm)

be = (compustat
  .get(["gvkey", "datadate", "be"])
  .dropna()
  .assign(date=lambda x: pd.to_datetime(x["datadate"]).dt.to_period("M").dt.to_timestamp())
)

me = (crsp_monthly
  .assign(sorting_date=lambda x: x["date"] + pd.DateOffset(months=1))
  .rename(columns={"mktcap": "me"})
  .get(["permno", "gvkey", "sorting_date", "me"])
)

bm = (be
  .merge(crsp_monthly, how="inner", on=["gvkey", "date"])
  .assign(
    bm=lambda x: x["be"] / x["mktcap"],
    sorting_date=lambda x: x["date"] + pd.DateOffset(months=6),
    comp_date=lambda x: x["date"])
  .get(["permno", "gvkey", "sorting_date", "comp_date", "bm"])
)

Independent sorts

An independent double sort ranks stocks on size and on book-to-market separately, then intersects the groups, giving a portfolio at every size–value combination. The value premium is the average return spread between high and low book-to-market, averaged across size groups.

data_for_sorts <- crsp_monthly |>
  left_join(bm, join_by(permno, gvkey, date == sorting_date)) |>
  left_join(me, join_by(permno, gvkey, date == sorting_date))

value_portfolios <- data_for_sorts |>
  group_by(date) |>
  mutate(
    portfolio_bm = assign_portfolio(pick(everything()), bm, n_portfolios = 5),
    portfolio_me = assign_portfolio(pick(everything()), me, n_portfolios = 5)
  ) |>
  group_by(date, portfolio_bm, portfolio_me) |>
  summarize(ret = weighted.mean(ret_excess, mktcap_lag), .groups = "drop")

value_premium <- value_portfolios |>
  group_by(date, portfolio_bm) |>
  summarize(ret = mean(ret), .groups = "drop") |>
  group_by(date) |>
  summarize(value_premium =
    ret[portfolio_bm == max(portfolio_bm)] -
    ret[portfolio_bm == min(portfolio_bm)])

data_for_sorts = (crsp_monthly
  .merge(bm, how="left",
         left_on=["permno", "gvkey", "date"],
         right_on=["permno", "gvkey", "sorting_date"])
  .merge(me, how="left",
         left_on=["permno", "gvkey", "date"],
         right_on=["permno", "gvkey", "sorting_date"])
)

value_portfolios = (data_for_sorts
  .groupby("date")
  .apply(lambda x: x.assign(
      portfolio_bm=assign_portfolio(x, "bm", 5),
      portfolio_me=assign_portfolio(x, "me", 5)))
  .reset_index(drop=True)
  .groupby(["date", "portfolio_bm", "portfolio_me"])
  .apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
  .reset_index(name="ret")
)

Dependent sorts

A dependent (sequential) sort ranks first on size, then ranks on book-to-market within each size group. This guarantees a balanced number of stocks along the controlling dimension, which matters when the two characteristics are correlated.

value_portfolios <- data_for_sorts |>
  group_by(date) |>
  mutate(portfolio_me = assign_portfolio(pick(everything()), me, n_portfolios = 5)) |>
  group_by(date, portfolio_me) |>
  mutate(portfolio_bm = assign_portfolio(pick(everything()), bm, n_portfolios = 5)) |>
  group_by(date, portfolio_bm, portfolio_me) |>
  summarize(ret = weighted.mean(ret_excess, mktcap_lag), .groups = "drop")

value_portfolios = (data_for_sorts
  .groupby("date")
  .apply(lambda x: x.assign(
      portfolio_me=assign_portfolio(x, "me", 5)))
  .reset_index(drop=True)
  .groupby(["date", "portfolio_me"])
  .apply(lambda x: x.assign(
      portfolio_bm=assign_portfolio(x, "bm", 5)))
  .reset_index(drop=True)
  .groupby(["date", "portfolio_bm", "portfolio_me"])
  .apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
  .reset_index(name="ret")
)

The two designs usually give similar value premia; the comparison is itself the lesson, since the choice between them is one more researcher degree of freedom. The size–value grid built here is exactly the structure the next chapter uses to construct the Fama–French SMB and HML factors.

Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.