Univariate Portfolio Sorts

Jun He May 31, 2026

The portfolio sort is the workhorse of empirical asset pricing: rank stocks on a characteristic, group them, and test whether the groups earn different returns. This chapter sorts on market beta, builds a long–short spread, and evaluates its CAPM alpha — then generalizes the procedure to any number of portfolios. Use the R | Python toggle to switch.

library(tidyverse)
library(tidyfinance)

import pandas as pd
import numpy as np
import sqlite3
import statsmodels.formula.api as smf

Data preparation

We pull the monthly returns panel, the market factor, and the betas estimated in the previous chapter from the local database.

crsp_monthly <- tbl(tidy_finance, "crsp_monthly") |>
  select(permno, date, ret_excess, mktcap_lag) |>
  collect()

factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |>
  select(date, mkt_excess) |>
  collect()

beta <- tbl(tidy_finance, "beta") |>
  select(permno, date, beta_monthly) |>
  collect()

tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")

crsp_monthly = pd.read_sql_query(
  sql="SELECT permno, date, ret_excess, mktcap_lag FROM crsp_monthly",
  con=tidy_finance, parse_dates={"date"}
)

factors_ff3_monthly = pd.read_sql_query(
  sql="SELECT date, mkt_excess FROM factors_ff3_monthly",
  con=tidy_finance, parse_dates={"date"}
)

beta = pd.read_sql_query(
  sql="SELECT permno, date, beta_monthly FROM beta",
  con=tidy_finance, parse_dates={"date"}
)

Sorting by market beta

The crucial timing step: we lag beta by one month so the sort uses only information available before the return period — no look-ahead. Then each month we split stocks at the median beta into "low" and "high" portfolios and compute each portfolio's value-weighted return (weighting by lagged market cap).

beta_lag <- beta |>
  mutate(date = date %m+% months(1)) |>
  select(permno, date, beta_lag = beta_monthly) |>
  drop_na()

data_for_sorts <- crsp_monthly |>
  inner_join(beta_lag, join_by(permno, date))

beta_portfolios <- data_for_sorts |>
  group_by(date) |>
  mutate(
    breakpoint = median(beta_lag),
    portfolio = case_when(
      beta_lag <= breakpoint ~ "low",
      beta_lag > breakpoint ~ "high"
    )
  ) |>
  group_by(portfolio, date) |>
  summarize(ret = weighted.mean(ret_excess, mktcap_lag), .groups = "drop")

beta_lag = (beta
  .assign(date=lambda x: x["date"] + pd.DateOffset(months=1))
  .get(["permno", "date", "beta_monthly"])
  .rename(columns={"beta_monthly": "beta_lag"})
  .dropna()
)

data_for_sorts = crsp_monthly.merge(beta_lag, how="inner", on=["permno", "date"])

beta_portfolios = (data_for_sorts
  .groupby("date")
  .apply(lambda x: x.assign(
      portfolio=pd.cut(
        x["beta_lag"],
        bins=[-np.inf, x["beta_lag"].median(), np.inf],
        labels=["low", "high"])))
  .reset_index(drop=True)
  .groupby(["portfolio", "date"])
  .apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
  .reset_index(name="ret")
)

Performance evaluation

The long–short spread goes long the high-beta portfolio and short the low-beta one. Regressing the spread on a constant tests whether its average return is significantly different from zero.

beta_longshort <- beta_portfolios |>
  pivot_wider(names_from = portfolio, values_from = ret) |>
  mutate(long_short = high - low)

model_fit <- lm(long_short ~ 1, data = beta_longshort)
summary(model_fit)

beta_longshort = (beta_portfolios
  .pivot_table(index="date", columns="portfolio", values="ret")
  .reset_index()
  .assign(long_short=lambda x: x["high"] - x["low"])
)

model_fit = smf.ols("long_short ~ 1", data=beta_longshort).fit()
model_fit.summary()

A reusable sorting function

Two portfolios is the simplest case. To generalize to deciles (or any number), we compute breakpoints from the sorting variable's quantiles and assign each stock to the interval it falls in. This assign_portfolio helper reappears throughout the asset-pricing chapters.

assign_portfolio <- function(data, sorting_variable, n_portfolios) {
  breakpoints <- data |>
    pull({{ sorting_variable }}) |>
    quantile(
      probs = seq(0, 1, length.out = n_portfolios + 1),
      na.rm = TRUE, names = FALSE
    )
  data |>
    mutate(portfolio = findInterval(
      pick(everything()) |> pull({{ sorting_variable }}),
      breakpoints, all.inside = TRUE
    )) |>
    pull(portfolio)
}

def assign_portfolio(data, sorting_variable, n_portfolios):
    breakpoints = np.quantile(
        data[sorting_variable].dropna(),
        np.linspace(0, 1, n_portfolios + 1),
        method="linear"
    )
    assigned = pd.cut(
        data[sorting_variable], bins=breakpoints,
        labels=range(1, n_portfolios + 1), include_lowest=True
    )
    return assigned

CAPM alphas across portfolios

Regressing each portfolio's return on the market factor decomposes it into a beta (market exposure) and an alpha (the part the market cannot explain). A monotone pattern in alphas across the sorted portfolios is the signature of a priced characteristic.

beta_portfolios_summary <- beta_portfolios |>
  group_by(portfolio) |>
  summarize(
    alpha = as.numeric(lm(ret ~ 1 + mkt_excess)$coefficients[1]),
    beta = as.numeric(lm(ret ~ 1 + mkt_excess)$coefficients[2]),
    ret = mean(ret)
  )

def estimate_portfolio_capm(data):
    fit = smf.ols("ret ~ 1 + mkt_excess", data=data).fit()
    return pd.Series({
        "alpha": fit.params["Intercept"],
        "beta": fit.params["mkt_excess"],
        "ret": data["ret"].mean()
    })

beta_portfolios_summary = (beta_portfolios
  .merge(factors_ff3_monthly, how="left", on="date")
  .groupby("portfolio")
  .apply(estimate_portfolio_capm)
  .reset_index()
)

Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.