WRDS, CRSP, and Compustat

CRSP and Compustat, accessed through WRDS, are the backbone datasets of empirical equity research. This chapter connects to WRDS, downloads and cleans CRSP stock data and Compustat fundamentals, and merges them into the firm-month panel that anchors the empirical chapters. Use the R | Python toggle to switch.

library(tidyverse)
library(tidyfinance)
library(arrow)
library(dbplyr)
library(RPostgres)
import pandas as pd
import numpy as np
import tidyfinance as tf
from sqlalchemy import create_engine

Accessing WRDS

WRDS is a PostgreSQL server; we connect with credentials stored in environment variables (never in code).

wrds <- dbConnect(
  Postgres(),
  host = "wrds-pgdata.wharton.upenn.edu",
  dbname = "wrds", port = 9737, sslmode = "require",
  user = Sys.getenv("WRDS_USER"),
  password = Sys.getenv("WRDS_PASSWORD")
)
import os
connection_string = (
    "postgresql+psycopg2://"
    f"{os.getenv('WRDS_USER')}:{os.getenv('WRDS_PASSWORD')}"
    "@wrds-pgdata.wharton.upenn.edu:9737/wrds"
)
wrds = create_engine(connection_string, pool_pre_ping=True)

Downloading and preparing CRSP

We pull monthly stock data joined to the security-info history, filtering to US common equity on the three major exchanges. Two cleaning steps matter most: computing market cap as price × shares outstanding, and building a one-month lagged market cap (the weighting variable for value-weighted portfolios).

crsp_monthly <- download_data(
  type = "wrds_crsp_monthly",
  start_date = start_date,
  end_date = end_date
)
crsp_monthly = tf.download_data(
    domain="wrds", dataset="crsp_monthly",
    start_date=start_date, end_date=end_date
)

Under the hood, download_data runs a SQL query against crsp.msf_v2 and the security-info history, then maps exchange and industry codes and computes returns in excess of the risk-free rate. The manual version makes the steps explicit:

crsp_monthly <- crsp_monthly |>
  mutate(
    mktcap = shrout * prc / 10^6,
    mktcap = na_if(mktcap, 0)
  )

mktcap_lag <- crsp_monthly |>
  mutate(date = date %m+% months(1)) |>
  select(permno, date, mktcap_lag = mktcap)

crsp_monthly <- crsp_monthly |>
  left_join(mktcap_lag, join_by(permno, date)) |>
  mutate(
    exchange = case_when(
      primaryexch == "N" ~ "NYSE",
      primaryexch == "A" ~ "AMEX",
      primaryexch == "Q" ~ "NASDAQ",
      .default = "Other"
    )
  )
crsp_monthly = (crsp_monthly
    .assign(mktcap=lambda x: (x["shrout"] * x["altprc"] / 1e6).replace(0, np.nan)))

mktcap_lag = (crsp_monthly
    .assign(date=lambda x: x["date"] + pd.DateOffset(months=1))
    .get(["permno", "date", "mktcap"])
    .rename(columns={"mktcap": "mktcap_lag"}))

def assign_exchange(p):
    return {"N": "NYSE", "A": "AMEX", "Q": "NASDAQ"}.get(p, "Other")

crsp_monthly = (crsp_monthly
    .merge(mktcap_lag, how="left", on=["permno", "date"])
    .assign(exchange=lambda x: x["primaryexch"].apply(assign_exchange)))

Excess returns and the industry mapping

Joining the Fama–French risk-free rate gives excess returns; the SIC code maps to a coarse industry label used throughout the book.

factors_ff3_monthly <- read_parquet("data-r/factors_ff3_monthly.parquet") |>
  select(date, risk_free)

crsp_monthly <- crsp_monthly |>
  left_join(factors_ff3_monthly, join_by(date)) |>
  mutate(ret_excess = ret - risk_free) |>
  select(-risk_free) |>
  drop_na(ret_excess, mktcap, mktcap_lag)

write_parquet(crsp_monthly, "data-r/crsp_monthly.parquet")
factors_ff3_monthly = pd.read_parquet(
    "data-python/factors_ff3_monthly.parquet")[["date", "risk_free"]]

crsp_monthly = (crsp_monthly
    .merge(factors_ff3_monthly, how="left", on="date")
    .assign(ret_excess=lambda x: x["ret"] - x["risk_free"])
    .drop(columns=["risk_free"])
    .dropna(subset=["ret_excess", "mktcap", "mktcap_lag"]))

crsp_monthly.to_parquet("data-python/crsp_monthly.parquet")

Preparing Compustat

Compustat's funda table gives accounting fundamentals. The two derived quantities used most are book equity (be, from the standard coalesce of stockholders' equity definitions) and operating profitability (op); we also keep the latest filing per firm-year and build a lagged total-assets for the investment characteristic.

compustat_annual <- download_data(
  type = "wrds_compustat_annual",
  start_date = start_date,
  end_date = end_date
)
compustat_annual = tf.download_data(
    domain="wrds", dataset="compustat_annual",
    start_date=start_date, end_date=end_date
)
compustat_annual <- compustat_annual |>
  mutate(
    be = coalesce(seq, ceq + pstk, at - lt) +
      coalesce(txditc, txdb + itcb, 0) -
      coalesce(pstkrv, pstkl, pstk, 0),
    be = if_else(be <= 0, NA, be),
    op = (sale - coalesce(cogs, 0) - coalesce(xsga, 0) - coalesce(xint, 0)) / be
  ) |>
  mutate(year = year(datadate)) |>
  group_by(gvkey, year) |>
  filter(datadate == max(datadate)) |>
  ungroup()
compustat_annual = (compustat_annual
    .assign(
        be=lambda x: (
            x["seq"].combine_first(x["ceq"] + x["pstk"]).combine_first(x["at"] - x["lt"])
            + x["txditc"].combine_first(x["txdb"] + x["itcb"]).fillna(0)
            - x["pstkrv"].combine_first(x["pstkl"]).combine_first(x["pstk"]).fillna(0)))
    .assign(be=lambda x: x["be"].where(x["be"] > 0, np.nan))
    .assign(op=lambda x: (
        (x["sale"] - x["cogs"].fillna(0) - x["xsga"].fillna(0) - x["xint"].fillna(0)) / x["be"]))
    .assign(year=lambda x: x["datadate"].dt.year)
    .sort_values("datadate")
    .groupby(["gvkey", "year"]).tail(1))

Merging CRSP with Compustat

The two databases use different identifiers, bridged by the CCM linking table (permnogvkey, valid within link dates). Joining it onto CRSP attaches each stock-month to its accounting gvkey, producing the firm-month panel the empirical chapters build on.

ccm_links <- download_data(type = "wrds_ccm_links")

crsp_monthly <- crsp_monthly |>
  left_join(ccm_links, join_by(permno, date))

write_parquet(crsp_monthly, "data-r/crsp_monthly.parquet")
ccm_links = tf.download_data(domain="wrds", dataset="ccm_links")

crsp_monthly = crsp_monthly.merge(ccm_links, how="left", on=["permno", "date"])
crsp_monthly.to_parquet("data-python/crsp_monthly.parquet")

A useful sanity check is the share of securities that successfully match to a book-equity value over time, by exchange — coverage is far higher for NYSE/Compustat-followed firms, a selection effect worth keeping in mind.


Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.