Factor Selection via Machine Learning
With hundreds of candidate predictors — the "factor zoo" — ordinary regression overfits. This chapter uses penalized regression (lasso, ridge, elastic net) to select among many correlated characteristics, tuned by time-series cross-validation so the evaluation respects the arrow of time. The R code uses tidymodels + glmnet; the Python uses scikit-learn. Use the R | Python toggle to switch.
library(tidyverse)
library(tidyfinance)
library(tidymodels)
library(glmnet)
library(timetk)
import pandas as pd
import numpy as np
import sqlite3
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
Data preparation
We build a panel of stock returns joined to many factor candidates — the Fama–French and q-factors plus macro predictors — with next month's excess return as the target.
factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |> collect()
factors_ff5_monthly <- tbl(tidy_finance, "factors_ff5_monthly") |> collect()
factors_q_monthly <- tbl(tidy_finance, "factors_q_monthly") |> collect()
macro_predictors <- tbl(tidy_finance, "macro_predictors") |> collect()
data <- factors_ff5_monthly |>
left_join(factors_q_monthly, join_by(date)) |>
left_join(macro_predictors, join_by(date)) |>
mutate(ret_excess_lead = lead(mkt_excess)) |>
drop_na()
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
factors_ff5_monthly = pd.read_sql_query("SELECT * FROM factors_ff5_monthly", tidy_finance, parse_dates={"date"})
factors_q_monthly = pd.read_sql_query("SELECT * FROM factors_q_monthly", tidy_finance, parse_dates={"date"})
macro_predictors = pd.read_sql_query("SELECT * FROM macro_predictors", tidy_finance, parse_dates={"date"})
data = (factors_ff5_monthly
.merge(factors_q_monthly, how="left", on="date")
.merge(macro_predictors, how="left", on="date")
.assign(ret_excess_lead=lambda x: x["mkt_excess"].shift(-1))
.dropna())
A preprocessing recipe
Before fitting, predictors are standardized (so the penalty treats them comparably) and can be interacted (characteristics × macro predictors, letting premia vary with the macro state). Bundling preprocessing with the model in a workflow/pipeline guarantees identical transforms in training and validation — no leakage.
rec <- recipe(ret_excess_lead ~ ., data = data) |>
step_rm(date) |>
step_interact(terms = ~ contains("factor"):contains("macro")) |>
step_normalize(all_predictors()) |>
step_center(ret_excess_lead, skip = TRUE)
from sklearn.compose import ColumnTransformer
predictors = [c for c in data.columns if c not in ["date", "ret_excess_lead"]]
X = data[predictors]
y = data["ret_excess_lead"]
preprocessor = ColumnTransformer([("scale", StandardScaler(), predictors)])
The lasso
The lasso adds an L1 penalty that drives many coefficients to exactly zero — performing variable selection automatically (mixture = 1 is pure lasso; lower values blend in ridge, giving the elastic net). The penalty strength is left as a tuning parameter.
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) |>
set_engine("glmnet")
lm_fit <- workflow() |>
add_recipe(rec) |>
add_model(lasso_spec)
pipeline = Pipeline([("preprocess", preprocessor), ("lasso", Lasso())])
Time-series cross-validation
The penalty is tuned by cross-validation that never lets the future inform the past: each fold trains on an initial window and validates on the months immediately after. Sweeping a grid of penalties and choosing the lowest out-of-sample RMSE selects the model; the fitted lasso keeps only the factors that earn their place and zeroes the rest.
data_folds <- time_series_cv(
data = data, date_var = date,
initial = "5 years", assess = "48 months",
cumulative = FALSE, slice_limit = 20
)
lasso_tune <- lm_fit |>
tune_grid(
resamples = data_folds,
grid = grid_regular(penalty(), levels = 20),
metrics = metric_set(rmse)
)
autoplot(lasso_tune)
data_folds = TimeSeriesSplit(n_splits=20, test_size=48)
param_grid = {"lasso__alpha": np.logspace(-4, 0, 20)}
lasso_tune = GridSearchCV(
pipeline, param_grid=param_grid,
cv=data_folds, scoring="neg_root_mean_squared_error"
)
lasso_tune.fit(X, y)
print(lasso_tune.best_params_)
Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.