Factor Selection via Machine Learning

Jun He May 31, 2026

With hundreds of candidate predictors — the "factor zoo" — ordinary regression overfits. This chapter uses penalized regression (lasso, ridge, elastic net) to select among many correlated characteristics, tuned by time-series cross-validation so the evaluation respects the arrow of time. The R code uses tidymodels + glmnet; the Python uses scikit-learn. Use the R | Python toggle to switch.

library(tidyverse)
library(tidyfinance)
library(tidymodels)
library(glmnet)
library(timetk)

import pandas as pd
import numpy as np
import sqlite3
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV

Data preparation

We build a panel of stock returns joined to many factor candidates — the Fama–French and q-factors plus macro predictors — with next month's excess return as the target.

factors_ff3_monthly <- tbl(tidy_finance, "factors_ff3_monthly") |> collect()
factors_ff5_monthly <- tbl(tidy_finance, "factors_ff5_monthly") |> collect()
factors_q_monthly <- tbl(tidy_finance, "factors_q_monthly") |> collect()
macro_predictors <- tbl(tidy_finance, "macro_predictors") |> collect()

data <- factors_ff5_monthly |>
  left_join(factors_q_monthly, join_by(date)) |>
  left_join(macro_predictors, join_by(date)) |>
  mutate(ret_excess_lead = lead(mkt_excess)) |>
  drop_na()

tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")

factors_ff5_monthly = pd.read_sql_query("SELECT * FROM factors_ff5_monthly", tidy_finance, parse_dates={"date"})
factors_q_monthly = pd.read_sql_query("SELECT * FROM factors_q_monthly", tidy_finance, parse_dates={"date"})
macro_predictors = pd.read_sql_query("SELECT * FROM macro_predictors", tidy_finance, parse_dates={"date"})

data = (factors_ff5_monthly
  .merge(factors_q_monthly, how="left", on="date")
  .merge(macro_predictors, how="left", on="date")
  .assign(ret_excess_lead=lambda x: x["mkt_excess"].shift(-1))
  .dropna())

A preprocessing recipe

Before fitting, predictors are standardized (so the penalty treats them comparably) and can be interacted (characteristics × macro predictors, letting premia vary with the macro state). Bundling preprocessing with the model in a workflow/pipeline guarantees identical transforms in training and validation — no leakage.

rec <- recipe(ret_excess_lead ~ ., data = data) |>
  step_rm(date) |>
  step_interact(terms = ~ contains("factor"):contains("macro")) |>
  step_normalize(all_predictors()) |>
  step_center(ret_excess_lead, skip = TRUE)

from sklearn.compose import ColumnTransformer

predictors = [c for c in data.columns if c not in ["date", "ret_excess_lead"]]
X = data[predictors]
y = data["ret_excess_lead"]

preprocessor = ColumnTransformer([("scale", StandardScaler(), predictors)])

The lasso

The lasso adds an L1 penalty that drives many coefficients to exactly zero — performing variable selection automatically (mixture = 1 is pure lasso; lower values blend in ridge, giving the elastic net). The penalty strength is left as a tuning parameter.

lasso_spec <- linear_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")

lm_fit <- workflow() |>
  add_recipe(rec) |>
  add_model(lasso_spec)

pipeline = Pipeline([("preprocess", preprocessor), ("lasso", Lasso())])

Time-series cross-validation

The penalty is tuned by cross-validation that never lets the future inform the past: each fold trains on an initial window and validates on the months immediately after. Sweeping a grid of penalties and choosing the lowest out-of-sample RMSE selects the model; the fitted lasso keeps only the factors that earn their place and zeroes the rest.

data_folds <- time_series_cv(
  data = data, date_var = date,
  initial = "5 years", assess = "48 months",
  cumulative = FALSE, slice_limit = 20
)

lasso_tune <- lm_fit |>
  tune_grid(
    resamples = data_folds,
    grid = grid_regular(penalty(), levels = 20),
    metrics = metric_set(rmse)
  )

autoplot(lasso_tune)

data_folds = TimeSeriesSplit(n_splits=20, test_size=48)
param_grid = {"lasso__alpha": np.logspace(-4, 0, 20)}

lasso_tune = GridSearchCV(
    pipeline, param_grid=param_grid,
    cv=data_folds, scoring="neg_root_mean_squared_error"
)
lasso_tune.fit(X, y)

print(lasso_tune.best_params_)

Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.