Working with Stock Returns
This first hands-on chapter sets up the workflow used everywhere later: download price data, turn it into returns, and visualize it. The code uses the tidyfinance helper package, which wraps common downloads (here, from Yahoo Finance) behind a single download_data call in both languages. Use the R | Python toggle to switch.
library(tidyverse)
library(tidyfinance)
library(scales)
import pandas as pd
import numpy as np
import tidyfinance as tf
Downloading data
We start with daily prices for a single symbol — Apple (AAPL) — over roughly two decades. The download returns a tidy data frame with one row per symbol-day: the date, trading volume, the open/high/low/close prices, and an adjusted close. The adjusted close corrects for stock splits and dividends, so it reflects the return an investor holding the stock would actually have earned; it is the series to use for return calculations.
prices <- download_data(
type = "stock_prices",
symbols = "AAPL",
start_date = "2000-01-01",
end_date = "2024-12-31"
)
prices
prices = tf.download_data(
domain="stock_prices",
symbols="AAPL",
start_date="2000-01-01",
end_date="2023-12-31"
)
prices.head().round(3)
Plotting the adjusted close over time gives the familiar long-run price path. The grammar-of-graphics style — ggplot2 in R, plotnine in Python — maps a column to each visual aesthetic (date to the x-axis, adjusted price to the y-axis) and builds the chart in layers.
prices |>
ggplot(aes(x = date, y = adjusted_close)) +
geom_line() +
labs(
x = NULL, y = NULL,
title = "Apple stock prices between beginning of 2000 and end of 2024"
)
apple_prices_figure = (
ggplot(prices, aes(y="adjusted_close", x="date"))
+ geom_line()
+ labs(x="", y="", title="Apple stock prices from 2000 to 2023")
)
apple_prices_figure.show()
Computing returns
Returns, not prices, are the unit of analysis: they are comparable across stocks and over time. A simple return is the adjusted price divided by the previous day's, minus one. Sorting by date first guarantees lag/pct_change compares the right rows; the first observation has no predecessor, so we drop the resulting missing value.
returns <- prices |>
arrange(date) |>
mutate(ret = adjusted_close / lag(adjusted_close) - 1) |>
select(symbol, date, ret) |>
drop_na(ret)
returns
returns = (prices
.sort_values("date")
.assign(ret=lambda x: x["adjusted_close"].pct_change())
.get(["symbol", "date", "ret"])
.dropna()
)
returns
A histogram of daily returns shows their distribution, with a dashed line at the 5% quantile — a quick read on tail risk. Daily equity returns are famously fat-tailed: extreme days occur far more often than a normal distribution would predict.
quantile_05 <- quantile(returns$ret, probs = 0.05)
returns |>
ggplot(aes(x = ret)) +
geom_histogram(bins = 100) +
geom_vline(aes(xintercept = quantile_05), linetype = "dashed") +
labs(x = NULL, y = NULL,
title = "Distribution of daily Apple stock returns") +
scale_x_continuous(labels = percent)
from mizani.formatters import percent_format
quantile_05 = returns["ret"].quantile(0.05)
apple_returns_figure = (
ggplot(returns, aes(x="ret"))
+ geom_histogram(bins=100)
+ geom_vline(aes(xintercept=quantile_05), linetype="dashed")
+ labs(x="", y="", title="Distribution of daily Apple stock returns")
+ scale_x_continuous(labels=percent_format())
)
apple_returns_figure.show()
Summary statistics — mean, standard deviation, min, max — quantify the same picture, and computing them year by year shows how volatility changes over time.
returns |>
summarize(across(
ret,
list(daily_mean = mean, daily_sd = sd, daily_min = min, daily_max = max)
))
pd.DataFrame(returns["ret"].describe()).round(3).T
Scaling up to many stocks
The real workflow operates on a cross-section. We pull the current constituents of an index — the Dow Jones Industrial Average — then download prices for all of them at once by passing the vector of symbols. This is the pattern that generalizes to thousands of stocks.
symbols <- download_data(
type = "constituents",
index = "Dow Jones Industrial Average"
)
symbols
symbols = tf.download_data(
domain="constituents",
index="Dow Jones Industrial Average"
)
prices_daily <- download_data(
type = "stock_prices",
symbols = symbols$symbol,
start_date = "2000-01-01",
end_date = "2024-12-31"
)
prices_daily = tf.download_data(
domain="stock_prices",
symbols=symbols["symbol"].tolist(),
start_date="2000-01-01",
end_date="2023-12-31"
)
Because the data is tidy, the same plotting call handles the whole panel; mapping symbol to color draws one line per stock, and the legend is suppressed since the shape, not the names, is the point. Computing returns now groups by symbol so each stock's lag stays within its own series.
returns_daily <- prices_daily |>
group_by(symbol) |>
mutate(ret = adjusted_close / lag(adjusted_close) - 1) |>
select(symbol, date, ret) |>
drop_na(ret)
returns_daily = (prices_daily
.assign(ret=lambda x: x.groupby("symbol")["adjusted_close"].pct_change())
.get(["symbol", "date", "ret"])
.dropna(subset="ret")
)
Different frequencies
Daily data is often aggregated to a lower frequency. To get monthly returns, floor each date to the start of its month, then compound the daily returns within each month — the product of one-plus-returns, minus one. Comparing the daily and monthly return distributions for one stock shows how aggregation thins the tails.
returns_monthly <- returns_daily |>
mutate(date = floor_date(date, "month")) |>
group_by(symbol, date) |>
summarize(ret = prod(1 + ret) - 1, .groups = "drop")
returns_monthly = (returns_daily
.assign(date=returns_daily["date"].dt.to_period("M").dt.to_timestamp())
.groupby(["symbol", "date"], as_index=False)
.agg(ret=("ret", lambda x: np.prod(1 + x) - 1))
)
Other forms of aggregation
The same split-apply-combine logic answers other questions. Aggregate daily dollar trading volume, for instance, sums volume * adjusted_close across stocks per day — and plotting today's against yesterday's reveals strong persistence in market activity.
trading_volume <- prices_daily |>
group_by(date) |>
summarize(trading_volume = sum(volume * adjusted_close))
trading_volume |>
ggplot(aes(x = date, y = trading_volume)) +
geom_line() +
labs(x = NULL, y = NULL,
title = "Aggregate daily trading volume") +
scale_y_continuous(labels = unit_format(unit = "B", scale = 1e-9))
trading_volume = (prices_daily
.assign(trading_volume=lambda x: (x["volume"] * x["adjusted_close"]) / 1e9)
.groupby("date")["trading_volume"]
.sum()
.reset_index()
)
trading_volume_figure = (
ggplot(trading_volume, aes(x="date", y="trading_volume"))
+ geom_line()
+ labs(x="", y="",
title="Aggregate daily trading volume in billion USD")
)
trading_volume_figure.show()
Study notes following the Tidy Finance curriculum by Scheuch, Voigt, Weiss, and Frey. Prose is my own; the R/Python code is reproduced from the book's open-source source, licensed CC BY-NC-SA 4.0.