Machine Learning (`wraquant.ml`)¶

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> returns = pd.Series(np.random.randn(100) * 0.01, name='ret')
>>> feats = rolling_features(returns, windows=(5, 21))
>>> feats.columns.tolist()[:3]
['mean_w5', 'std_w5', 'skew_w5']
>>> feats.shape[1]  # 6 stats * 2 windows
12

See also

return_features: Lagged and cumulative return features.
volatility_features: Realised volatility and vol-of-vol features.

return_features(prices, lags=(1, 2, 3, 5, 10, 21))[source]¶

Compute lagged and cumulative return features from a price series.

Use return features as inputs to ML models predicting future returns or direction. Lagged returns capture momentum and mean-reversion signals at multiple horizons; cumulative returns capture trend strength.

Parameters:

prices (Series) – Price series (e.g. adjusted close).
lags (Sequence[int], default: (1, 2, 3, 5, 10, 21)) – Lag periods for returns (default (1, 2, 3, 5, 10, 21)).

Returns:

DataFrame with columns ret_lag{l} (log return l periods ago, a momentum/mean-reversion signal) and cum_ret_{l} (cumulative log return over the last l periods, a trend signal) for each lag l. Early rows are NaN.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> prices = pd.Series([100, 101, 102, 100, 103, 105, 104],
...                     name='close')
>>> feats = return_features(prices, lags=(1, 3))
>>> list(feats.columns)
['ret_lag1', 'cum_ret_1', 'ret_lag3', 'cum_ret_3']
>>> feats['cum_ret_3'].iloc[-1] > 0  # cumulative 3-period return
True

See also

rolling_features: Rolling statistical features.
technical_features: Technical analysis features (RSI, MACD, etc.).

technical_features(high, low, close, volume=None)[source]¶

Compute common technical analysis features for ML pipelines.

Use these features as inputs to ML models when you want to capture classic technical signals without depending on the full wraquant.ta module. Combines momentum (RSI, MACD), volatility (ATR, Bollinger), and optionally volume (OBV) into a single DataFrame.

Computes RSI, MACD histogram, Bollinger Band %B, and ATR. If volume is provided, On-Balance Volume (OBV) is also included.

Parameters:

high (Series) – High prices.
low (Series) – Low prices.
close (Series) – Close prices.
volume (Series | None, default: None) – Trade volume (optional). When provided, adds OBV which tracks cumulative buying/selling pressure.

Returns:

DataFrame with columns:

rsi: Relative Strength Index (0-100). Values above 70 indicate overbought; below 30 indicate oversold.
macd_hist: MACD histogram. Positive values indicate bullish momentum; negative values indicate bearish.
bb_pctb: Bollinger Band %B (0-1 range typically). Values above 1 mean price is above the upper band.
atr: Average True Range. Higher values indicate more volatile price action.
obv (optional): On-Balance Volume. Rising OBV confirms an uptrend.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> n = 100
>>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5))
>>> high = close + np.abs(np.random.randn(n) * 0.3)
>>> low = close - np.abs(np.random.randn(n) * 0.3)
>>> feats = technical_features(high, low, close)
>>> list(feats.columns)
['rsi', 'macd_hist', 'bb_pctb', 'atr']

See also

return_features: Lagged and cumulative return features.
volatility_features: Realised volatility features.

ta_features(high, low, close, volume=None, include=None)[source]¶

Generate ML features using wraquant’s full technical analysis library.

Unlike technical_features (which uses inline implementations), this function imports directly from wraquant.ta to leverage the full 263-indicator library. This bridges the ml and ta modules so that ML pipelines can access production-quality TA indicators without manual wiring.

By default, computes a curated set of the most ML-relevant indicators: RSI, MACD histogram, Bollinger Band %B, ATR, and optionally OBV. Use the include parameter to select additional indicators.

Parameters:

high (Series) – High prices.
low (Series) – Low prices.
close (Series) – Close prices.
volume (Series | None, default: None) – Trade volume (optional). Required for volume-based indicators (OBV, MFI).
include (Optional[Sequence[str]], default: None) – Subset of indicators to include. Options: 'rsi', 'macd', 'bbands', 'atr', 'obv'. If None, includes all available indicators.

Return type:

Returns:

DataFrame with one column per indicator, indexed like the input series. Column names are descriptive (e.g., ta_rsi, ta_macd_hist, ta_bb_pctb, ta_atr, ta_obv).

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> n = 100
>>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5))
>>> high = close + np.abs(np.random.randn(n) * 0.3)
>>> low = close - np.abs(np.random.randn(n) * 0.3)
>>> feats = ta_features(high, low, close)
>>> 'ta_rsi' in feats.columns
True

See also

technical_features: Inline implementation (no ta/ dependency). wraquant.ta.momentum.rsi: Full RSI implementation. wraquant.ta.momentum.macd: Full MACD implementation.

volatility_features(returns, windows=(5, 10, 21, 63))[source]¶

Compute realised-volatility-related features.

Use volatility features to capture the current risk environment and volatility regime. Realised volatility is the most important feature in many financial ML models because volatility clusters (GARCH effect) and predicts future volatility better than returns predict future returns.

Parameters:

returns (Series) – Log or simple return series.
windows (Sequence[int], default: (5, 10, 21, 63)) – Window sizes for rolling calculations (default (5, 10, 21, 63)).

Returns:

Columns:

realized_vol_w{w}: Annualised rolling standard deviation (sqrt(252) scaling). Interpretation: a value of 0.20 means ~20% annualised volatility.
vol_of_vol_w{w}: Rolling std of the rolling vol. High values indicate unstable volatility (vol-of-vol regime).
vol_ratio_w{w1}_w{w2}: Ratio of short-window vol to long-window vol. Values > 1 indicate vol is spiking (risk-off signal); values < 1 indicate vol compression.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> rets = pd.Series(np.random.randn(200) * 0.01, name='daily_ret')
>>> feats = volatility_features(rets, windows=(5, 21))
>>> 'realized_vol_w5' in feats.columns
True
>>> 'vol_ratio_w5_w21' in feats.columns
True

See also

rolling_features: General rolling statistical features.
wraquant.vol: Full volatility modelling (GARCH, stochastic vol).

microstructure_features(high, low, close, volume)[source]¶

Compute market-microstructure features.

Use microstructure features to capture liquidity conditions, information asymmetry, and trading activity. These are particularly valuable for short-horizon alpha models and execution-aware strategies where liquidity predicts future returns or trading costs.

Parameters:

high (Series) – High prices.
low (Series) – Low prices.
close (Series) – Close prices.
volume (Series) – Trade volume.

Returns:

Columns:

amihud_illiq: Amihud illiquidity ratio (21-day rolling mean of |return| / dollar_volume). Higher values indicate less liquid, more price-impactful markets.
kyle_lambda: Kyle’s lambda (21-day rolling OLS slope of |price change| on signed sqrt-volume). Measures the price impact per unit of informed flow. Higher values suggest more information asymmetry.
log_volume: Natural log of volume. Smooths the skewed volume distribution for ML model consumption.
volume_ma_ratio: Current volume / 21-day moving average. Values > 1 indicate above-average activity (potential event).
dollar_volume: Price * volume. Absolute measure of trading activity and liquidity.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> n = 100
>>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5))
>>> high = close + np.abs(np.random.randn(n) * 0.3)
>>> low = close - np.abs(np.random.randn(n) * 0.3)
>>> volume = pd.Series(np.random.randint(1_000_000, 5_000_000, n))
>>> feats = microstructure_features(high, low, close, volume)
>>> list(feats.columns)
['amihud_illiq', 'kyle_lambda', 'log_volume', 'volume_ma_ratio', 'dollar_volume']

References

Amihud (2002), “Illiquidity and stock returns”
Kyle (1985), “Continuous Auctions and Insider Trading”

See also

technical_features: Price-based technical indicators.

label_fixed_horizon(returns, horizon=5, threshold=0.0)[source]¶

Label future return direction over a fixed horizon.

Use fixed-horizon labelling as the simplest way to create supervised learning targets for directional prediction. Each observation is labelled based on the cumulative return over the next horizon periods. This is the standard approach for “will the price go up or down over the next N days?” classification.

Parameters:

returns (Series) – Period (e.g. daily) returns.
horizon (int, default: 5) – Number of periods to accumulate forward returns (default 5, i.e. one trading week).
threshold (float, default: 0.0) – If threshold > 0, three labels are produced: 1 (up beyond threshold), 0 (flat), -1 (down beyond threshold). If threshold == 0, binary labels (1 / 0) are produced where 1 means positive cumulative return.

Returns:

Integer labels aligned to the original index. The last horizon rows will be NaN (no future data available).

Return type:

Example

>>> import pandas as pd, numpy as np
>>> rets = pd.Series([0.01, -0.005, 0.02, 0.01, -0.03, 0.015, 0.005])
>>> labels = label_fixed_horizon(rets, horizon=3, threshold=0.0)
>>> labels.iloc[0]  # sum of rets[1:4] = -0.005+0.02+0.01 > 0
1

Notes

Fixed-horizon labelling does not adapt to volatility. In high-vol regimes, the threshold is hit more often; in low-vol regimes, most labels become 0. For volatility-adaptive labels, use label_triple_barrier.

See also

label_triple_barrier: Volatility-adaptive labelling (Lopez de Prado).

label_triple_barrier(close, upper=None, lower=None, max_holding=10)[source]¶

Triple-barrier labelling (Lopez de Prado).

Use triple-barrier labelling when you want targets that adapt to market conditions. Unlike fixed-horizon labels, this method defines a profit-taking barrier (upper), a stop-loss barrier (lower), and a maximum holding period (vertical). Whichever barrier is hit first determines the label. This produces cleaner labels in volatile markets because the barriers can be scaled by volatility.

For each bar the method sets three barriers:

Upper: price rises by upper fraction -> label = 1
Lower: price falls by lower fraction -> label = -1
Vertical: max_holding bars elapse -> label = sign of return

If upper or lower is None the corresponding horizontal barrier is disabled.

Parameters:

close (Series) – Close price series.
upper (float | None, default: None) – Fractional distance for the upper barrier (e.g. 0.02 for 2 %).
lower (float | None, default: None) – Fractional distance for the lower barrier (positive value; e.g. 0.02 for -2 %).
max_holding (int, default: 10) – Maximum holding period in bars (vertical barrier).

Returns:

Integer labels in {-1, 0, 1} aligned to the input index. 1 = profit-taking barrier hit first (bullish), -1 = stop-loss barrier hit first (bearish), 0 = vertical barrier hit with zero return. The last max_holding entries may be NaN.

Return type:

Example

>>> import pandas as pd
>>> close = pd.Series([100, 101, 102, 103, 100, 97, 98, 99, 100, 101])
>>> labels = label_triple_barrier(close, upper=0.03, lower=0.03, max_holding=5)
>>> labels.iloc[0]  # price rises 3% by bar 3 (103/100 - 1 = 0.03)
1

Notes

In practice, set upper and lower proportional to recent volatility (e.g., upper = lower = daily_vol * sqrt(max_holding)). This makes the labels regime-adaptive.

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 3

See also

label_fixed_horizon: Simpler fixed-horizon labelling.

interaction_features(data, columns=None)[source]¶

Create pairwise interaction terms between features.

Use interaction features when you suspect that predictive power lies in the combination of features rather than individual signals. For example, momentum * volatility captures whether momentum is occurring in a high- or low-volatility environment, which may predict returns differently.

For each pair of selected columns (A, B), computes:

A_x_B: element-wise product (captures multiplicative relationships)
A_div_B: element-wise ratio A / B (captures relative magnitudes)

Parameters:

data (DataFrame) – Feature DataFrame.
columns (Optional[Sequence[str]], default: None) – Columns to use for interaction terms. If None, all columns are used.

Returns:

DataFrame containing all pairwise interaction features, with column names like col1_x_col2 and col1_div_col2.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> result = interaction_features(df, columns=['a', 'b'])
>>> 'a_x_b' in result.columns
True
>>> 'a_div_b' in result.columns
True

cross_asset_features(asset, benchmark, windows=(10, 21, 63))[source]¶

Compute cross-asset relationship features.

Use cross-asset features to capture how an asset co-moves with a benchmark or related instrument. Rolling correlation and beta detect changing exposures (useful for regime detection); relative strength identifies momentum divergence between the asset and its benchmark.

Given an asset return series and a benchmark (or related asset) return series, computes rolling correlation, rolling beta, and relative strength for each window.

Parameters:

asset (Series) – Return series for the asset of interest.
benchmark (Series) – Return series for the benchmark or related asset.
windows (Sequence[int], default: (10, 21, 63)) – Rolling window sizes for correlation and beta calculations.

Returns:

DataFrame with columns: - rolling_corr_w{w}: rolling Pearson correlation - rolling_beta_w{w}: rolling OLS beta (cov / var of benchmark) - relative_strength_w{w}: cumulative return ratio (asset / benchmark)

over the window

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> asset = pd.Series(np.random.randn(200) * 0.01, name='asset')
>>> bench = pd.Series(np.random.randn(200) * 0.01, name='bench')
>>> result = cross_asset_features(asset, bench, windows=[10, 21])
>>> 'rolling_corr_w10' in result.columns
True
>>> 'rolling_beta_w21' in result.columns
True

regime_features(regime_probabilities, regime_labels=None)[source]¶

Create features from regime probabilities or labels.

Use regime features when you have upstream regime detection (e.g., HMM, Markov-switching) and want to feed regime state into downstream ML models. Regime duration and transition probability are predictive because regimes tend to persist (duration) but eventually break down (transition probability rises before a switch).

Given regime probabilities (e.g., from an HMM or Markov-switching model), constructs features useful for downstream ML models: current regime identity, regime duration (how many consecutive periods in the current regime), and estimated transition probability (rolling mean of regime changes).

Parameters:

regime_probabilities (DataFrame) – DataFrame where each column is the probability of a regime (e.g., columns ['bull', 'bear'] with probabilities summing to 1).
regime_labels (Series | None, default: None) – Hard regime labels. If None, the most probable regime at each step is used (argmax of the probability columns).

Returns:

DataFrame with columns: - current_regime: integer label of the current regime - regime_duration: number of consecutive periods in the

current regime

regime_change: binary indicator (1 if regime changed)
transition_prob_w{w}: rolling mean of regime changes for w in [5, 10, 21]
one column per regime probability from the input

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(42)
>>> probs = pd.DataFrame({
...     'bull': np.random.dirichlet([5, 2], size=100)[:, 0],
...     'bear': np.random.dirichlet([5, 2], size=100)[:, 1],
... })
>>> result = regime_features(probs)
>>> 'current_regime' in result.columns
True
>>> 'regime_duration' in result.columns
True

purged_kfold(X, y, n_splits=5, embargo_pct=0.01)[source]¶

Purged K-Fold cross-validation.

Use purged K-fold instead of standard K-fold whenever your labels overlap in time (e.g., forward returns computed over a window). Standard K-fold leaks future information because a training sample’s label may depend on prices that appear in the test set. Purging removes an embargo zone after each test fold to break this leakage.

Ensures that training observations that immediately follow a test observation are removed (embargo) so that information cannot leak through overlapping labels.

Parameters:

X (DataFrame | ndarray) – Feature matrix (only its length is used).
y (Series | ndarray) – Target vector (only its length is used).
n_splits (int, default: 5) – Number of folds.
embargo_pct (float, default: 0.01) – Fraction of total samples to embargo after each test fold. For daily data with 5-day forward labels, 0.01 embargoes ~2.5 days on a 252-sample dataset.

Yields:

tuple[np.ndarray, np.ndarray] – (train_indices, test_indices) for each fold.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

Example

>>> import numpy as np
>>> X = np.random.randn(500, 3)
>>> y = np.random.randn(500)
>>> folds = list(purged_kfold(X, y, n_splits=5, embargo_pct=0.02))
>>> len(folds)
5
>>> train_idx, test_idx = folds[0]
>>> len(train_idx) + len(test_idx) < 500  # embargo removes some samples
True

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 7

See also

combinatorial_purged_kfold: Generates all C(n, k) purged splits.
wraquant.ml.pipeline.FinancialPipeline: Pipeline that uses purged K-fold.

combinatorial_purged_kfold(X, y, n_splits=5, n_test_splits=2, embargo_pct=0.01)[source]¶

Combinatorial purged K-Fold cross-validation.

Use combinatorial purged K-fold when you need more backtest paths than standard purged K-fold provides. By choosing n_test_splits groups as the test set from n_splits total groups, this generates C(n_splits, n_test_splits) distinct train/test splits – each with an embargo to prevent information leakage.

Generates all C(n_splits, n_test_splits) train/test combinations, applying an embargo after each test group to prevent leakage.

Parameters:

X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
n_splits (int, default: 5) – Total number of groups.
n_test_splits (int, default: 2) – Number of groups held out for testing in each split.
embargo_pct (float, default: 0.01) – Fraction of total samples to embargo after each test group.

Yields:

tuple[np.ndarray, np.ndarray] – (train_indices, test_indices) for each combination.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

Example

>>> import numpy as np
>>> X = np.random.randn(500, 3)
>>> y = np.random.randn(500)
>>> folds = list(combinatorial_purged_kfold(X, y, n_splits=5, n_test_splits=2))
>>> len(folds)  # C(5, 2) = 10
10

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 12

See also

purged_kfold: Simpler purged K-fold with n_splits folds.

fractional_differentiation(series, d=0.5, threshold=1e-05)[source]¶

Fractionally differentiate a time series.

Use fractional differentiation to make a price or factor series stationary (required by many ML models) while retaining as much memory (long-range dependence) as possible. Standard first differencing (d=1) makes the series stationary but destroys all memory. Fractional differencing with d=0.3-0.5 achieves stationarity while preserving most of the signal.

Applies the fractional differentiation operator of order d (Hosking, 1981) to obtain a (near-)stationary series while preserving long-range memory.

The operator is defined as:

(1 - B)^d = sum_{k=0}^{inf} C(d,k) * (-B)^k

where B is the backshift operator and C(d,k) are the binomial-like weights.

Parameters:

series (Series) – Input time series (e.g., log prices).
d (float, default: 0.5) – Fractional differentiation order (0 < d < 1 for partial differentiation; d = 1 is the standard first difference). Start with d=0.5 and decrease until the ADF test rejects at the desired significance level.
threshold (float, default: 1e-05) – Minimum absolute weight to retain. Smaller values use more lagged observations but increase computational cost.

Returns:

Fractionally differentiated series (initial rows where the full convolution is not available are dropped). Test stationarity with an ADF test; if non-stationary, increase d.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(42)
>>> prices = pd.Series(100 + np.cumsum(np.random.randn(300) * 0.5),
...                     name='close')
>>> frac_diff = fractional_differentiation(prices, d=0.4)
>>> len(frac_diff) < len(prices)  # initial rows dropped
True
>>> frac_diff.std() > 0  # non-trivial output
True

References

Hosking (1981), “Fractional Differencing”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 5

See also

denoised_correlation: Random Matrix Theory denoising.

denoised_correlation(returns, n_components=None)[source]¶

Denoise a correlation matrix using Random Matrix Theory.

Use denoised correlation before portfolio optimization or clustering to remove noise eigenvalues that arise from finite-sample estimation. When T/N (observations/assets) is not large, the sample correlation matrix contains substantial noise. RMT denoising replaces eigenvalues consistent with random noise (Marchenko-Pastur distribution) with their average, producing a cleaner matrix that leads to more stable portfolio weights.

Eigenvalues that fall within the Marchenko-Pastur distribution are replaced by their average, shrinking noise while preserving signal.

Parameters:

returns (DataFrame) – T x N return matrix (rows = observations, columns = assets).
n_components (int | None, default: None) – Number of signal eigenvalues to keep. If None, they are determined automatically from the Marchenko-Pastur bound.

Returns:

Denoised correlation matrix of shape (N, N). The matrix is symmetric, positive semi-definite, and has unit diagonal. Use it in place of returns.corr() for portfolio optimization.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(42)
>>> returns = pd.DataFrame(np.random.randn(252, 10) * 0.01)
>>> clean_corr = denoised_correlation(returns)
>>> clean_corr.shape
(10, 10)
>>> np.allclose(np.diag(clean_corr), 1.0)  # unit diagonal
True

Notes

The Marchenko-Pastur upper bound is:

lambda_+ = sigma^2 * (1 + sqrt(N/T))^2

Eigenvalues above this threshold are retained as “signal”; those below are replaced.

References

Laloux et al. (1999), “Noise dressing of financial correlation matrices”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 2

See also

detoned_correlation: Remove the market mode from a correlation matrix.

detoned_correlation(corr, n_components=1)[source]¶

Remove the first n_components eigenvectors (market mode) from a correlation matrix.

Use detoned correlation when you want to uncover residual co-movement structure after removing the dominant market factor. The first eigenvector of asset returns typically represents the “market mode” (all assets moving together). Removing it reveals sector, style, or idiosyncratic clustering that is hidden when the market factor dominates. This is particularly useful before hierarchical clustering or community detection.

Parameters:

corr (ndarray) – Correlation matrix of shape (N, N).
n_components (int, default: 1) – Number of leading eigenvalues/vectors to remove (default 1, which removes only the market factor).

Returns:

De-toned correlation matrix of shape (N, N). The matrix is symmetric with unit diagonal but is not positive definite (some eigenvalues are set to zero).

Return type:

Example

>>> import numpy as np
>>> np.random.seed(42)
>>> corr = np.corrcoef(np.random.randn(5, 252))
>>> detoned = detoned_correlation(corr, n_components=1)
>>> detoned.shape
(5, 5)
>>> np.allclose(np.diag(detoned), 1.0)
True

References

Lopez de Prado (2020), “Machine Learning for Asset Managers”, Ch. 2

See also

denoised_correlation: Remove noise eigenvalues from a correlation matrix.
wraquant.ml.clustering.correlation_clustering: Cluster assets by correlation.

walk_forward_train(model, X, y, train_size=252, test_size=21, step_size=21)[source]¶

Walk-forward (expanding or rolling window) analysis.

Use walk-forward analysis to evaluate a model under realistic conditions where only past data is available for training at each step. This is the standard time-series cross-validation approach in quantitative finance, avoiding the look-ahead bias inherent in random K-fold splits.

At each step the model is cloned (via scikit-learn’s clone), fitted on the training window, and used to predict the test window.

Parameters:

model (Any) – A scikit-learn-compatible estimator that implements fit and predict.
X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
train_size (int, default: 252) – Number of training observations in the first window (default 252, approximately one trading year).
test_size (int, default: 21) – Number of test observations per fold (default 21, approximately one trading month).
step_size (int, default: 21) – Number of observations to step forward between folds.

Returns:

predictionsnp.ndarray: Concatenated out-of-sample predictions across all folds.
actualsnp.ndarray: Corresponding true values. Compare with predictions to measure forecast accuracy.
test_indicesnp.ndarray: Original row indices for each prediction, useful for aligning results back to a DatetimeIndex.
n_foldsint: Number of walk-forward folds executed.

Return type:

Example

>>> from sklearn.linear_model import Ridge
>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> X = pd.DataFrame(np.random.randn(500, 3), columns=['mom', 'vol', 'size'])
>>> y = X['mom'] * 0.5 + np.random.randn(500) * 0.1
>>> result = walk_forward_train(Ridge(), X, y, train_size=252, test_size=21)
>>> result['n_folds'] > 0
True
>>> len(result['predictions']) == len(result['actuals'])
True

Notes

The window is expanding (all data from the start up to the current train end is used). For a rolling window, see wraquant.ml.pipeline.walk_forward_backtest which supports both modes.

See also

wraquant.ml.pipeline.walk_forward_backtest: Full walk-forward backtest with PnL.
wraquant.ml.preprocessing.purged_kfold: Purged K-fold cross-validation.

ensemble_predict(models, X, method='mean')[source]¶

Generate ensemble predictions from multiple fitted models.

Use ensemble prediction to combine several models (e.g., Ridge, Random Forest, Gradient Boosting) into a single, more robust forecast. Ensembles reduce variance and are standard practice in alpha research and competition-winning pipelines.

Parameters:

models (Sequence[Any]) – Fitted scikit-learn-compatible estimators. Each must implement predict(X).
X (DataFrame | ndarray) – Feature matrix.
method (Literal['mean', 'median', 'vote'], default: 'mean') – Aggregation method. 'mean' and 'median' average the raw predictions (best for regression); 'vote' takes the mode (majority vote, best for classification).

Returns:

Aggregated predictions. For 'mean'/'median', the values are continuous. For 'vote', the values are discrete class labels.

Return type:

Example

>>> from sklearn.linear_model import Ridge, Lasso
>>> import numpy as np
>>> np.random.seed(0)
>>> X_train = np.random.randn(200, 3)
>>> y_train = X_train @ [1, 0.5, 0] + np.random.randn(200) * 0.1
>>> m1 = Ridge().fit(X_train, y_train)
>>> m2 = Lasso(alpha=0.01).fit(X_train, y_train)
>>> X_test = np.random.randn(50, 3)
>>> preds = ensemble_predict([m1, m2], X_test, method='mean')
>>> preds.shape
(50,)

See also

walk_forward_train: Walk-forward evaluation for individual models.

feature_importance_mdi(model, feature_names)[source]¶

Mean Decrease Impurity (MDI) feature importance.

Use MDI as a fast, first-pass feature ranking after fitting a tree-based model. MDI measures how much each feature contributes to reducing node impurity (Gini for classification, variance for regression) across all trees.

Reads model.feature_importances_ (available on tree-based estimators after fitting) and returns a sorted pd.Series.

Parameters:

model (Any) – A fitted tree-based estimator with a feature_importances_ attribute (e.g. RandomForestClassifier).
feature_names (Sequence[str]) – Feature names corresponding to the columns of the training data.

Returns:

Importance values indexed by feature name, sorted descending. Higher values indicate features that contributed more to splits. Values sum to 1.0 for scikit-learn tree ensembles.

Return type:

Example

>>> from sklearn.ensemble import RandomForestClassifier
>>> import numpy as np
>>> np.random.seed(42)
>>> X = np.random.randn(300, 4)
>>> y = (X[:, 0] > 0).astype(int)
>>> rf = RandomForestClassifier(n_estimators=50, random_state=42).fit(X, y)
>>> imp = feature_importance_mdi(rf, ['momentum', 'vol', 'size', 'value'])
>>> imp.index[0]  # most important feature
'momentum'

Notes

MDI is biased toward high-cardinality and continuous features. For an unbiased alternative, use feature_importance_mda (permutation importance).

See also

feature_importance_mda: Permutation-based importance (unbiased).
wraquant.ml.advanced.random_forest_importance: Combined RF fit + importance.

feature_importance_mda(model, X, y, feature_names, n_repeats=10)[source]¶

Mean Decrease Accuracy (permutation importance).

Use MDA when you need an unbiased estimate of feature importance that accounts for feature interactions and is not affected by cardinality bias. Unlike MDI, MDA evaluates on held-out data and directly measures how much predictive power is lost when a feature is shuffled.

Repeatedly permutes each feature and measures the decrease in the model’s score.

Parameters:

model (Any) – A fitted scikit-learn-compatible estimator.
X (DataFrame | ndarray) – Feature matrix (test or validation set).
y (Series | ndarray) – True labels.
feature_names (Sequence[str]) – Feature names corresponding to columns of X.
n_repeats (int, default: 10) – Number of permutation repeats per feature. More repeats yield more stable estimates but increase runtime linearly.

Returns:

Mean importance values indexed by feature name, sorted descending. Positive values indicate features whose permutation hurts the model score; negative values suggest noise features.

Return type:

Example

>>> from sklearn.ensemble import RandomForestClassifier
>>> import numpy as np
>>> np.random.seed(42)
>>> X = np.random.randn(300, 4)
>>> y = (X[:, 0] + 0.3 * X[:, 2] > 0).astype(int)
>>> rf = RandomForestClassifier(n_estimators=50, random_state=42).fit(X, y)
>>> imp = feature_importance_mda(rf, X, y, ['mom', 'vol', 'size', 'val'])
>>> imp.iloc[0] > 0  # top feature has positive importance
True

Notes

MDA is model-agnostic and works with any estimator that exposes a score method. Correlated features share importance: permuting one leaves its correlated partner to compensate, so both appear less important than they truly are.

References

Breiman (2001), “Random Forests”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 8

See also

feature_importance_mdi: Faster but biased impurity-based importance.
wraquant.ml.pipeline.feature_importance_shap: SHAP-based importance.

sequential_feature_selection(model, X, y, n_features=5, direction='forward', cv=5)[source]¶

Sequential (forward / backward) feature selection.

Use sequential feature selection when you want to find a compact subset of features that maximises predictive performance. Forward selection greedily adds the best feature at each step; backward selection starts with all features and removes the least useful.

Parameters:

model (Any) – A scikit-learn-compatible estimator.
X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
n_features (int, default: 5) – Number of features to select.
direction (Literal['forward', 'backward'], default: 'forward') – Selection direction. Forward is faster when n_features is small relative to total features; backward is faster when you want to drop only a few.
cv (int, default: 5) – Number of cross-validation folds.

Returns:

Selected feature names (if X is a DataFrame) or column indices.

Return type:

list[str | int]

Example

>>> from sklearn.linear_model import Ridge
>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> X = pd.DataFrame(np.random.randn(200, 6),
...                   columns=['f1','f2','f3','f4','f5','f6'])
>>> y = X['f1'] * 2 + X['f3'] + np.random.randn(200) * 0.1
>>> selected = sequential_feature_selection(Ridge(), X, y, n_features=2)
>>> len(selected)
2

See also

feature_importance_mdi: Impurity-based ranking (faster, less rigorous).
feature_importance_mda: Permutation-based ranking.

class FinancialPipeline[source]¶

Bases: object

Sklearn Pipeline wrapper that enforces chronological splitting.

Standard sklearn Pipeline + cross_val_score uses random K-Fold which leaks future information into the training set. FinancialPipeline wraps an sklearn Pipeline and replaces all cross-validation with purged K-fold that respects time ordering and applies an embargo window to prevent information leakage through overlapping labels.

Parameters:

steps (list[tuple[str, Any]]) – List of (name, transform) tuples defining the pipeline, identical to the steps parameter of sklearn.pipeline.Pipeline.
n_splits (int, default: 5) – Number of folds for purged K-fold cross-validation.
embargo_pct (float, default: 0.01) – Fraction of total samples to embargo after each test fold, preventing label leakage from overlapping targets.

Example

>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import Ridge
>>> import numpy as np
>>> X = np.random.randn(500, 5)
>>> y = X @ np.array([1, 0.5, 0, 0, 0]) + np.random.randn(500) * 0.1
>>> pipe = FinancialPipeline(
...     steps=[('scaler', StandardScaler()), ('ridge', Ridge())],
...     n_splits=5,
... )
>>> result = pipe.fit_evaluate(X, y)
>>> len(result['fold_scores']) == 5
True

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 7

__init__(steps, n_splits=5, embargo_pct=0.01)[source]¶

Parameters:

steps (list[tuple[str, Any]])
n_splits (int, default: 5)
embargo_pct (float, default: 0.01)

Return type:

None

fit(X, y)[source]¶

Fit the pipeline on the full dataset.

Parameters:

X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.

Returns:

Self, for method chaining.

Return type:

FinancialPipeline

predict(X)[source]¶

Generate predictions using the fitted pipeline.

Parameters:: X (DataFrame | ndarray) – Feature matrix.
Returns:: Predictions.
Return type:: ndarray

fit_evaluate(X, y)[source]¶

Fit with purged K-fold cross-validation and return results.

Uses purged K-fold splitting to evaluate the pipeline without data leakage. After cross-validation, fits the pipeline on the full dataset.

Parameters:

X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.

Returns:

fold_scores: list of per-fold R-squared scores, mean_score: float mean of fold scores, std_score: float std of fold scores, pipeline: the fitted sklearn Pipeline.

Return type:

walk_forward_backtest(model, X, y, train_size=252, test_size=21, step_size=21, expanding=True)[source]¶

Full walk-forward ML backtest with PnL tracking.

Walk-forward validation is the gold standard for evaluating ML models in finance because it mirrors real trading: train on historical data, predict the next period, observe actual outcome, then advance.

Why walk-forward instead of standard cross-validation?: Standard K-Fold CV randomly shuffles observations, allowing the model to “peek” at future data during training. In finance, this creates massive upward bias in performance estimates. Walk-forward enforces strict temporal ordering: the model only ever trains on data that would have been available at the time of prediction.

The function supports both expanding windows (training set grows over time, using all available history) and rolling windows (fixed-size training window that slides forward). Expanding windows are preferred when you believe the data-generating process is stable; rolling windows are better when you expect structural breaks or regime changes.

Parameters:

model (Any) – A scikit-learn-compatible estimator with fit and predict.
X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector (typically forward returns for PnL calculation).
train_size (int, default: 252) – Number of training observations in the initial window.
test_size (int, default: 21) – Number of test observations per fold.
step_size (int, default: 21) – Number of observations to advance between folds.
expanding (bool, default: True) – If True, the training window expands over time. If False, a rolling window of fixed train_size is used.

Returns:

predictions: np.ndarray of concatenated out-of-sample predictions, actuals: np.ndarray of corresponding true values, pnl: np.ndarray of per-period PnL (prediction * actual, assuming long when prediction > 0), sharpe: float annualised Sharpe ratio of the PnL series (assuming 252 trading days), hit_rate: float fraction of periods where prediction sign matches actual sign, equity_curve: np.ndarray cumulative PnL.

Return type:

Example

>>> from sklearn.linear_model import Ridge
>>> import numpy as np
>>> np.random.seed(42)
>>> X = np.random.randn(600, 5)
>>> y = X @ np.array([0.5, 0.3, 0, 0, 0]) + np.random.randn(600) * 0.5
>>> result = walk_forward_backtest(Ridge(), X, y, train_size=200, test_size=20)
>>> len(result['predictions']) > 0
True
>>> 'sharpe' in result
True

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 12
Bailey et al. (2014), “The Deflated Sharpe Ratio”

feature_importance_shap(model, X, feature_names=None, max_samples=500)[source]¶

Compute SHAP-based feature importance for any sklearn model.

SHAP (SHapley Additive exPlanations) values provide a theoretically grounded decomposition of each prediction into per-feature contributions. Unlike impurity-based importance (MDI), SHAP values are consistent and account for feature interactions.

Parameters:

model (Any) – A fitted scikit-learn-compatible estimator.
X (DataFrame | ndarray) – Feature matrix to explain (typically the test set).
feature_names (Optional[Sequence[str]], default: None) – Feature names. If None and X is a DataFrame, column names are used.
max_samples (int, default: 500) – Maximum number of samples to use for computing SHAP values. Subsampled if X has more rows than this.

Returns:

shap_values: np.ndarray of shape (n_samples, n_features) containing per-sample SHAP values, feature_importance: np.ndarray of shape (n_features,) giving mean absolute SHAP value per feature (sorted descending), feature_names: list of feature names ordered by importance.

Return type:

Raises:

MissingDependencyError – If shap is not installed.

Example

>>> from sklearn.ensemble import RandomForestRegressor
>>> import numpy as np
>>> np.random.seed(42)
>>> X = np.random.randn(200, 5)
>>> y = X[:, 0] * 2 + X[:, 1] + np.random.randn(200) * 0.1
>>> model = RandomForestRegressor(n_estimators=50, random_state=42)
>>> model.fit(X, y)
RandomForestRegressor(n_estimators=50, random_state=42)
>>> result = feature_importance_shap(model, X)
>>> result["shap_values"].shape[1] == 5
True

References

Lundberg & Lee (2017), “A Unified Approach to Interpreting Model Predictions”

correlation_clustering(returns, n_clusters=None, method='hierarchical')[source]¶

Cluster assets by their return correlations.

Use correlation clustering to group assets that move together, which is useful for portfolio diversification (allocate across clusters), risk management (monitor cluster concentration), and statistical arbitrage (trade within-cluster mean-reversion).

The correlation-based distance is d(i,j) = sqrt(0.5 * (1 - rho_ij)), which maps perfect correlation to distance 0 and perfect negative correlation to distance 1.

Parameters:

returns (DataFrame) – T x N return matrix (rows = observations, columns = assets).
n_clusters (int | None, default: None) – Number of clusters. If None the optimal number is chosen automatically (silhouette score for hierarchical, or defaults to 3 for spectral).
method (Literal['hierarchical', 'spectral'], default: 'hierarchical') – Clustering algorithm. Hierarchical uses Ward linkage and produces a dendrogram-compatible linkage matrix. Spectral uses the correlation matrix as affinity and finds clusters via eigenvalue decomposition.

Returns:

labelsnp.ndarray: Cluster assignment for each asset (0-indexed, length N). Assets with the same label belong to the same cluster.
n_clustersint: Number of clusters found or specified.
linkage_matrixnp.ndarray or None: Linkage matrix (hierarchical only). Pass to scipy.cluster.hierarchy.dendrogram for visualization.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(42)
>>> # 3 groups of correlated assets
>>> factor = np.random.randn(252, 3)
>>> returns = pd.DataFrame(
...     np.column_stack([factor[:, i % 3] + np.random.randn(252) * 0.5
...                      for i in range(9)]),
...     columns=[f'asset_{i}' for i in range(9)]
... )
>>> result = correlation_clustering(returns, n_clusters=3)
>>> result['n_clusters']
3
>>> len(result['labels']) == 9
True

See also

regime_clustering: Cluster time periods into regimes.
optimal_clusters: Determine optimal cluster count.
wraquant.ml.preprocessing.detoned_correlation: Remove market mode before clustering.

regime_clustering(features, n_regimes=2, method='gmm')[source]¶

Cluster time periods into market regimes.

Use regime clustering when you want to identify distinct market states (e.g., bull/bear, risk-on/risk-off, high/low volatility) from observable features without a pre-defined model. GMM is preferred because it assigns soft probabilities to each regime; KMeans provides hard assignments only.

Parameters:

features (DataFrame | ndarray) – Feature matrix where each row is a time observation. Common inputs include rolling volatility, returns, spreads, and VIX.
n_regimes (int, default: 2) – Number of regimes to identify (default 2, typical for risk-on/risk-off).
method (Literal['gmm', 'kmeans'], default: 'gmm') – Clustering algorithm. 'gmm' (Gaussian Mixture Model) provides probabilistic assignments; 'kmeans' provides hard assignments and is faster.

Returns:

labelsnp.ndarray: Regime assignment for each time period (0-indexed).
n_regimesint: Number of regimes.
modelobject: Fitted GaussianMixture or KMeans model. For GMM, call model.predict_proba(X) to get regime probabilities.

Return type:

Example

>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> vol = np.concatenate([np.random.randn(100) * 0.5 + 0.1,
...                       np.random.randn(100) * 0.5 + 0.3])
>>> features = pd.DataFrame({'vol': vol, 'vol_sq': vol ** 2})
>>> result = regime_clustering(features, n_regimes=2)
>>> result['n_regimes']
2
>>> len(result['labels']) == 200
True

See also

correlation_clustering: Cluster assets (cross-sectional).
optimal_clusters: Find the optimal number of clusters/regimes.
wraquant.regimes: HMM and Markov-switching regime detection.

optimal_clusters(data, max_k=10, method='silhouette')[source]¶

Determine the optimal number of clusters.

Use this function before calling correlation_clustering or regime_clustering to select the number of clusters data-adaptively rather than guessing.

Parameters:

data (DataFrame | ndarray) – Feature matrix.
max_k (int, default: 10) – Maximum number of clusters to evaluate (default 10).
method (Literal['silhouette', 'bic'], default: 'silhouette') – Selection criterion. 'silhouette' uses the silhouette score with KMeans (higher is better, range [-1, 1]); 'bic' uses the Bayesian Information Criterion with a Gaussian Mixture Model (lower is better). Silhouette is faster; BIC is more principled for probabilistic models.

Returns:

Optimal number of clusters (between 2 and max_k). Use this value as n_clusters in correlation_clustering or n_regimes in regime_clustering.

Return type:

int

Example

>>> import numpy as np
>>> np.random.seed(42)
>>> # Generate data with 3 natural clusters
>>> data = np.vstack([np.random.randn(50, 2) + [0, 0],
...                   np.random.randn(50, 2) + [5, 5],
...                   np.random.randn(50, 2) + [10, 0]])
>>> k = optimal_clusters(data, max_k=6)
>>> 2 <= k <= 6
True

See also

correlation_clustering: Cluster assets by correlation.
regime_clustering: Cluster time periods into regimes.

classification_metrics(y_true, y_pred, y_prob=None)[source]¶

Compute standard classification metrics.

Use classification metrics to evaluate direction-prediction models (e.g., predicting up/down/flat labels). These metrics assess the statistical quality of the classifier independently of PnL; pair with financial_metrics for economic evaluation.

Parameters:

y_true (Series | ndarray) – True class labels.
y_pred (Series | ndarray) – Predicted class labels.
y_prob (Series | ndarray | None, default: None) – Predicted probabilities (for the positive class in binary classification). When provided, log-loss and AUC are included.

Returns:

accuracyfloat: Fraction of correct predictions.
precisionfloat: Macro-averaged precision (how many predicted positives are actually positive).
recallfloat: Macro-averaged recall (how many actual positives are captured).
f1float: Macro-averaged F1 score (harmonic mean of precision and recall).
log_lossfloat (only if y_prob given): Cross-entropy loss. Lower is better; measures calibration quality.
aucfloat (only if y_prob given, binary only): Area under the ROC curve. 0.5 = random, 1.0 = perfect.

Return type:

Example

>>> import numpy as np
>>> y_true = np.array([1, 0, 1, 1, 0, 1])
>>> y_pred = np.array([1, 0, 0, 1, 0, 1])
>>> metrics = classification_metrics(y_true, y_pred)
>>> metrics['accuracy']
0.8333333333333334
>>> metrics['f1'] > 0.5
True

See also

financial_metrics: PnL-based evaluation of directional predictions.
backtest_predictions: Full backtest with transaction costs.

financial_metrics(y_true, y_pred, returns)[source]¶

Compute finance-specific evaluation metrics from predictions.

Use financial metrics to evaluate whether a model’s predictions translate into actual trading profits. A model can have high accuracy but poor financial performance if it is right on small moves and wrong on large moves. These metrics directly measure economic value.

The predicted labels are interpreted as position signals: 1 for long, -1 for short, 0 for flat.

Parameters:

y_true (Series | ndarray) – True directional labels.
y_pred (Series | ndarray) – Predicted directional labels (used as signals).
returns (Series | ndarray) – Actual period returns corresponding to each observation.

Returns:

strategy_returnfloat: Cumulative strategy return (sum of signal * return).
sharpefloat: Annualised Sharpe ratio (252 trading days). Values above 1.0 are generally considered good; above 2.0 is excellent.
hit_ratefloat: Fraction of periods where predicted sign matches actual sign. A hit rate above 0.5 is necessary but not sufficient for profitability.
profit_factorfloat: Gross profit / gross loss. Values above 1.0 indicate a profitable strategy; above 2.0 is strong.

Return type:

Example

>>> import numpy as np
>>> y_true = np.array([1, -1, 1, 1, -1])
>>> y_pred = np.array([1, -1, -1, 1, 1])
>>> returns = np.array([0.02, -0.01, 0.015, 0.005, -0.02])
>>> metrics = financial_metrics(y_true, y_pred, returns)
>>> metrics['hit_rate']
0.6
>>> metrics['sharpe'] != 0
True

See also

classification_metrics: Standard ML classification metrics.
backtest_predictions: Full backtest with transaction costs.

learning_curve(model, X, y, train_sizes=None, cv=5)[source]¶

Generate a learning curve for a model.

Use learning curves to diagnose whether a model suffers from high bias (underfitting) or high variance (overfitting). If training and test scores converge at a low value, the model is too simple. If there is a large gap between training and test scores, the model is overfitting and more data or regularisation is needed.

Parameters:

model (Any) – A scikit-learn-compatible estimator.
X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
train_sizes (Union[Sequence[int | float], ndarray, None], default: None) – Training set sizes (absolute counts or fractions). Defaults to np.linspace(0.1, 1.0, 10).
cv (int, default: 5) – Number of cross-validation folds.

Returns:

train_sizesnp.ndarray: Absolute number of training samples at each point.
train_scoresnp.ndarray, shape (len(sizes), cv): Training scores at each size/fold. Plot the mean across folds to visualize training performance.
test_scoresnp.ndarray, shape (len(sizes), cv): Test scores at each size/fold. The gap between train and test mean scores indicates overfitting.

Return type:

dict[str, ndarray]

Example

>>> from sklearn.linear_model import Ridge
>>> import numpy as np
>>> X = np.random.randn(300, 5)
>>> y = X @ [1, 0.5, 0, 0, 0] + np.random.randn(300) * 0.1
>>> result = learning_curve(Ridge(), X, y, cv=3)
>>> result['train_sizes'].shape[0]  # 10 points by default
10

See also

classification_metrics: Evaluate classification quality.
financial_metrics: Evaluate economic value of predictions.

backtest_predictions(predictions, returns, cost_bps=10)[source]¶

Backtest a prediction signal against actual returns.

Use backtest_predictions as a quick sanity check of a model’s economic value before building a full backtest. It applies realistic transaction costs (proportional to position changes) and computes key performance metrics including Sharpe, max drawdown, and turnover.

Parameters:

predictions (Series | ndarray) – Predicted position signals (e.g. 1, 0, -1). The signal is applied as a position: signal * return.
returns (Series | ndarray) – Actual period returns corresponding to each prediction.
cost_bps (float, default: 10) – Transaction cost in basis points applied on each position change (default 10 bps). For equities, 5-10 bps is typical; for futures, 1-3 bps.

Returns:

gross_returnsnp.ndarray: Per-period strategy returns before costs.
net_returnsnp.ndarray: Per-period strategy returns after costs.
cumulative_returnfloat: Total cumulative net return. Positive = profitable.
sharpefloat: Annualised Sharpe ratio of net returns. Above 1.0 is generally good; above 2.0 is excellent.
max_drawdownfloat: Maximum peak-to-trough decline in cumulative PnL. Always negative or zero.
turnoverfloat: Mean absolute position change per period. Higher turnover means higher transaction costs.

Return type:

Example

>>> import numpy as np
>>> preds = np.array([1, 1, -1, 1, -1, 0, 1])
>>> rets = np.array([0.01, -0.005, -0.02, 0.015, 0.01, 0.005, 0.008])
>>> result = backtest_predictions(preds, rets, cost_bps=10)
>>> result['cumulative_return'] != 0
True
>>> result['max_drawdown'] <= 0
True

See also

financial_metrics: Quick financial metrics without transaction costs.
wraquant.ml.pipeline.walk_forward_backtest: Walk-forward backtest.

lstm_forecast(series, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶

Forecast a financial time series using an LSTM network.

Long Short-Term Memory networks are recurrent neural networks capable of learning long-range dependencies in sequential data. In finance, LSTMs are used to capture complex temporal patterns in price, volume, and return series that linear models miss.

The function auto-creates overlapping input/target sequences from the raw time series, splits into train/test sets chronologically (no shuffle to avoid lookahead bias), trains the model, and returns predictions on the test set.

When to use:

Use LSTM for multi-step forecasting when you have >1000 observations and suspect non-linear temporal dependencies. Works well for return prediction, volatility forecasting, and spread modeling.

Mathematical background:

At each time step t, the LSTM cell computes:: f_t = sigma(W_f [h_{t-1}, x_t] + b_f) (forget gate) i_t = sigma(W_i [h_{t-1}, x_t] + b_i) (input gate) o_t = sigma(W_o [h_{t-1}, x_t] + b_o) (output gate) c_t = f_t * c_{t-1} + i_t * tanh(W_c [h_{t-1}, x_t] + b_c) h_t = o_t * tanh(c_t)

The cell state c_t acts as a conveyor belt, allowing gradients to flow across many time steps without vanishing.

Parameters:

series (Series | ndarray) – Univariate time series (e.g., log returns, prices, spreads).
seq_length (int, default: 20) – Number of look-back time steps for each input sequence.
hidden_dim (int, default: 64) – Number of hidden units in each LSTM layer.
n_layers (int, default: 2) – Number of stacked LSTM layers.
dropout (float, default: 0.1) – Dropout probability between LSTM layers (applied only when n_layers > 1).
n_epochs (int, default: 50) – Number of training epochs.
lr (float, default: 0.001) – Learning rate for the Adam optimizer.
train_ratio (float, default: 0.8) – Fraction of data used for training (the rest is used for testing). The split is chronological – no shuffling.
batch_size (int, default: 32) – Mini-batch size for training.

Returns:

predictions: np.ndarray of test-set predictions, actuals: np.ndarray of actual test values, train_losses: list of per-epoch training losses, model: the trained torch.nn.Module.

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np
>>> returns = np.cumsum(np.random.randn(500) * 0.01)
>>> result = lstm_forecast(returns, seq_length=10, n_epochs=20)
>>> result["predictions"].shape
(80,)

Financial time series are notoriously noisy; LSTM is prone to overfitting on noise. Use dropout, early stopping, and validation.
Chronological train/test split is critical to avoid lookahead bias.
Normalisation (handled internally) is essential for gradient stability.

References

Hochreiter & Schmidhuber (1997), “Long Short-Term Memory”
Fischer & Krauss (2018), “Deep learning with long short-term memory networks for financial market predictions”

transformer_forecast(series, seq_length=20, d_model=64, n_heads=4, n_encoder_layers=2, dim_feedforward=128, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶

Forecast a financial time series using a Transformer encoder.

Transformer models use self-attention to capture dependencies at any distance in the input sequence, unlike RNNs which process sequentially. This makes them especially effective at discovering long-range patterns such as seasonality, lead-lag relationships, and regime persistence in financial data.

When to use:

Use Transformers when you have sufficient data (>2000 observations) and suspect that long-range dependencies matter. They often outperform LSTMs on longer sequences but require more data and compute.

Mathematical background:

Self-attention computes:: Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

where Q, K, V are linear projections of the input. Multi-head attention runs h parallel attention heads and concatenates:

MultiHead(Q, K, V) = Concat(head_1, …, head_h) W_O

Positional encoding injects order information:: PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

Parameters:

series (Series | ndarray) – Univariate time series.
seq_length (int, default: 20) – Number of look-back time steps.
d_model (int, default: 64) – Embedding dimension (must be divisible by n_heads).
n_heads (int, default: 4) – Number of attention heads.
n_encoder_layers (int, default: 2) – Number of Transformer encoder layers.
dim_feedforward (int, default: 128) – Hidden dimension in the feedforward sub-layers.
dropout (float, default: 0.1) – Dropout probability.
n_epochs (int, default: 50) – Number of training epochs.
lr (float, default: 0.001) – Learning rate for Adam.
train_ratio (float, default: 0.8) – Fraction of data for training.
batch_size (int, default: 32) – Mini-batch size.

Returns:

predictions: np.ndarray of test-set predictions, actuals: np.ndarray of actual test values, train_losses: list of per-epoch training losses, model: the trained torch.nn.Module.

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np
>>> prices = np.cumsum(np.random.randn(600) * 0.01) + 100
>>> result = transformer_forecast(prices, seq_length=15, n_epochs=10)
>>> len(result["predictions"]) > 0
True

Transformers are data-hungry; on small datasets (<500 obs) they will overfit severely.
Quadratic memory in sequence length: keep seq_length reasonable (< 256 for typical financial data).
No inherent notion of order without positional encoding.

References

Vaswani et al. (2017), “Attention Is All You Need”
Li et al. (2019), “Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting”

autoencoder_features(X, latent_dim=8, hidden_dim=64, n_epochs=50, lr=0.001, batch_size=32, beta=1.0)[source]¶

Extract latent features using a Variational Autoencoder (VAE).

A VAE learns a compressed, continuous latent representation of high-dimensional input features. In finance, this is valuable for:

Regime detection: Cluster the latent codes to find market states.
Anomaly detection: High reconstruction error flags unusual market conditions (flash crashes, liquidity crises).
Feature compression: Reduce hundreds of technical indicators to a handful of orthogonal latent factors.

When to use:

Use when you have a wide feature matrix (>20 features) and want to discover latent structure, detect anomalies, or reduce dimensionality in a non-linear way that PCA cannot capture.

Mathematical background:

The VAE optimises the Evidence Lower Bound (ELBO):: L = E_q[log p(x|z)] - beta * KL(q(z|x) || p(z))

where q(z|x) = N(mu(x), sigma^2(x)) is the encoder, p(x|z) is the decoder, and p(z) = N(0, I) is the prior. The KL term regularises the latent space to be smooth and continuous.

Parameters:

X (DataFrame | ndarray) – Feature matrix of shape (n_samples, n_features).
latent_dim (int, default: 8) – Dimensionality of the latent space.
hidden_dim (int, default: 64) – Hidden layer size in encoder/decoder.
n_epochs (int, default: 50) – Training epochs.
lr (float, default: 0.001) – Learning rate.
batch_size (int, default: 32) – Mini-batch size.
beta (float, default: 1.0) – Weight on the KL divergence term. beta=1 is standard VAE; beta<1 gives more reconstruction accuracy; beta>1 forces more disentangled representations.

Returns:

latent_features: np.ndarray of shape (n_samples, latent_dim) – the encoded representations, reconstruction_error: np.ndarray of per-sample reconstruction MSE, train_losses: list of per-epoch total losses, model: the trained VAE module.

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np
>>> X = np.random.randn(500, 30)  # 30 features
>>> result = autoencoder_features(X, latent_dim=5, n_epochs=20)
>>> result["latent_features"].shape
(500, 5)

Normalise your features before encoding; the VAE assumes roughly standard-normal inputs for stable training.
The latent space is stochastic; for deterministic embeddings, use the mean (mu) which is what this function returns.
Reconstruction error thresholds for anomaly detection should be calibrated on clean training data.

References

Kingma & Welling (2014), “Auto-Encoding Variational Bayes”
An & Cho (2015), “Variational Autoencoder based Anomaly Detection using Reconstruction Probability”

gru_forecast(series, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶

Forecast a financial time series using a GRU network.

Gated Recurrent Units are a simplified variant of LSTMs that merge the cell and hidden state, resulting in fewer parameters and faster training while achieving comparable performance on many financial forecasting tasks.

When to use:

Use GRU as a computationally cheaper alternative to LSTM. Preferred when you have moderate-sized datasets (500-5000 observations) or need faster iteration during model development.

Mathematical background:

The GRU update equations at time step t:: z_t = sigma(W_z [h_{t-1}, x_t]) (update gate) r_t = sigma(W_r [h_{t-1}, x_t]) (reset gate) h_t_hat = tanh(W [r_t * h_{t-1}, x_t]) (candidate) h_t = (1 - z_t) * h_{t-1} + z_t * h_t_hat

Compared to LSTM, GRU has no separate cell state and uses two gates instead of three, giving ~25% fewer parameters.

Parameters:

series (Series | ndarray) – Univariate time series.
seq_length (int, default: 20) – Number of look-back time steps.
hidden_dim (int, default: 64) – Number of hidden units per GRU layer.
n_layers (int, default: 2) – Number of stacked GRU layers.
dropout (float, default: 0.1) – Dropout between layers (only when n_layers > 1).
n_epochs (int, default: 50) – Training epochs.
lr (float, default: 0.001) – Learning rate.
train_ratio (float, default: 0.8) – Fraction of data for training.
batch_size (int, default: 32) – Mini-batch size.

Returns:

predictions: np.ndarray of test-set predictions, actuals: np.ndarray of actual test values, train_losses: list of per-epoch training losses, model: the trained torch.nn.Module.

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np
>>> vol = np.abs(np.random.randn(400)) * 0.02
>>> result = gru_forecast(vol, seq_length=10, n_epochs=15)
>>> result["predictions"].shape[0] > 0
True

Same overfitting risks as LSTM; use dropout and validation.
On very long sequences (>200 steps), Transformers may outperform GRU.

References

Cho et al. (2014), “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”

multivariate_lstm_forecast(features, target, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶

Forecast a target series using multiple input features via LSTM.

Multivariate LSTM ingests a DataFrame of features (e.g., returns of correlated assets, macro indicators, technical signals) and learns to predict a single target variable. This outperforms univariate LSTM when cross-asset signals exist – for example, when sector ETF returns lead individual stock returns, when VIX changes anticipate equity moves, or when order-flow imbalance across related instruments carries predictive information for the target.

The function normalises each feature column independently (z-score), creates multivariate look-back sequences, trains the LSTM with a chronological train/test split, and returns predictions on the held-out test set along with train and test MSE metrics.

Mathematical background:

The LSTM cell equations are the same as in lstm_forecast, but the input dimensionality is now n_features rather than 1:

x_t in R^{n_features} f_t = sigma(W_f [h_{t-1}, x_t] + b_f) i_t = sigma(W_i [h_{t-1}, x_t] + b_i) o_t = sigma(W_o [h_{t-1}, x_t] + b_o)

The weight matrices W_f, W_i, W_o, W_c have input dimension n_features instead of 1, allowing the network to learn cross-feature temporal dependencies.

Parameters:

features (DataFrame) – DataFrame of shape (T, n_features) containing the input features. All columns are used as inputs to the LSTM.
target (Series | ndarray) – Target variable of length T to predict.
seq_length (int, default: 20) – Number of look-back time steps for each input sequence.
hidden_dim (int, default: 64) – Number of hidden units in each LSTM layer.
n_layers (int, default: 2) – Number of stacked LSTM layers.
dropout (float, default: 0.1) – Dropout probability between LSTM layers (applied only when n_layers > 1).
n_epochs (int, default: 50) – Number of training epochs.
lr (float, default: 0.001) – Learning rate for the Adam optimizer.
train_ratio (float, default: 0.8) – Fraction of data used for training (chronological split).
batch_size (int, default: 32) – Mini-batch size for training.

Returns:

predictions: np.ndarray of test-set predictions, actuals: np.ndarray of actual test values, train_losses: list of per-epoch training losses, train_mse: float MSE on the training set, test_mse: float MSE on the test set, model: the trained torch.nn.Module.

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> df = pd.DataFrame({
...     'asset_a': np.cumsum(np.random.randn(500) * 0.01),
...     'asset_b': np.cumsum(np.random.randn(500) * 0.01),
...     'vix': np.abs(np.random.randn(500)) * 15 + 15,
... })
>>> target = pd.Series(np.cumsum(np.random.randn(500) * 0.01))
>>> result = multivariate_lstm_forecast(df, target, seq_length=10, n_epochs=5)
>>> result["predictions"].shape[0] > 0
True

References

Hochreiter & Schmidhuber (1997), “Long Short-Term Memory”
Fischer & Krauss (2018), “Deep learning with long short-term memory networks for financial market predictions”

temporal_fusion_transformer(features, target, seq_length=20, hidden_dim=64, n_heads=4, n_lstm_layers=1, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶

Simplified Temporal Fusion Transformer for interpretable forecasting.

The most promising architecture for interpretable financial forecasting. This implementation provides the core TFT components: a variable selection network that learns which input features matter, an LSTM encoder for temporal processing, multi-head attention for capturing long-range dependencies, and gated residual connections for stable gradient flow.

Unlike black-box models, TFT produces per-feature importance weights that reveal which inputs drive each prediction – critical for building trust in trading signals and satisfying model governance requirements.

Architecture:

Variable Selection Network (VSN): A soft-attention gate over input features. Each feature is projected to hidden_dim, then a shared softmax gate selects the most relevant ones.
LSTM Encoder: Processes the selected features sequentially to capture local temporal patterns.
Multi-Head Attention: Attends over the LSTM outputs to capture long-range dependencies (e.g., monthly seasonality).
Gated Residual Network (GRN): skip connections with gating for stable training on noisy financial data.
Output layer: Linear projection to produce the forecast.

Parameters:

features (DataFrame) – DataFrame of shape (T, n_features) containing the input features.
target (Series | ndarray) – Target variable of length T.
seq_length (int, default: 20) – Number of look-back time steps.
hidden_dim (int, default: 64) – Dimensionality of the hidden representations.
n_heads (int, default: 4) – Number of attention heads (must divide hidden_dim).
n_lstm_layers (int, default: 1) – Number of LSTM layers in the encoder.
dropout (float, default: 0.1) – Dropout probability.
n_epochs (int, default: 50) – Number of training epochs.
lr (float, default: 0.001) – Learning rate for Adam.
train_ratio (float, default: 0.8) – Fraction of data for training (chronological split).
batch_size (int, default: 32) – Mini-batch size.

Returns:

predictions: np.ndarray of test-set predictions, actuals: np.ndarray of actual test values, train_losses: list of per-epoch training losses, feature_importance: np.ndarray of shape (n_features,) giving the learned importance weight for each input feature (higher = more important), feature_names: list of feature names from the input DataFrame, model: the trained torch.nn.Module.

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> df = pd.DataFrame({
...     'momentum': np.random.randn(500),
...     'volume': np.abs(np.random.randn(500)),
...     'spread': np.random.randn(500) * 0.1,
... })
>>> target = pd.Series(np.cumsum(np.random.randn(500) * 0.01))
>>> result = temporal_fusion_transformer(
...     df, target, seq_length=10, hidden_dim=16, n_heads=2, n_epochs=5
... )
>>> result["predictions"].shape[0] > 0
True
>>> len(result["feature_importance"]) == 3
True

References

Lim et al. (2021), “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting”

svm_classifier(X_train, y_train, X_test, y_test, kernel='rbf', C_range=(0.1, 1.0, 10.0), gamma_range=('scale', 0.01, 0.1), cv=5)[source]¶

Train an SVM classifier for market regime classification.

Support Vector Machines find the maximum-margin hyperplane separating classes. With the RBF kernel, SVMs can capture non-linear decision boundaries in feature space, making them effective for classifying market regimes (bull/bear/neutral) from derived features like volatility, momentum, and volume profiles.

When to use:

Use SVM when you have a moderate number of features (5-100), moderate dataset size (500-50k), and need robust classification with good generalisation. SVMs handle high-dimensional spaces well and are resistant to overfitting when C is properly tuned.

Mathematical background:

SVM solves:: min_{w,b} (1/2) ||w||^2 + C * sum_i max(0, 1 - y_i(w.x_i + b))
The RBF kernel maps inputs to infinite-dimensional space:: K(x, x’) = exp(-gamma * ||x - x’||^2)

Grid search over C (regularisation) and gamma (kernel width) selects the best hyperparameters via cross-validation.

Parameters:

X_train (DataFrame | ndarray) – Training feature matrix.
y_train (Series | ndarray) – Training labels (e.g., 1 = bull, 0 = neutral, -1 = bear).
X_test (DataFrame | ndarray) – Test feature matrix.
y_test (Series | ndarray) – Test labels.
kernel (Literal['rbf', 'linear', 'poly'], default: 'rbf') – SVM kernel function.
C_range (Sequence[float], default: (0.1, 1.0, 10.0)) – Regularisation parameter values to search.
gamma_range (Sequence[float | str], default: ('scale', 0.01, 0.1)) – Kernel coefficient values to search (ignored for linear kernel).
cv (int, default: 5) – Cross-validation folds for grid search.

Returns:

model: fitted SVC, predictions: np.ndarray of test predictions, accuracy: float, confusion_matrix: np.ndarray, best_params: dict of best C and gamma, cv_score: float (mean CV accuracy).

Return type:

Example

>>> import numpy as np
>>> X = np.random.randn(200, 5)
>>> y = (X[:, 0] > 0).astype(int)
>>> result = svm_classifier(X[:150], y[:150], X[150:], y[150:])
>>> result["accuracy"] > 0.5
True

Scale features before training (StandardScaler recommended).
SVMs are O(n^2) in memory and O(n^3) in time – avoid for n > 100k.
For imbalanced classes, set class_weight='balanced' on the SVC.

References

Cortes & Vapnik (1995), “Support-Vector Networks”

random_forest_importance(X, y, feature_names=None, n_estimators=100, max_depth=5, random_state=42, task='classification')[source]¶

Rank features by importance using a Random Forest.

Random Forests aggregate many decorrelated decision trees and measure each feature’s contribution to reducing impurity (Gini for classification, variance for regression). This produces a natural feature ranking useful for selecting the most predictive signals from a large universe of technical indicators, fundamental factors, or alternative data features.

When to use:

Use as a first-pass feature selector when you have many candidate features (>20) and want to identify which ones carry signal. Fast, non-parametric, and handles mixed feature types.

Mathematical background:

Mean Decrease Impurity (MDI) for feature j:: Imp(j) = sum_{t in T_j} p(t) * Delta_i(t)

where T_j is the set of tree nodes splitting on feature j, p(t) is the fraction of samples reaching node t, and Delta_i(t) is the impurity decrease. MDI is averaged over all trees in the forest.

Parameters:

X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
feature_names (Optional[Sequence[str]], default: None) – Feature names. If None and X is a DataFrame, column names are used.
n_estimators (int, default: 100) – Number of trees.
max_depth (int | None, default: 5) – Maximum tree depth (None for unlimited).
random_state (int, default: 42) – Random seed for reproducibility.
task (Literal['classification', 'regression'], default: 'classification') – Type of prediction task.

Returns:

importance: pd.Series of feature importances sorted descending, model: fitted RandomForest estimator, oob_score: float (out-of-bag score if available, else None).

Return type:

Example

>>> import numpy as np
>>> X = np.random.randn(300, 10)
>>> y = (X[:, 0] + 0.5 * X[:, 3] > 0).astype(int)
>>> result = random_forest_importance(X, y)
>>> result["importance"].index[0]  # top feature is likely 0
0

MDI importance is biased toward high-cardinality features; consider permutation importance (feature_importance_mda) as a complement.
Correlated features share importance, causing both to appear weaker.

References

Breiman (2001), “Random Forests”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch.8

gradient_boost_forecast(X_train, y_train, X_test, y_test=None, task='regression', n_estimators=200, max_depth=4, learning_rate=0.1, subsample=0.8, cv=5, feature_names=None)[source]¶

Gradient boosting for forecasting or classification.

Gradient Boosting sequentially fits weak learners (shallow trees) to the residuals of the ensemble, greedily minimising a loss function. It is the workhorse of tabular ML in quant finance – used for return prediction, alpha factor construction, default prediction, and more.

When to use:

Use gradient boosting as your default tabular model. It handles non-linearities, feature interactions, and missing values naturally. Preferred over linear models when you have >500 samples and >5 features.

Mathematical background:

At each stage m, the model adds a tree h_m that minimises:: F_m(x) = F_{m-1}(x) + nu * h_m(x)
where h_m fits the negative gradient of the loss:: h_m = argmin_h sum_i L(y_i, F_{m-1}(x_i) + h(x_i))

For regression with squared loss, h_m fits the residuals. For classification with log-loss, h_m fits the log-odds residuals.

Parameters:

X_train (DataFrame | ndarray) – Training feature matrix.
y_train (Series | ndarray) – Training target.
X_test (DataFrame | ndarray) – Test feature matrix.
y_test (Series | ndarray | None, default: None) – Test target (if provided, test metrics are computed).
task (Literal['classification', 'regression'], default: 'regression') – Prediction task.
n_estimators (int, default: 200) – Number of boosting stages.
max_depth (int, default: 4) – Maximum depth of individual trees.
learning_rate (float, default: 0.1) – Shrinkage applied to each tree’s contribution.
subsample (float, default: 0.8) – Fraction of training samples used per tree (stochastic boosting).
cv (int, default: 5) – Cross-validation folds for reporting training CV score.
feature_names (Optional[Sequence[str]], default: None) – Feature names for importance ranking.

Returns:

model: fitted GradientBoosting estimator, predictions: np.ndarray of test predictions, feature_importance: pd.Series (sorted descending), cv_scores: np.ndarray of cross-validation scores, test_score: float or None (R^2 for regression, accuracy for classification).

Return type:

Example

>>> import numpy as np
>>> X = np.random.randn(300, 5)
>>> y = X[:, 0] * 2 + X[:, 1] + np.random.randn(300) * 0.5
>>> result = gradient_boost_forecast(X[:250], y[:250], X[250:], y[250:])
>>> result["test_score"] > 0
True

Overfits if n_estimators is too large; use early stopping or CV.
Sensitive to learning_rate / n_estimators trade-off.
For >100k samples, consider XGBoost/LightGBM for speed.

References

Friedman (2001), “Greedy Function Approximation: A Gradient Boosting Machine”

gaussian_process_regression(X_train, y_train, X_test, kernel='rbf', alpha=0.01, n_restarts=5)[source]¶

Gaussian Process regression with uncertainty quantification.

Gaussian Processes (GPs) define a distribution over functions and provide both point predictions and calibrated confidence intervals. In finance, GPs are used for smooth yield-curve fitting, volatility-surface interpolation, and any setting where uncertainty matters as much as the prediction.

When to use:

Use GP when you need uncertainty estimates (e.g., confidence bands on a yield curve) and have a small-to-moderate dataset (<5000 observations). The cubic complexity makes GPs impractical for large datasets without approximations.

Mathematical background:

A GP assumes f(x) ~ GP(m(x), k(x, x’)), where:: m(x) is the mean function (usually 0) k(x, x’) is the kernel (covariance function)
Posterior predictive at test point x*:: mu* = k(x*, X) [K + sigma^2 I]^{-1} y sigma*^2 = k(x*, x*) - k(x*, X) [K + sigma^2 I]^{-1} k(X, x*)

where K_{ij} = k(x_i, x_j) and sigma^2 is the noise variance.

Parameters:

X_train (DataFrame | ndarray) – Training features.
y_train (Series | ndarray) – Training target.
X_test (DataFrame | ndarray) – Test features.
kernel (str, default: 'rbf') – Kernel type: 'rbf', 'matern', or 'rational_quadratic'.
alpha (float, default: 0.01) – Noise level (regularisation diagonal added to the kernel matrix).
n_restarts (int, default: 5) – Number of optimiser restarts for kernel hyperparameters.

Returns:

predictions: np.ndarray of mean predictions, std: np.ndarray of predictive standard deviations, confidence_lower: np.ndarray (mean - 1.96 * std), confidence_upper: np.ndarray (mean + 1.96 * std), model: fitted GaussianProcessRegressor.

Return type:

Example

>>> import numpy as np
>>> X_train = np.linspace(0, 10, 50).reshape(-1, 1)
>>> y_train = np.sin(X_train).ravel() + np.random.randn(50) * 0.1
>>> X_test = np.linspace(0, 10, 20).reshape(-1, 1)
>>> result = gaussian_process_regression(X_train, y_train, X_test)
>>> result["predictions"].shape
(20,)
>>> result["std"].shape
(20,)

Complexity is O(n^3) for training and O(n^2) per prediction.
For large datasets, use sparse GP approximations (not included here).
Kernel choice strongly affects results; try multiple kernels.

References

Rasmussen & Williams (2006), “Gaussian Processes for Machine Learning”

isolation_forest_anomaly(returns, contamination=0.05, n_estimators=200, random_state=42)[source]¶

Detect anomalous days in return data using Isolation Forest.

Isolation Forest detects anomalies by randomly partitioning data and measuring how quickly each observation is isolated. Anomalous points (outlier returns, flash crashes, liquidity events) are isolated in fewer splits because they sit far from the bulk of the distribution.

When to use:

Use for unsupervised anomaly detection in returns, volumes, or spreads. Works well when you do not have labelled anomalies and want to flag unusual market days for review. Robust to high-dimensional feature spaces.

Mathematical background:

For a sample x, the anomaly score is based on the average path length E[h(x)] across the isolation trees:

s(x, n) = 2^{-E[h(x)] / c(n)}

where c(n) is the average path length in a binary search tree of n samples. Score close to 1 means anomaly; close to 0.5 means normal.

Parameters:

returns (Series | DataFrame | ndarray) – Return data. If 1-D, treated as a single feature; if 2-D, each column is a feature (e.g., return, volume, spread).
contamination (float, default: 0.05) – Expected fraction of anomalies in the dataset (0 < c < 0.5).
n_estimators (int, default: 200) – Number of isolation trees.
random_state (int, default: 42) – Random seed.

Returns:

anomaly_labels: np.ndarray of -1 (anomaly) / 1 (normal), anomaly_scores: np.ndarray of continuous anomaly scores (lower = more anomalous), anomaly_mask: np.ndarray of bool (True for anomalies), n_anomalies: int, model: fitted IsolationForest.

Return type:

Example

>>> import numpy as np
>>> rets = np.random.randn(500) * 0.01
>>> rets[100] = 0.15  # inject anomaly
>>> result = isolation_forest_anomaly(rets, contamination=0.02)
>>> result["anomaly_mask"][100]
True

The contamination parameter is a prior; misspecification leads to over- or under-detection.
Isolation Forest assumes anomalies are both rare and different; clustered anomalies may be missed.
For time-series anomaly detection, consider adding lagged features.

References

Liu, Ting & Zhou (2008), “Isolation Forest”

pca_factor_model(returns, n_components=None, explained_variance_threshold=0.9)[source]¶

Build a PCA-based latent factor model from asset returns.

Principal Component Analysis extracts orthogonal linear combinations of asset returns that explain the most variance. The first PC typically captures the market factor, the second often captures a value/growth or sector rotation, and so on.

When to use:

Use PCA factor models for dimensionality reduction in portfolio construction, risk decomposition, statistical arbitrage (pairs trading on residuals), and understanding co-movement structure.

Mathematical background:

Given return matrix R (T x N), PCA decomposes the covariance:: Sigma = V Lambda V^T

where Lambda = diag(lambda_1, …, lambda_N) are eigenvalues and V are eigenvectors (loadings). Factor returns are:

F = R @ V[:, :k] (T x k)

The fraction of variance explained by the first k components:: sum(lambda_1..k) / sum(lambda_1..N)

Parameters:

returns (DataFrame) – T x N return matrix (rows = observations, columns = assets).
n_components (int | None, default: None) – Number of principal components. If None, selects enough to explain explained_variance_threshold of total variance.
explained_variance_threshold (float, default: 0.9) – Minimum cumulative explained variance ratio when n_components is None.

Returns:

loadings: pd.DataFrame of shape (N, n_components) – asset loadings on each factor, factor_returns: pd.DataFrame of shape (T, n_components) – time series of factor returns, explained_variance_ratio: np.ndarray of per-component variance ratios, cumulative_variance: np.ndarray of cumulative variance ratios, n_components: int, model: fitted PCA object.

Return type:

Example

>>> import numpy as np, pandas as pd
>>> returns = pd.DataFrame(np.random.randn(252, 20) * 0.01)
>>> result = pca_factor_model(returns, n_components=3)
>>> result["factor_returns"].shape
(252, 3)

PCA is linear; for non-linear dimensionality reduction, use the VAE in wraquant.ml.deep.autoencoder_features.
Eigenvalues from small samples are noisy; use Random Matrix Theory denoising (wraquant.ml.preprocessing.denoised_correlation) first.
Components are not guaranteed to have economic meaning.

References

Jolliffe (2002), “Principal Component Analysis”
Avellaneda & Lee (2010), “Statistical arbitrage in the US equities market”

online_linear_regression(X, y, forgetting_factor=1.0, initial_covariance=100.0)[source]¶

Recursive Least Squares (RLS) online linear regression.

Processes observations one at a time, updating regression coefficients with each new data point. This is the online analogue of ordinary least squares and is fundamental to adaptive signal processing in finance: tracking time-varying betas, hedge ratios, and factor loadings.

When to use:

Use online regression when you need to: - Track a hedge ratio that drifts over time (pairs trading). - Estimate time-varying factor exposures (rolling beta). - Build adaptive trading signals that respond to regime changes. - Process streaming tick data without re-estimating from scratch.

Mathematical background:

Recursive Least Squares maintains:: P_t = (1/lambda) * (P_{t-1} - K_t x_t^T P_{t-1}) K_t = P_{t-1} x_t / (lambda + x_t^T P_{t-1} x_t) w_t = w_{t-1} + K_t (y_t - x_t^T w_{t-1})

where: - w_t is the coefficient vector at time t - P_t is the inverse covariance matrix (precision) - K_t is the Kalman gain - lambda is the forgetting factor (1 = no forgetting, <1 = down-weight old data)

With lambda = 1 and infinite data, RLS converges to OLS. With lambda < 1, the effective window length is approximately 1 / (1 - lambda) observations.

Parameters:

X (DataFrame | ndarray) – Feature matrix of shape (T, p) where T is the number of observations and p is the number of features.
y (Series | ndarray) – Target vector of length T.
forgetting_factor (float, default: 1.0) – Forgetting factor lambda in (0, 1]. Values close to 1 give long memory; values like 0.99 give an effective window of ~100 observations. Use 0.95-0.99 for fast-adapting signals.
initial_covariance (float, default: 100.0) – Scalar multiplier for the initial covariance matrix P_0 = c * I. Larger values make the filter more responsive early on.

Returns:

coefficients: np.ndarray of shape (T, p) – the time-varying coefficient vector at each step, predictions: np.ndarray of shape (T,) – one-step-ahead predictions (each y_hat_t uses coefficients estimated from data up to t-1), residuals: np.ndarray of shape (T,) – prediction errors, final_coefficients: np.ndarray of shape (p,) – the coefficients at the last time step.

Return type:

Example

>>> import numpy as np
>>> np.random.seed(42)
>>> T = 500
>>> X = np.random.randn(T, 2)
>>> # True coefficients shift halfway through
>>> beta_true = np.where(np.arange(T)[:, None] < 250,
...     [1.0, 0.5], [0.5, 1.0])
>>> y = np.sum(X * beta_true, axis=1) + np.random.randn(T) * 0.1
>>> result = online_linear_regression(X, y, forgetting_factor=0.98)
>>> result["coefficients"].shape
(500, 2)
>>> # After convergence, coefficients should track the true values
>>> np.abs(result["final_coefficients"][0] - 0.5) < 0.3
True

The forgetting factor is critical: too low causes noisy estimates, too high causes slow adaptation to regime changes.
RLS assumes the noise variance is constant; for heteroskedastic data, consider the exponential weighted variant or Kalman filters.
Initial predictions (before the filter converges) should be discarded in any evaluation.

References

Haykin (2002), “Adaptive Filter Theory”, Ch. 13 (RLS)
Montana et al. (2009), “Flexible least squares for temporal data mining and statistical arbitrage”

exponential_weighted_regression(X, y, halflife=63.0, min_periods=30)[source]¶

Exponentially weighted linear regression favouring recent data.

At each time step t, fits a weighted least squares regression where observation weights decay exponentially into the past. This produces smooth, adaptive coefficient estimates that naturally respond to regime changes without the abrupt sensitivity of rolling-window OLS.

When to use:

Use exponential weighted regression when: - You want smoother coefficient paths than RLS. - The halflife of predictive relationships is approximately known

(e.g., 63 trading days ~ 3 months).

You need an interpretable “recency bias” in your factor model.

Mathematical background:

At time t, the weight for observation s (where s <= t) is:: w_s = exp(-ln(2) * (t - s) / halflife)
The weighted regression solves:: beta_t = (X_t^T W_t X_t)^{-1} X_t^T W_t y_t

where W_t = diag(w_0, w_1, …, w_t). This is equivalent to EWMA smoothing of the sufficient statistics X^T X and X^T y.

Parameters:

X (DataFrame | ndarray) – Feature matrix of shape (T, p).
y (Series | ndarray) – Target vector of length T.
halflife (float, default: 63.0) – Halflife in observations. After halflife observations, the weight of a past data point has decayed to 50%. Common financial values: 21 (1 month), 63 (1 quarter), 252 (1 year).
min_periods (int, default: 30) – Minimum number of observations before producing a coefficient estimate. Earlier entries are filled with NaN.

Returns:

coefficients: np.ndarray of shape (T, p) – time-varying coefficients (NaN for the first min_periods - 1 rows), predictions: np.ndarray of shape (T,) – fitted values using contemporaneous coefficients, residuals: np.ndarray of shape (T,) – prediction errors, final_coefficients: np.ndarray of shape (p,) – last estimated coefficients.

Return type:

Example

>>> import numpy as np
>>> np.random.seed(0)
>>> T = 300
>>> X = np.random.randn(T, 2)
>>> beta_true = np.column_stack([
...     np.linspace(1, 0, T),      # drifting coefficient
...     np.full(T, 0.5),            # constant coefficient
... ])
>>> y = np.sum(X * beta_true, axis=1) + np.random.randn(T) * 0.1
>>> result = exponential_weighted_regression(X, y, halflife=60)
>>> result["coefficients"].shape
(300, 2)

Halflife selection is subjective; cross-validate if possible.
For very short halflives (<10), the effective sample size is small and estimates become noisy.
Assumes homoskedastic errors; for heteroskedastic data, consider EWMA-weighted robust regression.
Numerically less stable than RLS for ill-conditioned problems.

References

Pozzi et al. (2012), “Exponentially weighted moving average charts for detecting concept drift”
de Prado (2018), “Advances in Financial Machine Learning”, Ch. 17

Features¶

Feature engineering functions for transforming raw market data into predictive signals.

Feature engineering utilities for financial machine learning.

All functions in this module use only numpy and pandas – no external TA libraries are required.

rolling_features(data, windows=(5, 10, 21, 63))[source]¶

Generate rolling statistical features for each window length.

For every window the following statistics are computed: mean, std, skew, kurtosis, min, and max.

Parameters:

data (Series | DataFrame) – Numeric time-series data. If a DataFrame is passed, features are generated independently for each column.
windows (Sequence[int], default: (5, 10, 21, 63)) – Rolling-window sizes (default (5, 10, 21, 63)), corresponding roughly to 1-week, 2-week, 1-month, and 1-quarter horizons.

Returns:

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> returns = pd.Series(np.random.randn(100) * 0.01, name='ret')
>>> feats = rolling_features(returns, windows=(5, 21))
>>> feats.columns.tolist()[:3]
['mean_w5', 'std_w5', 'skew_w5']
>>> feats.shape[1]  # 6 stats * 2 windows
12

See also

return_features: Lagged and cumulative return features.
volatility_features: Realised volatility and vol-of-vol features.

return_features(prices, lags=(1, 2, 3, 5, 10, 21))[source]¶

Compute lagged and cumulative return features from a price series.

Parameters:

prices (Series) – Price series (e.g. adjusted close).
lags (Sequence[int], default: (1, 2, 3, 5, 10, 21)) – Lag periods for returns (default (1, 2, 3, 5, 10, 21)).

Returns:

Return type:

Example

>>> import pandas as pd, numpy as np
>>> prices = pd.Series([100, 101, 102, 100, 103, 105, 104],
...                     name='close')
>>> feats = return_features(prices, lags=(1, 3))
>>> list(feats.columns)
['ret_lag1', 'cum_ret_1', 'ret_lag3', 'cum_ret_3']
>>> feats['cum_ret_3'].iloc[-1] > 0  # cumulative 3-period return
True

See also

rolling_features: Rolling statistical features.
technical_features: Technical analysis features (RSI, MACD, etc.).

technical_features(high, low, close, volume=None)[source]¶

Compute common technical analysis features for ML pipelines.

Computes RSI, MACD histogram, Bollinger Band %B, and ATR. If volume is provided, On-Balance Volume (OBV) is also included.

Parameters:

high (Series) – High prices.
low (Series) – Low prices.
close (Series) – Close prices.
volume (Series | None, default: None) – Trade volume (optional). When provided, adds OBV which tracks cumulative buying/selling pressure.

Returns:

DataFrame with columns:

rsi: Relative Strength Index (0-100). Values above 70 indicate overbought; below 30 indicate oversold.
macd_hist: MACD histogram. Positive values indicate bullish momentum; negative values indicate bearish.
bb_pctb: Bollinger Band %B (0-1 range typically). Values above 1 mean price is above the upper band.
atr: Average True Range. Higher values indicate more volatile price action.
obv (optional): On-Balance Volume. Rising OBV confirms an uptrend.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> n = 100
>>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5))
>>> high = close + np.abs(np.random.randn(n) * 0.3)
>>> low = close - np.abs(np.random.randn(n) * 0.3)
>>> feats = technical_features(high, low, close)
>>> list(feats.columns)
['rsi', 'macd_hist', 'bb_pctb', 'atr']

See also

return_features: Lagged and cumulative return features.
volatility_features: Realised volatility features.

ta_features(high, low, close, volume=None, include=None)[source]¶

Generate ML features using wraquant’s full technical analysis library.

By default, computes a curated set of the most ML-relevant indicators: RSI, MACD histogram, Bollinger Band %B, ATR, and optionally OBV. Use the include parameter to select additional indicators.

Parameters:

high (Series) – High prices.
low (Series) – Low prices.
close (Series) – Close prices.
volume (Series | None, default: None) – Trade volume (optional). Required for volume-based indicators (OBV, MFI).
include (Optional[Sequence[str]], default: None) – Subset of indicators to include. Options: 'rsi', 'macd', 'bbands', 'atr', 'obv'. If None, includes all available indicators.

Return type:

Returns:

DataFrame with one column per indicator, indexed like the input series. Column names are descriptive (e.g., ta_rsi, ta_macd_hist, ta_bb_pctb, ta_atr, ta_obv).

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> n = 100
>>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5))
>>> high = close + np.abs(np.random.randn(n) * 0.3)
>>> low = close - np.abs(np.random.randn(n) * 0.3)
>>> feats = ta_features(high, low, close)
>>> 'ta_rsi' in feats.columns
True

See also

technical_features: Inline implementation (no ta/ dependency). wraquant.ta.momentum.rsi: Full RSI implementation. wraquant.ta.momentum.macd: Full MACD implementation.

volatility_features(returns, windows=(5, 10, 21, 63))[source]¶

Compute realised-volatility-related features.

Parameters:

returns (Series) – Log or simple return series.
windows (Sequence[int], default: (5, 10, 21, 63)) – Window sizes for rolling calculations (default (5, 10, 21, 63)).

Returns:

Columns:

realized_vol_w{w}: Annualised rolling standard deviation (sqrt(252) scaling). Interpretation: a value of 0.20 means ~20% annualised volatility.
vol_of_vol_w{w}: Rolling std of the rolling vol. High values indicate unstable volatility (vol-of-vol regime).
vol_ratio_w{w1}_w{w2}: Ratio of short-window vol to long-window vol. Values > 1 indicate vol is spiking (risk-off signal); values < 1 indicate vol compression.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> rets = pd.Series(np.random.randn(200) * 0.01, name='daily_ret')
>>> feats = volatility_features(rets, windows=(5, 21))
>>> 'realized_vol_w5' in feats.columns
True
>>> 'vol_ratio_w5_w21' in feats.columns
True

See also

rolling_features: General rolling statistical features.
wraquant.vol: Full volatility modelling (GARCH, stochastic vol).

microstructure_features(high, low, close, volume)[source]¶

Compute market-microstructure features.

Parameters:

high (Series) – High prices.
low (Series) – Low prices.
close (Series) – Close prices.
volume (Series) – Trade volume.

Returns:

Columns:

amihud_illiq: Amihud illiquidity ratio (21-day rolling mean of |return| / dollar_volume). Higher values indicate less liquid, more price-impactful markets.
kyle_lambda: Kyle’s lambda (21-day rolling OLS slope of |price change| on signed sqrt-volume). Measures the price impact per unit of informed flow. Higher values suggest more information asymmetry.
log_volume: Natural log of volume. Smooths the skewed volume distribution for ML model consumption.
volume_ma_ratio: Current volume / 21-day moving average. Values > 1 indicate above-average activity (potential event).
dollar_volume: Price * volume. Absolute measure of trading activity and liquidity.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> n = 100
>>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5))
>>> high = close + np.abs(np.random.randn(n) * 0.3)
>>> low = close - np.abs(np.random.randn(n) * 0.3)
>>> volume = pd.Series(np.random.randint(1_000_000, 5_000_000, n))
>>> feats = microstructure_features(high, low, close, volume)
>>> list(feats.columns)
['amihud_illiq', 'kyle_lambda', 'log_volume', 'volume_ma_ratio', 'dollar_volume']

References

Amihud (2002), “Illiquidity and stock returns”
Kyle (1985), “Continuous Auctions and Insider Trading”

See also

technical_features: Price-based technical indicators.

label_fixed_horizon(returns, horizon=5, threshold=0.0)[source]¶

Label future return direction over a fixed horizon.

Parameters:

returns (Series) – Period (e.g. daily) returns.
horizon (int, default: 5) – Number of periods to accumulate forward returns (default 5, i.e. one trading week).
threshold (float, default: 0.0) – If threshold > 0, three labels are produced: 1 (up beyond threshold), 0 (flat), -1 (down beyond threshold). If threshold == 0, binary labels (1 / 0) are produced where 1 means positive cumulative return.

Returns:

Integer labels aligned to the original index. The last horizon rows will be NaN (no future data available).

Return type:

Example

>>> import pandas as pd, numpy as np
>>> rets = pd.Series([0.01, -0.005, 0.02, 0.01, -0.03, 0.015, 0.005])
>>> labels = label_fixed_horizon(rets, horizon=3, threshold=0.0)
>>> labels.iloc[0]  # sum of rets[1:4] = -0.005+0.02+0.01 > 0
1

Notes

See also

label_triple_barrier: Volatility-adaptive labelling (Lopez de Prado).

label_triple_barrier(close, upper=None, lower=None, max_holding=10)[source]¶

Triple-barrier labelling (Lopez de Prado).

For each bar the method sets three barriers:

Upper: price rises by upper fraction -> label = 1
Lower: price falls by lower fraction -> label = -1
Vertical: max_holding bars elapse -> label = sign of return

If upper or lower is None the corresponding horizontal barrier is disabled.

Parameters:

close (Series) – Close price series.
upper (float | None, default: None) – Fractional distance for the upper barrier (e.g. 0.02 for 2 %).
lower (float | None, default: None) – Fractional distance for the lower barrier (positive value; e.g. 0.02 for -2 %).
max_holding (int, default: 10) – Maximum holding period in bars (vertical barrier).

Returns:

Return type:

Example

>>> import pandas as pd
>>> close = pd.Series([100, 101, 102, 103, 100, 97, 98, 99, 100, 101])
>>> labels = label_triple_barrier(close, upper=0.03, lower=0.03, max_holding=5)
>>> labels.iloc[0]  # price rises 3% by bar 3 (103/100 - 1 = 0.03)
1

Notes

In practice, set upper and lower proportional to recent volatility (e.g., upper = lower = daily_vol * sqrt(max_holding)). This makes the labels regime-adaptive.

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 3

See also

label_fixed_horizon: Simpler fixed-horizon labelling.

interaction_features(data, columns=None)[source]¶

Create pairwise interaction terms between features.

For each pair of selected columns (A, B), computes:

A_x_B: element-wise product (captures multiplicative relationships)
A_div_B: element-wise ratio A / B (captures relative magnitudes)

Parameters:

data (DataFrame) – Feature DataFrame.
columns (Optional[Sequence[str]], default: None) – Columns to use for interaction terms. If None, all columns are used.

Returns:

DataFrame containing all pairwise interaction features, with column names like col1_x_col2 and col1_div_col2.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
>>> result = interaction_features(df, columns=['a', 'b'])
>>> 'a_x_b' in result.columns
True
>>> 'a_div_b' in result.columns
True

cross_asset_features(asset, benchmark, windows=(10, 21, 63))[source]¶

Compute cross-asset relationship features.

Given an asset return series and a benchmark (or related asset) return series, computes rolling correlation, rolling beta, and relative strength for each window.

Parameters:

asset (Series) – Return series for the asset of interest.
benchmark (Series) – Return series for the benchmark or related asset.
windows (Sequence[int], default: (10, 21, 63)) – Rolling window sizes for correlation and beta calculations.

Returns:

over the window

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(0)
>>> asset = pd.Series(np.random.randn(200) * 0.01, name='asset')
>>> bench = pd.Series(np.random.randn(200) * 0.01, name='bench')
>>> result = cross_asset_features(asset, bench, windows=[10, 21])
>>> 'rolling_corr_w10' in result.columns
True
>>> 'rolling_beta_w21' in result.columns
True

regime_features(regime_probabilities, regime_labels=None)[source]¶

Create features from regime probabilities or labels.

Parameters:

regime_probabilities (DataFrame) – DataFrame where each column is the probability of a regime (e.g., columns ['bull', 'bear'] with probabilities summing to 1).
regime_labels (Series | None, default: None) – Hard regime labels. If None, the most probable regime at each step is used (argmax of the probability columns).

Returns:

DataFrame with columns: - current_regime: integer label of the current regime - regime_duration: number of consecutive periods in the

current regime

regime_change: binary indicator (1 if regime changed)
transition_prob_w{w}: rolling mean of regime changes for w in [5, 10, 21]
one column per regime probability from the input

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(42)
>>> probs = pd.DataFrame({
...     'bull': np.random.dirichlet([5, 2], size=100)[:, 0],
...     'bear': np.random.dirichlet([5, 2], size=100)[:, 1],
... })
>>> result = regime_features(probs)
>>> 'current_regime' in result.columns
True
>>> 'regime_duration' in result.columns
True

Preprocessing¶

Purged CV, fractional differentiation, and correlation matrix denoising.

Financial data preprocessing utilities.

Implements purged cross-validation, fractional differentiation, and random-matrix-theory denoising – all central to the Advances in Financial Machine Learning workflow (Lopez de Prado).

purged_kfold(X, y, n_splits=5, embargo_pct=0.01)[source]¶

Purged K-Fold cross-validation.

Ensures that training observations that immediately follow a test observation are removed (embargo) so that information cannot leak through overlapping labels.

Parameters:

X (DataFrame | ndarray) – Feature matrix (only its length is used).
y (Series | ndarray) – Target vector (only its length is used).
n_splits (int, default: 5) – Number of folds.
embargo_pct (float, default: 0.01) – Fraction of total samples to embargo after each test fold. For daily data with 5-day forward labels, 0.01 embargoes ~2.5 days on a 252-sample dataset.

Yields:

tuple[np.ndarray, np.ndarray] – (train_indices, test_indices) for each fold.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

Example

>>> import numpy as np
>>> X = np.random.randn(500, 3)
>>> y = np.random.randn(500)
>>> folds = list(purged_kfold(X, y, n_splits=5, embargo_pct=0.02))
>>> len(folds)
5
>>> train_idx, test_idx = folds[0]
>>> len(train_idx) + len(test_idx) < 500  # embargo removes some samples
True

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 7

See also

combinatorial_purged_kfold: Generates all C(n, k) purged splits.
wraquant.ml.pipeline.FinancialPipeline: Pipeline that uses purged K-fold.

combinatorial_purged_kfold(X, y, n_splits=5, n_test_splits=2, embargo_pct=0.01)[source]¶

Combinatorial purged K-Fold cross-validation.

Generates all C(n_splits, n_test_splits) train/test combinations, applying an embargo after each test group to prevent leakage.

Parameters:

X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
n_splits (int, default: 5) – Total number of groups.
n_test_splits (int, default: 2) – Number of groups held out for testing in each split.
embargo_pct (float, default: 0.01) – Fraction of total samples to embargo after each test group.

Yields:

tuple[np.ndarray, np.ndarray] – (train_indices, test_indices) for each combination.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

Example

>>> import numpy as np
>>> X = np.random.randn(500, 3)
>>> y = np.random.randn(500)
>>> folds = list(combinatorial_purged_kfold(X, y, n_splits=5, n_test_splits=2))
>>> len(folds)  # C(5, 2) = 10
10

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 12

See also

purged_kfold: Simpler purged K-fold with n_splits folds.

fractional_differentiation(series, d=0.5, threshold=1e-05)[source]¶

Fractionally differentiate a time series.

Applies the fractional differentiation operator of order d (Hosking, 1981) to obtain a (near-)stationary series while preserving long-range memory.

The operator is defined as:

(1 - B)^d = sum_{k=0}^{inf} C(d,k) * (-B)^k

where B is the backshift operator and C(d,k) are the binomial-like weights.

Parameters:

series (Series) – Input time series (e.g., log prices).
d (float, default: 0.5) – Fractional differentiation order (0 < d < 1 for partial differentiation; d = 1 is the standard first difference). Start with d=0.5 and decrease until the ADF test rejects at the desired significance level.
threshold (float, default: 1e-05) – Minimum absolute weight to retain. Smaller values use more lagged observations but increase computational cost.

Returns:

Fractionally differentiated series (initial rows where the full convolution is not available are dropped). Test stationarity with an ADF test; if non-stationary, increase d.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(42)
>>> prices = pd.Series(100 + np.cumsum(np.random.randn(300) * 0.5),
...                     name='close')
>>> frac_diff = fractional_differentiation(prices, d=0.4)
>>> len(frac_diff) < len(prices)  # initial rows dropped
True
>>> frac_diff.std() > 0  # non-trivial output
True

References

Hosking (1981), “Fractional Differencing”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 5

See also

denoised_correlation: Random Matrix Theory denoising.

denoised_correlation(returns, n_components=None)[source]¶

Denoise a correlation matrix using Random Matrix Theory.

Eigenvalues that fall within the Marchenko-Pastur distribution are replaced by their average, shrinking noise while preserving signal.

Parameters:

returns (DataFrame) – T x N return matrix (rows = observations, columns = assets).
n_components (int | None, default: None) – Number of signal eigenvalues to keep. If None, they are determined automatically from the Marchenko-Pastur bound.

Returns:

Denoised correlation matrix of shape (N, N). The matrix is symmetric, positive semi-definite, and has unit diagonal. Use it in place of returns.corr() for portfolio optimization.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(42)
>>> returns = pd.DataFrame(np.random.randn(252, 10) * 0.01)
>>> clean_corr = denoised_correlation(returns)
>>> clean_corr.shape
(10, 10)
>>> np.allclose(np.diag(clean_corr), 1.0)  # unit diagonal
True

Notes

The Marchenko-Pastur upper bound is:

lambda_+ = sigma^2 * (1 + sqrt(N/T))^2

Eigenvalues above this threshold are retained as “signal”; those below are replaced.

References

Laloux et al. (1999), “Noise dressing of financial correlation matrices”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 2

See also

detoned_correlation: Remove the market mode from a correlation matrix.

detoned_correlation(corr, n_components=1)[source]¶

Remove the first n_components eigenvectors (market mode) from a correlation matrix.

Parameters:

corr (ndarray) – Correlation matrix of shape (N, N).
n_components (int, default: 1) – Number of leading eigenvalues/vectors to remove (default 1, which removes only the market factor).

Returns:

De-toned correlation matrix of shape (N, N). The matrix is symmetric with unit diagonal but is not positive definite (some eigenvalues are set to zero).

Return type:

Example

>>> import numpy as np
>>> np.random.seed(42)
>>> corr = np.corrcoef(np.random.randn(5, 252))
>>> detoned = detoned_correlation(corr, n_components=1)
>>> detoned.shape
(5, 5)
>>> np.allclose(np.diag(detoned), 1.0)
True

References

Lopez de Prado (2020), “Machine Learning for Asset Managers”, Ch. 2

See also

denoised_correlation: Remove noise eigenvalues from a correlation matrix.
wraquant.ml.clustering.correlation_clustering: Cluster assets by correlation.

Models¶

Walk-forward training, ensembles, and feature importance.

Model wrappers for financial machine-learning workflows.

Functions that require scikit-learn are guarded by the @requires_extra('ml') decorator so that the rest of the package can be imported without it.

walk_forward_train(model, X, y, train_size=252, test_size=21, step_size=21)[source]¶

Walk-forward (expanding or rolling window) analysis.

At each step the model is cloned (via scikit-learn’s clone), fitted on the training window, and used to predict the test window.

Parameters:

model (Any) – A scikit-learn-compatible estimator that implements fit and predict.
X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
train_size (int, default: 252) – Number of training observations in the first window (default 252, approximately one trading year).
test_size (int, default: 21) – Number of test observations per fold (default 21, approximately one trading month).
step_size (int, default: 21) – Number of observations to step forward between folds.

Returns:

predictionsnp.ndarray: Concatenated out-of-sample predictions across all folds.
actualsnp.ndarray: Corresponding true values. Compare with predictions to measure forecast accuracy.
test_indicesnp.ndarray: Original row indices for each prediction, useful for aligning results back to a DatetimeIndex.
n_foldsint: Number of walk-forward folds executed.

Return type:

Example

>>> from sklearn.linear_model import Ridge
>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> X = pd.DataFrame(np.random.randn(500, 3), columns=['mom', 'vol', 'size'])
>>> y = X['mom'] * 0.5 + np.random.randn(500) * 0.1
>>> result = walk_forward_train(Ridge(), X, y, train_size=252, test_size=21)
>>> result['n_folds'] > 0
True
>>> len(result['predictions']) == len(result['actuals'])
True

Notes

The window is expanding (all data from the start up to the current train end is used). For a rolling window, see wraquant.ml.pipeline.walk_forward_backtest which supports both modes.

See also

wraquant.ml.pipeline.walk_forward_backtest: Full walk-forward backtest with PnL.
wraquant.ml.preprocessing.purged_kfold: Purged K-fold cross-validation.

ensemble_predict(models, X, method='mean')[source]¶

Generate ensemble predictions from multiple fitted models.

Parameters:

models (Sequence[Any]) – Fitted scikit-learn-compatible estimators. Each must implement predict(X).
X (DataFrame | ndarray) – Feature matrix.
method (Literal['mean', 'median', 'vote'], default: 'mean') – Aggregation method. 'mean' and 'median' average the raw predictions (best for regression); 'vote' takes the mode (majority vote, best for classification).

Returns:

Aggregated predictions. For 'mean'/'median', the values are continuous. For 'vote', the values are discrete class labels.

Return type:

Example

>>> from sklearn.linear_model import Ridge, Lasso
>>> import numpy as np
>>> np.random.seed(0)
>>> X_train = np.random.randn(200, 3)
>>> y_train = X_train @ [1, 0.5, 0] + np.random.randn(200) * 0.1
>>> m1 = Ridge().fit(X_train, y_train)
>>> m2 = Lasso(alpha=0.01).fit(X_train, y_train)
>>> X_test = np.random.randn(50, 3)
>>> preds = ensemble_predict([m1, m2], X_test, method='mean')
>>> preds.shape
(50,)

See also

walk_forward_train: Walk-forward evaluation for individual models.

feature_importance_mdi(model, feature_names)[source]¶

Mean Decrease Impurity (MDI) feature importance.

Reads model.feature_importances_ (available on tree-based estimators after fitting) and returns a sorted pd.Series.

Parameters:

model (Any) – A fitted tree-based estimator with a feature_importances_ attribute (e.g. RandomForestClassifier).
feature_names (Sequence[str]) – Feature names corresponding to the columns of the training data.

Returns:

Importance values indexed by feature name, sorted descending. Higher values indicate features that contributed more to splits. Values sum to 1.0 for scikit-learn tree ensembles.

Return type:

Example

>>> from sklearn.ensemble import RandomForestClassifier
>>> import numpy as np
>>> np.random.seed(42)
>>> X = np.random.randn(300, 4)
>>> y = (X[:, 0] > 0).astype(int)
>>> rf = RandomForestClassifier(n_estimators=50, random_state=42).fit(X, y)
>>> imp = feature_importance_mdi(rf, ['momentum', 'vol', 'size', 'value'])
>>> imp.index[0]  # most important feature
'momentum'

Notes

MDI is biased toward high-cardinality and continuous features. For an unbiased alternative, use feature_importance_mda (permutation importance).

See also

feature_importance_mda: Permutation-based importance (unbiased).
wraquant.ml.advanced.random_forest_importance: Combined RF fit + importance.

feature_importance_mda(model, X, y, feature_names, n_repeats=10)[source]¶

Mean Decrease Accuracy (permutation importance).

Repeatedly permutes each feature and measures the decrease in the model’s score.

Parameters:

model (Any) – A fitted scikit-learn-compatible estimator.
X (DataFrame | ndarray) – Feature matrix (test or validation set).
y (Series | ndarray) – True labels.
feature_names (Sequence[str]) – Feature names corresponding to columns of X.
n_repeats (int, default: 10) – Number of permutation repeats per feature. More repeats yield more stable estimates but increase runtime linearly.

Returns:

Mean importance values indexed by feature name, sorted descending. Positive values indicate features whose permutation hurts the model score; negative values suggest noise features.

Return type:

Example

>>> from sklearn.ensemble import RandomForestClassifier
>>> import numpy as np
>>> np.random.seed(42)
>>> X = np.random.randn(300, 4)
>>> y = (X[:, 0] + 0.3 * X[:, 2] > 0).astype(int)
>>> rf = RandomForestClassifier(n_estimators=50, random_state=42).fit(X, y)
>>> imp = feature_importance_mda(rf, X, y, ['mom', 'vol', 'size', 'val'])
>>> imp.iloc[0] > 0  # top feature has positive importance
True

Notes

References

Breiman (2001), “Random Forests”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 8

See also

feature_importance_mdi: Faster but biased impurity-based importance.
wraquant.ml.pipeline.feature_importance_shap: SHAP-based importance.

sequential_feature_selection(model, X, y, n_features=5, direction='forward', cv=5)[source]¶

Sequential (forward / backward) feature selection.

Parameters:

model (Any) – A scikit-learn-compatible estimator.
X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
n_features (int, default: 5) – Number of features to select.
direction (Literal['forward', 'backward'], default: 'forward') – Selection direction. Forward is faster when n_features is small relative to total features; backward is faster when you want to drop only a few.
cv (int, default: 5) – Number of cross-validation folds.

Returns:

Selected feature names (if X is a DataFrame) or column indices.

Return type:

list[str | int]

Example

>>> from sklearn.linear_model import Ridge
>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> X = pd.DataFrame(np.random.randn(200, 6),
...                   columns=['f1','f2','f3','f4','f5','f6'])
>>> y = X['f1'] * 2 + X['f3'] + np.random.randn(200) * 0.1
>>> selected = sequential_feature_selection(Ridge(), X, y, n_features=2)
>>> len(selected)
2

See also

feature_importance_mdi: Impurity-based ranking (faster, less rigorous).
feature_importance_mda: Permutation-based ranking.

Deep Learning¶

LSTM, GRU, Transformer, and autoencoder architectures for time-series forecasting. Requires PyTorch.

Deep learning models for quantitative finance.

Provides PyTorch-based neural network architectures tailored for financial time-series forecasting and feature extraction. All torch imports are guarded so the rest of the package works without PyTorch installed.

Models included: - LSTM forecasting - Transformer-based forecasting - GRU forecasting - Variational Autoencoder for feature extraction

lstm_forecast(series, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶

Forecast a financial time series using an LSTM network.

When to use:

Use LSTM for multi-step forecasting when you have >1000 observations and suspect non-linear temporal dependencies. Works well for return prediction, volatility forecasting, and spread modeling.

Mathematical background:

At each time step t, the LSTM cell computes:: f_t = sigma(W_f [h_{t-1}, x_t] + b_f) (forget gate) i_t = sigma(W_i [h_{t-1}, x_t] + b_i) (input gate) o_t = sigma(W_o [h_{t-1}, x_t] + b_o) (output gate) c_t = f_t * c_{t-1} + i_t * tanh(W_c [h_{t-1}, x_t] + b_c) h_t = o_t * tanh(c_t)

The cell state c_t acts as a conveyor belt, allowing gradients to flow across many time steps without vanishing.

Parameters:

series (Series | ndarray) – Univariate time series (e.g., log returns, prices, spreads).
seq_length (int, default: 20) – Number of look-back time steps for each input sequence.
hidden_dim (int, default: 64) – Number of hidden units in each LSTM layer.
n_layers (int, default: 2) – Number of stacked LSTM layers.
dropout (float, default: 0.1) – Dropout probability between LSTM layers (applied only when n_layers > 1).
n_epochs (int, default: 50) – Number of training epochs.
lr (float, default: 0.001) – Learning rate for the Adam optimizer.
train_ratio (float, default: 0.8) – Fraction of data used for training (the rest is used for testing). The split is chronological – no shuffling.
batch_size (int, default: 32) – Mini-batch size for training.

Returns:

predictions: np.ndarray of test-set predictions, actuals: np.ndarray of actual test values, train_losses: list of per-epoch training losses, model: the trained torch.nn.Module.

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np
>>> returns = np.cumsum(np.random.randn(500) * 0.01)
>>> result = lstm_forecast(returns, seq_length=10, n_epochs=20)
>>> result["predictions"].shape
(80,)

Financial time series are notoriously noisy; LSTM is prone to overfitting on noise. Use dropout, early stopping, and validation.
Chronological train/test split is critical to avoid lookahead bias.
Normalisation (handled internally) is essential for gradient stability.

References

Hochreiter & Schmidhuber (1997), “Long Short-Term Memory”
Fischer & Krauss (2018), “Deep learning with long short-term memory networks for financial market predictions”

Forecast a financial time series using a Transformer encoder.

When to use:

Use Transformers when you have sufficient data (>2000 observations) and suspect that long-range dependencies matter. They often outperform LSTMs on longer sequences but require more data and compute.

Mathematical background:

Self-attention computes:: Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

where Q, K, V are linear projections of the input. Multi-head attention runs h parallel attention heads and concatenates:

MultiHead(Q, K, V) = Concat(head_1, …, head_h) W_O

Positional encoding injects order information:: PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

Parameters:

series (Series | ndarray) – Univariate time series.
seq_length (int, default: 20) – Number of look-back time steps.
d_model (int, default: 64) – Embedding dimension (must be divisible by n_heads).
n_heads (int, default: 4) – Number of attention heads.
n_encoder_layers (int, default: 2) – Number of Transformer encoder layers.
dim_feedforward (int, default: 128) – Hidden dimension in the feedforward sub-layers.
dropout (float, default: 0.1) – Dropout probability.
n_epochs (int, default: 50) – Number of training epochs.
lr (float, default: 0.001) – Learning rate for Adam.
train_ratio (float, default: 0.8) – Fraction of data for training.
batch_size (int, default: 32) – Mini-batch size.

Returns:

predictions: np.ndarray of test-set predictions, actuals: np.ndarray of actual test values, train_losses: list of per-epoch training losses, model: the trained torch.nn.Module.

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np
>>> prices = np.cumsum(np.random.randn(600) * 0.01) + 100
>>> result = transformer_forecast(prices, seq_length=15, n_epochs=10)
>>> len(result["predictions"]) > 0
True

Transformers are data-hungry; on small datasets (<500 obs) they will overfit severely.
Quadratic memory in sequence length: keep seq_length reasonable (< 256 for typical financial data).
No inherent notion of order without positional encoding.

References

Vaswani et al. (2017), “Attention Is All You Need”
Li et al. (2019), “Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting”

autoencoder_features(X, latent_dim=8, hidden_dim=64, n_epochs=50, lr=0.001, batch_size=32, beta=1.0)[source]¶

Extract latent features using a Variational Autoencoder (VAE).

A VAE learns a compressed, continuous latent representation of high-dimensional input features. In finance, this is valuable for:

Regime detection: Cluster the latent codes to find market states.
Anomaly detection: High reconstruction error flags unusual market conditions (flash crashes, liquidity crises).
Feature compression: Reduce hundreds of technical indicators to a handful of orthogonal latent factors.

When to use:

Use when you have a wide feature matrix (>20 features) and want to discover latent structure, detect anomalies, or reduce dimensionality in a non-linear way that PCA cannot capture.

Mathematical background:

The VAE optimises the Evidence Lower Bound (ELBO):: L = E_q[log p(x|z)] - beta * KL(q(z|x) || p(z))

where q(z|x) = N(mu(x), sigma^2(x)) is the encoder, p(x|z) is the decoder, and p(z) = N(0, I) is the prior. The KL term regularises the latent space to be smooth and continuous.

Parameters:

X (DataFrame | ndarray) – Feature matrix of shape (n_samples, n_features).
latent_dim (int, default: 8) – Dimensionality of the latent space.
hidden_dim (int, default: 64) – Hidden layer size in encoder/decoder.
n_epochs (int, default: 50) – Training epochs.
lr (float, default: 0.001) – Learning rate.
batch_size (int, default: 32) – Mini-batch size.
beta (float, default: 1.0) – Weight on the KL divergence term. beta=1 is standard VAE; beta<1 gives more reconstruction accuracy; beta>1 forces more disentangled representations.

Returns:

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np
>>> X = np.random.randn(500, 30)  # 30 features
>>> result = autoencoder_features(X, latent_dim=5, n_epochs=20)
>>> result["latent_features"].shape
(500, 5)

Normalise your features before encoding; the VAE assumes roughly standard-normal inputs for stable training.
The latent space is stochastic; for deterministic embeddings, use the mean (mu) which is what this function returns.
Reconstruction error thresholds for anomaly detection should be calibrated on clean training data.

References

Kingma & Welling (2014), “Auto-Encoding Variational Bayes”
An & Cho (2015), “Variational Autoencoder based Anomaly Detection using Reconstruction Probability”

gru_forecast(series, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶

Forecast a financial time series using a GRU network.

When to use:

Use GRU as a computationally cheaper alternative to LSTM. Preferred when you have moderate-sized datasets (500-5000 observations) or need faster iteration during model development.

Mathematical background:

The GRU update equations at time step t:: z_t = sigma(W_z [h_{t-1}, x_t]) (update gate) r_t = sigma(W_r [h_{t-1}, x_t]) (reset gate) h_t_hat = tanh(W [r_t * h_{t-1}, x_t]) (candidate) h_t = (1 - z_t) * h_{t-1} + z_t * h_t_hat

Compared to LSTM, GRU has no separate cell state and uses two gates instead of three, giving ~25% fewer parameters.

Parameters:

series (Series | ndarray) – Univariate time series.
seq_length (int, default: 20) – Number of look-back time steps.
hidden_dim (int, default: 64) – Number of hidden units per GRU layer.
n_layers (int, default: 2) – Number of stacked GRU layers.
dropout (float, default: 0.1) – Dropout between layers (only when n_layers > 1).
n_epochs (int, default: 50) – Training epochs.
lr (float, default: 0.001) – Learning rate.
train_ratio (float, default: 0.8) – Fraction of data for training.
batch_size (int, default: 32) – Mini-batch size.

Returns:

predictions: np.ndarray of test-set predictions, actuals: np.ndarray of actual test values, train_losses: list of per-epoch training losses, model: the trained torch.nn.Module.

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np
>>> vol = np.abs(np.random.randn(400)) * 0.02
>>> result = gru_forecast(vol, seq_length=10, n_epochs=15)
>>> result["predictions"].shape[0] > 0
True

Same overfitting risks as LSTM; use dropout and validation.
On very long sequences (>200 steps), Transformers may outperform GRU.

References

Cho et al. (2014), “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”

multivariate_lstm_forecast(features, target, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶

Forecast a target series using multiple input features via LSTM.

Mathematical background:

The LSTM cell equations are the same as in lstm_forecast, but the input dimensionality is now n_features rather than 1:

x_t in R^{n_features} f_t = sigma(W_f [h_{t-1}, x_t] + b_f) i_t = sigma(W_i [h_{t-1}, x_t] + b_i) o_t = sigma(W_o [h_{t-1}, x_t] + b_o)

The weight matrices W_f, W_i, W_o, W_c have input dimension n_features instead of 1, allowing the network to learn cross-feature temporal dependencies.

Parameters:

features (DataFrame) – DataFrame of shape (T, n_features) containing the input features. All columns are used as inputs to the LSTM.
target (Series | ndarray) – Target variable of length T to predict.
seq_length (int, default: 20) – Number of look-back time steps for each input sequence.
hidden_dim (int, default: 64) – Number of hidden units in each LSTM layer.
n_layers (int, default: 2) – Number of stacked LSTM layers.
dropout (float, default: 0.1) – Dropout probability between LSTM layers (applied only when n_layers > 1).
n_epochs (int, default: 50) – Number of training epochs.
lr (float, default: 0.001) – Learning rate for the Adam optimizer.
train_ratio (float, default: 0.8) – Fraction of data used for training (chronological split).
batch_size (int, default: 32) – Mini-batch size for training.

Returns:

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> df = pd.DataFrame({
...     'asset_a': np.cumsum(np.random.randn(500) * 0.01),
...     'asset_b': np.cumsum(np.random.randn(500) * 0.01),
...     'vix': np.abs(np.random.randn(500)) * 15 + 15,
... })
>>> target = pd.Series(np.cumsum(np.random.randn(500) * 0.01))
>>> result = multivariate_lstm_forecast(df, target, seq_length=10, n_epochs=5)
>>> result["predictions"].shape[0] > 0
True

References

Hochreiter & Schmidhuber (1997), “Long Short-Term Memory”
Fischer & Krauss (2018), “Deep learning with long short-term memory networks for financial market predictions”

Simplified Temporal Fusion Transformer for interpretable forecasting.

Architecture:

Variable Selection Network (VSN): A soft-attention gate over input features. Each feature is projected to hidden_dim, then a shared softmax gate selects the most relevant ones.
LSTM Encoder: Processes the selected features sequentially to capture local temporal patterns.
Multi-Head Attention: Attends over the LSTM outputs to capture long-range dependencies (e.g., monthly seasonality).
Gated Residual Network (GRN): skip connections with gating for stable training on noisy financial data.
Output layer: Linear projection to produce the forecast.

Parameters:

features (DataFrame) – DataFrame of shape (T, n_features) containing the input features.
target (Series | ndarray) – Target variable of length T.
seq_length (int, default: 20) – Number of look-back time steps.
hidden_dim (int, default: 64) – Dimensionality of the hidden representations.
n_heads (int, default: 4) – Number of attention heads (must divide hidden_dim).
n_lstm_layers (int, default: 1) – Number of LSTM layers in the encoder.
dropout (float, default: 0.1) – Dropout probability.
n_epochs (int, default: 50) – Number of training epochs.
lr (float, default: 0.001) – Learning rate for Adam.
train_ratio (float, default: 0.8) – Fraction of data for training (chronological split).
batch_size (int, default: 32) – Mini-batch size.

Returns:

Return type:

Raises:

ImportError – If PyTorch is not installed.

Example

>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> df = pd.DataFrame({
...     'momentum': np.random.randn(500),
...     'volume': np.abs(np.random.randn(500)),
...     'spread': np.random.randn(500) * 0.1,
... })
>>> target = pd.Series(np.cumsum(np.random.randn(500) * 0.01))
>>> result = temporal_fusion_transformer(
...     df, target, seq_length=10, hidden_dim=16, n_heads=2, n_epochs=5
... )
>>> result["predictions"].shape[0] > 0
True
>>> len(result["feature_importance"]) == 3
True

References

Lim et al. (2021), “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting”

Advanced Models¶

SVM, Random Forest, Gradient Boosting, Gaussian Process, Isolation Forest, PCA factor models.

Advanced scikit-learn models for quantitative finance.

Provides production-ready wrappers around SVM, Random Forest, Gradient Boosting, Gaussian Process, Isolation Forest, and PCA – all with finance-specific defaults, comprehensive docstrings, and clean return interfaces.

All functions guard sklearn imports behind @requires_extra('ml') so the rest of wraquant works without scikit-learn installed.

svm_classifier(X_train, y_train, X_test, y_test, kernel='rbf', C_range=(0.1, 1.0, 10.0), gamma_range=('scale', 0.01, 0.1), cv=5)[source]¶

Train an SVM classifier for market regime classification.

When to use:

Mathematical background:

SVM solves:: min_{w,b} (1/2) ||w||^2 + C * sum_i max(0, 1 - y_i(w.x_i + b))
The RBF kernel maps inputs to infinite-dimensional space:: K(x, x’) = exp(-gamma * ||x - x’||^2)

Grid search over C (regularisation) and gamma (kernel width) selects the best hyperparameters via cross-validation.

Parameters:

X_train (DataFrame | ndarray) – Training feature matrix.
y_train (Series | ndarray) – Training labels (e.g., 1 = bull, 0 = neutral, -1 = bear).
X_test (DataFrame | ndarray) – Test feature matrix.
y_test (Series | ndarray) – Test labels.
kernel (Literal['rbf', 'linear', 'poly'], default: 'rbf') – SVM kernel function.
C_range (Sequence[float], default: (0.1, 1.0, 10.0)) – Regularisation parameter values to search.
gamma_range (Sequence[float | str], default: ('scale', 0.01, 0.1)) – Kernel coefficient values to search (ignored for linear kernel).
cv (int, default: 5) – Cross-validation folds for grid search.

Returns:

model: fitted SVC, predictions: np.ndarray of test predictions, accuracy: float, confusion_matrix: np.ndarray, best_params: dict of best C and gamma, cv_score: float (mean CV accuracy).

Return type:

Example

>>> import numpy as np
>>> X = np.random.randn(200, 5)
>>> y = (X[:, 0] > 0).astype(int)
>>> result = svm_classifier(X[:150], y[:150], X[150:], y[150:])
>>> result["accuracy"] > 0.5
True

Scale features before training (StandardScaler recommended).
SVMs are O(n^2) in memory and O(n^3) in time – avoid for n > 100k.
For imbalanced classes, set class_weight='balanced' on the SVC.

References

Cortes & Vapnik (1995), “Support-Vector Networks”

random_forest_importance(X, y, feature_names=None, n_estimators=100, max_depth=5, random_state=42, task='classification')[source]¶

Rank features by importance using a Random Forest.

When to use:

Use as a first-pass feature selector when you have many candidate features (>20) and want to identify which ones carry signal. Fast, non-parametric, and handles mixed feature types.

Mathematical background:

Mean Decrease Impurity (MDI) for feature j:: Imp(j) = sum_{t in T_j} p(t) * Delta_i(t)

where T_j is the set of tree nodes splitting on feature j, p(t) is the fraction of samples reaching node t, and Delta_i(t) is the impurity decrease. MDI is averaged over all trees in the forest.

Parameters:

X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
feature_names (Optional[Sequence[str]], default: None) – Feature names. If None and X is a DataFrame, column names are used.
n_estimators (int, default: 100) – Number of trees.
max_depth (int | None, default: 5) – Maximum tree depth (None for unlimited).
random_state (int, default: 42) – Random seed for reproducibility.
task (Literal['classification', 'regression'], default: 'classification') – Type of prediction task.

Returns:

importance: pd.Series of feature importances sorted descending, model: fitted RandomForest estimator, oob_score: float (out-of-bag score if available, else None).

Return type:

Example

>>> import numpy as np
>>> X = np.random.randn(300, 10)
>>> y = (X[:, 0] + 0.5 * X[:, 3] > 0).astype(int)
>>> result = random_forest_importance(X, y)
>>> result["importance"].index[0]  # top feature is likely 0
0

MDI importance is biased toward high-cardinality features; consider permutation importance (feature_importance_mda) as a complement.
Correlated features share importance, causing both to appear weaker.

References

Breiman (2001), “Random Forests”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch.8

Gradient boosting for forecasting or classification.

When to use:

Mathematical background:

At each stage m, the model adds a tree h_m that minimises:: F_m(x) = F_{m-1}(x) + nu * h_m(x)
where h_m fits the negative gradient of the loss:: h_m = argmin_h sum_i L(y_i, F_{m-1}(x_i) + h(x_i))

For regression with squared loss, h_m fits the residuals. For classification with log-loss, h_m fits the log-odds residuals.

Parameters:

X_train (DataFrame | ndarray) – Training feature matrix.
y_train (Series | ndarray) – Training target.
X_test (DataFrame | ndarray) – Test feature matrix.
y_test (Series | ndarray | None, default: None) – Test target (if provided, test metrics are computed).
task (Literal['classification', 'regression'], default: 'regression') – Prediction task.
n_estimators (int, default: 200) – Number of boosting stages.
max_depth (int, default: 4) – Maximum depth of individual trees.
learning_rate (float, default: 0.1) – Shrinkage applied to each tree’s contribution.
subsample (float, default: 0.8) – Fraction of training samples used per tree (stochastic boosting).
cv (int, default: 5) – Cross-validation folds for reporting training CV score.
feature_names (Optional[Sequence[str]], default: None) – Feature names for importance ranking.

Returns:

Return type:

Example

>>> import numpy as np
>>> X = np.random.randn(300, 5)
>>> y = X[:, 0] * 2 + X[:, 1] + np.random.randn(300) * 0.5
>>> result = gradient_boost_forecast(X[:250], y[:250], X[250:], y[250:])
>>> result["test_score"] > 0
True

Overfits if n_estimators is too large; use early stopping or CV.
Sensitive to learning_rate / n_estimators trade-off.
For >100k samples, consider XGBoost/LightGBM for speed.

References

Friedman (2001), “Greedy Function Approximation: A Gradient Boosting Machine”

gaussian_process_regression(X_train, y_train, X_test, kernel='rbf', alpha=0.01, n_restarts=5)[source]¶

Gaussian Process regression with uncertainty quantification.

When to use:

Mathematical background:

A GP assumes f(x) ~ GP(m(x), k(x, x’)), where:: m(x) is the mean function (usually 0) k(x, x’) is the kernel (covariance function)
Posterior predictive at test point x*:: mu* = k(x*, X) [K + sigma^2 I]^{-1} y sigma*^2 = k(x*, x*) - k(x*, X) [K + sigma^2 I]^{-1} k(X, x*)

where K_{ij} = k(x_i, x_j) and sigma^2 is the noise variance.

Parameters:

X_train (DataFrame | ndarray) – Training features.
y_train (Series | ndarray) – Training target.
X_test (DataFrame | ndarray) – Test features.
kernel (str, default: 'rbf') – Kernel type: 'rbf', 'matern', or 'rational_quadratic'.
alpha (float, default: 0.01) – Noise level (regularisation diagonal added to the kernel matrix).
n_restarts (int, default: 5) – Number of optimiser restarts for kernel hyperparameters.

Returns:

Return type:

Example

>>> import numpy as np
>>> X_train = np.linspace(0, 10, 50).reshape(-1, 1)
>>> y_train = np.sin(X_train).ravel() + np.random.randn(50) * 0.1
>>> X_test = np.linspace(0, 10, 20).reshape(-1, 1)
>>> result = gaussian_process_regression(X_train, y_train, X_test)
>>> result["predictions"].shape
(20,)
>>> result["std"].shape
(20,)

Complexity is O(n^3) for training and O(n^2) per prediction.
For large datasets, use sparse GP approximations (not included here).
Kernel choice strongly affects results; try multiple kernels.

References

Rasmussen & Williams (2006), “Gaussian Processes for Machine Learning”

isolation_forest_anomaly(returns, contamination=0.05, n_estimators=200, random_state=42)[source]¶

Detect anomalous days in return data using Isolation Forest.

When to use:

Mathematical background:

For a sample x, the anomaly score is based on the average path length E[h(x)] across the isolation trees:

s(x, n) = 2^{-E[h(x)] / c(n)}

where c(n) is the average path length in a binary search tree of n samples. Score close to 1 means anomaly; close to 0.5 means normal.

Parameters:

returns (Series | DataFrame | ndarray) – Return data. If 1-D, treated as a single feature; if 2-D, each column is a feature (e.g., return, volume, spread).
contamination (float, default: 0.05) – Expected fraction of anomalies in the dataset (0 < c < 0.5).
n_estimators (int, default: 200) – Number of isolation trees.
random_state (int, default: 42) – Random seed.

Returns:

Return type:

Example

>>> import numpy as np
>>> rets = np.random.randn(500) * 0.01
>>> rets[100] = 0.15  # inject anomaly
>>> result = isolation_forest_anomaly(rets, contamination=0.02)
>>> result["anomaly_mask"][100]
True

The contamination parameter is a prior; misspecification leads to over- or under-detection.
Isolation Forest assumes anomalies are both rare and different; clustered anomalies may be missed.
For time-series anomaly detection, consider adding lagged features.

References

Liu, Ting & Zhou (2008), “Isolation Forest”

pca_factor_model(returns, n_components=None, explained_variance_threshold=0.9)[source]¶

Build a PCA-based latent factor model from asset returns.

When to use:

Use PCA factor models for dimensionality reduction in portfolio construction, risk decomposition, statistical arbitrage (pairs trading on residuals), and understanding co-movement structure.

Mathematical background:

Given return matrix R (T x N), PCA decomposes the covariance:: Sigma = V Lambda V^T

where Lambda = diag(lambda_1, …, lambda_N) are eigenvalues and V are eigenvectors (loadings). Factor returns are:

F = R @ V[:, :k] (T x k)

The fraction of variance explained by the first k components:: sum(lambda_1..k) / sum(lambda_1..N)

Parameters:

returns (DataFrame) – T x N return matrix (rows = observations, columns = assets).
n_components (int | None, default: None) – Number of principal components. If None, selects enough to explain explained_variance_threshold of total variance.
explained_variance_threshold (float, default: 0.9) – Minimum cumulative explained variance ratio when n_components is None.

Returns:

Return type:

Example

>>> import numpy as np, pandas as pd
>>> returns = pd.DataFrame(np.random.randn(252, 20) * 0.01)
>>> result = pca_factor_model(returns, n_components=3)
>>> result["factor_returns"].shape
(252, 3)

PCA is linear; for non-linear dimensionality reduction, use the VAE in wraquant.ml.deep.autoencoder_features.
Eigenvalues from small samples are noisy; use Random Matrix Theory denoising (wraquant.ml.preprocessing.denoised_correlation) first.
Components are not guaranteed to have economic meaning.

References

Jolliffe (2002), “Principal Component Analysis”
Avellaneda & Lee (2010), “Statistical arbitrage in the US equities market”

Clustering¶

Correlation-based clustering, regime clustering, optimal cluster selection.

Financial clustering methods.

Provides correlation-based asset clustering, market-regime detection, and optimal-cluster-count selection.

correlation_clustering(returns, n_clusters=None, method='hierarchical')[source]¶

Cluster assets by their return correlations.

The correlation-based distance is d(i,j) = sqrt(0.5 * (1 - rho_ij)), which maps perfect correlation to distance 0 and perfect negative correlation to distance 1.

Parameters:

returns (DataFrame) – T x N return matrix (rows = observations, columns = assets).
n_clusters (int | None, default: None) – Number of clusters. If None the optimal number is chosen automatically (silhouette score for hierarchical, or defaults to 3 for spectral).
method (Literal['hierarchical', 'spectral'], default: 'hierarchical') – Clustering algorithm. Hierarchical uses Ward linkage and produces a dendrogram-compatible linkage matrix. Spectral uses the correlation matrix as affinity and finds clusters via eigenvalue decomposition.

Returns:

labelsnp.ndarray: Cluster assignment for each asset (0-indexed, length N). Assets with the same label belong to the same cluster.
n_clustersint: Number of clusters found or specified.
linkage_matrixnp.ndarray or None: Linkage matrix (hierarchical only). Pass to scipy.cluster.hierarchy.dendrogram for visualization.

Return type:

Example

>>> import pandas as pd, numpy as np
>>> np.random.seed(42)
>>> # 3 groups of correlated assets
>>> factor = np.random.randn(252, 3)
>>> returns = pd.DataFrame(
...     np.column_stack([factor[:, i % 3] + np.random.randn(252) * 0.5
...                      for i in range(9)]),
...     columns=[f'asset_{i}' for i in range(9)]
... )
>>> result = correlation_clustering(returns, n_clusters=3)
>>> result['n_clusters']
3
>>> len(result['labels']) == 9
True

See also

regime_clustering: Cluster time periods into regimes.
optimal_clusters: Determine optimal cluster count.
wraquant.ml.preprocessing.detoned_correlation: Remove market mode before clustering.

regime_clustering(features, n_regimes=2, method='gmm')[source]¶

Cluster time periods into market regimes.

Parameters:

features (DataFrame | ndarray) – Feature matrix where each row is a time observation. Common inputs include rolling volatility, returns, spreads, and VIX.
n_regimes (int, default: 2) – Number of regimes to identify (default 2, typical for risk-on/risk-off).
method (Literal['gmm', 'kmeans'], default: 'gmm') – Clustering algorithm. 'gmm' (Gaussian Mixture Model) provides probabilistic assignments; 'kmeans' provides hard assignments and is faster.

Returns:

labelsnp.ndarray: Regime assignment for each time period (0-indexed).
n_regimesint: Number of regimes.
modelobject: Fitted GaussianMixture or KMeans model. For GMM, call model.predict_proba(X) to get regime probabilities.

Return type:

Example

>>> import numpy as np, pandas as pd
>>> np.random.seed(42)
>>> vol = np.concatenate([np.random.randn(100) * 0.5 + 0.1,
...                       np.random.randn(100) * 0.5 + 0.3])
>>> features = pd.DataFrame({'vol': vol, 'vol_sq': vol ** 2})
>>> result = regime_clustering(features, n_regimes=2)
>>> result['n_regimes']
2
>>> len(result['labels']) == 200
True

See also

correlation_clustering: Cluster assets (cross-sectional).
optimal_clusters: Find the optimal number of clusters/regimes.
wraquant.regimes: HMM and Markov-switching regime detection.

optimal_clusters(data, max_k=10, method='silhouette')[source]¶

Determine the optimal number of clusters.

Use this function before calling correlation_clustering or regime_clustering to select the number of clusters data-adaptively rather than guessing.

Parameters:

data (DataFrame | ndarray) – Feature matrix.
max_k (int, default: 10) – Maximum number of clusters to evaluate (default 10).
method (Literal['silhouette', 'bic'], default: 'silhouette') – Selection criterion. 'silhouette' uses the silhouette score with KMeans (higher is better, range [-1, 1]); 'bic' uses the Bayesian Information Criterion with a Gaussian Mixture Model (lower is better). Silhouette is faster; BIC is more principled for probabilistic models.

Returns:

Optimal number of clusters (between 2 and max_k). Use this value as n_clusters in correlation_clustering or n_regimes in regime_clustering.

Return type:

int

Example

>>> import numpy as np
>>> np.random.seed(42)
>>> # Generate data with 3 natural clusters
>>> data = np.vstack([np.random.randn(50, 2) + [0, 0],
...                   np.random.randn(50, 2) + [5, 5],
...                   np.random.randn(50, 2) + [10, 0]])
>>> k = optimal_clusters(data, max_k=6)
>>> 2 <= k <= 6
True

See also

correlation_clustering: Cluster assets by correlation.
regime_clustering: Cluster time periods into regimes.

Evaluation¶

Classification metrics, financial metrics, learning curves, and backtest evaluation of predictions.

Model evaluation utilities for financial machine learning.

Provides both standard classification metrics and finance-specific performance measures such as Sharpe ratio from predictions and backtesting with transaction costs.

classification_metrics(y_true, y_pred, y_prob=None)[source]¶

Compute standard classification metrics.

Parameters:

y_true (Series | ndarray) – True class labels.
y_pred (Series | ndarray) – Predicted class labels.
y_prob (Series | ndarray | None, default: None) – Predicted probabilities (for the positive class in binary classification). When provided, log-loss and AUC are included.

Returns:

accuracyfloat: Fraction of correct predictions.
precisionfloat: Macro-averaged precision (how many predicted positives are actually positive).
recallfloat: Macro-averaged recall (how many actual positives are captured).
f1float: Macro-averaged F1 score (harmonic mean of precision and recall).
log_lossfloat (only if y_prob given): Cross-entropy loss. Lower is better; measures calibration quality.
aucfloat (only if y_prob given, binary only): Area under the ROC curve. 0.5 = random, 1.0 = perfect.

Return type:

Example

>>> import numpy as np
>>> y_true = np.array([1, 0, 1, 1, 0, 1])
>>> y_pred = np.array([1, 0, 0, 1, 0, 1])
>>> metrics = classification_metrics(y_true, y_pred)
>>> metrics['accuracy']
0.8333333333333334
>>> metrics['f1'] > 0.5
True

See also

financial_metrics: PnL-based evaluation of directional predictions.
backtest_predictions: Full backtest with transaction costs.

financial_metrics(y_true, y_pred, returns)[source]¶

Compute finance-specific evaluation metrics from predictions.

The predicted labels are interpreted as position signals: 1 for long, -1 for short, 0 for flat.

Parameters:

y_true (Series | ndarray) – True directional labels.
y_pred (Series | ndarray) – Predicted directional labels (used as signals).
returns (Series | ndarray) – Actual period returns corresponding to each observation.

Returns:

strategy_returnfloat: Cumulative strategy return (sum of signal * return).
sharpefloat: Annualised Sharpe ratio (252 trading days). Values above 1.0 are generally considered good; above 2.0 is excellent.
hit_ratefloat: Fraction of periods where predicted sign matches actual sign. A hit rate above 0.5 is necessary but not sufficient for profitability.
profit_factorfloat: Gross profit / gross loss. Values above 1.0 indicate a profitable strategy; above 2.0 is strong.

Return type:

Example

>>> import numpy as np
>>> y_true = np.array([1, -1, 1, 1, -1])
>>> y_pred = np.array([1, -1, -1, 1, 1])
>>> returns = np.array([0.02, -0.01, 0.015, 0.005, -0.02])
>>> metrics = financial_metrics(y_true, y_pred, returns)
>>> metrics['hit_rate']
0.6
>>> metrics['sharpe'] != 0
True

See also

classification_metrics: Standard ML classification metrics.
backtest_predictions: Full backtest with transaction costs.

learning_curve(model, X, y, train_sizes=None, cv=5)[source]¶

Generate a learning curve for a model.

Parameters:

model (Any) – A scikit-learn-compatible estimator.
X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.
train_sizes (Union[Sequence[int | float], ndarray, None], default: None) – Training set sizes (absolute counts or fractions). Defaults to np.linspace(0.1, 1.0, 10).
cv (int, default: 5) – Number of cross-validation folds.

Returns:

train_sizesnp.ndarray: Absolute number of training samples at each point.
train_scoresnp.ndarray, shape (len(sizes), cv): Training scores at each size/fold. Plot the mean across folds to visualize training performance.
test_scoresnp.ndarray, shape (len(sizes), cv): Test scores at each size/fold. The gap between train and test mean scores indicates overfitting.

Return type:

dict[str, ndarray]

Example

>>> from sklearn.linear_model import Ridge
>>> import numpy as np
>>> X = np.random.randn(300, 5)
>>> y = X @ [1, 0.5, 0, 0, 0] + np.random.randn(300) * 0.1
>>> result = learning_curve(Ridge(), X, y, cv=3)
>>> result['train_sizes'].shape[0]  # 10 points by default
10

See also

classification_metrics: Evaluate classification quality.
financial_metrics: Evaluate economic value of predictions.

backtest_predictions(predictions, returns, cost_bps=10)[source]¶

Backtest a prediction signal against actual returns.

Parameters:

predictions (Series | ndarray) – Predicted position signals (e.g. 1, 0, -1). The signal is applied as a position: signal * return.
returns (Series | ndarray) – Actual period returns corresponding to each prediction.
cost_bps (float, default: 10) – Transaction cost in basis points applied on each position change (default 10 bps). For equities, 5-10 bps is typical; for futures, 1-3 bps.

Returns:

gross_returnsnp.ndarray: Per-period strategy returns before costs.
net_returnsnp.ndarray: Per-period strategy returns after costs.
cumulative_returnfloat: Total cumulative net return. Positive = profitable.
sharpefloat: Annualised Sharpe ratio of net returns. Above 1.0 is generally good; above 2.0 is excellent.
max_drawdownfloat: Maximum peak-to-trough decline in cumulative PnL. Always negative or zero.
turnoverfloat: Mean absolute position change per period. Higher turnover means higher transaction costs.

Return type:

Example

>>> import numpy as np
>>> preds = np.array([1, 1, -1, 1, -1, 0, 1])
>>> rets = np.array([0.01, -0.005, -0.02, 0.015, 0.01, 0.005, 0.008])
>>> result = backtest_predictions(preds, rets, cost_bps=10)
>>> result['cumulative_return'] != 0
True
>>> result['max_drawdown'] <= 0
True

See also

financial_metrics: Quick financial metrics without transaction costs.
wraquant.ml.pipeline.walk_forward_backtest: Walk-forward backtest.

Online Learning¶

Incrementally updating models for streaming data.

Online (streaming) machine learning for quantitative finance.

Provides recursive and weighted regression algorithms that update incrementally with each new observation, enabling real-time tracking of time-varying relationships in financial data.

These algorithms require only numpy and pandas – no optional dependencies.

online_linear_regression(X, y, forgetting_factor=1.0, initial_covariance=100.0)[source]¶

Recursive Least Squares (RLS) online linear regression.

When to use:

Mathematical background:

Recursive Least Squares maintains:: P_t = (1/lambda) * (P_{t-1} - K_t x_t^T P_{t-1}) K_t = P_{t-1} x_t / (lambda + x_t^T P_{t-1} x_t) w_t = w_{t-1} + K_t (y_t - x_t^T w_{t-1})

With lambda = 1 and infinite data, RLS converges to OLS. With lambda < 1, the effective window length is approximately 1 / (1 - lambda) observations.

Parameters:

X (DataFrame | ndarray) – Feature matrix of shape (T, p) where T is the number of observations and p is the number of features.
y (Series | ndarray) – Target vector of length T.
forgetting_factor (float, default: 1.0) – Forgetting factor lambda in (0, 1]. Values close to 1 give long memory; values like 0.99 give an effective window of ~100 observations. Use 0.95-0.99 for fast-adapting signals.
initial_covariance (float, default: 100.0) – Scalar multiplier for the initial covariance matrix P_0 = c * I. Larger values make the filter more responsive early on.

Returns:

Return type:

Example

>>> import numpy as np
>>> np.random.seed(42)
>>> T = 500
>>> X = np.random.randn(T, 2)
>>> # True coefficients shift halfway through
>>> beta_true = np.where(np.arange(T)[:, None] < 250,
...     [1.0, 0.5], [0.5, 1.0])
>>> y = np.sum(X * beta_true, axis=1) + np.random.randn(T) * 0.1
>>> result = online_linear_regression(X, y, forgetting_factor=0.98)
>>> result["coefficients"].shape
(500, 2)
>>> # After convergence, coefficients should track the true values
>>> np.abs(result["final_coefficients"][0] - 0.5) < 0.3
True

The forgetting factor is critical: too low causes noisy estimates, too high causes slow adaptation to regime changes.
RLS assumes the noise variance is constant; for heteroskedastic data, consider the exponential weighted variant or Kalman filters.
Initial predictions (before the filter converges) should be discarded in any evaluation.

References

Haykin (2002), “Adaptive Filter Theory”, Ch. 13 (RLS)
Montana et al. (2009), “Flexible least squares for temporal data mining and statistical arbitrage”

exponential_weighted_regression(X, y, halflife=63.0, min_periods=30)[source]¶

Exponentially weighted linear regression favouring recent data.

When to use:

Use exponential weighted regression when: - You want smoother coefficient paths than RLS. - The halflife of predictive relationships is approximately known

(e.g., 63 trading days ~ 3 months).

You need an interpretable “recency bias” in your factor model.

Mathematical background:

At time t, the weight for observation s (where s <= t) is:: w_s = exp(-ln(2) * (t - s) / halflife)
The weighted regression solves:: beta_t = (X_t^T W_t X_t)^{-1} X_t^T W_t y_t

where W_t = diag(w_0, w_1, …, w_t). This is equivalent to EWMA smoothing of the sufficient statistics X^T X and X^T y.

Parameters:

X (DataFrame | ndarray) – Feature matrix of shape (T, p).
y (Series | ndarray) – Target vector of length T.
halflife (float, default: 63.0) – Halflife in observations. After halflife observations, the weight of a past data point has decayed to 50%. Common financial values: 21 (1 month), 63 (1 quarter), 252 (1 year).
min_periods (int, default: 30) – Minimum number of observations before producing a coefficient estimate. Earlier entries are filled with NaN.

Returns:

Return type:

Example

>>> import numpy as np
>>> np.random.seed(0)
>>> T = 300
>>> X = np.random.randn(T, 2)
>>> beta_true = np.column_stack([
...     np.linspace(1, 0, T),      # drifting coefficient
...     np.full(T, 0.5),            # constant coefficient
... ])
>>> y = np.sum(X * beta_true, axis=1) + np.random.randn(T) * 0.1
>>> result = exponential_weighted_regression(X, y, halflife=60)
>>> result["coefficients"].shape
(300, 2)

Halflife selection is subjective; cross-validate if possible.
For very short halflives (<10), the effective sample size is small and estimates become noisy.
Assumes homoskedastic errors; for heteroskedastic data, consider EWMA-weighted robust regression.
Numerically less stable than RLS for ill-conditioned problems.

References

Pozzi et al. (2012), “Exponentially weighted moving average charts for detecting concept drift”
de Prado (2018), “Advances in Financial Machine Learning”, Ch. 17

Pipeline¶

FinancialPipeline, walk-forward backtest, and SHAP integration.

Financial ML pipeline utilities.

Provides chronology-aware pipeline wrappers, walk-forward backtesting with PnL tracking, and SHAP-based feature importance – all designed to prevent data leakage that is rampant in naive ML-for-finance workflows.

class FinancialPipeline[source]¶

Bases: object

Sklearn Pipeline wrapper that enforces chronological splitting.

Parameters:

steps (list[tuple[str, Any]]) – List of (name, transform) tuples defining the pipeline, identical to the steps parameter of sklearn.pipeline.Pipeline.
n_splits (int, default: 5) – Number of folds for purged K-fold cross-validation.
embargo_pct (float, default: 0.01) – Fraction of total samples to embargo after each test fold, preventing label leakage from overlapping targets.

Example

>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import Ridge
>>> import numpy as np
>>> X = np.random.randn(500, 5)
>>> y = X @ np.array([1, 0.5, 0, 0, 0]) + np.random.randn(500) * 0.1
>>> pipe = FinancialPipeline(
...     steps=[('scaler', StandardScaler()), ('ridge', Ridge())],
...     n_splits=5,
... )
>>> result = pipe.fit_evaluate(X, y)
>>> len(result['fold_scores']) == 5
True

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 7

__init__(steps, n_splits=5, embargo_pct=0.01)[source]¶

Parameters:

steps (list[tuple[str, Any]])
n_splits (int, default: 5)
embargo_pct (float, default: 0.01)

Return type:

None

fit(X, y)[source]¶

Fit the pipeline on the full dataset.

Parameters:

X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.

Returns:

Self, for method chaining.

Return type:

FinancialPipeline

predict(X)[source]¶

Generate predictions using the fitted pipeline.

Parameters:: X (DataFrame | ndarray) – Feature matrix.
Returns:: Predictions.
Return type:: ndarray

fit_evaluate(X, y)[source]¶

Fit with purged K-fold cross-validation and return results.

Uses purged K-fold splitting to evaluate the pipeline without data leakage. After cross-validation, fits the pipeline on the full dataset.

Parameters:

X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector.

Returns:

fold_scores: list of per-fold R-squared scores, mean_score: float mean of fold scores, std_score: float std of fold scores, pipeline: the fitted sklearn Pipeline.

Return type:

walk_forward_backtest(model, X, y, train_size=252, test_size=21, step_size=21, expanding=True)[source]¶

Full walk-forward ML backtest with PnL tracking.

Why walk-forward instead of standard cross-validation?: Standard K-Fold CV randomly shuffles observations, allowing the model to “peek” at future data during training. In finance, this creates massive upward bias in performance estimates. Walk-forward enforces strict temporal ordering: the model only ever trains on data that would have been available at the time of prediction.

Parameters:

model (Any) – A scikit-learn-compatible estimator with fit and predict.
X (DataFrame | ndarray) – Feature matrix.
y (Series | ndarray) – Target vector (typically forward returns for PnL calculation).
train_size (int, default: 252) – Number of training observations in the initial window.
test_size (int, default: 21) – Number of test observations per fold.
step_size (int, default: 21) – Number of observations to advance between folds.
expanding (bool, default: True) – If True, the training window expands over time. If False, a rolling window of fixed train_size is used.

Returns:

Return type:

Example

>>> from sklearn.linear_model import Ridge
>>> import numpy as np
>>> np.random.seed(42)
>>> X = np.random.randn(600, 5)
>>> y = X @ np.array([0.5, 0.3, 0, 0, 0]) + np.random.randn(600) * 0.5
>>> result = walk_forward_backtest(Ridge(), X, y, train_size=200, test_size=20)
>>> len(result['predictions']) > 0
True
>>> 'sharpe' in result
True

References

Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 12
Bailey et al. (2014), “The Deflated Sharpe Ratio”

feature_importance_shap(model, X, feature_names=None, max_samples=500)[source]¶

Compute SHAP-based feature importance for any sklearn model.

Parameters:

model (Any) – A fitted scikit-learn-compatible estimator.
X (DataFrame | ndarray) – Feature matrix to explain (typically the test set).
feature_names (Optional[Sequence[str]], default: None) – Feature names. If None and X is a DataFrame, column names are used.
max_samples (int, default: 500) – Maximum number of samples to use for computing SHAP values. Subsampled if X has more rows than this.

Returns:

Return type: