Machine Learning (wraquant.ml)¶
The ML module implements the full machine learning pipeline for financial prediction: feature engineering, preprocessing, model training, evaluation, and online learning – all designed to avoid the pitfalls that make naive ML on financial data fail (lookahead bias, non-stationarity, overfitting).
Pipeline stages:
Feature engineering – return features, TA features, volatility features, triple-barrier labels
Preprocessing – purged K-fold CV, fractional differentiation, denoised correlation
Model training – walk-forward, ensembles, feature importance (MDA/MDI), sequential selection
Deep learning – LSTM, GRU, Transformer, Temporal Fusion Transformer, autoencoders
Evaluation – financial metrics (Sharpe, profit factor), learning curves, backtest predictions
Online learning – incremental regression for streaming data
Quick Example¶
from wraquant.ml import (
technical_features, label_triple_barrier,
purged_kfold, walk_forward_train, financial_metrics,
)
from sklearn.ensemble import RandomForestClassifier
# Engineer features from TA indicators
features = technical_features(prices, indicators=["rsi", "macd", "adx"])
# Triple-barrier labels (de Prado method)
labels = label_triple_barrier(prices, profit_target=0.02, stop_loss=0.01)
# Align features and labels
X, y = features.align(labels['label'], join='inner', axis=0)
# Purged cross-validation (no information leakage)
splits = purged_kfold(X, y, n_splits=5, purge_gap=10)
# Walk-forward training (gold standard for financial ML)
result = walk_forward_train(
X, y,
model=RandomForestClassifier(n_estimators=100),
train_size=504,
test_size=63,
)
print(f"OOS accuracy: {result['mean_accuracy']:.4f}")
# Evaluate as a trading strategy
fin = financial_metrics(result['strategy_returns'])
print(f"Sharpe: {fin['sharpe_ratio']:.4f}")
Feature Importance¶
from wraquant.ml import feature_importance_mda
# MDA: permutation-based importance (preferred over MDI)
mda = feature_importance_mda(model, X, y, purged_cv=True)
print(mda['importance'].head(10))
Deep Learning¶
from wraquant.ml import lstm_forecast, transformer_forecast
result = lstm_forecast(X, y, hidden_size=64, sequence_length=21, epochs=50)
print(f"LSTM accuracy: {result['test_accuracy']:.4f}")
See also
ML Alpha Research – Full ML alpha research tutorial
Technical Analysis (wraquant.ta) – TA indicators used for feature engineering
Backtesting (wraquant.backtest) – Backtest ML-generated signals
API Reference¶
Machine learning utilities for quantitative finance.
This module implements the full ML pipeline for financial prediction and analysis: feature engineering, preprocessing, model training, evaluation, and online learning – all designed to avoid the pitfalls that make naive ML on financial data fail (lookahead bias, non-stationarity, overfitting on noise).
Pipeline overview¶
A typical financial ML workflow moves through five stages, each supported by a sub-module here:
Feature engineering (
features) – transform raw market data into predictive signals.return_features– lagged returns, log returns, and cross- sectional return features.volatility_features– realized vol, GARCH residuals, vol-of-vol.technical_features– wrapswraquant.taindicators into a feature DataFrame.rolling_features– rolling statistics (mean, std, skew, kurt, z-score) at multiple windows.microstructure_features– bid-ask spread, order imbalance, trade intensity.label_fixed_horizon– binary or ternary labels based on forward returns over a fixed window.label_triple_barrier– labels based on the triple-barrier method (de Prado): a trade is labeled by which barrier (profit target, stop-loss, or time expiry) is hit first. Preferred over fixed-horizon labels for realistic strategy evaluation.interaction_features– pairwise interaction terms between features (products and ratios).cross_asset_features– rolling correlation, beta, and relative strength between assets.regime_features– features derived from regime probabilities (current regime, duration, transition probability).
Preprocessing (
preprocessing) – prepare data for training without introducing bias.purged_kfold– time-series cross-validation with a purge gap to prevent information leakage between train and test folds.combinatorial_purged_kfold– generates all combinatorial train/test splits with purging (de Prado Chapter 12).fractional_differentiation– make a price series stationary while retaining as much memory as possible (unlike simple differencing which destroys long-range dependencies).denoised_correlation– apply Marcenko-Pastur random matrix theory to shrink noisy eigenvalues of the correlation matrix.detoned_correlation– remove the market mode (first eigenvector) to expose sector-level structure.
Model training (
models,advanced,deep,online,pipeline) – fit models designed for financial prediction.walk_forward_train– expanding or rolling window training with out-of-sample prediction at each step. The gold standard for financial model validation.ensemble_predict– blend predictions from multiple models.feature_importance_mdi/feature_importance_mda– Mean Decrease Impurity and Mean Decrease Accuracy feature importance (de Prado Chapter 8). MDA is preferred as it accounts for substitution effects.sequential_feature_selection– forward/backward feature selection with cross-validation.
Pipeline utilities: -
FinancialPipeline– sklearn Pipeline wrapper enforcingchronological splitting with purged K-fold CV.
walk_forward_backtest– full walk-forward ML backtest with PnL, Sharpe, hit rate, and equity curve.feature_importance_shap– SHAP-based feature importance.
Advanced sklearn wrappers: -
svm_classifier,random_forest_importance,gradient_boost_forecast,gaussian_process_regression,isolation_forest_anomaly,pca_factor_model.Deep learning (requires PyTorch): -
lstm_forecast,gru_forecast,transformer_forecast–recurrent and attention-based time-series forecasting.
multivariate_lstm_forecast– LSTM with multiple input features.temporal_fusion_transformer– interpretable forecasting with variable selection and attention.autoencoder_features– VAE-based latent feature extraction and anomaly detection.
Online / streaming: -
online_linear_regression,exponential_weighted_regression–models that update incrementally with each new observation.
Clustering (
clustering) – discover structure in returns.correlation_clustering– hierarchical clustering of the correlation matrix to find asset groups.regime_clustering– cluster return features to identify market regimes.optimal_clusters– determine the optimal number of clusters via silhouette score and gap statistic.
Evaluation (
evaluation) – measure performance correctly.classification_metrics– accuracy, precision, recall, F1 for classification models.financial_metrics– Sharpe, Sortino, max drawdown, and hit rate of the model’s predictions when used as a trading signal.learning_curve– diagnose bias/variance trade-off as training set size grows.backtest_predictions– convert model predictions into a P&L series and compute strategy-level metrics.
Common pitfalls¶
Lookahead bias: always use
purged_kfoldorwalk_forward_train, never random shuffled CV.Non-stationarity: apply
fractional_differentiationto price levels before using them as features.Overfitting: financial signal-to-noise ratio is very low; use
feature_importance_mdato prune irrelevant features and monitorlearning_curvefor divergence between train and test error.Label leakage:
label_triple_barrierbarriers must be computed on future data only; the purge gap in cross-validation must be at least as wide as the label horizon.
References
de Prado (2018), “Advances in Financial Machine Learning”
Dixon, Halperin & Bilokon (2020), “Machine Learning in Finance”
- rolling_features(data, windows=(5, 10, 21, 63))[source]¶
Generate rolling statistical features for each window length.
Use rolling features as a general-purpose feature engineering step before training ML models on time-series data. The rolling statistics capture time-varying moments that can signal changes in trend (mean), risk (std), asymmetry (skew), and tail behaviour (kurtosis).
For every window the following statistics are computed: mean, std, skew, kurtosis, min, and max.
- Parameters:
data (
Series|DataFrame) – Numeric time-series data. If a DataFrame is passed, features are generated independently for each column.windows (
Sequence[int], default:(5, 10, 21, 63)) – Rolling-window sizes (default(5, 10, 21, 63)), corresponding roughly to 1-week, 2-week, 1-month, and 1-quarter horizons.
- Returns:
DataFrame whose columns are named
{col}_{stat}_w{window}(or{stat}_w{window}when data is a Series). The number of feature columns equalsn_cols * len(windows) * 6. Early rows contain NaN where the window has insufficient data.- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> returns = pd.Series(np.random.randn(100) * 0.01, name='ret') >>> feats = rolling_features(returns, windows=(5, 21)) >>> feats.columns.tolist()[:3] ['mean_w5', 'std_w5', 'skew_w5'] >>> feats.shape[1] # 6 stats * 2 windows 12
See also
return_featuresLagged and cumulative return features.
volatility_featuresRealised volatility and vol-of-vol features.
- return_features(prices, lags=(1, 2, 3, 5, 10, 21))[source]¶
Compute lagged and cumulative return features from a price series.
Use return features as inputs to ML models predicting future returns or direction. Lagged returns capture momentum and mean-reversion signals at multiple horizons; cumulative returns capture trend strength.
- Parameters:
- Returns:
DataFrame with columns
ret_lag{l}(log return l periods ago, a momentum/mean-reversion signal) andcum_ret_{l}(cumulative log return over the last l periods, a trend signal) for each lag l. Early rows are NaN.- Return type:
Example
>>> import pandas as pd, numpy as np >>> prices = pd.Series([100, 101, 102, 100, 103, 105, 104], ... name='close') >>> feats = return_features(prices, lags=(1, 3)) >>> list(feats.columns) ['ret_lag1', 'cum_ret_1', 'ret_lag3', 'cum_ret_3'] >>> feats['cum_ret_3'].iloc[-1] > 0 # cumulative 3-period return True
See also
rolling_featuresRolling statistical features.
technical_featuresTechnical analysis features (RSI, MACD, etc.).
- technical_features(high, low, close, volume=None)[source]¶
Compute common technical analysis features for ML pipelines.
Use these features as inputs to ML models when you want to capture classic technical signals without depending on the full
wraquant.tamodule. Combines momentum (RSI, MACD), volatility (ATR, Bollinger), and optionally volume (OBV) into a single DataFrame.Computes RSI, MACD histogram, Bollinger Band %B, and ATR. If volume is provided, On-Balance Volume (OBV) is also included.
- Parameters:
- Returns:
DataFrame with columns:
rsi: Relative Strength Index (0-100). Values above 70 indicate overbought; below 30 indicate oversold.macd_hist: MACD histogram. Positive values indicate bullish momentum; negative values indicate bearish.bb_pctb: Bollinger Band %B (0-1 range typically). Values above 1 mean price is above the upper band.atr: Average True Range. Higher values indicate more volatile price action.obv(optional): On-Balance Volume. Rising OBV confirms an uptrend.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> n = 100 >>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5)) >>> high = close + np.abs(np.random.randn(n) * 0.3) >>> low = close - np.abs(np.random.randn(n) * 0.3) >>> feats = technical_features(high, low, close) >>> list(feats.columns) ['rsi', 'macd_hist', 'bb_pctb', 'atr']
See also
return_featuresLagged and cumulative return features.
volatility_featuresRealised volatility features.
- ta_features(high, low, close, volume=None, include=None)[source]¶
Generate ML features using wraquant’s full technical analysis library.
Unlike
technical_features(which uses inline implementations), this function imports directly fromwraquant.tato leverage the full 263-indicator library. This bridges themlandtamodules so that ML pipelines can access production-quality TA indicators without manual wiring.By default, computes a curated set of the most ML-relevant indicators: RSI, MACD histogram, Bollinger Band %B, ATR, and optionally OBV. Use the include parameter to select additional indicators.
- Parameters:
high (
Series) – High prices.low (
Series) – Low prices.close (
Series) – Close prices.volume (
Series|None, default:None) – Trade volume (optional). Required for volume-based indicators (OBV, MFI).include (
Optional[Sequence[str]], default:None) – Subset of indicators to include. Options:'rsi','macd','bbands','atr','obv'. If None, includes all available indicators.
- Return type:
- Returns:
DataFrame with one column per indicator, indexed like the input series. Column names are descriptive (e.g.,
ta_rsi,ta_macd_hist,ta_bb_pctb,ta_atr,ta_obv).
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> n = 100 >>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5)) >>> high = close + np.abs(np.random.randn(n) * 0.3) >>> low = close - np.abs(np.random.randn(n) * 0.3) >>> feats = ta_features(high, low, close) >>> 'ta_rsi' in feats.columns True
See also
technical_features: Inline implementation (no ta/ dependency). wraquant.ta.momentum.rsi: Full RSI implementation. wraquant.ta.momentum.macd: Full MACD implementation.
- volatility_features(returns, windows=(5, 10, 21, 63))[source]¶
Compute realised-volatility-related features.
Use volatility features to capture the current risk environment and volatility regime. Realised volatility is the most important feature in many financial ML models because volatility clusters (GARCH effect) and predicts future volatility better than returns predict future returns.
- Parameters:
- Returns:
Columns:
realized_vol_w{w}: Annualised rolling standard deviation (sqrt(252) scaling). Interpretation: a value of 0.20 means ~20% annualised volatility.vol_of_vol_w{w}: Rolling std of the rolling vol. High values indicate unstable volatility (vol-of-vol regime).vol_ratio_w{w1}_w{w2}: Ratio of short-window vol to long-window vol. Values > 1 indicate vol is spiking (risk-off signal); values < 1 indicate vol compression.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> rets = pd.Series(np.random.randn(200) * 0.01, name='daily_ret') >>> feats = volatility_features(rets, windows=(5, 21)) >>> 'realized_vol_w5' in feats.columns True >>> 'vol_ratio_w5_w21' in feats.columns True
See also
rolling_featuresGeneral rolling statistical features.
wraquant.volFull volatility modelling (GARCH, stochastic vol).
- microstructure_features(high, low, close, volume)[source]¶
Compute market-microstructure features.
Use microstructure features to capture liquidity conditions, information asymmetry, and trading activity. These are particularly valuable for short-horizon alpha models and execution-aware strategies where liquidity predicts future returns or trading costs.
- Parameters:
- Returns:
Columns:
amihud_illiq: Amihud illiquidity ratio (21-day rolling mean of |return| / dollar_volume). Higher values indicate less liquid, more price-impactful markets.kyle_lambda: Kyle’s lambda (21-day rolling OLS slope of |price change| on signed sqrt-volume). Measures the price impact per unit of informed flow. Higher values suggest more information asymmetry.log_volume: Natural log of volume. Smooths the skewed volume distribution for ML model consumption.volume_ma_ratio: Current volume / 21-day moving average. Values > 1 indicate above-average activity (potential event).dollar_volume: Price * volume. Absolute measure of trading activity and liquidity.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> n = 100 >>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5)) >>> high = close + np.abs(np.random.randn(n) * 0.3) >>> low = close - np.abs(np.random.randn(n) * 0.3) >>> volume = pd.Series(np.random.randint(1_000_000, 5_000_000, n)) >>> feats = microstructure_features(high, low, close, volume) >>> list(feats.columns) ['amihud_illiq', 'kyle_lambda', 'log_volume', 'volume_ma_ratio', 'dollar_volume']
References
Amihud (2002), “Illiquidity and stock returns”
Kyle (1985), “Continuous Auctions and Insider Trading”
See also
technical_featuresPrice-based technical indicators.
- label_fixed_horizon(returns, horizon=5, threshold=0.0)[source]¶
Label future return direction over a fixed horizon.
Use fixed-horizon labelling as the simplest way to create supervised learning targets for directional prediction. Each observation is labelled based on the cumulative return over the next horizon periods. This is the standard approach for “will the price go up or down over the next N days?” classification.
- Parameters:
returns (
Series) – Period (e.g. daily) returns.horizon (
int, default:5) – Number of periods to accumulate forward returns (default 5, i.e. one trading week).threshold (
float, default:0.0) – Ifthreshold > 0, three labels are produced:1(up beyond threshold),0(flat),-1(down beyond threshold). Ifthreshold == 0, binary labels (1/0) are produced where1means positive cumulative return.
- Returns:
Integer labels aligned to the original index. The last horizon rows will be
NaN(no future data available).- Return type:
Example
>>> import pandas as pd, numpy as np >>> rets = pd.Series([0.01, -0.005, 0.02, 0.01, -0.03, 0.015, 0.005]) >>> labels = label_fixed_horizon(rets, horizon=3, threshold=0.0) >>> labels.iloc[0] # sum of rets[1:4] = -0.005+0.02+0.01 > 0 1
Notes
Fixed-horizon labelling does not adapt to volatility. In high-vol regimes, the threshold is hit more often; in low-vol regimes, most labels become
0. For volatility-adaptive labels, uselabel_triple_barrier.See also
label_triple_barrierVolatility-adaptive labelling (Lopez de Prado).
- label_triple_barrier(close, upper=None, lower=None, max_holding=10)[source]¶
Triple-barrier labelling (Lopez de Prado).
Use triple-barrier labelling when you want targets that adapt to market conditions. Unlike fixed-horizon labels, this method defines a profit-taking barrier (upper), a stop-loss barrier (lower), and a maximum holding period (vertical). Whichever barrier is hit first determines the label. This produces cleaner labels in volatile markets because the barriers can be scaled by volatility.
For each bar the method sets three barriers:
Upper: price rises by upper fraction -> label = 1
Lower: price falls by lower fraction -> label = -1
Vertical: max_holding bars elapse -> label = sign of return
If upper or lower is
Nonethe corresponding horizontal barrier is disabled.- Parameters:
close (
Series) – Close price series.upper (
float|None, default:None) – Fractional distance for the upper barrier (e.g.0.02for 2 %).lower (
float|None, default:None) – Fractional distance for the lower barrier (positive value; e.g.0.02for -2 %).max_holding (
int, default:10) – Maximum holding period in bars (vertical barrier).
- Returns:
Integer labels in
{-1, 0, 1}aligned to the input index.1= profit-taking barrier hit first (bullish),-1= stop-loss barrier hit first (bearish),0= vertical barrier hit with zero return. The last max_holding entries may beNaN.- Return type:
Example
>>> import pandas as pd >>> close = pd.Series([100, 101, 102, 103, 100, 97, 98, 99, 100, 101]) >>> labels = label_triple_barrier(close, upper=0.03, lower=0.03, max_holding=5) >>> labels.iloc[0] # price rises 3% by bar 3 (103/100 - 1 = 0.03) 1
Notes
In practice, set
upperandlowerproportional to recent volatility (e.g.,upper = lower = daily_vol * sqrt(max_holding)). This makes the labels regime-adaptive.References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 3
See also
label_fixed_horizonSimpler fixed-horizon labelling.
- interaction_features(data, columns=None)[source]¶
Create pairwise interaction terms between features.
Use interaction features when you suspect that predictive power lies in the combination of features rather than individual signals. For example,
momentum * volatilitycaptures whether momentum is occurring in a high- or low-volatility environment, which may predict returns differently.For each pair of selected columns
(A, B), computes:A_x_B: element-wise product (captures multiplicative relationships)A_div_B: element-wise ratio A / B (captures relative magnitudes)
- Parameters:
- Returns:
DataFrame containing all pairwise interaction features, with column names like
col1_x_col2andcol1_div_col2.- Return type:
Example
>>> import pandas as pd, numpy as np >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) >>> result = interaction_features(df, columns=['a', 'b']) >>> 'a_x_b' in result.columns True >>> 'a_div_b' in result.columns True
- cross_asset_features(asset, benchmark, windows=(10, 21, 63))[source]¶
Compute cross-asset relationship features.
Use cross-asset features to capture how an asset co-moves with a benchmark or related instrument. Rolling correlation and beta detect changing exposures (useful for regime detection); relative strength identifies momentum divergence between the asset and its benchmark.
Given an asset return series and a benchmark (or related asset) return series, computes rolling correlation, rolling beta, and relative strength for each window.
- Parameters:
- Returns:
DataFrame with columns: -
rolling_corr_w{w}: rolling Pearson correlation -rolling_beta_w{w}: rolling OLS beta (cov / var of benchmark) -relative_strength_w{w}: cumulative return ratio (asset / benchmark)over the window
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> asset = pd.Series(np.random.randn(200) * 0.01, name='asset') >>> bench = pd.Series(np.random.randn(200) * 0.01, name='bench') >>> result = cross_asset_features(asset, bench, windows=[10, 21]) >>> 'rolling_corr_w10' in result.columns True >>> 'rolling_beta_w21' in result.columns True
- regime_features(regime_probabilities, regime_labels=None)[source]¶
Create features from regime probabilities or labels.
Use regime features when you have upstream regime detection (e.g., HMM, Markov-switching) and want to feed regime state into downstream ML models. Regime duration and transition probability are predictive because regimes tend to persist (duration) but eventually break down (transition probability rises before a switch).
Given regime probabilities (e.g., from an HMM or Markov-switching model), constructs features useful for downstream ML models: current regime identity, regime duration (how many consecutive periods in the current regime), and estimated transition probability (rolling mean of regime changes).
- Parameters:
regime_probabilities (
DataFrame) – DataFrame where each column is the probability of a regime (e.g., columns['bull', 'bear']with probabilities summing to 1).regime_labels (
Series|None, default:None) – Hard regime labels. If None, the most probable regime at each step is used (argmax of the probability columns).
- Returns:
DataFrame with columns: -
current_regime: integer label of the current regime -regime_duration: number of consecutive periods in thecurrent regime
regime_change: binary indicator (1 if regime changed)transition_prob_w{w}: rolling mean of regime changes for w in [5, 10, 21]one column per regime probability from the input
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(42) >>> probs = pd.DataFrame({ ... 'bull': np.random.dirichlet([5, 2], size=100)[:, 0], ... 'bear': np.random.dirichlet([5, 2], size=100)[:, 1], ... }) >>> result = regime_features(probs) >>> 'current_regime' in result.columns True >>> 'regime_duration' in result.columns True
- purged_kfold(X, y, n_splits=5, embargo_pct=0.01)[source]¶
Purged K-Fold cross-validation.
Use purged K-fold instead of standard K-fold whenever your labels overlap in time (e.g., forward returns computed over a window). Standard K-fold leaks future information because a training sample’s label may depend on prices that appear in the test set. Purging removes an embargo zone after each test fold to break this leakage.
Ensures that training observations that immediately follow a test observation are removed (embargo) so that information cannot leak through overlapping labels.
- Parameters:
X (
DataFrame|ndarray) – Feature matrix (only its length is used).y (
Series|ndarray) – Target vector (only its length is used).n_splits (
int, default:5) – Number of folds.embargo_pct (
float, default:0.01) – Fraction of total samples to embargo after each test fold. For daily data with 5-day forward labels,0.01embargoes ~2.5 days on a 252-sample dataset.
- Yields:
tuple[np.ndarray,np.ndarray]–(train_indices, test_indices)for each fold.- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(500, 3) >>> y = np.random.randn(500) >>> folds = list(purged_kfold(X, y, n_splits=5, embargo_pct=0.02)) >>> len(folds) 5 >>> train_idx, test_idx = folds[0] >>> len(train_idx) + len(test_idx) < 500 # embargo removes some samples True
References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 7
See also
combinatorial_purged_kfoldGenerates all C(n, k) purged splits.
wraquant.ml.pipeline.FinancialPipelinePipeline that uses purged K-fold.
- combinatorial_purged_kfold(X, y, n_splits=5, n_test_splits=2, embargo_pct=0.01)[source]¶
Combinatorial purged K-Fold cross-validation.
Use combinatorial purged K-fold when you need more backtest paths than standard purged K-fold provides. By choosing
n_test_splitsgroups as the test set fromn_splitstotal groups, this generates C(n_splits, n_test_splits) distinct train/test splits – each with an embargo to prevent information leakage.Generates all C(n_splits, n_test_splits) train/test combinations, applying an embargo after each test group to prevent leakage.
- Parameters:
- Yields:
tuple[np.ndarray,np.ndarray]–(train_indices, test_indices)for each combination.- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(500, 3) >>> y = np.random.randn(500) >>> folds = list(combinatorial_purged_kfold(X, y, n_splits=5, n_test_splits=2)) >>> len(folds) # C(5, 2) = 10 10
References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 12
See also
purged_kfoldSimpler purged K-fold with n_splits folds.
- fractional_differentiation(series, d=0.5, threshold=1e-05)[source]¶
Fractionally differentiate a time series.
Use fractional differentiation to make a price or factor series stationary (required by many ML models) while retaining as much memory (long-range dependence) as possible. Standard first differencing (d=1) makes the series stationary but destroys all memory. Fractional differencing with d=0.3-0.5 achieves stationarity while preserving most of the signal.
Applies the fractional differentiation operator of order d (Hosking, 1981) to obtain a (near-)stationary series while preserving long-range memory.
The operator is defined as:
(1 - B)^d = sum_{k=0}^{inf} C(d,k) * (-B)^k
where B is the backshift operator and C(d,k) are the binomial-like weights.
- Parameters:
series (
Series) – Input time series (e.g., log prices).d (
float, default:0.5) – Fractional differentiation order (0 < d < 1 for partial differentiation; d = 1 is the standard first difference). Start with d=0.5 and decrease until the ADF test rejects at the desired significance level.threshold (
float, default:1e-05) – Minimum absolute weight to retain. Smaller values use more lagged observations but increase computational cost.
- Returns:
Fractionally differentiated series (initial rows where the full convolution is not available are dropped). Test stationarity with an ADF test; if non-stationary, increase d.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(42) >>> prices = pd.Series(100 + np.cumsum(np.random.randn(300) * 0.5), ... name='close') >>> frac_diff = fractional_differentiation(prices, d=0.4) >>> len(frac_diff) < len(prices) # initial rows dropped True >>> frac_diff.std() > 0 # non-trivial output True
References
Hosking (1981), “Fractional Differencing”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 5
See also
denoised_correlationRandom Matrix Theory denoising.
- denoised_correlation(returns, n_components=None)[source]¶
Denoise a correlation matrix using Random Matrix Theory.
Use denoised correlation before portfolio optimization or clustering to remove noise eigenvalues that arise from finite-sample estimation. When T/N (observations/assets) is not large, the sample correlation matrix contains substantial noise. RMT denoising replaces eigenvalues consistent with random noise (Marchenko-Pastur distribution) with their average, producing a cleaner matrix that leads to more stable portfolio weights.
Eigenvalues that fall within the Marchenko-Pastur distribution are replaced by their average, shrinking noise while preserving signal.
- Parameters:
- Returns:
Denoised correlation matrix of shape
(N, N). The matrix is symmetric, positive semi-definite, and has unit diagonal. Use it in place ofreturns.corr()for portfolio optimization.- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(42) >>> returns = pd.DataFrame(np.random.randn(252, 10) * 0.01) >>> clean_corr = denoised_correlation(returns) >>> clean_corr.shape (10, 10) >>> np.allclose(np.diag(clean_corr), 1.0) # unit diagonal True
Notes
The Marchenko-Pastur upper bound is:
lambda_+ = sigma^2 * (1 + sqrt(N/T))^2
Eigenvalues above this threshold are retained as “signal”; those below are replaced.
References
Laloux et al. (1999), “Noise dressing of financial correlation matrices”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 2
See also
detoned_correlationRemove the market mode from a correlation matrix.
- detoned_correlation(corr, n_components=1)[source]¶
Remove the first n_components eigenvectors (market mode) from a correlation matrix.
Use detoned correlation when you want to uncover residual co-movement structure after removing the dominant market factor. The first eigenvector of asset returns typically represents the “market mode” (all assets moving together). Removing it reveals sector, style, or idiosyncratic clustering that is hidden when the market factor dominates. This is particularly useful before hierarchical clustering or community detection.
- Parameters:
- Returns:
De-toned correlation matrix of shape
(N, N). The matrix is symmetric with unit diagonal but is not positive definite (some eigenvalues are set to zero).- Return type:
Example
>>> import numpy as np >>> np.random.seed(42) >>> corr = np.corrcoef(np.random.randn(5, 252)) >>> detoned = detoned_correlation(corr, n_components=1) >>> detoned.shape (5, 5) >>> np.allclose(np.diag(detoned), 1.0) True
References
Lopez de Prado (2020), “Machine Learning for Asset Managers”, Ch. 2
See also
denoised_correlationRemove noise eigenvalues from a correlation matrix.
wraquant.ml.clustering.correlation_clusteringCluster assets by correlation.
- walk_forward_train(model, X, y, train_size=252, test_size=21, step_size=21)[source]¶
Walk-forward (expanding or rolling window) analysis.
Use walk-forward analysis to evaluate a model under realistic conditions where only past data is available for training at each step. This is the standard time-series cross-validation approach in quantitative finance, avoiding the look-ahead bias inherent in random K-fold splits.
At each step the model is cloned (via scikit-learn’s
clone), fitted on the training window, and used to predict the test window.- Parameters:
model (
Any) – A scikit-learn-compatible estimator that implementsfitandpredict.train_size (
int, default:252) – Number of training observations in the first window (default 252, approximately one trading year).test_size (
int, default:21) – Number of test observations per fold (default 21, approximately one trading month).step_size (
int, default:21) – Number of observations to step forward between folds.
- Returns:
predictionsnp.ndarrayConcatenated out-of-sample predictions across all folds.
actualsnp.ndarrayCorresponding true values. Compare with predictions to measure forecast accuracy.
test_indicesnp.ndarrayOriginal row indices for each prediction, useful for aligning results back to a DatetimeIndex.
n_foldsintNumber of walk-forward folds executed.
- Return type:
Example
>>> from sklearn.linear_model import Ridge >>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> X = pd.DataFrame(np.random.randn(500, 3), columns=['mom', 'vol', 'size']) >>> y = X['mom'] * 0.5 + np.random.randn(500) * 0.1 >>> result = walk_forward_train(Ridge(), X, y, train_size=252, test_size=21) >>> result['n_folds'] > 0 True >>> len(result['predictions']) == len(result['actuals']) True
Notes
The window is expanding (all data from the start up to the current train end is used). For a rolling window, see
wraquant.ml.pipeline.walk_forward_backtestwhich supports both modes.See also
wraquant.ml.pipeline.walk_forward_backtestFull walk-forward backtest with PnL.
wraquant.ml.preprocessing.purged_kfoldPurged K-fold cross-validation.
- ensemble_predict(models, X, method='mean')[source]¶
Generate ensemble predictions from multiple fitted models.
Use ensemble prediction to combine several models (e.g., Ridge, Random Forest, Gradient Boosting) into a single, more robust forecast. Ensembles reduce variance and are standard practice in alpha research and competition-winning pipelines.
- Parameters:
models (
Sequence[Any]) – Fitted scikit-learn-compatible estimators. Each must implementpredict(X).method (
Literal['mean','median','vote'], default:'mean') – Aggregation method.'mean'and'median'average the raw predictions (best for regression);'vote'takes the mode (majority vote, best for classification).
- Returns:
Aggregated predictions. For
'mean'/'median', the values are continuous. For'vote', the values are discrete class labels.- Return type:
Example
>>> from sklearn.linear_model import Ridge, Lasso >>> import numpy as np >>> np.random.seed(0) >>> X_train = np.random.randn(200, 3) >>> y_train = X_train @ [1, 0.5, 0] + np.random.randn(200) * 0.1 >>> m1 = Ridge().fit(X_train, y_train) >>> m2 = Lasso(alpha=0.01).fit(X_train, y_train) >>> X_test = np.random.randn(50, 3) >>> preds = ensemble_predict([m1, m2], X_test, method='mean') >>> preds.shape (50,)
See also
walk_forward_trainWalk-forward evaluation for individual models.
- feature_importance_mdi(model, feature_names)[source]¶
Mean Decrease Impurity (MDI) feature importance.
Use MDI as a fast, first-pass feature ranking after fitting a tree-based model. MDI measures how much each feature contributes to reducing node impurity (Gini for classification, variance for regression) across all trees.
Reads
model.feature_importances_(available on tree-based estimators after fitting) and returns a sortedpd.Series.- Parameters:
- Returns:
Importance values indexed by feature name, sorted descending. Higher values indicate features that contributed more to splits. Values sum to 1.0 for scikit-learn tree ensembles.
- Return type:
Example
>>> from sklearn.ensemble import RandomForestClassifier >>> import numpy as np >>> np.random.seed(42) >>> X = np.random.randn(300, 4) >>> y = (X[:, 0] > 0).astype(int) >>> rf = RandomForestClassifier(n_estimators=50, random_state=42).fit(X, y) >>> imp = feature_importance_mdi(rf, ['momentum', 'vol', 'size', 'value']) >>> imp.index[0] # most important feature 'momentum'
Notes
MDI is biased toward high-cardinality and continuous features. For an unbiased alternative, use
feature_importance_mda(permutation importance).See also
feature_importance_mdaPermutation-based importance (unbiased).
wraquant.ml.advanced.random_forest_importanceCombined RF fit + importance.
- feature_importance_mda(model, X, y, feature_names, n_repeats=10)[source]¶
Mean Decrease Accuracy (permutation importance).
Use MDA when you need an unbiased estimate of feature importance that accounts for feature interactions and is not affected by cardinality bias. Unlike MDI, MDA evaluates on held-out data and directly measures how much predictive power is lost when a feature is shuffled.
Repeatedly permutes each feature and measures the decrease in the model’s score.
- Parameters:
model (
Any) – A fitted scikit-learn-compatible estimator.X (
DataFrame|ndarray) – Feature matrix (test or validation set).feature_names (
Sequence[str]) – Feature names corresponding to columns of X.n_repeats (
int, default:10) – Number of permutation repeats per feature. More repeats yield more stable estimates but increase runtime linearly.
- Returns:
Mean importance values indexed by feature name, sorted descending. Positive values indicate features whose permutation hurts the model score; negative values suggest noise features.
- Return type:
Example
>>> from sklearn.ensemble import RandomForestClassifier >>> import numpy as np >>> np.random.seed(42) >>> X = np.random.randn(300, 4) >>> y = (X[:, 0] + 0.3 * X[:, 2] > 0).astype(int) >>> rf = RandomForestClassifier(n_estimators=50, random_state=42).fit(X, y) >>> imp = feature_importance_mda(rf, X, y, ['mom', 'vol', 'size', 'val']) >>> imp.iloc[0] > 0 # top feature has positive importance True
Notes
MDA is model-agnostic and works with any estimator that exposes a
scoremethod. Correlated features share importance: permuting one leaves its correlated partner to compensate, so both appear less important than they truly are.References
Breiman (2001), “Random Forests”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 8
See also
feature_importance_mdiFaster but biased impurity-based importance.
wraquant.ml.pipeline.feature_importance_shapSHAP-based importance.
- sequential_feature_selection(model, X, y, n_features=5, direction='forward', cv=5)[source]¶
Sequential (forward / backward) feature selection.
Use sequential feature selection when you want to find a compact subset of features that maximises predictive performance. Forward selection greedily adds the best feature at each step; backward selection starts with all features and removes the least useful.
- Parameters:
model (
Any) – A scikit-learn-compatible estimator.n_features (
int, default:5) – Number of features to select.direction (
Literal['forward','backward'], default:'forward') – Selection direction. Forward is faster whenn_featuresis small relative to total features; backward is faster when you want to drop only a few.cv (
int, default:5) – Number of cross-validation folds.
- Returns:
Selected feature names (if X is a DataFrame) or column indices.
- Return type:
Example
>>> from sklearn.linear_model import Ridge >>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> X = pd.DataFrame(np.random.randn(200, 6), ... columns=['f1','f2','f3','f4','f5','f6']) >>> y = X['f1'] * 2 + X['f3'] + np.random.randn(200) * 0.1 >>> selected = sequential_feature_selection(Ridge(), X, y, n_features=2) >>> len(selected) 2
See also
feature_importance_mdiImpurity-based ranking (faster, less rigorous).
feature_importance_mdaPermutation-based ranking.
- class FinancialPipeline[source]¶
Bases:
objectSklearn Pipeline wrapper that enforces chronological splitting.
Standard sklearn
Pipeline+cross_val_scoreuses random K-Fold which leaks future information into the training set.FinancialPipelinewraps an sklearnPipelineand replaces all cross-validation with purged K-fold that respects time ordering and applies an embargo window to prevent information leakage through overlapping labels.- Parameters:
steps (
list[tuple[str,Any]]) – List of(name, transform)tuples defining the pipeline, identical to thestepsparameter ofsklearn.pipeline.Pipeline.n_splits (
int, default:5) – Number of folds for purged K-fold cross-validation.embargo_pct (
float, default:0.01) – Fraction of total samples to embargo after each test fold, preventing label leakage from overlapping targets.
Example
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.linear_model import Ridge >>> import numpy as np >>> X = np.random.randn(500, 5) >>> y = X @ np.array([1, 0.5, 0, 0, 0]) + np.random.randn(500) * 0.1 >>> pipe = FinancialPipeline( ... steps=[('scaler', StandardScaler()), ('ridge', Ridge())], ... n_splits=5, ... ) >>> result = pipe.fit_evaluate(X, y) >>> len(result['fold_scores']) == 5 True
References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 7
- fit(X, y)[source]¶
Fit the pipeline on the full dataset.
- Parameters:
- Returns:
Self, for method chaining.
- Return type:
- walk_forward_backtest(model, X, y, train_size=252, test_size=21, step_size=21, expanding=True)[source]¶
Full walk-forward ML backtest with PnL tracking.
Walk-forward validation is the gold standard for evaluating ML models in finance because it mirrors real trading: train on historical data, predict the next period, observe actual outcome, then advance.
- Why walk-forward instead of standard cross-validation?
Standard K-Fold CV randomly shuffles observations, allowing the model to “peek” at future data during training. In finance, this creates massive upward bias in performance estimates. Walk-forward enforces strict temporal ordering: the model only ever trains on data that would have been available at the time of prediction.
The function supports both expanding windows (training set grows over time, using all available history) and rolling windows (fixed-size training window that slides forward). Expanding windows are preferred when you believe the data-generating process is stable; rolling windows are better when you expect structural breaks or regime changes.
- Parameters:
model (
Any) – A scikit-learn-compatible estimator withfitandpredict.y (
Series|ndarray) – Target vector (typically forward returns for PnL calculation).train_size (
int, default:252) – Number of training observations in the initial window.test_size (
int, default:21) – Number of test observations per fold.step_size (
int, default:21) – Number of observations to advance between folds.expanding (
bool, default:True) – If True, the training window expands over time. If False, a rolling window of fixedtrain_sizeis used.
- Returns:
predictions: np.ndarray of concatenated out-of-sample predictions,actuals: np.ndarray of corresponding true values,pnl: np.ndarray of per-period PnL (prediction * actual, assuming long when prediction > 0),sharpe: float annualised Sharpe ratio of the PnL series (assuming 252 trading days),hit_rate: float fraction of periods where prediction sign matches actual sign,equity_curve: np.ndarray cumulative PnL.- Return type:
Example
>>> from sklearn.linear_model import Ridge >>> import numpy as np >>> np.random.seed(42) >>> X = np.random.randn(600, 5) >>> y = X @ np.array([0.5, 0.3, 0, 0, 0]) + np.random.randn(600) * 0.5 >>> result = walk_forward_backtest(Ridge(), X, y, train_size=200, test_size=20) >>> len(result['predictions']) > 0 True >>> 'sharpe' in result True
References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 12
Bailey et al. (2014), “The Deflated Sharpe Ratio”
- feature_importance_shap(model, X, feature_names=None, max_samples=500)[source]¶
Compute SHAP-based feature importance for any sklearn model.
SHAP (SHapley Additive exPlanations) values provide a theoretically grounded decomposition of each prediction into per-feature contributions. Unlike impurity-based importance (MDI), SHAP values are consistent and account for feature interactions.
- Parameters:
model (
Any) – A fitted scikit-learn-compatible estimator.X (
DataFrame|ndarray) – Feature matrix to explain (typically the test set).feature_names (
Optional[Sequence[str]], default:None) – Feature names. If None and X is a DataFrame, column names are used.max_samples (
int, default:500) – Maximum number of samples to use for computing SHAP values. Subsampled if X has more rows than this.
- Returns:
shap_values: np.ndarray of shape(n_samples, n_features)containing per-sample SHAP values,feature_importance: np.ndarray of shape(n_features,)giving mean absolute SHAP value per feature (sorted descending),feature_names: list of feature names ordered by importance.- Return type:
- Raises:
MissingDependencyError – If shap is not installed.
Example
>>> from sklearn.ensemble import RandomForestRegressor >>> import numpy as np >>> np.random.seed(42) >>> X = np.random.randn(200, 5) >>> y = X[:, 0] * 2 + X[:, 1] + np.random.randn(200) * 0.1 >>> model = RandomForestRegressor(n_estimators=50, random_state=42) >>> model.fit(X, y) RandomForestRegressor(n_estimators=50, random_state=42) >>> result = feature_importance_shap(model, X) >>> result["shap_values"].shape[1] == 5 True
References
Lundberg & Lee (2017), “A Unified Approach to Interpreting Model Predictions”
- correlation_clustering(returns, n_clusters=None, method='hierarchical')[source]¶
Cluster assets by their return correlations.
Use correlation clustering to group assets that move together, which is useful for portfolio diversification (allocate across clusters), risk management (monitor cluster concentration), and statistical arbitrage (trade within-cluster mean-reversion).
The correlation-based distance is
d(i,j) = sqrt(0.5 * (1 - rho_ij)), which maps perfect correlation to distance 0 and perfect negative correlation to distance 1.- Parameters:
returns (
DataFrame) – T x N return matrix (rows = observations, columns = assets).n_clusters (
int|None, default:None) – Number of clusters. IfNonethe optimal number is chosen automatically (silhouette score for hierarchical, or defaults to3for spectral).method (
Literal['hierarchical','spectral'], default:'hierarchical') – Clustering algorithm. Hierarchical uses Ward linkage and produces a dendrogram-compatible linkage matrix. Spectral uses the correlation matrix as affinity and finds clusters via eigenvalue decomposition.
- Returns:
labelsnp.ndarrayCluster assignment for each asset (0-indexed, length N). Assets with the same label belong to the same cluster.
n_clustersintNumber of clusters found or specified.
linkage_matrixnp.ndarray or NoneLinkage matrix (hierarchical only). Pass to
scipy.cluster.hierarchy.dendrogramfor visualization.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(42) >>> # 3 groups of correlated assets >>> factor = np.random.randn(252, 3) >>> returns = pd.DataFrame( ... np.column_stack([factor[:, i % 3] + np.random.randn(252) * 0.5 ... for i in range(9)]), ... columns=[f'asset_{i}' for i in range(9)] ... ) >>> result = correlation_clustering(returns, n_clusters=3) >>> result['n_clusters'] 3 >>> len(result['labels']) == 9 True
See also
regime_clusteringCluster time periods into regimes.
optimal_clustersDetermine optimal cluster count.
wraquant.ml.preprocessing.detoned_correlationRemove market mode before clustering.
- regime_clustering(features, n_regimes=2, method='gmm')[source]¶
Cluster time periods into market regimes.
Use regime clustering when you want to identify distinct market states (e.g., bull/bear, risk-on/risk-off, high/low volatility) from observable features without a pre-defined model. GMM is preferred because it assigns soft probabilities to each regime; KMeans provides hard assignments only.
- Parameters:
features (
DataFrame|ndarray) – Feature matrix where each row is a time observation. Common inputs include rolling volatility, returns, spreads, and VIX.n_regimes (
int, default:2) – Number of regimes to identify (default 2, typical for risk-on/risk-off).method (
Literal['gmm','kmeans'], default:'gmm') – Clustering algorithm.'gmm'(Gaussian Mixture Model) provides probabilistic assignments;'kmeans'provides hard assignments and is faster.
- Returns:
labelsnp.ndarrayRegime assignment for each time period (0-indexed).
n_regimesintNumber of regimes.
modelobjectFitted GaussianMixture or KMeans model. For GMM, call
model.predict_proba(X)to get regime probabilities.
- Return type:
Example
>>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> vol = np.concatenate([np.random.randn(100) * 0.5 + 0.1, ... np.random.randn(100) * 0.5 + 0.3]) >>> features = pd.DataFrame({'vol': vol, 'vol_sq': vol ** 2}) >>> result = regime_clustering(features, n_regimes=2) >>> result['n_regimes'] 2 >>> len(result['labels']) == 200 True
See also
correlation_clusteringCluster assets (cross-sectional).
optimal_clustersFind the optimal number of clusters/regimes.
wraquant.regimesHMM and Markov-switching regime detection.
- optimal_clusters(data, max_k=10, method='silhouette')[source]¶
Determine the optimal number of clusters.
Use this function before calling
correlation_clusteringorregime_clusteringto select the number of clusters data-adaptively rather than guessing.- Parameters:
max_k (
int, default:10) – Maximum number of clusters to evaluate (default 10).method (
Literal['silhouette','bic'], default:'silhouette') – Selection criterion.'silhouette'uses the silhouette score with KMeans (higher is better, range [-1, 1]);'bic'uses the Bayesian Information Criterion with a Gaussian Mixture Model (lower is better). Silhouette is faster; BIC is more principled for probabilistic models.
- Returns:
Optimal number of clusters (between 2 and max_k). Use this value as
n_clustersincorrelation_clusteringorn_regimesinregime_clustering.- Return type:
Example
>>> import numpy as np >>> np.random.seed(42) >>> # Generate data with 3 natural clusters >>> data = np.vstack([np.random.randn(50, 2) + [0, 0], ... np.random.randn(50, 2) + [5, 5], ... np.random.randn(50, 2) + [10, 0]]) >>> k = optimal_clusters(data, max_k=6) >>> 2 <= k <= 6 True
See also
correlation_clusteringCluster assets by correlation.
regime_clusteringCluster time periods into regimes.
- classification_metrics(y_true, y_pred, y_prob=None)[source]¶
Compute standard classification metrics.
Use classification metrics to evaluate direction-prediction models (e.g., predicting up/down/flat labels). These metrics assess the statistical quality of the classifier independently of PnL; pair with
financial_metricsfor economic evaluation.- Parameters:
- Returns:
accuracyfloatFraction of correct predictions.
precisionfloatMacro-averaged precision (how many predicted positives are actually positive).
recallfloatMacro-averaged recall (how many actual positives are captured).
f1floatMacro-averaged F1 score (harmonic mean of precision and recall).
log_lossfloat (only if y_prob given)Cross-entropy loss. Lower is better; measures calibration quality.
aucfloat (only if y_prob given, binary only)Area under the ROC curve. 0.5 = random, 1.0 = perfect.
- Return type:
Example
>>> import numpy as np >>> y_true = np.array([1, 0, 1, 1, 0, 1]) >>> y_pred = np.array([1, 0, 0, 1, 0, 1]) >>> metrics = classification_metrics(y_true, y_pred) >>> metrics['accuracy'] 0.8333333333333334 >>> metrics['f1'] > 0.5 True
See also
financial_metricsPnL-based evaluation of directional predictions.
backtest_predictionsFull backtest with transaction costs.
- financial_metrics(y_true, y_pred, returns)[source]¶
Compute finance-specific evaluation metrics from predictions.
Use financial metrics to evaluate whether a model’s predictions translate into actual trading profits. A model can have high accuracy but poor financial performance if it is right on small moves and wrong on large moves. These metrics directly measure economic value.
The predicted labels are interpreted as position signals:
1for long,-1for short,0for flat.- Parameters:
- Returns:
strategy_returnfloatCumulative strategy return (sum of signal * return).
sharpefloatAnnualised Sharpe ratio (252 trading days). Values above 1.0 are generally considered good; above 2.0 is excellent.
hit_ratefloatFraction of periods where predicted sign matches actual sign. A hit rate above 0.5 is necessary but not sufficient for profitability.
profit_factorfloatGross profit / gross loss. Values above 1.0 indicate a profitable strategy; above 2.0 is strong.
- Return type:
Example
>>> import numpy as np >>> y_true = np.array([1, -1, 1, 1, -1]) >>> y_pred = np.array([1, -1, -1, 1, 1]) >>> returns = np.array([0.02, -0.01, 0.015, 0.005, -0.02]) >>> metrics = financial_metrics(y_true, y_pred, returns) >>> metrics['hit_rate'] 0.6 >>> metrics['sharpe'] != 0 True
See also
classification_metricsStandard ML classification metrics.
backtest_predictionsFull backtest with transaction costs.
- learning_curve(model, X, y, train_sizes=None, cv=5)[source]¶
Generate a learning curve for a model.
Use learning curves to diagnose whether a model suffers from high bias (underfitting) or high variance (overfitting). If training and test scores converge at a low value, the model is too simple. If there is a large gap between training and test scores, the model is overfitting and more data or regularisation is needed.
- Parameters:
- Returns:
train_sizesnp.ndarrayAbsolute number of training samples at each point.
train_scoresnp.ndarray, shape(len(sizes), cv)Training scores at each size/fold. Plot the mean across folds to visualize training performance.
test_scoresnp.ndarray, shape(len(sizes), cv)Test scores at each size/fold. The gap between train and test mean scores indicates overfitting.
- Return type:
Example
>>> from sklearn.linear_model import Ridge >>> import numpy as np >>> X = np.random.randn(300, 5) >>> y = X @ [1, 0.5, 0, 0, 0] + np.random.randn(300) * 0.1 >>> result = learning_curve(Ridge(), X, y, cv=3) >>> result['train_sizes'].shape[0] # 10 points by default 10
See also
classification_metricsEvaluate classification quality.
financial_metricsEvaluate economic value of predictions.
- backtest_predictions(predictions, returns, cost_bps=10)[source]¶
Backtest a prediction signal against actual returns.
Use backtest_predictions as a quick sanity check of a model’s economic value before building a full backtest. It applies realistic transaction costs (proportional to position changes) and computes key performance metrics including Sharpe, max drawdown, and turnover.
- Parameters:
predictions (
Series|ndarray) – Predicted position signals (e.g. 1, 0, -1). The signal is applied as a position:signal * return.returns (
Series|ndarray) – Actual period returns corresponding to each prediction.cost_bps (
float, default:10) – Transaction cost in basis points applied on each position change (default 10 bps). For equities, 5-10 bps is typical; for futures, 1-3 bps.
- Returns:
gross_returnsnp.ndarrayPer-period strategy returns before costs.
net_returnsnp.ndarrayPer-period strategy returns after costs.
cumulative_returnfloatTotal cumulative net return. Positive = profitable.
sharpefloatAnnualised Sharpe ratio of net returns. Above 1.0 is generally good; above 2.0 is excellent.
max_drawdownfloatMaximum peak-to-trough decline in cumulative PnL. Always negative or zero.
turnoverfloatMean absolute position change per period. Higher turnover means higher transaction costs.
- Return type:
Example
>>> import numpy as np >>> preds = np.array([1, 1, -1, 1, -1, 0, 1]) >>> rets = np.array([0.01, -0.005, -0.02, 0.015, 0.01, 0.005, 0.008]) >>> result = backtest_predictions(preds, rets, cost_bps=10) >>> result['cumulative_return'] != 0 True >>> result['max_drawdown'] <= 0 True
See also
financial_metricsQuick financial metrics without transaction costs.
wraquant.ml.pipeline.walk_forward_backtestWalk-forward backtest.
- lstm_forecast(series, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Forecast a financial time series using an LSTM network.
Long Short-Term Memory networks are recurrent neural networks capable of learning long-range dependencies in sequential data. In finance, LSTMs are used to capture complex temporal patterns in price, volume, and return series that linear models miss.
The function auto-creates overlapping input/target sequences from the raw time series, splits into train/test sets chronologically (no shuffle to avoid lookahead bias), trains the model, and returns predictions on the test set.
- When to use:
Use LSTM for multi-step forecasting when you have >1000 observations and suspect non-linear temporal dependencies. Works well for return prediction, volatility forecasting, and spread modeling.
- Mathematical background:
- At each time step t, the LSTM cell computes:
f_t = sigma(W_f [h_{t-1}, x_t] + b_f) (forget gate) i_t = sigma(W_i [h_{t-1}, x_t] + b_i) (input gate) o_t = sigma(W_o [h_{t-1}, x_t] + b_o) (output gate) c_t = f_t * c_{t-1} + i_t * tanh(W_c [h_{t-1}, x_t] + b_c) h_t = o_t * tanh(c_t)
The cell state c_t acts as a conveyor belt, allowing gradients to flow across many time steps without vanishing.
- Parameters:
series (
Series|ndarray) – Univariate time series (e.g., log returns, prices, spreads).seq_length (
int, default:20) – Number of look-back time steps for each input sequence.hidden_dim (
int, default:64) – Number of hidden units in each LSTM layer.n_layers (
int, default:2) – Number of stacked LSTM layers.dropout (
float, default:0.1) – Dropout probability between LSTM layers (applied only whenn_layers > 1).n_epochs (
int, default:50) – Number of training epochs.lr (
float, default:0.001) – Learning rate for the Adam optimizer.train_ratio (
float, default:0.8) – Fraction of data used for training (the rest is used for testing). The split is chronological – no shuffling.batch_size (
int, default:32) – Mini-batch size for training.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np >>> returns = np.cumsum(np.random.randn(500) * 0.01) >>> result = lstm_forecast(returns, seq_length=10, n_epochs=20) >>> result["predictions"].shape (80,)
Financial time series are notoriously noisy; LSTM is prone to overfitting on noise. Use dropout, early stopping, and validation.
Chronological train/test split is critical to avoid lookahead bias.
Normalisation (handled internally) is essential for gradient stability.
References
Hochreiter & Schmidhuber (1997), “Long Short-Term Memory”
Fischer & Krauss (2018), “Deep learning with long short-term memory networks for financial market predictions”
- transformer_forecast(series, seq_length=20, d_model=64, n_heads=4, n_encoder_layers=2, dim_feedforward=128, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Forecast a financial time series using a Transformer encoder.
Transformer models use self-attention to capture dependencies at any distance in the input sequence, unlike RNNs which process sequentially. This makes them especially effective at discovering long-range patterns such as seasonality, lead-lag relationships, and regime persistence in financial data.
- When to use:
Use Transformers when you have sufficient data (>2000 observations) and suspect that long-range dependencies matter. They often outperform LSTMs on longer sequences but require more data and compute.
- Mathematical background:
- Self-attention computes:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
where Q, K, V are linear projections of the input. Multi-head attention runs h parallel attention heads and concatenates:
MultiHead(Q, K, V) = Concat(head_1, …, head_h) W_O
- Positional encoding injects order information:
PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})
- Parameters:
seq_length (
int, default:20) – Number of look-back time steps.d_model (
int, default:64) – Embedding dimension (must be divisible byn_heads).n_heads (
int, default:4) – Number of attention heads.n_encoder_layers (
int, default:2) – Number of Transformer encoder layers.dim_feedforward (
int, default:128) – Hidden dimension in the feedforward sub-layers.dropout (
float, default:0.1) – Dropout probability.n_epochs (
int, default:50) – Number of training epochs.lr (
float, default:0.001) – Learning rate for Adam.train_ratio (
float, default:0.8) – Fraction of data for training.batch_size (
int, default:32) – Mini-batch size.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np >>> prices = np.cumsum(np.random.randn(600) * 0.01) + 100 >>> result = transformer_forecast(prices, seq_length=15, n_epochs=10) >>> len(result["predictions"]) > 0 True
Transformers are data-hungry; on small datasets (<500 obs) they will overfit severely.
Quadratic memory in sequence length: keep seq_length reasonable (< 256 for typical financial data).
No inherent notion of order without positional encoding.
References
Vaswani et al. (2017), “Attention Is All You Need”
Li et al. (2019), “Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting”
- autoencoder_features(X, latent_dim=8, hidden_dim=64, n_epochs=50, lr=0.001, batch_size=32, beta=1.0)[source]¶
Extract latent features using a Variational Autoencoder (VAE).
A VAE learns a compressed, continuous latent representation of high-dimensional input features. In finance, this is valuable for:
Regime detection: Cluster the latent codes to find market states.
Anomaly detection: High reconstruction error flags unusual market conditions (flash crashes, liquidity crises).
Feature compression: Reduce hundreds of technical indicators to a handful of orthogonal latent factors.
- When to use:
Use when you have a wide feature matrix (>20 features) and want to discover latent structure, detect anomalies, or reduce dimensionality in a non-linear way that PCA cannot capture.
- Mathematical background:
- The VAE optimises the Evidence Lower Bound (ELBO):
L = E_q[log p(x|z)] - beta * KL(q(z|x) || p(z))
where q(z|x) = N(mu(x), sigma^2(x)) is the encoder, p(x|z) is the decoder, and p(z) = N(0, I) is the prior. The KL term regularises the latent space to be smooth and continuous.
- Parameters:
X (
DataFrame|ndarray) – Feature matrix of shape(n_samples, n_features).latent_dim (
int, default:8) – Dimensionality of the latent space.hidden_dim (
int, default:64) – Hidden layer size in encoder/decoder.n_epochs (
int, default:50) – Training epochs.lr (
float, default:0.001) – Learning rate.batch_size (
int, default:32) – Mini-batch size.beta (
float, default:1.0) – Weight on the KL divergence term.beta=1is standard VAE;beta<1gives more reconstruction accuracy;beta>1forces more disentangled representations.
- Returns:
latent_features: np.ndarray of shape(n_samples, latent_dim)– the encoded representations,reconstruction_error: np.ndarray of per-sample reconstruction MSE,train_losses: list of per-epoch total losses,model: the trained VAE module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np >>> X = np.random.randn(500, 30) # 30 features >>> result = autoencoder_features(X, latent_dim=5, n_epochs=20) >>> result["latent_features"].shape (500, 5)
Normalise your features before encoding; the VAE assumes roughly standard-normal inputs for stable training.
The latent space is stochastic; for deterministic embeddings, use the mean (mu) which is what this function returns.
Reconstruction error thresholds for anomaly detection should be calibrated on clean training data.
References
Kingma & Welling (2014), “Auto-Encoding Variational Bayes”
An & Cho (2015), “Variational Autoencoder based Anomaly Detection using Reconstruction Probability”
- gru_forecast(series, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Forecast a financial time series using a GRU network.
Gated Recurrent Units are a simplified variant of LSTMs that merge the cell and hidden state, resulting in fewer parameters and faster training while achieving comparable performance on many financial forecasting tasks.
- When to use:
Use GRU as a computationally cheaper alternative to LSTM. Preferred when you have moderate-sized datasets (500-5000 observations) or need faster iteration during model development.
- Mathematical background:
- The GRU update equations at time step t:
z_t = sigma(W_z [h_{t-1}, x_t]) (update gate) r_t = sigma(W_r [h_{t-1}, x_t]) (reset gate) h_t_hat = tanh(W [r_t * h_{t-1}, x_t]) (candidate) h_t = (1 - z_t) * h_{t-1} + z_t * h_t_hat
Compared to LSTM, GRU has no separate cell state and uses two gates instead of three, giving ~25% fewer parameters.
- Parameters:
seq_length (
int, default:20) – Number of look-back time steps.hidden_dim (
int, default:64) – Number of hidden units per GRU layer.n_layers (
int, default:2) – Number of stacked GRU layers.dropout (
float, default:0.1) – Dropout between layers (only whenn_layers > 1).n_epochs (
int, default:50) – Training epochs.lr (
float, default:0.001) – Learning rate.train_ratio (
float, default:0.8) – Fraction of data for training.batch_size (
int, default:32) – Mini-batch size.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np >>> vol = np.abs(np.random.randn(400)) * 0.02 >>> result = gru_forecast(vol, seq_length=10, n_epochs=15) >>> result["predictions"].shape[0] > 0 True
Same overfitting risks as LSTM; use dropout and validation.
On very long sequences (>200 steps), Transformers may outperform GRU.
References
Cho et al. (2014), “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”
- multivariate_lstm_forecast(features, target, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Forecast a target series using multiple input features via LSTM.
Multivariate LSTM ingests a DataFrame of features (e.g., returns of correlated assets, macro indicators, technical signals) and learns to predict a single target variable. This outperforms univariate LSTM when cross-asset signals exist – for example, when sector ETF returns lead individual stock returns, when VIX changes anticipate equity moves, or when order-flow imbalance across related instruments carries predictive information for the target.
The function normalises each feature column independently (z-score), creates multivariate look-back sequences, trains the LSTM with a chronological train/test split, and returns predictions on the held-out test set along with train and test MSE metrics.
- Mathematical background:
The LSTM cell equations are the same as in
lstm_forecast, but the input dimensionality is now n_features rather than 1:x_t in R^{n_features} f_t = sigma(W_f [h_{t-1}, x_t] + b_f) i_t = sigma(W_i [h_{t-1}, x_t] + b_i) o_t = sigma(W_o [h_{t-1}, x_t] + b_o)
The weight matrices W_f, W_i, W_o, W_c have input dimension n_features instead of 1, allowing the network to learn cross-feature temporal dependencies.
- Parameters:
features (
DataFrame) – DataFrame of shape(T, n_features)containing the input features. All columns are used as inputs to the LSTM.target (
Series|ndarray) – Target variable of length T to predict.seq_length (
int, default:20) – Number of look-back time steps for each input sequence.hidden_dim (
int, default:64) – Number of hidden units in each LSTM layer.n_layers (
int, default:2) – Number of stacked LSTM layers.dropout (
float, default:0.1) – Dropout probability between LSTM layers (applied only whenn_layers > 1).n_epochs (
int, default:50) – Number of training epochs.lr (
float, default:0.001) – Learning rate for the Adam optimizer.train_ratio (
float, default:0.8) – Fraction of data used for training (chronological split).batch_size (
int, default:32) – Mini-batch size for training.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,train_mse: float MSE on the training set,test_mse: float MSE on the test set,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> df = pd.DataFrame({ ... 'asset_a': np.cumsum(np.random.randn(500) * 0.01), ... 'asset_b': np.cumsum(np.random.randn(500) * 0.01), ... 'vix': np.abs(np.random.randn(500)) * 15 + 15, ... }) >>> target = pd.Series(np.cumsum(np.random.randn(500) * 0.01)) >>> result = multivariate_lstm_forecast(df, target, seq_length=10, n_epochs=5) >>> result["predictions"].shape[0] > 0 True
References
Hochreiter & Schmidhuber (1997), “Long Short-Term Memory”
Fischer & Krauss (2018), “Deep learning with long short-term memory networks for financial market predictions”
- temporal_fusion_transformer(features, target, seq_length=20, hidden_dim=64, n_heads=4, n_lstm_layers=1, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Simplified Temporal Fusion Transformer for interpretable forecasting.
The most promising architecture for interpretable financial forecasting. This implementation provides the core TFT components: a variable selection network that learns which input features matter, an LSTM encoder for temporal processing, multi-head attention for capturing long-range dependencies, and gated residual connections for stable gradient flow.
Unlike black-box models, TFT produces per-feature importance weights that reveal which inputs drive each prediction – critical for building trust in trading signals and satisfying model governance requirements.
- Architecture:
Variable Selection Network (VSN): A soft-attention gate over input features. Each feature is projected to
hidden_dim, then a shared softmax gate selects the most relevant ones.LSTM Encoder: Processes the selected features sequentially to capture local temporal patterns.
Multi-Head Attention: Attends over the LSTM outputs to capture long-range dependencies (e.g., monthly seasonality).
Gated Residual Network (GRN): skip connections with gating for stable training on noisy financial data.
Output layer: Linear projection to produce the forecast.
- Parameters:
features (
DataFrame) – DataFrame of shape(T, n_features)containing the input features.seq_length (
int, default:20) – Number of look-back time steps.hidden_dim (
int, default:64) – Dimensionality of the hidden representations.n_heads (
int, default:4) – Number of attention heads (must dividehidden_dim).n_lstm_layers (
int, default:1) – Number of LSTM layers in the encoder.dropout (
float, default:0.1) – Dropout probability.n_epochs (
int, default:50) – Number of training epochs.lr (
float, default:0.001) – Learning rate for Adam.train_ratio (
float, default:0.8) – Fraction of data for training (chronological split).batch_size (
int, default:32) – Mini-batch size.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,feature_importance: np.ndarray of shape(n_features,)giving the learned importance weight for each input feature (higher = more important),feature_names: list of feature names from the input DataFrame,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> df = pd.DataFrame({ ... 'momentum': np.random.randn(500), ... 'volume': np.abs(np.random.randn(500)), ... 'spread': np.random.randn(500) * 0.1, ... }) >>> target = pd.Series(np.cumsum(np.random.randn(500) * 0.01)) >>> result = temporal_fusion_transformer( ... df, target, seq_length=10, hidden_dim=16, n_heads=2, n_epochs=5 ... ) >>> result["predictions"].shape[0] > 0 True >>> len(result["feature_importance"]) == 3 True
References
Lim et al. (2021), “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting”
- svm_classifier(X_train, y_train, X_test, y_test, kernel='rbf', C_range=(0.1, 1.0, 10.0), gamma_range=('scale', 0.01, 0.1), cv=5)[source]¶
Train an SVM classifier for market regime classification.
Support Vector Machines find the maximum-margin hyperplane separating classes. With the RBF kernel, SVMs can capture non-linear decision boundaries in feature space, making them effective for classifying market regimes (bull/bear/neutral) from derived features like volatility, momentum, and volume profiles.
- When to use:
Use SVM when you have a moderate number of features (5-100), moderate dataset size (500-50k), and need robust classification with good generalisation. SVMs handle high-dimensional spaces well and are resistant to overfitting when C is properly tuned.
- Mathematical background:
- SVM solves:
min_{w,b} (1/2) ||w||^2 + C * sum_i max(0, 1 - y_i(w.x_i + b))
- The RBF kernel maps inputs to infinite-dimensional space:
K(x, x’) = exp(-gamma * ||x - x’||^2)
Grid search over C (regularisation) and gamma (kernel width) selects the best hyperparameters via cross-validation.
- Parameters:
y_train (
Series|ndarray) – Training labels (e.g., 1 = bull, 0 = neutral, -1 = bear).kernel (
Literal['rbf','linear','poly'], default:'rbf') – SVM kernel function.C_range (
Sequence[float], default:(0.1, 1.0, 10.0)) – Regularisation parameter values to search.gamma_range (
Sequence[float|str], default:('scale', 0.01, 0.1)) – Kernel coefficient values to search (ignored for linear kernel).cv (
int, default:5) – Cross-validation folds for grid search.
- Returns:
model: fitted SVC,predictions: np.ndarray of test predictions,accuracy: float,confusion_matrix: np.ndarray,best_params: dict of best C and gamma,cv_score: float (mean CV accuracy).- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(200, 5) >>> y = (X[:, 0] > 0).astype(int) >>> result = svm_classifier(X[:150], y[:150], X[150:], y[150:]) >>> result["accuracy"] > 0.5 True
Scale features before training (StandardScaler recommended).
SVMs are O(n^2) in memory and O(n^3) in time – avoid for n > 100k.
For imbalanced classes, set
class_weight='balanced'on the SVC.
References
Cortes & Vapnik (1995), “Support-Vector Networks”
- random_forest_importance(X, y, feature_names=None, n_estimators=100, max_depth=5, random_state=42, task='classification')[source]¶
Rank features by importance using a Random Forest.
Random Forests aggregate many decorrelated decision trees and measure each feature’s contribution to reducing impurity (Gini for classification, variance for regression). This produces a natural feature ranking useful for selecting the most predictive signals from a large universe of technical indicators, fundamental factors, or alternative data features.
- When to use:
Use as a first-pass feature selector when you have many candidate features (>20) and want to identify which ones carry signal. Fast, non-parametric, and handles mixed feature types.
- Mathematical background:
- Mean Decrease Impurity (MDI) for feature j:
Imp(j) = sum_{t in T_j} p(t) * Delta_i(t)
where T_j is the set of tree nodes splitting on feature j, p(t) is the fraction of samples reaching node t, and Delta_i(t) is the impurity decrease. MDI is averaged over all trees in the forest.
- Parameters:
feature_names (
Optional[Sequence[str]], default:None) – Feature names. If None and X is a DataFrame, column names are used.n_estimators (
int, default:100) – Number of trees.max_depth (
int|None, default:5) – Maximum tree depth (None for unlimited).random_state (
int, default:42) – Random seed for reproducibility.task (
Literal['classification','regression'], default:'classification') – Type of prediction task.
- Returns:
importance: pd.Series of feature importances sorted descending,model: fitted RandomForest estimator,oob_score: float (out-of-bag score if available, else None).- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(300, 10) >>> y = (X[:, 0] + 0.5 * X[:, 3] > 0).astype(int) >>> result = random_forest_importance(X, y) >>> result["importance"].index[0] # top feature is likely 0 0
MDI importance is biased toward high-cardinality features; consider permutation importance (
feature_importance_mda) as a complement.Correlated features share importance, causing both to appear weaker.
References
Breiman (2001), “Random Forests”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch.8
- gradient_boost_forecast(X_train, y_train, X_test, y_test=None, task='regression', n_estimators=200, max_depth=4, learning_rate=0.1, subsample=0.8, cv=5, feature_names=None)[source]¶
Gradient boosting for forecasting or classification.
Gradient Boosting sequentially fits weak learners (shallow trees) to the residuals of the ensemble, greedily minimising a loss function. It is the workhorse of tabular ML in quant finance – used for return prediction, alpha factor construction, default prediction, and more.
- When to use:
Use gradient boosting as your default tabular model. It handles non-linearities, feature interactions, and missing values naturally. Preferred over linear models when you have >500 samples and >5 features.
- Mathematical background:
- At each stage m, the model adds a tree h_m that minimises:
F_m(x) = F_{m-1}(x) + nu * h_m(x)
- where h_m fits the negative gradient of the loss:
h_m = argmin_h sum_i L(y_i, F_{m-1}(x_i) + h(x_i))
For regression with squared loss, h_m fits the residuals. For classification with log-loss, h_m fits the log-odds residuals.
- Parameters:
y_test (
Series|ndarray|None, default:None) – Test target (if provided, test metrics are computed).task (
Literal['classification','regression'], default:'regression') – Prediction task.n_estimators (
int, default:200) – Number of boosting stages.max_depth (
int, default:4) – Maximum depth of individual trees.learning_rate (
float, default:0.1) – Shrinkage applied to each tree’s contribution.subsample (
float, default:0.8) – Fraction of training samples used per tree (stochastic boosting).cv (
int, default:5) – Cross-validation folds for reporting training CV score.feature_names (
Optional[Sequence[str]], default:None) – Feature names for importance ranking.
- Returns:
model: fitted GradientBoosting estimator,predictions: np.ndarray of test predictions,feature_importance: pd.Series (sorted descending),cv_scores: np.ndarray of cross-validation scores,test_score: float or None (R^2 for regression, accuracy for classification).- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(300, 5) >>> y = X[:, 0] * 2 + X[:, 1] + np.random.randn(300) * 0.5 >>> result = gradient_boost_forecast(X[:250], y[:250], X[250:], y[250:]) >>> result["test_score"] > 0 True
Overfits if n_estimators is too large; use early stopping or CV.
Sensitive to learning_rate / n_estimators trade-off.
For >100k samples, consider XGBoost/LightGBM for speed.
References
Friedman (2001), “Greedy Function Approximation: A Gradient Boosting Machine”
- gaussian_process_regression(X_train, y_train, X_test, kernel='rbf', alpha=0.01, n_restarts=5)[source]¶
Gaussian Process regression with uncertainty quantification.
Gaussian Processes (GPs) define a distribution over functions and provide both point predictions and calibrated confidence intervals. In finance, GPs are used for smooth yield-curve fitting, volatility-surface interpolation, and any setting where uncertainty matters as much as the prediction.
- When to use:
Use GP when you need uncertainty estimates (e.g., confidence bands on a yield curve) and have a small-to-moderate dataset (<5000 observations). The cubic complexity makes GPs impractical for large datasets without approximations.
- Mathematical background:
- A GP assumes f(x) ~ GP(m(x), k(x, x’)), where:
m(x) is the mean function (usually 0) k(x, x’) is the kernel (covariance function)
- Posterior predictive at test point x*:
mu* = k(x*, X) [K + sigma^2 I]^{-1} y sigma*^2 = k(x*, x*) - k(x*, X) [K + sigma^2 I]^{-1} k(X, x*)
where K_{ij} = k(x_i, x_j) and sigma^2 is the noise variance.
- Parameters:
- Returns:
predictions: np.ndarray of mean predictions,std: np.ndarray of predictive standard deviations,confidence_lower: np.ndarray (mean - 1.96 * std),confidence_upper: np.ndarray (mean + 1.96 * std),model: fitted GaussianProcessRegressor.- Return type:
Example
>>> import numpy as np >>> X_train = np.linspace(0, 10, 50).reshape(-1, 1) >>> y_train = np.sin(X_train).ravel() + np.random.randn(50) * 0.1 >>> X_test = np.linspace(0, 10, 20).reshape(-1, 1) >>> result = gaussian_process_regression(X_train, y_train, X_test) >>> result["predictions"].shape (20,) >>> result["std"].shape (20,)
Complexity is O(n^3) for training and O(n^2) per prediction.
For large datasets, use sparse GP approximations (not included here).
Kernel choice strongly affects results; try multiple kernels.
References
Rasmussen & Williams (2006), “Gaussian Processes for Machine Learning”
- isolation_forest_anomaly(returns, contamination=0.05, n_estimators=200, random_state=42)[source]¶
Detect anomalous days in return data using Isolation Forest.
Isolation Forest detects anomalies by randomly partitioning data and measuring how quickly each observation is isolated. Anomalous points (outlier returns, flash crashes, liquidity events) are isolated in fewer splits because they sit far from the bulk of the distribution.
- When to use:
Use for unsupervised anomaly detection in returns, volumes, or spreads. Works well when you do not have labelled anomalies and want to flag unusual market days for review. Robust to high-dimensional feature spaces.
- Mathematical background:
For a sample x, the anomaly score is based on the average path length E[h(x)] across the isolation trees:
s(x, n) = 2^{-E[h(x)] / c(n)}
where c(n) is the average path length in a binary search tree of n samples. Score close to 1 means anomaly; close to 0.5 means normal.
- Parameters:
returns (
Series|DataFrame|ndarray) – Return data. If 1-D, treated as a single feature; if 2-D, each column is a feature (e.g., return, volume, spread).contamination (
float, default:0.05) – Expected fraction of anomalies in the dataset (0 < c < 0.5).n_estimators (
int, default:200) – Number of isolation trees.random_state (
int, default:42) – Random seed.
- Returns:
anomaly_labels: np.ndarray of -1 (anomaly) / 1 (normal),anomaly_scores: np.ndarray of continuous anomaly scores (lower = more anomalous),anomaly_mask: np.ndarray of bool (True for anomalies),n_anomalies: int,model: fitted IsolationForest.- Return type:
Example
>>> import numpy as np >>> rets = np.random.randn(500) * 0.01 >>> rets[100] = 0.15 # inject anomaly >>> result = isolation_forest_anomaly(rets, contamination=0.02) >>> result["anomaly_mask"][100] True
The contamination parameter is a prior; misspecification leads to over- or under-detection.
Isolation Forest assumes anomalies are both rare and different; clustered anomalies may be missed.
For time-series anomaly detection, consider adding lagged features.
References
Liu, Ting & Zhou (2008), “Isolation Forest”
- pca_factor_model(returns, n_components=None, explained_variance_threshold=0.9)[source]¶
Build a PCA-based latent factor model from asset returns.
Principal Component Analysis extracts orthogonal linear combinations of asset returns that explain the most variance. The first PC typically captures the market factor, the second often captures a value/growth or sector rotation, and so on.
- When to use:
Use PCA factor models for dimensionality reduction in portfolio construction, risk decomposition, statistical arbitrage (pairs trading on residuals), and understanding co-movement structure.
- Mathematical background:
- Given return matrix R (T x N), PCA decomposes the covariance:
Sigma = V Lambda V^T
where Lambda = diag(lambda_1, …, lambda_N) are eigenvalues and V are eigenvectors (loadings). Factor returns are:
F = R @ V[:, :k] (T x k)
- The fraction of variance explained by the first k components:
sum(lambda_1..k) / sum(lambda_1..N)
- Parameters:
returns (
DataFrame) – T x N return matrix (rows = observations, columns = assets).n_components (
int|None, default:None) – Number of principal components. If None, selects enough to explainexplained_variance_thresholdof total variance.explained_variance_threshold (
float, default:0.9) – Minimum cumulative explained variance ratio whenn_componentsis None.
- Returns:
loadings: pd.DataFrame of shape(N, n_components)– asset loadings on each factor,factor_returns: pd.DataFrame of shape(T, n_components)– time series of factor returns,explained_variance_ratio: np.ndarray of per-component variance ratios,cumulative_variance: np.ndarray of cumulative variance ratios,n_components: int,model: fitted PCA object.- Return type:
Example
>>> import numpy as np, pandas as pd >>> returns = pd.DataFrame(np.random.randn(252, 20) * 0.01) >>> result = pca_factor_model(returns, n_components=3) >>> result["factor_returns"].shape (252, 3)
PCA is linear; for non-linear dimensionality reduction, use the VAE in
wraquant.ml.deep.autoencoder_features.Eigenvalues from small samples are noisy; use Random Matrix Theory denoising (
wraquant.ml.preprocessing.denoised_correlation) first.Components are not guaranteed to have economic meaning.
References
Jolliffe (2002), “Principal Component Analysis”
Avellaneda & Lee (2010), “Statistical arbitrage in the US equities market”
- online_linear_regression(X, y, forgetting_factor=1.0, initial_covariance=100.0)[source]¶
Recursive Least Squares (RLS) online linear regression.
Processes observations one at a time, updating regression coefficients with each new data point. This is the online analogue of ordinary least squares and is fundamental to adaptive signal processing in finance: tracking time-varying betas, hedge ratios, and factor loadings.
- When to use:
Use online regression when you need to: - Track a hedge ratio that drifts over time (pairs trading). - Estimate time-varying factor exposures (rolling beta). - Build adaptive trading signals that respond to regime changes. - Process streaming tick data without re-estimating from scratch.
- Mathematical background:
- Recursive Least Squares maintains:
P_t = (1/lambda) * (P_{t-1} - K_t x_t^T P_{t-1}) K_t = P_{t-1} x_t / (lambda + x_t^T P_{t-1} x_t) w_t = w_{t-1} + K_t (y_t - x_t^T w_{t-1})
where: - w_t is the coefficient vector at time t - P_t is the inverse covariance matrix (precision) - K_t is the Kalman gain - lambda is the forgetting factor (1 = no forgetting, <1 = down-weight old data)
With lambda = 1 and infinite data, RLS converges to OLS. With lambda < 1, the effective window length is approximately 1 / (1 - lambda) observations.
- Parameters:
X (
DataFrame|ndarray) – Feature matrix of shape(T, p)where T is the number of observations and p is the number of features.forgetting_factor (
float, default:1.0) – Forgetting factor lambda in (0, 1]. Values close to 1 give long memory; values like 0.99 give an effective window of ~100 observations. Use 0.95-0.99 for fast-adapting signals.initial_covariance (
float, default:100.0) – Scalar multiplier for the initial covariance matrix P_0 = c * I. Larger values make the filter more responsive early on.
- Returns:
coefficients: np.ndarray of shape(T, p)– the time-varying coefficient vector at each step,predictions: np.ndarray of shape(T,)– one-step-ahead predictions (each y_hat_t uses coefficients estimated from data up to t-1),residuals: np.ndarray of shape(T,)– prediction errors,final_coefficients: np.ndarray of shape(p,)– the coefficients at the last time step.- Return type:
Example
>>> import numpy as np >>> np.random.seed(42) >>> T = 500 >>> X = np.random.randn(T, 2) >>> # True coefficients shift halfway through >>> beta_true = np.where(np.arange(T)[:, None] < 250, ... [1.0, 0.5], [0.5, 1.0]) >>> y = np.sum(X * beta_true, axis=1) + np.random.randn(T) * 0.1 >>> result = online_linear_regression(X, y, forgetting_factor=0.98) >>> result["coefficients"].shape (500, 2) >>> # After convergence, coefficients should track the true values >>> np.abs(result["final_coefficients"][0] - 0.5) < 0.3 True
The forgetting factor is critical: too low causes noisy estimates, too high causes slow adaptation to regime changes.
RLS assumes the noise variance is constant; for heteroskedastic data, consider the exponential weighted variant or Kalman filters.
Initial predictions (before the filter converges) should be discarded in any evaluation.
References
Haykin (2002), “Adaptive Filter Theory”, Ch. 13 (RLS)
Montana et al. (2009), “Flexible least squares for temporal data mining and statistical arbitrage”
- exponential_weighted_regression(X, y, halflife=63.0, min_periods=30)[source]¶
Exponentially weighted linear regression favouring recent data.
At each time step t, fits a weighted least squares regression where observation weights decay exponentially into the past. This produces smooth, adaptive coefficient estimates that naturally respond to regime changes without the abrupt sensitivity of rolling-window OLS.
- When to use:
Use exponential weighted regression when: - You want smoother coefficient paths than RLS. - The halflife of predictive relationships is approximately known
(e.g., 63 trading days ~ 3 months).
You need an interpretable “recency bias” in your factor model.
- Mathematical background:
- At time t, the weight for observation s (where s <= t) is:
w_s = exp(-ln(2) * (t - s) / halflife)
- The weighted regression solves:
beta_t = (X_t^T W_t X_t)^{-1} X_t^T W_t y_t
where W_t = diag(w_0, w_1, …, w_t). This is equivalent to EWMA smoothing of the sufficient statistics X^T X and X^T y.
- Parameters:
halflife (
float, default:63.0) – Halflife in observations. Afterhalflifeobservations, the weight of a past data point has decayed to 50%. Common financial values: 21 (1 month), 63 (1 quarter), 252 (1 year).min_periods (
int, default:30) – Minimum number of observations before producing a coefficient estimate. Earlier entries are filled with NaN.
- Returns:
coefficients: np.ndarray of shape(T, p)– time-varying coefficients (NaN for the firstmin_periods - 1rows),predictions: np.ndarray of shape(T,)– fitted values using contemporaneous coefficients,residuals: np.ndarray of shape(T,)– prediction errors,final_coefficients: np.ndarray of shape(p,)– last estimated coefficients.- Return type:
Example
>>> import numpy as np >>> np.random.seed(0) >>> T = 300 >>> X = np.random.randn(T, 2) >>> beta_true = np.column_stack([ ... np.linspace(1, 0, T), # drifting coefficient ... np.full(T, 0.5), # constant coefficient ... ]) >>> y = np.sum(X * beta_true, axis=1) + np.random.randn(T) * 0.1 >>> result = exponential_weighted_regression(X, y, halflife=60) >>> result["coefficients"].shape (300, 2)
Halflife selection is subjective; cross-validate if possible.
For very short halflives (<10), the effective sample size is small and estimates become noisy.
Assumes homoskedastic errors; for heteroskedastic data, consider EWMA-weighted robust regression.
Numerically less stable than RLS for ill-conditioned problems.
References
Pozzi et al. (2012), “Exponentially weighted moving average charts for detecting concept drift”
de Prado (2018), “Advances in Financial Machine Learning”, Ch. 17
Features¶
Feature engineering functions for transforming raw market data into predictive signals.
Feature engineering utilities for financial machine learning.
All functions in this module use only numpy and pandas – no external TA libraries are required.
- rolling_features(data, windows=(5, 10, 21, 63))[source]¶
Generate rolling statistical features for each window length.
Use rolling features as a general-purpose feature engineering step before training ML models on time-series data. The rolling statistics capture time-varying moments that can signal changes in trend (mean), risk (std), asymmetry (skew), and tail behaviour (kurtosis).
For every window the following statistics are computed: mean, std, skew, kurtosis, min, and max.
- Parameters:
data (
Series|DataFrame) – Numeric time-series data. If a DataFrame is passed, features are generated independently for each column.windows (
Sequence[int], default:(5, 10, 21, 63)) – Rolling-window sizes (default(5, 10, 21, 63)), corresponding roughly to 1-week, 2-week, 1-month, and 1-quarter horizons.
- Returns:
DataFrame whose columns are named
{col}_{stat}_w{window}(or{stat}_w{window}when data is a Series). The number of feature columns equalsn_cols * len(windows) * 6. Early rows contain NaN where the window has insufficient data.- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> returns = pd.Series(np.random.randn(100) * 0.01, name='ret') >>> feats = rolling_features(returns, windows=(5, 21)) >>> feats.columns.tolist()[:3] ['mean_w5', 'std_w5', 'skew_w5'] >>> feats.shape[1] # 6 stats * 2 windows 12
See also
return_featuresLagged and cumulative return features.
volatility_featuresRealised volatility and vol-of-vol features.
- return_features(prices, lags=(1, 2, 3, 5, 10, 21))[source]¶
Compute lagged and cumulative return features from a price series.
Use return features as inputs to ML models predicting future returns or direction. Lagged returns capture momentum and mean-reversion signals at multiple horizons; cumulative returns capture trend strength.
- Parameters:
- Returns:
DataFrame with columns
ret_lag{l}(log return l periods ago, a momentum/mean-reversion signal) andcum_ret_{l}(cumulative log return over the last l periods, a trend signal) for each lag l. Early rows are NaN.- Return type:
Example
>>> import pandas as pd, numpy as np >>> prices = pd.Series([100, 101, 102, 100, 103, 105, 104], ... name='close') >>> feats = return_features(prices, lags=(1, 3)) >>> list(feats.columns) ['ret_lag1', 'cum_ret_1', 'ret_lag3', 'cum_ret_3'] >>> feats['cum_ret_3'].iloc[-1] > 0 # cumulative 3-period return True
See also
rolling_featuresRolling statistical features.
technical_featuresTechnical analysis features (RSI, MACD, etc.).
- technical_features(high, low, close, volume=None)[source]¶
Compute common technical analysis features for ML pipelines.
Use these features as inputs to ML models when you want to capture classic technical signals without depending on the full
wraquant.tamodule. Combines momentum (RSI, MACD), volatility (ATR, Bollinger), and optionally volume (OBV) into a single DataFrame.Computes RSI, MACD histogram, Bollinger Band %B, and ATR. If volume is provided, On-Balance Volume (OBV) is also included.
- Parameters:
- Returns:
DataFrame with columns:
rsi: Relative Strength Index (0-100). Values above 70 indicate overbought; below 30 indicate oversold.macd_hist: MACD histogram. Positive values indicate bullish momentum; negative values indicate bearish.bb_pctb: Bollinger Band %B (0-1 range typically). Values above 1 mean price is above the upper band.atr: Average True Range. Higher values indicate more volatile price action.obv(optional): On-Balance Volume. Rising OBV confirms an uptrend.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> n = 100 >>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5)) >>> high = close + np.abs(np.random.randn(n) * 0.3) >>> low = close - np.abs(np.random.randn(n) * 0.3) >>> feats = technical_features(high, low, close) >>> list(feats.columns) ['rsi', 'macd_hist', 'bb_pctb', 'atr']
See also
return_featuresLagged and cumulative return features.
volatility_featuresRealised volatility features.
- ta_features(high, low, close, volume=None, include=None)[source]¶
Generate ML features using wraquant’s full technical analysis library.
Unlike
technical_features(which uses inline implementations), this function imports directly fromwraquant.tato leverage the full 263-indicator library. This bridges themlandtamodules so that ML pipelines can access production-quality TA indicators without manual wiring.By default, computes a curated set of the most ML-relevant indicators: RSI, MACD histogram, Bollinger Band %B, ATR, and optionally OBV. Use the include parameter to select additional indicators.
- Parameters:
high (
Series) – High prices.low (
Series) – Low prices.close (
Series) – Close prices.volume (
Series|None, default:None) – Trade volume (optional). Required for volume-based indicators (OBV, MFI).include (
Optional[Sequence[str]], default:None) – Subset of indicators to include. Options:'rsi','macd','bbands','atr','obv'. If None, includes all available indicators.
- Return type:
- Returns:
DataFrame with one column per indicator, indexed like the input series. Column names are descriptive (e.g.,
ta_rsi,ta_macd_hist,ta_bb_pctb,ta_atr,ta_obv).
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> n = 100 >>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5)) >>> high = close + np.abs(np.random.randn(n) * 0.3) >>> low = close - np.abs(np.random.randn(n) * 0.3) >>> feats = ta_features(high, low, close) >>> 'ta_rsi' in feats.columns True
See also
technical_features: Inline implementation (no ta/ dependency). wraquant.ta.momentum.rsi: Full RSI implementation. wraquant.ta.momentum.macd: Full MACD implementation.
- volatility_features(returns, windows=(5, 10, 21, 63))[source]¶
Compute realised-volatility-related features.
Use volatility features to capture the current risk environment and volatility regime. Realised volatility is the most important feature in many financial ML models because volatility clusters (GARCH effect) and predicts future volatility better than returns predict future returns.
- Parameters:
- Returns:
Columns:
realized_vol_w{w}: Annualised rolling standard deviation (sqrt(252) scaling). Interpretation: a value of 0.20 means ~20% annualised volatility.vol_of_vol_w{w}: Rolling std of the rolling vol. High values indicate unstable volatility (vol-of-vol regime).vol_ratio_w{w1}_w{w2}: Ratio of short-window vol to long-window vol. Values > 1 indicate vol is spiking (risk-off signal); values < 1 indicate vol compression.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> rets = pd.Series(np.random.randn(200) * 0.01, name='daily_ret') >>> feats = volatility_features(rets, windows=(5, 21)) >>> 'realized_vol_w5' in feats.columns True >>> 'vol_ratio_w5_w21' in feats.columns True
See also
rolling_featuresGeneral rolling statistical features.
wraquant.volFull volatility modelling (GARCH, stochastic vol).
- microstructure_features(high, low, close, volume)[source]¶
Compute market-microstructure features.
Use microstructure features to capture liquidity conditions, information asymmetry, and trading activity. These are particularly valuable for short-horizon alpha models and execution-aware strategies where liquidity predicts future returns or trading costs.
- Parameters:
- Returns:
Columns:
amihud_illiq: Amihud illiquidity ratio (21-day rolling mean of |return| / dollar_volume). Higher values indicate less liquid, more price-impactful markets.kyle_lambda: Kyle’s lambda (21-day rolling OLS slope of |price change| on signed sqrt-volume). Measures the price impact per unit of informed flow. Higher values suggest more information asymmetry.log_volume: Natural log of volume. Smooths the skewed volume distribution for ML model consumption.volume_ma_ratio: Current volume / 21-day moving average. Values > 1 indicate above-average activity (potential event).dollar_volume: Price * volume. Absolute measure of trading activity and liquidity.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> n = 100 >>> close = pd.Series(100 + np.cumsum(np.random.randn(n) * 0.5)) >>> high = close + np.abs(np.random.randn(n) * 0.3) >>> low = close - np.abs(np.random.randn(n) * 0.3) >>> volume = pd.Series(np.random.randint(1_000_000, 5_000_000, n)) >>> feats = microstructure_features(high, low, close, volume) >>> list(feats.columns) ['amihud_illiq', 'kyle_lambda', 'log_volume', 'volume_ma_ratio', 'dollar_volume']
References
Amihud (2002), “Illiquidity and stock returns”
Kyle (1985), “Continuous Auctions and Insider Trading”
See also
technical_featuresPrice-based technical indicators.
- label_fixed_horizon(returns, horizon=5, threshold=0.0)[source]¶
Label future return direction over a fixed horizon.
Use fixed-horizon labelling as the simplest way to create supervised learning targets for directional prediction. Each observation is labelled based on the cumulative return over the next horizon periods. This is the standard approach for “will the price go up or down over the next N days?” classification.
- Parameters:
returns (
Series) – Period (e.g. daily) returns.horizon (
int, default:5) – Number of periods to accumulate forward returns (default 5, i.e. one trading week).threshold (
float, default:0.0) – Ifthreshold > 0, three labels are produced:1(up beyond threshold),0(flat),-1(down beyond threshold). Ifthreshold == 0, binary labels (1/0) are produced where1means positive cumulative return.
- Returns:
Integer labels aligned to the original index. The last horizon rows will be
NaN(no future data available).- Return type:
Example
>>> import pandas as pd, numpy as np >>> rets = pd.Series([0.01, -0.005, 0.02, 0.01, -0.03, 0.015, 0.005]) >>> labels = label_fixed_horizon(rets, horizon=3, threshold=0.0) >>> labels.iloc[0] # sum of rets[1:4] = -0.005+0.02+0.01 > 0 1
Notes
Fixed-horizon labelling does not adapt to volatility. In high-vol regimes, the threshold is hit more often; in low-vol regimes, most labels become
0. For volatility-adaptive labels, uselabel_triple_barrier.See also
label_triple_barrierVolatility-adaptive labelling (Lopez de Prado).
- label_triple_barrier(close, upper=None, lower=None, max_holding=10)[source]¶
Triple-barrier labelling (Lopez de Prado).
Use triple-barrier labelling when you want targets that adapt to market conditions. Unlike fixed-horizon labels, this method defines a profit-taking barrier (upper), a stop-loss barrier (lower), and a maximum holding period (vertical). Whichever barrier is hit first determines the label. This produces cleaner labels in volatile markets because the barriers can be scaled by volatility.
For each bar the method sets three barriers:
Upper: price rises by upper fraction -> label = 1
Lower: price falls by lower fraction -> label = -1
Vertical: max_holding bars elapse -> label = sign of return
If upper or lower is
Nonethe corresponding horizontal barrier is disabled.- Parameters:
close (
Series) – Close price series.upper (
float|None, default:None) – Fractional distance for the upper barrier (e.g.0.02for 2 %).lower (
float|None, default:None) – Fractional distance for the lower barrier (positive value; e.g.0.02for -2 %).max_holding (
int, default:10) – Maximum holding period in bars (vertical barrier).
- Returns:
Integer labels in
{-1, 0, 1}aligned to the input index.1= profit-taking barrier hit first (bullish),-1= stop-loss barrier hit first (bearish),0= vertical barrier hit with zero return. The last max_holding entries may beNaN.- Return type:
Example
>>> import pandas as pd >>> close = pd.Series([100, 101, 102, 103, 100, 97, 98, 99, 100, 101]) >>> labels = label_triple_barrier(close, upper=0.03, lower=0.03, max_holding=5) >>> labels.iloc[0] # price rises 3% by bar 3 (103/100 - 1 = 0.03) 1
Notes
In practice, set
upperandlowerproportional to recent volatility (e.g.,upper = lower = daily_vol * sqrt(max_holding)). This makes the labels regime-adaptive.References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 3
See also
label_fixed_horizonSimpler fixed-horizon labelling.
- interaction_features(data, columns=None)[source]¶
Create pairwise interaction terms between features.
Use interaction features when you suspect that predictive power lies in the combination of features rather than individual signals. For example,
momentum * volatilitycaptures whether momentum is occurring in a high- or low-volatility environment, which may predict returns differently.For each pair of selected columns
(A, B), computes:A_x_B: element-wise product (captures multiplicative relationships)A_div_B: element-wise ratio A / B (captures relative magnitudes)
- Parameters:
- Returns:
DataFrame containing all pairwise interaction features, with column names like
col1_x_col2andcol1_div_col2.- Return type:
Example
>>> import pandas as pd, numpy as np >>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) >>> result = interaction_features(df, columns=['a', 'b']) >>> 'a_x_b' in result.columns True >>> 'a_div_b' in result.columns True
- cross_asset_features(asset, benchmark, windows=(10, 21, 63))[source]¶
Compute cross-asset relationship features.
Use cross-asset features to capture how an asset co-moves with a benchmark or related instrument. Rolling correlation and beta detect changing exposures (useful for regime detection); relative strength identifies momentum divergence between the asset and its benchmark.
Given an asset return series and a benchmark (or related asset) return series, computes rolling correlation, rolling beta, and relative strength for each window.
- Parameters:
- Returns:
DataFrame with columns: -
rolling_corr_w{w}: rolling Pearson correlation -rolling_beta_w{w}: rolling OLS beta (cov / var of benchmark) -relative_strength_w{w}: cumulative return ratio (asset / benchmark)over the window
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(0) >>> asset = pd.Series(np.random.randn(200) * 0.01, name='asset') >>> bench = pd.Series(np.random.randn(200) * 0.01, name='bench') >>> result = cross_asset_features(asset, bench, windows=[10, 21]) >>> 'rolling_corr_w10' in result.columns True >>> 'rolling_beta_w21' in result.columns True
- regime_features(regime_probabilities, regime_labels=None)[source]¶
Create features from regime probabilities or labels.
Use regime features when you have upstream regime detection (e.g., HMM, Markov-switching) and want to feed regime state into downstream ML models. Regime duration and transition probability are predictive because regimes tend to persist (duration) but eventually break down (transition probability rises before a switch).
Given regime probabilities (e.g., from an HMM or Markov-switching model), constructs features useful for downstream ML models: current regime identity, regime duration (how many consecutive periods in the current regime), and estimated transition probability (rolling mean of regime changes).
- Parameters:
regime_probabilities (
DataFrame) – DataFrame where each column is the probability of a regime (e.g., columns['bull', 'bear']with probabilities summing to 1).regime_labels (
Series|None, default:None) – Hard regime labels. If None, the most probable regime at each step is used (argmax of the probability columns).
- Returns:
DataFrame with columns: -
current_regime: integer label of the current regime -regime_duration: number of consecutive periods in thecurrent regime
regime_change: binary indicator (1 if regime changed)transition_prob_w{w}: rolling mean of regime changes for w in [5, 10, 21]one column per regime probability from the input
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(42) >>> probs = pd.DataFrame({ ... 'bull': np.random.dirichlet([5, 2], size=100)[:, 0], ... 'bear': np.random.dirichlet([5, 2], size=100)[:, 1], ... }) >>> result = regime_features(probs) >>> 'current_regime' in result.columns True >>> 'regime_duration' in result.columns True
Preprocessing¶
Purged CV, fractional differentiation, and correlation matrix denoising.
Financial data preprocessing utilities.
Implements purged cross-validation, fractional differentiation, and random-matrix-theory denoising – all central to the Advances in Financial Machine Learning workflow (Lopez de Prado).
- purged_kfold(X, y, n_splits=5, embargo_pct=0.01)[source]¶
Purged K-Fold cross-validation.
Use purged K-fold instead of standard K-fold whenever your labels overlap in time (e.g., forward returns computed over a window). Standard K-fold leaks future information because a training sample’s label may depend on prices that appear in the test set. Purging removes an embargo zone after each test fold to break this leakage.
Ensures that training observations that immediately follow a test observation are removed (embargo) so that information cannot leak through overlapping labels.
- Parameters:
X (
DataFrame|ndarray) – Feature matrix (only its length is used).y (
Series|ndarray) – Target vector (only its length is used).n_splits (
int, default:5) – Number of folds.embargo_pct (
float, default:0.01) – Fraction of total samples to embargo after each test fold. For daily data with 5-day forward labels,0.01embargoes ~2.5 days on a 252-sample dataset.
- Yields:
tuple[np.ndarray,np.ndarray]–(train_indices, test_indices)for each fold.- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(500, 3) >>> y = np.random.randn(500) >>> folds = list(purged_kfold(X, y, n_splits=5, embargo_pct=0.02)) >>> len(folds) 5 >>> train_idx, test_idx = folds[0] >>> len(train_idx) + len(test_idx) < 500 # embargo removes some samples True
References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 7
See also
combinatorial_purged_kfoldGenerates all C(n, k) purged splits.
wraquant.ml.pipeline.FinancialPipelinePipeline that uses purged K-fold.
- combinatorial_purged_kfold(X, y, n_splits=5, n_test_splits=2, embargo_pct=0.01)[source]¶
Combinatorial purged K-Fold cross-validation.
Use combinatorial purged K-fold when you need more backtest paths than standard purged K-fold provides. By choosing
n_test_splitsgroups as the test set fromn_splitstotal groups, this generates C(n_splits, n_test_splits) distinct train/test splits – each with an embargo to prevent information leakage.Generates all C(n_splits, n_test_splits) train/test combinations, applying an embargo after each test group to prevent leakage.
- Parameters:
- Yields:
tuple[np.ndarray,np.ndarray]–(train_indices, test_indices)for each combination.- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(500, 3) >>> y = np.random.randn(500) >>> folds = list(combinatorial_purged_kfold(X, y, n_splits=5, n_test_splits=2)) >>> len(folds) # C(5, 2) = 10 10
References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 12
See also
purged_kfoldSimpler purged K-fold with n_splits folds.
- fractional_differentiation(series, d=0.5, threshold=1e-05)[source]¶
Fractionally differentiate a time series.
Use fractional differentiation to make a price or factor series stationary (required by many ML models) while retaining as much memory (long-range dependence) as possible. Standard first differencing (d=1) makes the series stationary but destroys all memory. Fractional differencing with d=0.3-0.5 achieves stationarity while preserving most of the signal.
Applies the fractional differentiation operator of order d (Hosking, 1981) to obtain a (near-)stationary series while preserving long-range memory.
The operator is defined as:
(1 - B)^d = sum_{k=0}^{inf} C(d,k) * (-B)^k
where B is the backshift operator and C(d,k) are the binomial-like weights.
- Parameters:
series (
Series) – Input time series (e.g., log prices).d (
float, default:0.5) – Fractional differentiation order (0 < d < 1 for partial differentiation; d = 1 is the standard first difference). Start with d=0.5 and decrease until the ADF test rejects at the desired significance level.threshold (
float, default:1e-05) – Minimum absolute weight to retain. Smaller values use more lagged observations but increase computational cost.
- Returns:
Fractionally differentiated series (initial rows where the full convolution is not available are dropped). Test stationarity with an ADF test; if non-stationary, increase d.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(42) >>> prices = pd.Series(100 + np.cumsum(np.random.randn(300) * 0.5), ... name='close') >>> frac_diff = fractional_differentiation(prices, d=0.4) >>> len(frac_diff) < len(prices) # initial rows dropped True >>> frac_diff.std() > 0 # non-trivial output True
References
Hosking (1981), “Fractional Differencing”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 5
See also
denoised_correlationRandom Matrix Theory denoising.
- denoised_correlation(returns, n_components=None)[source]¶
Denoise a correlation matrix using Random Matrix Theory.
Use denoised correlation before portfolio optimization or clustering to remove noise eigenvalues that arise from finite-sample estimation. When T/N (observations/assets) is not large, the sample correlation matrix contains substantial noise. RMT denoising replaces eigenvalues consistent with random noise (Marchenko-Pastur distribution) with their average, producing a cleaner matrix that leads to more stable portfolio weights.
Eigenvalues that fall within the Marchenko-Pastur distribution are replaced by their average, shrinking noise while preserving signal.
- Parameters:
- Returns:
Denoised correlation matrix of shape
(N, N). The matrix is symmetric, positive semi-definite, and has unit diagonal. Use it in place ofreturns.corr()for portfolio optimization.- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(42) >>> returns = pd.DataFrame(np.random.randn(252, 10) * 0.01) >>> clean_corr = denoised_correlation(returns) >>> clean_corr.shape (10, 10) >>> np.allclose(np.diag(clean_corr), 1.0) # unit diagonal True
Notes
The Marchenko-Pastur upper bound is:
lambda_+ = sigma^2 * (1 + sqrt(N/T))^2
Eigenvalues above this threshold are retained as “signal”; those below are replaced.
References
Laloux et al. (1999), “Noise dressing of financial correlation matrices”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 2
See also
detoned_correlationRemove the market mode from a correlation matrix.
- detoned_correlation(corr, n_components=1)[source]¶
Remove the first n_components eigenvectors (market mode) from a correlation matrix.
Use detoned correlation when you want to uncover residual co-movement structure after removing the dominant market factor. The first eigenvector of asset returns typically represents the “market mode” (all assets moving together). Removing it reveals sector, style, or idiosyncratic clustering that is hidden when the market factor dominates. This is particularly useful before hierarchical clustering or community detection.
- Parameters:
- Returns:
De-toned correlation matrix of shape
(N, N). The matrix is symmetric with unit diagonal but is not positive definite (some eigenvalues are set to zero).- Return type:
Example
>>> import numpy as np >>> np.random.seed(42) >>> corr = np.corrcoef(np.random.randn(5, 252)) >>> detoned = detoned_correlation(corr, n_components=1) >>> detoned.shape (5, 5) >>> np.allclose(np.diag(detoned), 1.0) True
References
Lopez de Prado (2020), “Machine Learning for Asset Managers”, Ch. 2
See also
denoised_correlationRemove noise eigenvalues from a correlation matrix.
wraquant.ml.clustering.correlation_clusteringCluster assets by correlation.
Models¶
Walk-forward training, ensembles, and feature importance.
Model wrappers for financial machine-learning workflows.
Functions that require scikit-learn are guarded by the
@requires_extra('ml') decorator so that the rest of the package can
be imported without it.
- walk_forward_train(model, X, y, train_size=252, test_size=21, step_size=21)[source]¶
Walk-forward (expanding or rolling window) analysis.
Use walk-forward analysis to evaluate a model under realistic conditions where only past data is available for training at each step. This is the standard time-series cross-validation approach in quantitative finance, avoiding the look-ahead bias inherent in random K-fold splits.
At each step the model is cloned (via scikit-learn’s
clone), fitted on the training window, and used to predict the test window.- Parameters:
model (
Any) – A scikit-learn-compatible estimator that implementsfitandpredict.train_size (
int, default:252) – Number of training observations in the first window (default 252, approximately one trading year).test_size (
int, default:21) – Number of test observations per fold (default 21, approximately one trading month).step_size (
int, default:21) – Number of observations to step forward between folds.
- Returns:
predictionsnp.ndarrayConcatenated out-of-sample predictions across all folds.
actualsnp.ndarrayCorresponding true values. Compare with predictions to measure forecast accuracy.
test_indicesnp.ndarrayOriginal row indices for each prediction, useful for aligning results back to a DatetimeIndex.
n_foldsintNumber of walk-forward folds executed.
- Return type:
Example
>>> from sklearn.linear_model import Ridge >>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> X = pd.DataFrame(np.random.randn(500, 3), columns=['mom', 'vol', 'size']) >>> y = X['mom'] * 0.5 + np.random.randn(500) * 0.1 >>> result = walk_forward_train(Ridge(), X, y, train_size=252, test_size=21) >>> result['n_folds'] > 0 True >>> len(result['predictions']) == len(result['actuals']) True
Notes
The window is expanding (all data from the start up to the current train end is used). For a rolling window, see
wraquant.ml.pipeline.walk_forward_backtestwhich supports both modes.See also
wraquant.ml.pipeline.walk_forward_backtestFull walk-forward backtest with PnL.
wraquant.ml.preprocessing.purged_kfoldPurged K-fold cross-validation.
- ensemble_predict(models, X, method='mean')[source]¶
Generate ensemble predictions from multiple fitted models.
Use ensemble prediction to combine several models (e.g., Ridge, Random Forest, Gradient Boosting) into a single, more robust forecast. Ensembles reduce variance and are standard practice in alpha research and competition-winning pipelines.
- Parameters:
models (
Sequence[Any]) – Fitted scikit-learn-compatible estimators. Each must implementpredict(X).method (
Literal['mean','median','vote'], default:'mean') – Aggregation method.'mean'and'median'average the raw predictions (best for regression);'vote'takes the mode (majority vote, best for classification).
- Returns:
Aggregated predictions. For
'mean'/'median', the values are continuous. For'vote', the values are discrete class labels.- Return type:
Example
>>> from sklearn.linear_model import Ridge, Lasso >>> import numpy as np >>> np.random.seed(0) >>> X_train = np.random.randn(200, 3) >>> y_train = X_train @ [1, 0.5, 0] + np.random.randn(200) * 0.1 >>> m1 = Ridge().fit(X_train, y_train) >>> m2 = Lasso(alpha=0.01).fit(X_train, y_train) >>> X_test = np.random.randn(50, 3) >>> preds = ensemble_predict([m1, m2], X_test, method='mean') >>> preds.shape (50,)
See also
walk_forward_trainWalk-forward evaluation for individual models.
- feature_importance_mdi(model, feature_names)[source]¶
Mean Decrease Impurity (MDI) feature importance.
Use MDI as a fast, first-pass feature ranking after fitting a tree-based model. MDI measures how much each feature contributes to reducing node impurity (Gini for classification, variance for regression) across all trees.
Reads
model.feature_importances_(available on tree-based estimators after fitting) and returns a sortedpd.Series.- Parameters:
- Returns:
Importance values indexed by feature name, sorted descending. Higher values indicate features that contributed more to splits. Values sum to 1.0 for scikit-learn tree ensembles.
- Return type:
Example
>>> from sklearn.ensemble import RandomForestClassifier >>> import numpy as np >>> np.random.seed(42) >>> X = np.random.randn(300, 4) >>> y = (X[:, 0] > 0).astype(int) >>> rf = RandomForestClassifier(n_estimators=50, random_state=42).fit(X, y) >>> imp = feature_importance_mdi(rf, ['momentum', 'vol', 'size', 'value']) >>> imp.index[0] # most important feature 'momentum'
Notes
MDI is biased toward high-cardinality and continuous features. For an unbiased alternative, use
feature_importance_mda(permutation importance).See also
feature_importance_mdaPermutation-based importance (unbiased).
wraquant.ml.advanced.random_forest_importanceCombined RF fit + importance.
- feature_importance_mda(model, X, y, feature_names, n_repeats=10)[source]¶
Mean Decrease Accuracy (permutation importance).
Use MDA when you need an unbiased estimate of feature importance that accounts for feature interactions and is not affected by cardinality bias. Unlike MDI, MDA evaluates on held-out data and directly measures how much predictive power is lost when a feature is shuffled.
Repeatedly permutes each feature and measures the decrease in the model’s score.
- Parameters:
model (
Any) – A fitted scikit-learn-compatible estimator.X (
DataFrame|ndarray) – Feature matrix (test or validation set).feature_names (
Sequence[str]) – Feature names corresponding to columns of X.n_repeats (
int, default:10) – Number of permutation repeats per feature. More repeats yield more stable estimates but increase runtime linearly.
- Returns:
Mean importance values indexed by feature name, sorted descending. Positive values indicate features whose permutation hurts the model score; negative values suggest noise features.
- Return type:
Example
>>> from sklearn.ensemble import RandomForestClassifier >>> import numpy as np >>> np.random.seed(42) >>> X = np.random.randn(300, 4) >>> y = (X[:, 0] + 0.3 * X[:, 2] > 0).astype(int) >>> rf = RandomForestClassifier(n_estimators=50, random_state=42).fit(X, y) >>> imp = feature_importance_mda(rf, X, y, ['mom', 'vol', 'size', 'val']) >>> imp.iloc[0] > 0 # top feature has positive importance True
Notes
MDA is model-agnostic and works with any estimator that exposes a
scoremethod. Correlated features share importance: permuting one leaves its correlated partner to compensate, so both appear less important than they truly are.References
Breiman (2001), “Random Forests”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 8
See also
feature_importance_mdiFaster but biased impurity-based importance.
wraquant.ml.pipeline.feature_importance_shapSHAP-based importance.
- sequential_feature_selection(model, X, y, n_features=5, direction='forward', cv=5)[source]¶
Sequential (forward / backward) feature selection.
Use sequential feature selection when you want to find a compact subset of features that maximises predictive performance. Forward selection greedily adds the best feature at each step; backward selection starts with all features and removes the least useful.
- Parameters:
model (
Any) – A scikit-learn-compatible estimator.n_features (
int, default:5) – Number of features to select.direction (
Literal['forward','backward'], default:'forward') – Selection direction. Forward is faster whenn_featuresis small relative to total features; backward is faster when you want to drop only a few.cv (
int, default:5) – Number of cross-validation folds.
- Returns:
Selected feature names (if X is a DataFrame) or column indices.
- Return type:
Example
>>> from sklearn.linear_model import Ridge >>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> X = pd.DataFrame(np.random.randn(200, 6), ... columns=['f1','f2','f3','f4','f5','f6']) >>> y = X['f1'] * 2 + X['f3'] + np.random.randn(200) * 0.1 >>> selected = sequential_feature_selection(Ridge(), X, y, n_features=2) >>> len(selected) 2
See also
feature_importance_mdiImpurity-based ranking (faster, less rigorous).
feature_importance_mdaPermutation-based ranking.
Deep Learning¶
LSTM, GRU, Transformer, and autoencoder architectures for time-series forecasting. Requires PyTorch.
Deep learning models for quantitative finance.
Provides PyTorch-based neural network architectures tailored for financial time-series forecasting and feature extraction. All torch imports are guarded so the rest of the package works without PyTorch installed.
Models included: - LSTM forecasting - Transformer-based forecasting - GRU forecasting - Variational Autoencoder for feature extraction
- lstm_forecast(series, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Forecast a financial time series using an LSTM network.
Long Short-Term Memory networks are recurrent neural networks capable of learning long-range dependencies in sequential data. In finance, LSTMs are used to capture complex temporal patterns in price, volume, and return series that linear models miss.
The function auto-creates overlapping input/target sequences from the raw time series, splits into train/test sets chronologically (no shuffle to avoid lookahead bias), trains the model, and returns predictions on the test set.
- When to use:
Use LSTM for multi-step forecasting when you have >1000 observations and suspect non-linear temporal dependencies. Works well for return prediction, volatility forecasting, and spread modeling.
- Mathematical background:
- At each time step t, the LSTM cell computes:
f_t = sigma(W_f [h_{t-1}, x_t] + b_f) (forget gate) i_t = sigma(W_i [h_{t-1}, x_t] + b_i) (input gate) o_t = sigma(W_o [h_{t-1}, x_t] + b_o) (output gate) c_t = f_t * c_{t-1} + i_t * tanh(W_c [h_{t-1}, x_t] + b_c) h_t = o_t * tanh(c_t)
The cell state c_t acts as a conveyor belt, allowing gradients to flow across many time steps without vanishing.
- Parameters:
series (
Series|ndarray) – Univariate time series (e.g., log returns, prices, spreads).seq_length (
int, default:20) – Number of look-back time steps for each input sequence.hidden_dim (
int, default:64) – Number of hidden units in each LSTM layer.n_layers (
int, default:2) – Number of stacked LSTM layers.dropout (
float, default:0.1) – Dropout probability between LSTM layers (applied only whenn_layers > 1).n_epochs (
int, default:50) – Number of training epochs.lr (
float, default:0.001) – Learning rate for the Adam optimizer.train_ratio (
float, default:0.8) – Fraction of data used for training (the rest is used for testing). The split is chronological – no shuffling.batch_size (
int, default:32) – Mini-batch size for training.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np >>> returns = np.cumsum(np.random.randn(500) * 0.01) >>> result = lstm_forecast(returns, seq_length=10, n_epochs=20) >>> result["predictions"].shape (80,)
Financial time series are notoriously noisy; LSTM is prone to overfitting on noise. Use dropout, early stopping, and validation.
Chronological train/test split is critical to avoid lookahead bias.
Normalisation (handled internally) is essential for gradient stability.
References
Hochreiter & Schmidhuber (1997), “Long Short-Term Memory”
Fischer & Krauss (2018), “Deep learning with long short-term memory networks for financial market predictions”
- transformer_forecast(series, seq_length=20, d_model=64, n_heads=4, n_encoder_layers=2, dim_feedforward=128, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Forecast a financial time series using a Transformer encoder.
Transformer models use self-attention to capture dependencies at any distance in the input sequence, unlike RNNs which process sequentially. This makes them especially effective at discovering long-range patterns such as seasonality, lead-lag relationships, and regime persistence in financial data.
- When to use:
Use Transformers when you have sufficient data (>2000 observations) and suspect that long-range dependencies matter. They often outperform LSTMs on longer sequences but require more data and compute.
- Mathematical background:
- Self-attention computes:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
where Q, K, V are linear projections of the input. Multi-head attention runs h parallel attention heads and concatenates:
MultiHead(Q, K, V) = Concat(head_1, …, head_h) W_O
- Positional encoding injects order information:
PE(pos, 2i) = sin(pos / 10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})
- Parameters:
seq_length (
int, default:20) – Number of look-back time steps.d_model (
int, default:64) – Embedding dimension (must be divisible byn_heads).n_heads (
int, default:4) – Number of attention heads.n_encoder_layers (
int, default:2) – Number of Transformer encoder layers.dim_feedforward (
int, default:128) – Hidden dimension in the feedforward sub-layers.dropout (
float, default:0.1) – Dropout probability.n_epochs (
int, default:50) – Number of training epochs.lr (
float, default:0.001) – Learning rate for Adam.train_ratio (
float, default:0.8) – Fraction of data for training.batch_size (
int, default:32) – Mini-batch size.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np >>> prices = np.cumsum(np.random.randn(600) * 0.01) + 100 >>> result = transformer_forecast(prices, seq_length=15, n_epochs=10) >>> len(result["predictions"]) > 0 True
Transformers are data-hungry; on small datasets (<500 obs) they will overfit severely.
Quadratic memory in sequence length: keep seq_length reasonable (< 256 for typical financial data).
No inherent notion of order without positional encoding.
References
Vaswani et al. (2017), “Attention Is All You Need”
Li et al. (2019), “Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting”
- autoencoder_features(X, latent_dim=8, hidden_dim=64, n_epochs=50, lr=0.001, batch_size=32, beta=1.0)[source]¶
Extract latent features using a Variational Autoencoder (VAE).
A VAE learns a compressed, continuous latent representation of high-dimensional input features. In finance, this is valuable for:
Regime detection: Cluster the latent codes to find market states.
Anomaly detection: High reconstruction error flags unusual market conditions (flash crashes, liquidity crises).
Feature compression: Reduce hundreds of technical indicators to a handful of orthogonal latent factors.
- When to use:
Use when you have a wide feature matrix (>20 features) and want to discover latent structure, detect anomalies, or reduce dimensionality in a non-linear way that PCA cannot capture.
- Mathematical background:
- The VAE optimises the Evidence Lower Bound (ELBO):
L = E_q[log p(x|z)] - beta * KL(q(z|x) || p(z))
where q(z|x) = N(mu(x), sigma^2(x)) is the encoder, p(x|z) is the decoder, and p(z) = N(0, I) is the prior. The KL term regularises the latent space to be smooth and continuous.
- Parameters:
X (
DataFrame|ndarray) – Feature matrix of shape(n_samples, n_features).latent_dim (
int, default:8) – Dimensionality of the latent space.hidden_dim (
int, default:64) – Hidden layer size in encoder/decoder.n_epochs (
int, default:50) – Training epochs.lr (
float, default:0.001) – Learning rate.batch_size (
int, default:32) – Mini-batch size.beta (
float, default:1.0) – Weight on the KL divergence term.beta=1is standard VAE;beta<1gives more reconstruction accuracy;beta>1forces more disentangled representations.
- Returns:
latent_features: np.ndarray of shape(n_samples, latent_dim)– the encoded representations,reconstruction_error: np.ndarray of per-sample reconstruction MSE,train_losses: list of per-epoch total losses,model: the trained VAE module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np >>> X = np.random.randn(500, 30) # 30 features >>> result = autoencoder_features(X, latent_dim=5, n_epochs=20) >>> result["latent_features"].shape (500, 5)
Normalise your features before encoding; the VAE assumes roughly standard-normal inputs for stable training.
The latent space is stochastic; for deterministic embeddings, use the mean (mu) which is what this function returns.
Reconstruction error thresholds for anomaly detection should be calibrated on clean training data.
References
Kingma & Welling (2014), “Auto-Encoding Variational Bayes”
An & Cho (2015), “Variational Autoencoder based Anomaly Detection using Reconstruction Probability”
- gru_forecast(series, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Forecast a financial time series using a GRU network.
Gated Recurrent Units are a simplified variant of LSTMs that merge the cell and hidden state, resulting in fewer parameters and faster training while achieving comparable performance on many financial forecasting tasks.
- When to use:
Use GRU as a computationally cheaper alternative to LSTM. Preferred when you have moderate-sized datasets (500-5000 observations) or need faster iteration during model development.
- Mathematical background:
- The GRU update equations at time step t:
z_t = sigma(W_z [h_{t-1}, x_t]) (update gate) r_t = sigma(W_r [h_{t-1}, x_t]) (reset gate) h_t_hat = tanh(W [r_t * h_{t-1}, x_t]) (candidate) h_t = (1 - z_t) * h_{t-1} + z_t * h_t_hat
Compared to LSTM, GRU has no separate cell state and uses two gates instead of three, giving ~25% fewer parameters.
- Parameters:
seq_length (
int, default:20) – Number of look-back time steps.hidden_dim (
int, default:64) – Number of hidden units per GRU layer.n_layers (
int, default:2) – Number of stacked GRU layers.dropout (
float, default:0.1) – Dropout between layers (only whenn_layers > 1).n_epochs (
int, default:50) – Training epochs.lr (
float, default:0.001) – Learning rate.train_ratio (
float, default:0.8) – Fraction of data for training.batch_size (
int, default:32) – Mini-batch size.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np >>> vol = np.abs(np.random.randn(400)) * 0.02 >>> result = gru_forecast(vol, seq_length=10, n_epochs=15) >>> result["predictions"].shape[0] > 0 True
Same overfitting risks as LSTM; use dropout and validation.
On very long sequences (>200 steps), Transformers may outperform GRU.
References
Cho et al. (2014), “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”
- multivariate_lstm_forecast(features, target, seq_length=20, hidden_dim=64, n_layers=2, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Forecast a target series using multiple input features via LSTM.
Multivariate LSTM ingests a DataFrame of features (e.g., returns of correlated assets, macro indicators, technical signals) and learns to predict a single target variable. This outperforms univariate LSTM when cross-asset signals exist – for example, when sector ETF returns lead individual stock returns, when VIX changes anticipate equity moves, or when order-flow imbalance across related instruments carries predictive information for the target.
The function normalises each feature column independently (z-score), creates multivariate look-back sequences, trains the LSTM with a chronological train/test split, and returns predictions on the held-out test set along with train and test MSE metrics.
- Mathematical background:
The LSTM cell equations are the same as in
lstm_forecast, but the input dimensionality is now n_features rather than 1:x_t in R^{n_features} f_t = sigma(W_f [h_{t-1}, x_t] + b_f) i_t = sigma(W_i [h_{t-1}, x_t] + b_i) o_t = sigma(W_o [h_{t-1}, x_t] + b_o)
The weight matrices W_f, W_i, W_o, W_c have input dimension n_features instead of 1, allowing the network to learn cross-feature temporal dependencies.
- Parameters:
features (
DataFrame) – DataFrame of shape(T, n_features)containing the input features. All columns are used as inputs to the LSTM.target (
Series|ndarray) – Target variable of length T to predict.seq_length (
int, default:20) – Number of look-back time steps for each input sequence.hidden_dim (
int, default:64) – Number of hidden units in each LSTM layer.n_layers (
int, default:2) – Number of stacked LSTM layers.dropout (
float, default:0.1) – Dropout probability between LSTM layers (applied only whenn_layers > 1).n_epochs (
int, default:50) – Number of training epochs.lr (
float, default:0.001) – Learning rate for the Adam optimizer.train_ratio (
float, default:0.8) – Fraction of data used for training (chronological split).batch_size (
int, default:32) – Mini-batch size for training.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,train_mse: float MSE on the training set,test_mse: float MSE on the test set,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> df = pd.DataFrame({ ... 'asset_a': np.cumsum(np.random.randn(500) * 0.01), ... 'asset_b': np.cumsum(np.random.randn(500) * 0.01), ... 'vix': np.abs(np.random.randn(500)) * 15 + 15, ... }) >>> target = pd.Series(np.cumsum(np.random.randn(500) * 0.01)) >>> result = multivariate_lstm_forecast(df, target, seq_length=10, n_epochs=5) >>> result["predictions"].shape[0] > 0 True
References
Hochreiter & Schmidhuber (1997), “Long Short-Term Memory”
Fischer & Krauss (2018), “Deep learning with long short-term memory networks for financial market predictions”
- temporal_fusion_transformer(features, target, seq_length=20, hidden_dim=64, n_heads=4, n_lstm_layers=1, dropout=0.1, n_epochs=50, lr=0.001, train_ratio=0.8, batch_size=32)[source]¶
Simplified Temporal Fusion Transformer for interpretable forecasting.
The most promising architecture for interpretable financial forecasting. This implementation provides the core TFT components: a variable selection network that learns which input features matter, an LSTM encoder for temporal processing, multi-head attention for capturing long-range dependencies, and gated residual connections for stable gradient flow.
Unlike black-box models, TFT produces per-feature importance weights that reveal which inputs drive each prediction – critical for building trust in trading signals and satisfying model governance requirements.
- Architecture:
Variable Selection Network (VSN): A soft-attention gate over input features. Each feature is projected to
hidden_dim, then a shared softmax gate selects the most relevant ones.LSTM Encoder: Processes the selected features sequentially to capture local temporal patterns.
Multi-Head Attention: Attends over the LSTM outputs to capture long-range dependencies (e.g., monthly seasonality).
Gated Residual Network (GRN): skip connections with gating for stable training on noisy financial data.
Output layer: Linear projection to produce the forecast.
- Parameters:
features (
DataFrame) – DataFrame of shape(T, n_features)containing the input features.seq_length (
int, default:20) – Number of look-back time steps.hidden_dim (
int, default:64) – Dimensionality of the hidden representations.n_heads (
int, default:4) – Number of attention heads (must dividehidden_dim).n_lstm_layers (
int, default:1) – Number of LSTM layers in the encoder.dropout (
float, default:0.1) – Dropout probability.n_epochs (
int, default:50) – Number of training epochs.lr (
float, default:0.001) – Learning rate for Adam.train_ratio (
float, default:0.8) – Fraction of data for training (chronological split).batch_size (
int, default:32) – Mini-batch size.
- Returns:
predictions: np.ndarray of test-set predictions,actuals: np.ndarray of actual test values,train_losses: list of per-epoch training losses,feature_importance: np.ndarray of shape(n_features,)giving the learned importance weight for each input feature (higher = more important),feature_names: list of feature names from the input DataFrame,model: the trainedtorch.nn.Module.- Return type:
- Raises:
ImportError – If PyTorch is not installed.
Example
>>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> df = pd.DataFrame({ ... 'momentum': np.random.randn(500), ... 'volume': np.abs(np.random.randn(500)), ... 'spread': np.random.randn(500) * 0.1, ... }) >>> target = pd.Series(np.cumsum(np.random.randn(500) * 0.01)) >>> result = temporal_fusion_transformer( ... df, target, seq_length=10, hidden_dim=16, n_heads=2, n_epochs=5 ... ) >>> result["predictions"].shape[0] > 0 True >>> len(result["feature_importance"]) == 3 True
References
Lim et al. (2021), “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting”
Advanced Models¶
SVM, Random Forest, Gradient Boosting, Gaussian Process, Isolation Forest, PCA factor models.
Advanced scikit-learn models for quantitative finance.
Provides production-ready wrappers around SVM, Random Forest, Gradient Boosting, Gaussian Process, Isolation Forest, and PCA – all with finance-specific defaults, comprehensive docstrings, and clean return interfaces.
All functions guard sklearn imports behind @requires_extra('ml') so the
rest of wraquant works without scikit-learn installed.
- svm_classifier(X_train, y_train, X_test, y_test, kernel='rbf', C_range=(0.1, 1.0, 10.0), gamma_range=('scale', 0.01, 0.1), cv=5)[source]¶
Train an SVM classifier for market regime classification.
Support Vector Machines find the maximum-margin hyperplane separating classes. With the RBF kernel, SVMs can capture non-linear decision boundaries in feature space, making them effective for classifying market regimes (bull/bear/neutral) from derived features like volatility, momentum, and volume profiles.
- When to use:
Use SVM when you have a moderate number of features (5-100), moderate dataset size (500-50k), and need robust classification with good generalisation. SVMs handle high-dimensional spaces well and are resistant to overfitting when C is properly tuned.
- Mathematical background:
- SVM solves:
min_{w,b} (1/2) ||w||^2 + C * sum_i max(0, 1 - y_i(w.x_i + b))
- The RBF kernel maps inputs to infinite-dimensional space:
K(x, x’) = exp(-gamma * ||x - x’||^2)
Grid search over C (regularisation) and gamma (kernel width) selects the best hyperparameters via cross-validation.
- Parameters:
y_train (
Series|ndarray) – Training labels (e.g., 1 = bull, 0 = neutral, -1 = bear).kernel (
Literal['rbf','linear','poly'], default:'rbf') – SVM kernel function.C_range (
Sequence[float], default:(0.1, 1.0, 10.0)) – Regularisation parameter values to search.gamma_range (
Sequence[float|str], default:('scale', 0.01, 0.1)) – Kernel coefficient values to search (ignored for linear kernel).cv (
int, default:5) – Cross-validation folds for grid search.
- Returns:
model: fitted SVC,predictions: np.ndarray of test predictions,accuracy: float,confusion_matrix: np.ndarray,best_params: dict of best C and gamma,cv_score: float (mean CV accuracy).- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(200, 5) >>> y = (X[:, 0] > 0).astype(int) >>> result = svm_classifier(X[:150], y[:150], X[150:], y[150:]) >>> result["accuracy"] > 0.5 True
Scale features before training (StandardScaler recommended).
SVMs are O(n^2) in memory and O(n^3) in time – avoid for n > 100k.
For imbalanced classes, set
class_weight='balanced'on the SVC.
References
Cortes & Vapnik (1995), “Support-Vector Networks”
- random_forest_importance(X, y, feature_names=None, n_estimators=100, max_depth=5, random_state=42, task='classification')[source]¶
Rank features by importance using a Random Forest.
Random Forests aggregate many decorrelated decision trees and measure each feature’s contribution to reducing impurity (Gini for classification, variance for regression). This produces a natural feature ranking useful for selecting the most predictive signals from a large universe of technical indicators, fundamental factors, or alternative data features.
- When to use:
Use as a first-pass feature selector when you have many candidate features (>20) and want to identify which ones carry signal. Fast, non-parametric, and handles mixed feature types.
- Mathematical background:
- Mean Decrease Impurity (MDI) for feature j:
Imp(j) = sum_{t in T_j} p(t) * Delta_i(t)
where T_j is the set of tree nodes splitting on feature j, p(t) is the fraction of samples reaching node t, and Delta_i(t) is the impurity decrease. MDI is averaged over all trees in the forest.
- Parameters:
feature_names (
Optional[Sequence[str]], default:None) – Feature names. If None and X is a DataFrame, column names are used.n_estimators (
int, default:100) – Number of trees.max_depth (
int|None, default:5) – Maximum tree depth (None for unlimited).random_state (
int, default:42) – Random seed for reproducibility.task (
Literal['classification','regression'], default:'classification') – Type of prediction task.
- Returns:
importance: pd.Series of feature importances sorted descending,model: fitted RandomForest estimator,oob_score: float (out-of-bag score if available, else None).- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(300, 10) >>> y = (X[:, 0] + 0.5 * X[:, 3] > 0).astype(int) >>> result = random_forest_importance(X, y) >>> result["importance"].index[0] # top feature is likely 0 0
MDI importance is biased toward high-cardinality features; consider permutation importance (
feature_importance_mda) as a complement.Correlated features share importance, causing both to appear weaker.
References
Breiman (2001), “Random Forests”
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch.8
- gradient_boost_forecast(X_train, y_train, X_test, y_test=None, task='regression', n_estimators=200, max_depth=4, learning_rate=0.1, subsample=0.8, cv=5, feature_names=None)[source]¶
Gradient boosting for forecasting or classification.
Gradient Boosting sequentially fits weak learners (shallow trees) to the residuals of the ensemble, greedily minimising a loss function. It is the workhorse of tabular ML in quant finance – used for return prediction, alpha factor construction, default prediction, and more.
- When to use:
Use gradient boosting as your default tabular model. It handles non-linearities, feature interactions, and missing values naturally. Preferred over linear models when you have >500 samples and >5 features.
- Mathematical background:
- At each stage m, the model adds a tree h_m that minimises:
F_m(x) = F_{m-1}(x) + nu * h_m(x)
- where h_m fits the negative gradient of the loss:
h_m = argmin_h sum_i L(y_i, F_{m-1}(x_i) + h(x_i))
For regression with squared loss, h_m fits the residuals. For classification with log-loss, h_m fits the log-odds residuals.
- Parameters:
y_test (
Series|ndarray|None, default:None) – Test target (if provided, test metrics are computed).task (
Literal['classification','regression'], default:'regression') – Prediction task.n_estimators (
int, default:200) – Number of boosting stages.max_depth (
int, default:4) – Maximum depth of individual trees.learning_rate (
float, default:0.1) – Shrinkage applied to each tree’s contribution.subsample (
float, default:0.8) – Fraction of training samples used per tree (stochastic boosting).cv (
int, default:5) – Cross-validation folds for reporting training CV score.feature_names (
Optional[Sequence[str]], default:None) – Feature names for importance ranking.
- Returns:
model: fitted GradientBoosting estimator,predictions: np.ndarray of test predictions,feature_importance: pd.Series (sorted descending),cv_scores: np.ndarray of cross-validation scores,test_score: float or None (R^2 for regression, accuracy for classification).- Return type:
Example
>>> import numpy as np >>> X = np.random.randn(300, 5) >>> y = X[:, 0] * 2 + X[:, 1] + np.random.randn(300) * 0.5 >>> result = gradient_boost_forecast(X[:250], y[:250], X[250:], y[250:]) >>> result["test_score"] > 0 True
Overfits if n_estimators is too large; use early stopping or CV.
Sensitive to learning_rate / n_estimators trade-off.
For >100k samples, consider XGBoost/LightGBM for speed.
References
Friedman (2001), “Greedy Function Approximation: A Gradient Boosting Machine”
- gaussian_process_regression(X_train, y_train, X_test, kernel='rbf', alpha=0.01, n_restarts=5)[source]¶
Gaussian Process regression with uncertainty quantification.
Gaussian Processes (GPs) define a distribution over functions and provide both point predictions and calibrated confidence intervals. In finance, GPs are used for smooth yield-curve fitting, volatility-surface interpolation, and any setting where uncertainty matters as much as the prediction.
- When to use:
Use GP when you need uncertainty estimates (e.g., confidence bands on a yield curve) and have a small-to-moderate dataset (<5000 observations). The cubic complexity makes GPs impractical for large datasets without approximations.
- Mathematical background:
- A GP assumes f(x) ~ GP(m(x), k(x, x’)), where:
m(x) is the mean function (usually 0) k(x, x’) is the kernel (covariance function)
- Posterior predictive at test point x*:
mu* = k(x*, X) [K + sigma^2 I]^{-1} y sigma*^2 = k(x*, x*) - k(x*, X) [K + sigma^2 I]^{-1} k(X, x*)
where K_{ij} = k(x_i, x_j) and sigma^2 is the noise variance.
- Parameters:
- Returns:
predictions: np.ndarray of mean predictions,std: np.ndarray of predictive standard deviations,confidence_lower: np.ndarray (mean - 1.96 * std),confidence_upper: np.ndarray (mean + 1.96 * std),model: fitted GaussianProcessRegressor.- Return type:
Example
>>> import numpy as np >>> X_train = np.linspace(0, 10, 50).reshape(-1, 1) >>> y_train = np.sin(X_train).ravel() + np.random.randn(50) * 0.1 >>> X_test = np.linspace(0, 10, 20).reshape(-1, 1) >>> result = gaussian_process_regression(X_train, y_train, X_test) >>> result["predictions"].shape (20,) >>> result["std"].shape (20,)
Complexity is O(n^3) for training and O(n^2) per prediction.
For large datasets, use sparse GP approximations (not included here).
Kernel choice strongly affects results; try multiple kernels.
References
Rasmussen & Williams (2006), “Gaussian Processes for Machine Learning”
- isolation_forest_anomaly(returns, contamination=0.05, n_estimators=200, random_state=42)[source]¶
Detect anomalous days in return data using Isolation Forest.
Isolation Forest detects anomalies by randomly partitioning data and measuring how quickly each observation is isolated. Anomalous points (outlier returns, flash crashes, liquidity events) are isolated in fewer splits because they sit far from the bulk of the distribution.
- When to use:
Use for unsupervised anomaly detection in returns, volumes, or spreads. Works well when you do not have labelled anomalies and want to flag unusual market days for review. Robust to high-dimensional feature spaces.
- Mathematical background:
For a sample x, the anomaly score is based on the average path length E[h(x)] across the isolation trees:
s(x, n) = 2^{-E[h(x)] / c(n)}
where c(n) is the average path length in a binary search tree of n samples. Score close to 1 means anomaly; close to 0.5 means normal.
- Parameters:
returns (
Series|DataFrame|ndarray) – Return data. If 1-D, treated as a single feature; if 2-D, each column is a feature (e.g., return, volume, spread).contamination (
float, default:0.05) – Expected fraction of anomalies in the dataset (0 < c < 0.5).n_estimators (
int, default:200) – Number of isolation trees.random_state (
int, default:42) – Random seed.
- Returns:
anomaly_labels: np.ndarray of -1 (anomaly) / 1 (normal),anomaly_scores: np.ndarray of continuous anomaly scores (lower = more anomalous),anomaly_mask: np.ndarray of bool (True for anomalies),n_anomalies: int,model: fitted IsolationForest.- Return type:
Example
>>> import numpy as np >>> rets = np.random.randn(500) * 0.01 >>> rets[100] = 0.15 # inject anomaly >>> result = isolation_forest_anomaly(rets, contamination=0.02) >>> result["anomaly_mask"][100] True
The contamination parameter is a prior; misspecification leads to over- or under-detection.
Isolation Forest assumes anomalies are both rare and different; clustered anomalies may be missed.
For time-series anomaly detection, consider adding lagged features.
References
Liu, Ting & Zhou (2008), “Isolation Forest”
- pca_factor_model(returns, n_components=None, explained_variance_threshold=0.9)[source]¶
Build a PCA-based latent factor model from asset returns.
Principal Component Analysis extracts orthogonal linear combinations of asset returns that explain the most variance. The first PC typically captures the market factor, the second often captures a value/growth or sector rotation, and so on.
- When to use:
Use PCA factor models for dimensionality reduction in portfolio construction, risk decomposition, statistical arbitrage (pairs trading on residuals), and understanding co-movement structure.
- Mathematical background:
- Given return matrix R (T x N), PCA decomposes the covariance:
Sigma = V Lambda V^T
where Lambda = diag(lambda_1, …, lambda_N) are eigenvalues and V are eigenvectors (loadings). Factor returns are:
F = R @ V[:, :k] (T x k)
- The fraction of variance explained by the first k components:
sum(lambda_1..k) / sum(lambda_1..N)
- Parameters:
returns (
DataFrame) – T x N return matrix (rows = observations, columns = assets).n_components (
int|None, default:None) – Number of principal components. If None, selects enough to explainexplained_variance_thresholdof total variance.explained_variance_threshold (
float, default:0.9) – Minimum cumulative explained variance ratio whenn_componentsis None.
- Returns:
loadings: pd.DataFrame of shape(N, n_components)– asset loadings on each factor,factor_returns: pd.DataFrame of shape(T, n_components)– time series of factor returns,explained_variance_ratio: np.ndarray of per-component variance ratios,cumulative_variance: np.ndarray of cumulative variance ratios,n_components: int,model: fitted PCA object.- Return type:
Example
>>> import numpy as np, pandas as pd >>> returns = pd.DataFrame(np.random.randn(252, 20) * 0.01) >>> result = pca_factor_model(returns, n_components=3) >>> result["factor_returns"].shape (252, 3)
PCA is linear; for non-linear dimensionality reduction, use the VAE in
wraquant.ml.deep.autoencoder_features.Eigenvalues from small samples are noisy; use Random Matrix Theory denoising (
wraquant.ml.preprocessing.denoised_correlation) first.Components are not guaranteed to have economic meaning.
References
Jolliffe (2002), “Principal Component Analysis”
Avellaneda & Lee (2010), “Statistical arbitrage in the US equities market”
Clustering¶
Correlation-based clustering, regime clustering, optimal cluster selection.
Financial clustering methods.
Provides correlation-based asset clustering, market-regime detection, and optimal-cluster-count selection.
- correlation_clustering(returns, n_clusters=None, method='hierarchical')[source]¶
Cluster assets by their return correlations.
Use correlation clustering to group assets that move together, which is useful for portfolio diversification (allocate across clusters), risk management (monitor cluster concentration), and statistical arbitrage (trade within-cluster mean-reversion).
The correlation-based distance is
d(i,j) = sqrt(0.5 * (1 - rho_ij)), which maps perfect correlation to distance 0 and perfect negative correlation to distance 1.- Parameters:
returns (
DataFrame) – T x N return matrix (rows = observations, columns = assets).n_clusters (
int|None, default:None) – Number of clusters. IfNonethe optimal number is chosen automatically (silhouette score for hierarchical, or defaults to3for spectral).method (
Literal['hierarchical','spectral'], default:'hierarchical') – Clustering algorithm. Hierarchical uses Ward linkage and produces a dendrogram-compatible linkage matrix. Spectral uses the correlation matrix as affinity and finds clusters via eigenvalue decomposition.
- Returns:
labelsnp.ndarrayCluster assignment for each asset (0-indexed, length N). Assets with the same label belong to the same cluster.
n_clustersintNumber of clusters found or specified.
linkage_matrixnp.ndarray or NoneLinkage matrix (hierarchical only). Pass to
scipy.cluster.hierarchy.dendrogramfor visualization.
- Return type:
Example
>>> import pandas as pd, numpy as np >>> np.random.seed(42) >>> # 3 groups of correlated assets >>> factor = np.random.randn(252, 3) >>> returns = pd.DataFrame( ... np.column_stack([factor[:, i % 3] + np.random.randn(252) * 0.5 ... for i in range(9)]), ... columns=[f'asset_{i}' for i in range(9)] ... ) >>> result = correlation_clustering(returns, n_clusters=3) >>> result['n_clusters'] 3 >>> len(result['labels']) == 9 True
See also
regime_clusteringCluster time periods into regimes.
optimal_clustersDetermine optimal cluster count.
wraquant.ml.preprocessing.detoned_correlationRemove market mode before clustering.
- regime_clustering(features, n_regimes=2, method='gmm')[source]¶
Cluster time periods into market regimes.
Use regime clustering when you want to identify distinct market states (e.g., bull/bear, risk-on/risk-off, high/low volatility) from observable features without a pre-defined model. GMM is preferred because it assigns soft probabilities to each regime; KMeans provides hard assignments only.
- Parameters:
features (
DataFrame|ndarray) – Feature matrix where each row is a time observation. Common inputs include rolling volatility, returns, spreads, and VIX.n_regimes (
int, default:2) – Number of regimes to identify (default 2, typical for risk-on/risk-off).method (
Literal['gmm','kmeans'], default:'gmm') – Clustering algorithm.'gmm'(Gaussian Mixture Model) provides probabilistic assignments;'kmeans'provides hard assignments and is faster.
- Returns:
labelsnp.ndarrayRegime assignment for each time period (0-indexed).
n_regimesintNumber of regimes.
modelobjectFitted GaussianMixture or KMeans model. For GMM, call
model.predict_proba(X)to get regime probabilities.
- Return type:
Example
>>> import numpy as np, pandas as pd >>> np.random.seed(42) >>> vol = np.concatenate([np.random.randn(100) * 0.5 + 0.1, ... np.random.randn(100) * 0.5 + 0.3]) >>> features = pd.DataFrame({'vol': vol, 'vol_sq': vol ** 2}) >>> result = regime_clustering(features, n_regimes=2) >>> result['n_regimes'] 2 >>> len(result['labels']) == 200 True
See also
correlation_clusteringCluster assets (cross-sectional).
optimal_clustersFind the optimal number of clusters/regimes.
wraquant.regimesHMM and Markov-switching regime detection.
- optimal_clusters(data, max_k=10, method='silhouette')[source]¶
Determine the optimal number of clusters.
Use this function before calling
correlation_clusteringorregime_clusteringto select the number of clusters data-adaptively rather than guessing.- Parameters:
max_k (
int, default:10) – Maximum number of clusters to evaluate (default 10).method (
Literal['silhouette','bic'], default:'silhouette') – Selection criterion.'silhouette'uses the silhouette score with KMeans (higher is better, range [-1, 1]);'bic'uses the Bayesian Information Criterion with a Gaussian Mixture Model (lower is better). Silhouette is faster; BIC is more principled for probabilistic models.
- Returns:
Optimal number of clusters (between 2 and max_k). Use this value as
n_clustersincorrelation_clusteringorn_regimesinregime_clustering.- Return type:
Example
>>> import numpy as np >>> np.random.seed(42) >>> # Generate data with 3 natural clusters >>> data = np.vstack([np.random.randn(50, 2) + [0, 0], ... np.random.randn(50, 2) + [5, 5], ... np.random.randn(50, 2) + [10, 0]]) >>> k = optimal_clusters(data, max_k=6) >>> 2 <= k <= 6 True
See also
correlation_clusteringCluster assets by correlation.
regime_clusteringCluster time periods into regimes.
Evaluation¶
Classification metrics, financial metrics, learning curves, and backtest evaluation of predictions.
Model evaluation utilities for financial machine learning.
Provides both standard classification metrics and finance-specific performance measures such as Sharpe ratio from predictions and backtesting with transaction costs.
- classification_metrics(y_true, y_pred, y_prob=None)[source]¶
Compute standard classification metrics.
Use classification metrics to evaluate direction-prediction models (e.g., predicting up/down/flat labels). These metrics assess the statistical quality of the classifier independently of PnL; pair with
financial_metricsfor economic evaluation.- Parameters:
- Returns:
accuracyfloatFraction of correct predictions.
precisionfloatMacro-averaged precision (how many predicted positives are actually positive).
recallfloatMacro-averaged recall (how many actual positives are captured).
f1floatMacro-averaged F1 score (harmonic mean of precision and recall).
log_lossfloat (only if y_prob given)Cross-entropy loss. Lower is better; measures calibration quality.
aucfloat (only if y_prob given, binary only)Area under the ROC curve. 0.5 = random, 1.0 = perfect.
- Return type:
Example
>>> import numpy as np >>> y_true = np.array([1, 0, 1, 1, 0, 1]) >>> y_pred = np.array([1, 0, 0, 1, 0, 1]) >>> metrics = classification_metrics(y_true, y_pred) >>> metrics['accuracy'] 0.8333333333333334 >>> metrics['f1'] > 0.5 True
See also
financial_metricsPnL-based evaluation of directional predictions.
backtest_predictionsFull backtest with transaction costs.
- financial_metrics(y_true, y_pred, returns)[source]¶
Compute finance-specific evaluation metrics from predictions.
Use financial metrics to evaluate whether a model’s predictions translate into actual trading profits. A model can have high accuracy but poor financial performance if it is right on small moves and wrong on large moves. These metrics directly measure economic value.
The predicted labels are interpreted as position signals:
1for long,-1for short,0for flat.- Parameters:
- Returns:
strategy_returnfloatCumulative strategy return (sum of signal * return).
sharpefloatAnnualised Sharpe ratio (252 trading days). Values above 1.0 are generally considered good; above 2.0 is excellent.
hit_ratefloatFraction of periods where predicted sign matches actual sign. A hit rate above 0.5 is necessary but not sufficient for profitability.
profit_factorfloatGross profit / gross loss. Values above 1.0 indicate a profitable strategy; above 2.0 is strong.
- Return type:
Example
>>> import numpy as np >>> y_true = np.array([1, -1, 1, 1, -1]) >>> y_pred = np.array([1, -1, -1, 1, 1]) >>> returns = np.array([0.02, -0.01, 0.015, 0.005, -0.02]) >>> metrics = financial_metrics(y_true, y_pred, returns) >>> metrics['hit_rate'] 0.6 >>> metrics['sharpe'] != 0 True
See also
classification_metricsStandard ML classification metrics.
backtest_predictionsFull backtest with transaction costs.
- learning_curve(model, X, y, train_sizes=None, cv=5)[source]¶
Generate a learning curve for a model.
Use learning curves to diagnose whether a model suffers from high bias (underfitting) or high variance (overfitting). If training and test scores converge at a low value, the model is too simple. If there is a large gap between training and test scores, the model is overfitting and more data or regularisation is needed.
- Parameters:
- Returns:
train_sizesnp.ndarrayAbsolute number of training samples at each point.
train_scoresnp.ndarray, shape(len(sizes), cv)Training scores at each size/fold. Plot the mean across folds to visualize training performance.
test_scoresnp.ndarray, shape(len(sizes), cv)Test scores at each size/fold. The gap between train and test mean scores indicates overfitting.
- Return type:
Example
>>> from sklearn.linear_model import Ridge >>> import numpy as np >>> X = np.random.randn(300, 5) >>> y = X @ [1, 0.5, 0, 0, 0] + np.random.randn(300) * 0.1 >>> result = learning_curve(Ridge(), X, y, cv=3) >>> result['train_sizes'].shape[0] # 10 points by default 10
See also
classification_metricsEvaluate classification quality.
financial_metricsEvaluate economic value of predictions.
- backtest_predictions(predictions, returns, cost_bps=10)[source]¶
Backtest a prediction signal against actual returns.
Use backtest_predictions as a quick sanity check of a model’s economic value before building a full backtest. It applies realistic transaction costs (proportional to position changes) and computes key performance metrics including Sharpe, max drawdown, and turnover.
- Parameters:
predictions (
Series|ndarray) – Predicted position signals (e.g. 1, 0, -1). The signal is applied as a position:signal * return.returns (
Series|ndarray) – Actual period returns corresponding to each prediction.cost_bps (
float, default:10) – Transaction cost in basis points applied on each position change (default 10 bps). For equities, 5-10 bps is typical; for futures, 1-3 bps.
- Returns:
gross_returnsnp.ndarrayPer-period strategy returns before costs.
net_returnsnp.ndarrayPer-period strategy returns after costs.
cumulative_returnfloatTotal cumulative net return. Positive = profitable.
sharpefloatAnnualised Sharpe ratio of net returns. Above 1.0 is generally good; above 2.0 is excellent.
max_drawdownfloatMaximum peak-to-trough decline in cumulative PnL. Always negative or zero.
turnoverfloatMean absolute position change per period. Higher turnover means higher transaction costs.
- Return type:
Example
>>> import numpy as np >>> preds = np.array([1, 1, -1, 1, -1, 0, 1]) >>> rets = np.array([0.01, -0.005, -0.02, 0.015, 0.01, 0.005, 0.008]) >>> result = backtest_predictions(preds, rets, cost_bps=10) >>> result['cumulative_return'] != 0 True >>> result['max_drawdown'] <= 0 True
See also
financial_metricsQuick financial metrics without transaction costs.
wraquant.ml.pipeline.walk_forward_backtestWalk-forward backtest.
Online Learning¶
Incrementally updating models for streaming data.
Online (streaming) machine learning for quantitative finance.
Provides recursive and weighted regression algorithms that update incrementally with each new observation, enabling real-time tracking of time-varying relationships in financial data.
These algorithms require only numpy and pandas – no optional dependencies.
- online_linear_regression(X, y, forgetting_factor=1.0, initial_covariance=100.0)[source]¶
Recursive Least Squares (RLS) online linear regression.
Processes observations one at a time, updating regression coefficients with each new data point. This is the online analogue of ordinary least squares and is fundamental to adaptive signal processing in finance: tracking time-varying betas, hedge ratios, and factor loadings.
- When to use:
Use online regression when you need to: - Track a hedge ratio that drifts over time (pairs trading). - Estimate time-varying factor exposures (rolling beta). - Build adaptive trading signals that respond to regime changes. - Process streaming tick data without re-estimating from scratch.
- Mathematical background:
- Recursive Least Squares maintains:
P_t = (1/lambda) * (P_{t-1} - K_t x_t^T P_{t-1}) K_t = P_{t-1} x_t / (lambda + x_t^T P_{t-1} x_t) w_t = w_{t-1} + K_t (y_t - x_t^T w_{t-1})
where: - w_t is the coefficient vector at time t - P_t is the inverse covariance matrix (precision) - K_t is the Kalman gain - lambda is the forgetting factor (1 = no forgetting, <1 = down-weight old data)
With lambda = 1 and infinite data, RLS converges to OLS. With lambda < 1, the effective window length is approximately 1 / (1 - lambda) observations.
- Parameters:
X (
DataFrame|ndarray) – Feature matrix of shape(T, p)where T is the number of observations and p is the number of features.forgetting_factor (
float, default:1.0) – Forgetting factor lambda in (0, 1]. Values close to 1 give long memory; values like 0.99 give an effective window of ~100 observations. Use 0.95-0.99 for fast-adapting signals.initial_covariance (
float, default:100.0) – Scalar multiplier for the initial covariance matrix P_0 = c * I. Larger values make the filter more responsive early on.
- Returns:
coefficients: np.ndarray of shape(T, p)– the time-varying coefficient vector at each step,predictions: np.ndarray of shape(T,)– one-step-ahead predictions (each y_hat_t uses coefficients estimated from data up to t-1),residuals: np.ndarray of shape(T,)– prediction errors,final_coefficients: np.ndarray of shape(p,)– the coefficients at the last time step.- Return type:
Example
>>> import numpy as np >>> np.random.seed(42) >>> T = 500 >>> X = np.random.randn(T, 2) >>> # True coefficients shift halfway through >>> beta_true = np.where(np.arange(T)[:, None] < 250, ... [1.0, 0.5], [0.5, 1.0]) >>> y = np.sum(X * beta_true, axis=1) + np.random.randn(T) * 0.1 >>> result = online_linear_regression(X, y, forgetting_factor=0.98) >>> result["coefficients"].shape (500, 2) >>> # After convergence, coefficients should track the true values >>> np.abs(result["final_coefficients"][0] - 0.5) < 0.3 True
The forgetting factor is critical: too low causes noisy estimates, too high causes slow adaptation to regime changes.
RLS assumes the noise variance is constant; for heteroskedastic data, consider the exponential weighted variant or Kalman filters.
Initial predictions (before the filter converges) should be discarded in any evaluation.
References
Haykin (2002), “Adaptive Filter Theory”, Ch. 13 (RLS)
Montana et al. (2009), “Flexible least squares for temporal data mining and statistical arbitrage”
- exponential_weighted_regression(X, y, halflife=63.0, min_periods=30)[source]¶
Exponentially weighted linear regression favouring recent data.
At each time step t, fits a weighted least squares regression where observation weights decay exponentially into the past. This produces smooth, adaptive coefficient estimates that naturally respond to regime changes without the abrupt sensitivity of rolling-window OLS.
- When to use:
Use exponential weighted regression when: - You want smoother coefficient paths than RLS. - The halflife of predictive relationships is approximately known
(e.g., 63 trading days ~ 3 months).
You need an interpretable “recency bias” in your factor model.
- Mathematical background:
- At time t, the weight for observation s (where s <= t) is:
w_s = exp(-ln(2) * (t - s) / halflife)
- The weighted regression solves:
beta_t = (X_t^T W_t X_t)^{-1} X_t^T W_t y_t
where W_t = diag(w_0, w_1, …, w_t). This is equivalent to EWMA smoothing of the sufficient statistics X^T X and X^T y.
- Parameters:
halflife (
float, default:63.0) – Halflife in observations. Afterhalflifeobservations, the weight of a past data point has decayed to 50%. Common financial values: 21 (1 month), 63 (1 quarter), 252 (1 year).min_periods (
int, default:30) – Minimum number of observations before producing a coefficient estimate. Earlier entries are filled with NaN.
- Returns:
coefficients: np.ndarray of shape(T, p)– time-varying coefficients (NaN for the firstmin_periods - 1rows),predictions: np.ndarray of shape(T,)– fitted values using contemporaneous coefficients,residuals: np.ndarray of shape(T,)– prediction errors,final_coefficients: np.ndarray of shape(p,)– last estimated coefficients.- Return type:
Example
>>> import numpy as np >>> np.random.seed(0) >>> T = 300 >>> X = np.random.randn(T, 2) >>> beta_true = np.column_stack([ ... np.linspace(1, 0, T), # drifting coefficient ... np.full(T, 0.5), # constant coefficient ... ]) >>> y = np.sum(X * beta_true, axis=1) + np.random.randn(T) * 0.1 >>> result = exponential_weighted_regression(X, y, halflife=60) >>> result["coefficients"].shape (300, 2)
Halflife selection is subjective; cross-validate if possible.
For very short halflives (<10), the effective sample size is small and estimates become noisy.
Assumes homoskedastic errors; for heteroskedastic data, consider EWMA-weighted robust regression.
Numerically less stable than RLS for ill-conditioned problems.
References
Pozzi et al. (2012), “Exponentially weighted moving average charts for detecting concept drift”
de Prado (2018), “Advances in Financial Machine Learning”, Ch. 17
Pipeline¶
FinancialPipeline, walk-forward backtest, and SHAP integration.
Financial ML pipeline utilities.
Provides chronology-aware pipeline wrappers, walk-forward backtesting with PnL tracking, and SHAP-based feature importance – all designed to prevent data leakage that is rampant in naive ML-for-finance workflows.
- class FinancialPipeline[source]¶
Bases:
objectSklearn Pipeline wrapper that enforces chronological splitting.
Standard sklearn
Pipeline+cross_val_scoreuses random K-Fold which leaks future information into the training set.FinancialPipelinewraps an sklearnPipelineand replaces all cross-validation with purged K-fold that respects time ordering and applies an embargo window to prevent information leakage through overlapping labels.- Parameters:
steps (
list[tuple[str,Any]]) – List of(name, transform)tuples defining the pipeline, identical to thestepsparameter ofsklearn.pipeline.Pipeline.n_splits (
int, default:5) – Number of folds for purged K-fold cross-validation.embargo_pct (
float, default:0.01) – Fraction of total samples to embargo after each test fold, preventing label leakage from overlapping targets.
Example
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.linear_model import Ridge >>> import numpy as np >>> X = np.random.randn(500, 5) >>> y = X @ np.array([1, 0.5, 0, 0, 0]) + np.random.randn(500) * 0.1 >>> pipe = FinancialPipeline( ... steps=[('scaler', StandardScaler()), ('ridge', Ridge())], ... n_splits=5, ... ) >>> result = pipe.fit_evaluate(X, y) >>> len(result['fold_scores']) == 5 True
References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 7
- fit(X, y)[source]¶
Fit the pipeline on the full dataset.
- Parameters:
- Returns:
Self, for method chaining.
- Return type:
- walk_forward_backtest(model, X, y, train_size=252, test_size=21, step_size=21, expanding=True)[source]¶
Full walk-forward ML backtest with PnL tracking.
Walk-forward validation is the gold standard for evaluating ML models in finance because it mirrors real trading: train on historical data, predict the next period, observe actual outcome, then advance.
- Why walk-forward instead of standard cross-validation?
Standard K-Fold CV randomly shuffles observations, allowing the model to “peek” at future data during training. In finance, this creates massive upward bias in performance estimates. Walk-forward enforces strict temporal ordering: the model only ever trains on data that would have been available at the time of prediction.
The function supports both expanding windows (training set grows over time, using all available history) and rolling windows (fixed-size training window that slides forward). Expanding windows are preferred when you believe the data-generating process is stable; rolling windows are better when you expect structural breaks or regime changes.
- Parameters:
model (
Any) – A scikit-learn-compatible estimator withfitandpredict.y (
Series|ndarray) – Target vector (typically forward returns for PnL calculation).train_size (
int, default:252) – Number of training observations in the initial window.test_size (
int, default:21) – Number of test observations per fold.step_size (
int, default:21) – Number of observations to advance between folds.expanding (
bool, default:True) – If True, the training window expands over time. If False, a rolling window of fixedtrain_sizeis used.
- Returns:
predictions: np.ndarray of concatenated out-of-sample predictions,actuals: np.ndarray of corresponding true values,pnl: np.ndarray of per-period PnL (prediction * actual, assuming long when prediction > 0),sharpe: float annualised Sharpe ratio of the PnL series (assuming 252 trading days),hit_rate: float fraction of periods where prediction sign matches actual sign,equity_curve: np.ndarray cumulative PnL.- Return type:
Example
>>> from sklearn.linear_model import Ridge >>> import numpy as np >>> np.random.seed(42) >>> X = np.random.randn(600, 5) >>> y = X @ np.array([0.5, 0.3, 0, 0, 0]) + np.random.randn(600) * 0.5 >>> result = walk_forward_backtest(Ridge(), X, y, train_size=200, test_size=20) >>> len(result['predictions']) > 0 True >>> 'sharpe' in result True
References
Lopez de Prado (2018), “Advances in Financial Machine Learning”, Ch. 12
Bailey et al. (2014), “The Deflated Sharpe Ratio”
- feature_importance_shap(model, X, feature_names=None, max_samples=500)[source]¶
Compute SHAP-based feature importance for any sklearn model.
SHAP (SHapley Additive exPlanations) values provide a theoretically grounded decomposition of each prediction into per-feature contributions. Unlike impurity-based importance (MDI), SHAP values are consistent and account for feature interactions.
- Parameters:
model (
Any) – A fitted scikit-learn-compatible estimator.X (
DataFrame|ndarray) – Feature matrix to explain (typically the test set).feature_names (
Optional[Sequence[str]], default:None) – Feature names. If None and X is a DataFrame, column names are used.max_samples (
int, default:500) – Maximum number of samples to use for computing SHAP values. Subsampled if X has more rows than this.
- Returns:
shap_values: np.ndarray of shape(n_samples, n_features)containing per-sample SHAP values,feature_importance: np.ndarray of shape(n_features,)giving mean absolute SHAP value per feature (sorted descending),feature_names: list of feature names ordered by importance.- Return type:
- Raises:
MissingDependencyError – If shap is not installed.
Example
>>> from sklearn.ensemble import RandomForestRegressor >>> import numpy as np >>> np.random.seed(42) >>> X = np.random.randn(200, 5) >>> y = X[:, 0] * 2 + X[:, 1] + np.random.randn(200) * 0.1 >>> model = RandomForestRegressor(n_estimators=50, random_state=42) >>> model.fit(X, y) RandomForestRegressor(n_estimators=50, random_state=42) >>> result = feature_importance_shap(model, X) >>> result["shap_values"].shape[1] == 5 True
References
Lundberg & Lee (2017), “A Unified Approach to Interpreting Model Predictions”