Data (wraquant.data)

Data fetching, cleaning, validation, transformation, and caching. Supports yfinance, FRED, and NASDAQ Data Link as data sources, with comprehensive cleaning pipelines for handling missing values, outliers, corporate actions, and calendar alignment.

Quick Example

from wraquant.data import fetch_prices

# Fetch daily OHLCV from yfinance
prices = fetch_prices(["AAPL", "MSFT", "GOOGL"], start="2020-01-01")
print(prices.head())

# Data is automatically cleaned: forward-filled, split-adjusted,
# with trading calendar alignment.

See also

API Reference

Data fetching, cleaning, validation, transforms, and caching for financial data.

Provides a unified API for fetching prices, macroeconomic data, and other financial time series from multiple sources (Yahoo Finance, FRED, NASDAQ Data Link, CSV files), plus cleaning, validation, and transformation utilities. This module is the primary entry point for getting data into wraquant and preparing it for downstream analysis.

Key sub-modules:

  • Fetching (loaders) – fetch_prices retrieves OHLCV data, fetch_macro retrieves macroeconomic indicators, and fetch_ohlcv returns full OHLCV bars. A pluggable ProviderRegistry allows custom data sources.

  • Cleaning (cleaning, cleaning_advanced) – Handle missing values (fill_missing), outliers (detect_outliers, remove_outliers, winsorize), duplicates, split/dividend adjustments, OHLCV resampling, fuzzy merges, and date parsing.

  • Validation (validation, validation_advanced) – Check data quality with validate_ohlcv, validate_returns, check_completeness, check_staleness, and data_quality_report. Schema-based validation via pandera is available through pandera_validate.

  • Transforms (transforms) – Convert between prices and returns (to_returns, to_prices), compute excess returns, normalize prices, and calculate rolling/expanding z-scores.

Example

>>> from wraquant.data import fetch_prices, validate_ohlcv, to_returns
>>> prices = fetch_prices("AAPL", start="2020-01-01")
>>> report = validate_ohlcv(prices)
>>> returns = to_returns(prices["close"])

Use wraquant.data for all data ingestion and preparation. For file- and database-based I/O (Parquet, HDF5, SQL, cloud storage), use wraquant.io instead. The data module feeds cleaned data into wraquant.stats, wraquant.ts, wraquant.backtest, and all other analytical modules.

class DataProvider[source]

Bases: ABC

Abstract base class for all data providers.

Subclasses must implement fetch_prices and declare their name.

name: str = ''
abstractmethod fetch_prices(symbol, start=None, end=None, **kwargs)[source]

Fetch closing prices for a symbol.

Parameters:
Return type:

Series

Returns:

Price series with DatetimeIndex.

fetch_ohlcv(symbol, start=None, end=None, **kwargs)[source]

Fetch OHLCV data for a symbol.

Parameters:
Return type:

DataFrame

Returns:

DataFrame with open, high, low, close, volume columns.

fetch_macro(series_id, start=None, end=None, **kwargs)[source]

Fetch macroeconomic data series.

Parameters:
Return type:

Series

Returns:

Macro data series with DatetimeIndex.

class ProviderRegistry[source]

Bases: object

Registry for data providers.

Allows registering, retrieving, and listing data providers by name.

__init__()[source]
Return type:

None

register(provider, *, default=False)[source]

Register a data provider.

Parameters:
  • provider (DataProvider) – DataProvider instance to register.

  • default (bool, default: False) – If True, make this the default provider.

Return type:

None

get(name=None)[source]

Get a provider by name, or the default.

Parameters:

name (str | None, default: None) – Provider name. None returns the default.

Return type:

DataProvider

Returns:

The requested DataProvider.

Raises:

KeyError – If the provider is not registered.

list_providers()[source]

List all registered provider names.

Return type:

list[str]

Returns:

List of provider name strings.

property default: str | None

Name of the default provider.

fetch_prices(symbol, start=None, end=None, source=None, **kwargs)[source]

Fetch closing prices for a symbol.

Retrieves a daily close price series from the specified data provider. The default provider is determined by the registry (typically Yahoo Finance for equities).

Parameters:
  • symbol (str) – Ticker symbol (e.g., 'AAPL', 'EURUSD=X', 'BTC-USD').

  • start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date (string, datetime, or pandas Timestamp). None fetches from the earliest available date.

  • end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to today.

  • source (str | None, default: None) – Provider name (e.g., 'yahoo', 'fred'). None uses the default provider.

  • **kwargs (Any) – Additional keyword arguments forwarded to the provider’s fetch_prices method.

Returns:

Price series with a DatetimeIndex.

Return type:

Series

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> prices = fetch_prices("AAPL", start="2020-01-01")

See also

fetch_ohlcv: Fetch full OHLCV data. fetch_macro: Fetch macroeconomic series from FRED.

fetch_ohlcv(symbol, start=None, end=None, source=None, **kwargs)[source]

Fetch OHLCV (Open, High, Low, Close, Volume) data for a symbol.

Returns a DataFrame with standard column names suitable for backtesting, technical analysis, and charting.

Parameters:
  • symbol (str) – Ticker symbol (e.g., 'AAPL').

  • start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date. None fetches from the earliest available date.

  • end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to today.

  • source (str | None, default: None) – Provider name. None uses the default provider.

  • **kwargs (Any) – Additional keyword arguments forwarded to the provider.

Returns:

DataFrame with columns open, high,

low, close, volume and a DatetimeIndex.

Return type:

DataFrame

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> df = fetch_ohlcv("AAPL", start="2020-01-01")

See also

fetch_prices: Fetch close prices only (lighter weight). fetch_macro: Fetch macroeconomic series.

fetch_macro(series_id, start=None, end=None, source='fred', **kwargs)[source]

Fetch macroeconomic data series.

Retrieves economic indicators from FRED (Federal Reserve Economic Data) or other macro data providers. Common series include GDP, unemployment rate (UNRATE), federal funds rate (DFF), CPI, and Treasury yields.

Parameters:
  • series_id (str) – Series identifier (e.g., 'GDP', 'UNRATE', 'DFF', 'T10Y2Y').

  • start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date. None fetches the full history.

  • end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to the latest available release.

  • source (str, default: 'fred') – Provider name (default 'fred').

  • **kwargs (Any) – Additional keyword arguments forwarded to the provider.

Returns:

Macro data series with a DatetimeIndex.

Return type:

Series

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> gdp = fetch_macro("GDP", source="fred")

See also

fetch_prices: Fetch asset prices. fetch_ohlcv: Fetch OHLCV bar data.

list_providers()[source]

List all available data providers.

Returns the names of all registered providers (e.g., 'yahoo', 'fred', 'nasdaq', 'csv'). The list depends on which optional dependencies are installed.

Returns:

List of registered provider names.

Return type:

list[str]

Example

>>> providers = list_providers()
>>> "yahoo" in providers
True
align_series(*series, method='inner')[source]

Align multiple series to a common index.

Parameters:
  • *series (Series) – Two or more series to align.

  • method (Literal['inner', 'outer'], default: 'inner') – Join method. 'inner' keeps only dates present in all series; 'outer' keeps all dates (filling gaps with NaN).

Returns:

Aligned series sharing the same index.

Return type:

tuple[Series, ...]

detect_outliers(data, method='zscore', threshold=3.0)[source]

Flag rows that contain outlier values.

Parameters:
  • data (DataFrame | Series) – Input data.

  • method (Literal['zscore', 'iqr', 'mad'], default: 'zscore') – Detection method.

  • threshold (float, default: 3.0) – Sensitivity threshold.

Returns:

Boolean series with True for outlier rows.

Return type:

Series

fill_missing(data, method='ffill', limit=None)[source]

Fill or remove missing values.

Parameters:
  • data (DataFrame | Series) – Input data possibly containing NaN values.

  • method (Literal['ffill', 'bfill', 'interpolate', 'drop'], default: 'ffill') – Strategy for handling missing values.

  • limit (int | None, default: None) – Maximum number of consecutive NaN values to fill. Only used with 'ffill', 'bfill', and 'interpolate'.

Returns:

Data with missing values handled.

Return type:

DataFrame | Series

handle_splits_dividends(prices, splits=None, dividends=None)[source]

Adjust a price series for stock splits and dividends.

Parameters:
  • prices (Series) – Raw (unadjusted) price series indexed by date.

  • splits (Series | None, default: None) – Split ratios indexed by date. A 2-for-1 split is represented as 2.0. Dates not present in prices are ignored.

  • dividends (Series | None, default: None) – Cash dividend amounts indexed by ex-date.

Returns:

Adjusted price series.

Return type:

Series

remove_duplicates(data, keep='last')[source]

Remove duplicate index entries.

Parameters:
  • data (DataFrame) – Data whose index may contain duplicates.

  • keep (Literal['first', 'last', False], default: 'last') – Which duplicate to keep.

Returns:

Data with unique index values.

Return type:

DataFrame

remove_outliers(data, method='zscore', threshold=3.0)[source]

Remove rows containing outlier values from the data.

Parameters:
  • data (DataFrame | Series) – Input data with a DatetimeIndex.

  • method (Literal['zscore', 'iqr', 'mad'], default: 'zscore') – Outlier detection method.

  • threshold (float, default: 3.0) – Sensitivity threshold. For z-score and MAD this is the number of standard deviations; for IQR it is the multiplier applied to the interquartile range.

Returns:

Data with outlier rows removed.

Return type:

DataFrame | Series

resample_ohlcv(ohlcv, freq='W')[source]

Resample OHLCV data to a lower frequency.

The aggregation follows standard financial conventions:

  • open – first value in the period

  • high – maximum value in the period

  • low – minimum value in the period

  • close – last value in the period

  • volume – sum over the period

Parameters:
  • ohlcv (DataFrame) – DataFrame with columns open, high, low, close, and volume (case-insensitive) indexed by date.

  • freq (str, default: 'W') – Target frequency (any pandas offset alias).

Returns:

Resampled OHLCV data.

Return type:

DataFrame

winsorize(data, limits=(0.01, 0.01))[source]

Clip extreme values at the given percentile limits.

Parameters:
  • data (DataFrame | Series) – Input data.

  • limits (tuple[float, float], default: (0.01, 0.01)) – Lower and upper percentile fractions to clip. (0.01, 0.01) clips the bottom 1 % and top 1 % of values.

Returns:

Winsorized data with the same shape as the input.

Return type:

DataFrame | Series

expanding_zscore(data)[source]

Compute an expanding-window z-score.

Parameters:

data (Series) – Input time series.

Returns:

Z-scores computed using all data up to and including each point.

Return type:

Series

normalize_prices(prices, base=100.0)[source]

Rebase a price series so that it starts at base.

Parameters:
  • prices (Series | DataFrame) – Price series.

  • base (float, default: 100.0) – Desired starting value.

Returns:

Rebased price series.

Return type:

Series | DataFrame

percentile_rank(data, window=252)[source]

Compute a rolling percentile rank.

For each date the value is ranked within the preceding window observations and expressed as a percentile (0–1).

Parameters:
  • data (Series) – Input time series.

  • window (int, default: 252) – Rolling window size.

Returns:

Rolling percentile ranks.

Return type:

Series

rank_transform(data)[source]

Apply a cross-sectional rank transform.

Values are replaced with their rank divided by the count of non-NaN values, producing output in the range (0, 1].

Parameters:

data (Series | DataFrame) – Input data.

Returns:

Rank-transformed data.

Return type:

Series | DataFrame

rolling_zscore(data, window=252)[source]

Compute a rolling-window z-score.

Parameters:
  • data (Series) – Input time series.

  • window (int, default: 252) – Rolling window size.

Returns:

Z-scores computed over the trailing window observations.

Return type:

Series

to_excess_returns(returns, risk_free_rate)[source]

Compute excess returns above a risk-free rate.

Parameters:
  • returns (Series) – Asset return series.

  • risk_free_rate (Series | float) – Risk-free rate. If a pd.Series, it is aligned to returns by index.

Returns:

Excess return series.

Return type:

Series

to_prices(returns, initial_price=100.0, method='simple')[source]

Convert a return series back to prices.

Parameters:
  • returns (Series | DataFrame) – Return series (may contain a leading NaN).

  • initial_price (float, default: 100.0) – Starting price level.

  • method (Literal['simple', 'log'], default: 'simple') – Must match the method used to compute the returns.

Returns:

Reconstructed price series beginning at initial_price.

Return type:

Series | DataFrame

to_returns(prices, method='simple')[source]

Convert a price series to returns.

Parameters:
  • prices (Series | DataFrame) – Price series indexed by date.

  • method (Literal['simple', 'log'], default: 'simple') – 'simple' computes arithmetic returns (P_t / P_{t-1}) - 1. 'log' computes logarithmic returns ln(P_t / P_{t-1}).

Returns:

Return series. The first row will be NaN.

Return type:

Series | DataFrame

check_completeness(data, expected_freq='B')[source]

Report on data completeness relative to an expected frequency.

Parameters:
  • data (Series | DataFrame) – Time-series data with a DatetimeIndex.

  • expected_freq (str, default: 'B') – Expected frequency (e.g. 'B' for business days, 'D' for calendar days).

Returns:

Dictionary containing:

  • expected_count – number of expected periods

  • actual_count – number of actual observations

  • missing_count – number of missing periods

  • missing_dates – list of missing dates

  • completeness_pct – percentage of expected dates present

Return type:

dict[str, Any]

check_staleness(data, max_unchanged=5)[source]

Detect stale (stuck/unchanged) values in a time series.

Parameters:
  • data (Series | DataFrame) – Time-series data.

  • max_unchanged (int, default: 5) – Number of consecutive identical values before flagging as stale.

Returns:

Dictionary containing:

  • stale_periods – list of (start, end, length) tuples for each run of identical values exceeding max_unchanged.

  • total_stale_rows – total number of rows within stale periods.

Return type:

dict[str, Any]

data_quality_report(data, freq='B')[source]

Generate a comprehensive data quality report.

Combines completeness, staleness, and value-range checks into a single report dictionary.

Parameters:
  • data (DataFrame) – Time-series data with a DatetimeIndex.

  • freq (str, default: 'B') – Expected frequency for completeness checking.

Returns:

Dictionary containing:

  • completeness – output of check_completeness()

  • staleness – output of check_staleness()

  • missing_values – NaN counts per column

  • duplicated_dates – number of duplicate index entries

  • date_range(first_date, last_date)

  • shape(rows, cols)

  • dtypes – column data types

Return type:

dict[str, Any]

validate_ohlcv(df)[source]

Validate OHLCV data for common issues.

Checks performed:

  • high_lt_low – rows where high < low

  • close_outside_range – rows where close is outside [low, high]

  • negative_volume – rows with negative volume

  • missing_values – count of NaN values per column

  • gaps – missing business days in the index

Parameters:

df (DataFrame) – DataFrame with columns open, high, low, close, and volume (case-insensitive).

Returns:

Dictionary keyed by check name with details of any issues found.

Return type:

dict[str, Any]

validate_returns(returns, max_abs=0.5)[source]

Validate a return series for suspicious values.

Parameters:
  • returns (Series | DataFrame) – Return series (simple or log).

  • max_abs (float, default: 0.5) – Returns with absolute value greater than this are flagged.

Returns:

Dictionary containing:

  • suspicious – indices where |return| > max_abs

  • has_nan – whether any NaN values exist

  • nan_count – number of NaN values

  • min – minimum return value

  • max – maximum return value

Return type:

dict[str, Any]

janitor_clean_names(df)[source]

Clean DataFrame column names using pyjanitor.

Converts column names to lowercase snake_case, strips whitespace, and replaces special characters with underscores.

Parameters:

df (DataFrame) – DataFrame with messy column names.

Returns:

DataFrame with cleaned column names.

Return type:

DataFrame

janitor_remove_empty(df)[source]

Remove empty rows and columns using pyjanitor.

Drops rows and columns that are entirely NaN or empty.

Parameters:

df (DataFrame) – DataFrame possibly containing empty rows/columns.

Returns:

DataFrame with empty rows and columns removed.

Return type:

DataFrame

fuzzy_merge(df1, df2, left_col, right_col, threshold=80.0)[source]

Merge two DataFrames using fuzzy string matching via rapidfuzz.

For each value in left_col of df1, the best match above threshold in right_col of df2 is found. Matched rows are joined; unmatched rows from df1 are retained with NaN for df2 columns.

Parameters:
  • df1 (DataFrame) – Left DataFrame.

  • df2 (DataFrame) – Right DataFrame.

  • left_col (str) – Column name in df1 to match on.

  • right_col (str) – Column name in df2 to match on.

  • threshold (float, default: 80.0) – Minimum similarity score (0–100) to consider a match.

Returns:

Merged DataFrame with an additional match_score column indicating the similarity score for each matched pair.

Return type:

DataFrame

parse_dates_flexible(series)[source]

Parse mixed-format date strings using dateparser.

Handles a wide variety of date formats and natural language dates (e.g. 'yesterday', '3 days ago').

Parameters:

series (Series) – Series of date strings in potentially mixed formats.

Returns:

Series of datetime objects. Values that cannot be parsed are set to NaT.

Return type:

Series

parse_prices(series)[source]

Parse price strings into numeric amounts and currencies.

Uses the price-parser library to extract amounts and currency codes from strings like '$1,234.56' or 'EUR 99.99'.

Parameters:

series (Series) – Series of price strings.

Returns:

DataFrame with columns:

  • amount – extracted numeric price (float, NaN if unparseable).

  • currency – extracted currency code (str or None).

Return type:

DataFrame

normalize_countries(series)[source]

Standardise country names and codes using country-converter.

Parameters:

series (Series) – Series of country names, ISO codes, or other country identifiers in various formats.

Returns:

DataFrame with columns:

  • name_short – standardised short country name.

  • iso3 – ISO 3166-1 alpha-3 code.

  • iso2 – ISO 3166-1 alpha-2 code.

Return type:

DataFrame

fix_text(series)[source]

Fix text encoding issues using ftfy and unidecode.

Repairs mojibake, normalises Unicode, and transliterates non-ASCII characters to their closest ASCII equivalents.

Parameters:

series (Series) – Series of strings that may contain encoding artefacts.

Returns:

Series with fixed text encoding. NaN values are preserved.

Return type:

Series

pandera_validate(df, schema)[source]

Validate a DataFrame against a pandera schema.

Wraps schema.validate() and returns the validated DataFrame (which may include coerced dtypes).

Parameters:
  • df (DataFrame) – DataFrame to validate.

  • schema (Any) – Pandera schema defining the expected structure, dtypes, and value constraints.

Returns:

The validated (and potentially coerced) DataFrame.

Return type:

DataFrame

Raises:

pandera.errors.SchemaError – If validation fails.

create_ohlcv_schema(strict=False, coerce=True)[source]

Create a pandera schema for OHLCV financial data.

The schema enforces:

  • Columns open, high, low, close are positive floats.

  • Column volume is a non-negative integer or float.

  • high >= low for every row.

  • close is within [low, high] for every row.

Parameters:
  • strict (bool, default: False) – If True, extra columns not in the schema cause validation to fail.

  • coerce (bool, default: True) – If True, attempt to coerce column dtypes before validation.

Returns:

Schema suitable for passing to pandera_validate().

Return type:

Any

create_returns_schema(max_abs_return=1.0, allow_nan=False, strict=False, coerce=True)[source]

Create a pandera schema for financial return data.

The schema enforces:

  • All return columns are float type.

  • Return values are within [-max_abs_return, max_abs_return].

Parameters:
  • max_abs_return (float, default: 1.0) – Maximum allowed absolute return value. Values outside [-max_abs_return, max_abs_return] fail validation.

  • allow_nan (bool, default: False) – Whether NaN values are allowed in return columns.

  • strict (bool, default: False) – If True, extra columns cause failure.

  • coerce (bool, default: True) – If True, attempt dtype coercion before validation.

Returns:

Schema suitable for passing to pandera_validate().

Return type:

Any

Loaders

High-level data loading API.

Convenience functions that delegate to the provider registry.

fetch_prices(symbol, start=None, end=None, source=None, **kwargs)[source]

Fetch closing prices for a symbol.

Retrieves a daily close price series from the specified data provider. The default provider is determined by the registry (typically Yahoo Finance for equities).

Parameters:
  • symbol (str) – Ticker symbol (e.g., 'AAPL', 'EURUSD=X', 'BTC-USD').

  • start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date (string, datetime, or pandas Timestamp). None fetches from the earliest available date.

  • end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to today.

  • source (str | None, default: None) – Provider name (e.g., 'yahoo', 'fred'). None uses the default provider.

  • **kwargs (Any) – Additional keyword arguments forwarded to the provider’s fetch_prices method.

Returns:

Price series with a DatetimeIndex.

Return type:

Series

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> prices = fetch_prices("AAPL", start="2020-01-01")

See also

fetch_ohlcv: Fetch full OHLCV data. fetch_macro: Fetch macroeconomic series from FRED.

fetch_ohlcv(symbol, start=None, end=None, source=None, **kwargs)[source]

Fetch OHLCV (Open, High, Low, Close, Volume) data for a symbol.

Returns a DataFrame with standard column names suitable for backtesting, technical analysis, and charting.

Parameters:
  • symbol (str) – Ticker symbol (e.g., 'AAPL').

  • start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date. None fetches from the earliest available date.

  • end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to today.

  • source (str | None, default: None) – Provider name. None uses the default provider.

  • **kwargs (Any) – Additional keyword arguments forwarded to the provider.

Returns:

DataFrame with columns open, high,

low, close, volume and a DatetimeIndex.

Return type:

DataFrame

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> df = fetch_ohlcv("AAPL", start="2020-01-01")

See also

fetch_prices: Fetch close prices only (lighter weight). fetch_macro: Fetch macroeconomic series.

fetch_macro(series_id, start=None, end=None, source='fred', **kwargs)[source]

Fetch macroeconomic data series.

Retrieves economic indicators from FRED (Federal Reserve Economic Data) or other macro data providers. Common series include GDP, unemployment rate (UNRATE), federal funds rate (DFF), CPI, and Treasury yields.

Parameters:
  • series_id (str) – Series identifier (e.g., 'GDP', 'UNRATE', 'DFF', 'T10Y2Y').

  • start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date. None fetches the full history.

  • end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to the latest available release.

  • source (str, default: 'fred') – Provider name (default 'fred').

  • **kwargs (Any) – Additional keyword arguments forwarded to the provider.

Returns:

Macro data series with a DatetimeIndex.

Return type:

Series

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> gdp = fetch_macro("GDP", source="fred")

See also

fetch_prices: Fetch asset prices. fetch_ohlcv: Fetch OHLCV bar data.

list_providers()[source]

List all available data providers.

Returns the names of all registered providers (e.g., 'yahoo', 'fred', 'nasdaq', 'csv'). The list depends on which optional dependencies are installed.

Returns:

List of registered provider names.

Return type:

list[str]

Example

>>> providers = list_providers()
>>> "yahoo" in providers
True

Cleaning

Data cleaning utilities for financial time series.

remove_outliers(data, method='zscore', threshold=3.0)[source]

Remove rows containing outlier values from the data.

Parameters:
  • data (DataFrame | Series) – Input data with a DatetimeIndex.

  • method (Literal['zscore', 'iqr', 'mad'], default: 'zscore') – Outlier detection method.

  • threshold (float, default: 3.0) – Sensitivity threshold. For z-score and MAD this is the number of standard deviations; for IQR it is the multiplier applied to the interquartile range.

Returns:

Data with outlier rows removed.

Return type:

DataFrame | Series

winsorize(data, limits=(0.01, 0.01))[source]

Clip extreme values at the given percentile limits.

Parameters:
  • data (DataFrame | Series) – Input data.

  • limits (tuple[float, float], default: (0.01, 0.01)) – Lower and upper percentile fractions to clip. (0.01, 0.01) clips the bottom 1 % and top 1 % of values.

Returns:

Winsorized data with the same shape as the input.

Return type:

DataFrame | Series

fill_missing(data, method='ffill', limit=None)[source]

Fill or remove missing values.

Parameters:
  • data (DataFrame | Series) – Input data possibly containing NaN values.

  • method (Literal['ffill', 'bfill', 'interpolate', 'drop'], default: 'ffill') – Strategy for handling missing values.

  • limit (int | None, default: None) – Maximum number of consecutive NaN values to fill. Only used with 'ffill', 'bfill', and 'interpolate'.

Returns:

Data with missing values handled.

Return type:

DataFrame | Series

detect_outliers(data, method='zscore', threshold=3.0)[source]

Flag rows that contain outlier values.

Parameters:
  • data (DataFrame | Series) – Input data.

  • method (Literal['zscore', 'iqr', 'mad'], default: 'zscore') – Detection method.

  • threshold (float, default: 3.0) – Sensitivity threshold.

Returns:

Boolean series with True for outlier rows.

Return type:

Series

handle_splits_dividends(prices, splits=None, dividends=None)[source]

Adjust a price series for stock splits and dividends.

Parameters:
  • prices (Series) – Raw (unadjusted) price series indexed by date.

  • splits (Series | None, default: None) – Split ratios indexed by date. A 2-for-1 split is represented as 2.0. Dates not present in prices are ignored.

  • dividends (Series | None, default: None) – Cash dividend amounts indexed by ex-date.

Returns:

Adjusted price series.

Return type:

Series

remove_duplicates(data, keep='last')[source]

Remove duplicate index entries.

Parameters:
  • data (DataFrame) – Data whose index may contain duplicates.

  • keep (Literal['first', 'last', False], default: 'last') – Which duplicate to keep.

Returns:

Data with unique index values.

Return type:

DataFrame

align_series(*series, method='inner')[source]

Align multiple series to a common index.

Parameters:
  • *series (Series) – Two or more series to align.

  • method (Literal['inner', 'outer'], default: 'inner') – Join method. 'inner' keeps only dates present in all series; 'outer' keeps all dates (filling gaps with NaN).

Returns:

Aligned series sharing the same index.

Return type:

tuple[Series, ...]

resample_ohlcv(ohlcv, freq='W')[source]

Resample OHLCV data to a lower frequency.

The aggregation follows standard financial conventions:

  • open – first value in the period

  • high – maximum value in the period

  • low – minimum value in the period

  • close – last value in the period

  • volume – sum over the period

Parameters:
  • ohlcv (DataFrame) – DataFrame with columns open, high, low, close, and volume (case-insensitive) indexed by date.

  • freq (str, default: 'W') – Target frequency (any pandas offset alias).

Returns:

Resampled OHLCV data.

Return type:

DataFrame

Advanced Cleaning

Advanced data cleaning integrations using optional packages.

Provides wrappers around pyjanitor, rapidfuzz, dateparser, price-parser, country-converter, ftfy, and unidecode for column name cleaning, fuzzy merging, flexible date parsing, price parsing, country normalisation, and text encoding fixes.

janitor_clean_names(df)[source]

Clean DataFrame column names using pyjanitor.

Converts column names to lowercase snake_case, strips whitespace, and replaces special characters with underscores.

Parameters:

df (DataFrame) – DataFrame with messy column names.

Returns:

DataFrame with cleaned column names.

Return type:

DataFrame

janitor_remove_empty(df)[source]

Remove empty rows and columns using pyjanitor.

Drops rows and columns that are entirely NaN or empty.

Parameters:

df (DataFrame) – DataFrame possibly containing empty rows/columns.

Returns:

DataFrame with empty rows and columns removed.

Return type:

DataFrame

fuzzy_merge(df1, df2, left_col, right_col, threshold=80.0)[source]

Merge two DataFrames using fuzzy string matching via rapidfuzz.

For each value in left_col of df1, the best match above threshold in right_col of df2 is found. Matched rows are joined; unmatched rows from df1 are retained with NaN for df2 columns.

Parameters:
  • df1 (DataFrame) – Left DataFrame.

  • df2 (DataFrame) – Right DataFrame.

  • left_col (str) – Column name in df1 to match on.

  • right_col (str) – Column name in df2 to match on.

  • threshold (float, default: 80.0) – Minimum similarity score (0–100) to consider a match.

Returns:

Merged DataFrame with an additional match_score column indicating the similarity score for each matched pair.

Return type:

DataFrame

parse_dates_flexible(series)[source]

Parse mixed-format date strings using dateparser.

Handles a wide variety of date formats and natural language dates (e.g. 'yesterday', '3 days ago').

Parameters:

series (Series) – Series of date strings in potentially mixed formats.

Returns:

Series of datetime objects. Values that cannot be parsed are set to NaT.

Return type:

Series

parse_prices(series)[source]

Parse price strings into numeric amounts and currencies.

Uses the price-parser library to extract amounts and currency codes from strings like '$1,234.56' or 'EUR 99.99'.

Parameters:

series (Series) – Series of price strings.

Returns:

DataFrame with columns:

  • amount – extracted numeric price (float, NaN if unparseable).

  • currency – extracted currency code (str or None).

Return type:

DataFrame

normalize_countries(series)[source]

Standardise country names and codes using country-converter.

Parameters:

series (Series) – Series of country names, ISO codes, or other country identifiers in various formats.

Returns:

DataFrame with columns:

  • name_short – standardised short country name.

  • iso3 – ISO 3166-1 alpha-3 code.

  • iso2 – ISO 3166-1 alpha-2 code.

Return type:

DataFrame

fix_text(series)[source]

Fix text encoding issues using ftfy and unidecode.

Repairs mojibake, normalises Unicode, and transliterates non-ASCII characters to their closest ASCII equivalents.

Parameters:

series (Series) – Series of strings that may contain encoding artefacts.

Returns:

Series with fixed text encoding. NaN values are preserved.

Return type:

Series

Validation

Data quality checks and validation for financial time series.

validate_ohlcv(df)[source]

Validate OHLCV data for common issues.

Checks performed:

  • high_lt_low – rows where high < low

  • close_outside_range – rows where close is outside [low, high]

  • negative_volume – rows with negative volume

  • missing_values – count of NaN values per column

  • gaps – missing business days in the index

Parameters:

df (DataFrame) – DataFrame with columns open, high, low, close, and volume (case-insensitive).

Returns:

Dictionary keyed by check name with details of any issues found.

Return type:

dict[str, Any]

validate_returns(returns, max_abs=0.5)[source]

Validate a return series for suspicious values.

Parameters:
  • returns (Series | DataFrame) – Return series (simple or log).

  • max_abs (float, default: 0.5) – Returns with absolute value greater than this are flagged.

Returns:

Dictionary containing:

  • suspicious – indices where |return| > max_abs

  • has_nan – whether any NaN values exist

  • nan_count – number of NaN values

  • min – minimum return value

  • max – maximum return value

Return type:

dict[str, Any]

check_completeness(data, expected_freq='B')[source]

Report on data completeness relative to an expected frequency.

Parameters:
  • data (Series | DataFrame) – Time-series data with a DatetimeIndex.

  • expected_freq (str, default: 'B') – Expected frequency (e.g. 'B' for business days, 'D' for calendar days).

Returns:

Dictionary containing:

  • expected_count – number of expected periods

  • actual_count – number of actual observations

  • missing_count – number of missing periods

  • missing_dates – list of missing dates

  • completeness_pct – percentage of expected dates present

Return type:

dict[str, Any]

check_staleness(data, max_unchanged=5)[source]

Detect stale (stuck/unchanged) values in a time series.

Parameters:
  • data (Series | DataFrame) – Time-series data.

  • max_unchanged (int, default: 5) – Number of consecutive identical values before flagging as stale.

Returns:

Dictionary containing:

  • stale_periods – list of (start, end, length) tuples for each run of identical values exceeding max_unchanged.

  • total_stale_rows – total number of rows within stale periods.

Return type:

dict[str, Any]

data_quality_report(data, freq='B')[source]

Generate a comprehensive data quality report.

Combines completeness, staleness, and value-range checks into a single report dictionary.

Parameters:
  • data (DataFrame) – Time-series data with a DatetimeIndex.

  • freq (str, default: 'B') – Expected frequency for completeness checking.

Returns:

Dictionary containing:

  • completeness – output of check_completeness()

  • staleness – output of check_staleness()

  • missing_values – NaN counts per column

  • duplicated_dates – number of duplicate index entries

  • date_range(first_date, last_date)

  • shape(rows, cols)

  • dtypes – column data types

Return type:

dict[str, Any]

Advanced Validation

Advanced data validation using pandera.

Provides pandera schema validation for DataFrames and pre-built schemas for common financial data formats (OHLCV, returns).

pandera_validate(df, schema)[source]

Validate a DataFrame against a pandera schema.

Wraps schema.validate() and returns the validated DataFrame (which may include coerced dtypes).

Parameters:
  • df (DataFrame) – DataFrame to validate.

  • schema (Any) – Pandera schema defining the expected structure, dtypes, and value constraints.

Returns:

The validated (and potentially coerced) DataFrame.

Return type:

DataFrame

Raises:

pandera.errors.SchemaError – If validation fails.

create_ohlcv_schema(strict=False, coerce=True)[source]

Create a pandera schema for OHLCV financial data.

The schema enforces:

  • Columns open, high, low, close are positive floats.

  • Column volume is a non-negative integer or float.

  • high >= low for every row.

  • close is within [low, high] for every row.

Parameters:
  • strict (bool, default: False) – If True, extra columns not in the schema cause validation to fail.

  • coerce (bool, default: True) – If True, attempt to coerce column dtypes before validation.

Returns:

Schema suitable for passing to pandera_validate().

Return type:

Any

create_returns_schema(max_abs_return=1.0, allow_nan=False, strict=False, coerce=True)[source]

Create a pandera schema for financial return data.

The schema enforces:

  • All return columns are float type.

  • Return values are within [-max_abs_return, max_abs_return].

Parameters:
  • max_abs_return (float, default: 1.0) – Maximum allowed absolute return value. Values outside [-max_abs_return, max_abs_return] fail validation.

  • allow_nan (bool, default: False) – Whether NaN values are allowed in return columns.

  • strict (bool, default: False) – If True, extra columns cause failure.

  • coerce (bool, default: True) – If True, attempt dtype coercion before validation.

Returns:

Schema suitable for passing to pandera_validate().

Return type:

Any

Transforms

Data transformations for financial time series.

to_returns(prices, method='simple')[source]

Convert a price series to returns.

Parameters:
  • prices (Series | DataFrame) – Price series indexed by date.

  • method (Literal['simple', 'log'], default: 'simple') – 'simple' computes arithmetic returns (P_t / P_{t-1}) - 1. 'log' computes logarithmic returns ln(P_t / P_{t-1}).

Returns:

Return series. The first row will be NaN.

Return type:

Series | DataFrame

to_prices(returns, initial_price=100.0, method='simple')[source]

Convert a return series back to prices.

Parameters:
  • returns (Series | DataFrame) – Return series (may contain a leading NaN).

  • initial_price (float, default: 100.0) – Starting price level.

  • method (Literal['simple', 'log'], default: 'simple') – Must match the method used to compute the returns.

Returns:

Reconstructed price series beginning at initial_price.

Return type:

Series | DataFrame

to_excess_returns(returns, risk_free_rate)[source]

Compute excess returns above a risk-free rate.

Parameters:
  • returns (Series) – Asset return series.

  • risk_free_rate (Series | float) – Risk-free rate. If a pd.Series, it is aligned to returns by index.

Returns:

Excess return series.

Return type:

Series

normalize_prices(prices, base=100.0)[source]

Rebase a price series so that it starts at base.

Parameters:
  • prices (Series | DataFrame) – Price series.

  • base (float, default: 100.0) – Desired starting value.

Returns:

Rebased price series.

Return type:

Series | DataFrame

rank_transform(data)[source]

Apply a cross-sectional rank transform.

Values are replaced with their rank divided by the count of non-NaN values, producing output in the range (0, 1].

Parameters:

data (Series | DataFrame) – Input data.

Returns:

Rank-transformed data.

Return type:

Series | DataFrame

percentile_rank(data, window=252)[source]

Compute a rolling percentile rank.

For each date the value is ranked within the preceding window observations and expressed as a percentile (0–1).

Parameters:
  • data (Series) – Input time series.

  • window (int, default: 252) – Rolling window size.

Returns:

Rolling percentile ranks.

Return type:

Series

expanding_zscore(data)[source]

Compute an expanding-window z-score.

Parameters:

data (Series) – Input time series.

Returns:

Z-scores computed using all data up to and including each point.

Return type:

Series

rolling_zscore(data, window=252)[source]

Compute a rolling-window z-score.

Parameters:
  • data (Series) – Input time series.

  • window (int, default: 252) – Rolling window size.

Returns:

Z-scores computed over the trailing window observations.

Return type:

Series

Calendar

Trading calendar utilities.

Wraps exchange-calendars and pandas-market-calendars for trading day schedules, market hours, and holiday detection.

get_trading_calendar(exchange='XNYS')[source]

Get a trading calendar for an exchange.

Parameters:

exchange (str, default: 'XNYS') – Exchange MIC code (default: NYSE = ‘XNYS’).

Return type:

Any

Returns:

exchange_calendars.ExchangeCalendar instance.

Example

>>> cal = get_trading_calendar("XNYS")
trading_days(start, end, exchange='XNYS')[source]

Get trading days between two dates.

Parameters:
Return type:

DatetimeIndex

Returns:

DatetimeIndex of valid trading days.

is_business_day(dt)[source]

Check if a date is a business day (Mon-Fri).

Parameters:

dt (Union[str, date, datetime, Timestamp, datetime64]) – Date to check.

Return type:

bool

Returns:

True if weekday, False if weekend.

Cache

Caching infrastructure for data fetches.

Provides both in-memory TTL cache and optional disk caching via diskcache.

class MemoryCache[source]

Bases: object

Simple in-memory cache with TTL.

Parameters:

ttl (int | None, default: None)

__init__(ttl=None)[source]
Parameters:

ttl (int | None, default: None)

Return type:

None

property ttl: int
get(key)[source]

Get a cached value if it exists and hasn’t expired.

Parameters:

key (str)

Return type:

Any | None

set(key, value)[source]

Store a value in the cache.

Parameters:
Return type:

None

clear()[source]

Clear all cached entries.

Return type:

None

class DiskCache[source]

Bases: object

Disk-based cache for persisting fetched data across sessions.

Falls back to no-op if diskcache is not installed.

Parameters:

cache_dir (Path | None, default: None)

__init__(cache_dir=None)[source]
Parameters:

cache_dir (Path | None, default: None)

Return type:

None

get(key)[source]

Retrieve cached data from disk.

Parameters:

key (str)

Return type:

Series | DataFrame | None

set(key, value, ttl=None)[source]

Store data to disk cache.

Parameters:
Return type:

None

clear()[source]

Clear the disk cache.

Return type:

None

Base

Abstract base classes and provider registry for data sources.

class DataProvider[source]

Bases: ABC

Abstract base class for all data providers.

Subclasses must implement fetch_prices and declare their name.

name: str = ''
abstractmethod fetch_prices(symbol, start=None, end=None, **kwargs)[source]

Fetch closing prices for a symbol.

Parameters:
Return type:

Series

Returns:

Price series with DatetimeIndex.

fetch_ohlcv(symbol, start=None, end=None, **kwargs)[source]

Fetch OHLCV data for a symbol.

Parameters:
Return type:

DataFrame

Returns:

DataFrame with open, high, low, close, volume columns.

fetch_macro(series_id, start=None, end=None, **kwargs)[source]

Fetch macroeconomic data series.

Parameters:
Return type:

Series

Returns:

Macro data series with DatetimeIndex.

class ProviderRegistry[source]

Bases: object

Registry for data providers.

Allows registering, retrieving, and listing data providers by name.

__init__()[source]
Return type:

None

register(provider, *, default=False)[source]

Register a data provider.

Parameters:
  • provider (DataProvider) – DataProvider instance to register.

  • default (bool, default: False) – If True, make this the default provider.

Return type:

None

get(name=None)[source]

Get a provider by name, or the default.

Parameters:

name (str | None, default: None) – Provider name. None returns the default.

Return type:

DataProvider

Returns:

The requested DataProvider.

Raises:

KeyError – If the provider is not registered.

list_providers()[source]

List all registered provider names.

Return type:

list[str]

Returns:

List of provider name strings.

property default: str | None

Name of the default provider.

Utilities

Data utility functions — date parsing, symbol cleaning, etc.

parse_date(d)[source]

Parse a date-like value into a pd.Timestamp.

Parameters:

d (Union[str, date, datetime, Timestamp, datetime64, None]) – Date string, date, datetime, Timestamp, or None.

Return type:

Timestamp | None

Returns:

pd.Timestamp or None if input is None.

clean_symbol(symbol)[source]

Normalize a ticker symbol.

Parameters:

symbol (str) – Raw ticker symbol.

Return type:

str

Returns:

Cleaned, uppercase ticker symbol.

infer_frequency(index)[source]

Attempt to infer the frequency of a DatetimeIndex.

Parameters:

index (DatetimeIndex) – DatetimeIndex to analyze.

Return type:

str | None

Returns:

Frequency string or None if cannot be determined.