Data ( wraquant.data ) ¶

Returns:

Price series with DatetimeIndex.

fetch_ohlcv(symbol, start=None, end=None, **kwargs)[source]¶

Fetch OHLCV data for a symbol.

Parameters:

symbol (str) – Ticker or identifier.
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date.
kwargs (Any)

Return type:

Returns:

DataFrame with open, high, low, close, volume columns.

fetch_macro(series_id, start=None, end=None, **kwargs)[source]¶

Fetch macroeconomic data series.

Parameters:

series_id (str) – Series identifier (e.g., ‘GDP’, ‘UNRATE’).
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date.
kwargs (Any)

Return type:

Returns:

Macro data series with DatetimeIndex.

class ProviderRegistry[source]¶

Bases: object

Registry for data providers.

Allows registering, retrieving, and listing data providers by name.

__init__()[source]¶

Return type:: None

register(provider, *, default=False)[source]¶

Parameters:

provider (DataProvider) – DataProvider instance to register.
default (bool, default: False) – If True, make this the default provider.

Return type:

get(name=None)[source]¶

Get a provider by name, or the default.

Parameters:: name (str | None, default: None) – Provider name. None returns the default.
Return type:: DataProvider
Returns:: The requested DataProvider.
Raises:: KeyError – If the provider is not registered.

list_providers()[source]¶

List all registered provider names.

Return type:: list[str]
Returns:: List of provider name strings.

property default: str | None¶: Name of the default provider.

fetch_prices(symbol, start=None, end=None, source=None, **kwargs)[source]¶

Fetch closing prices for a symbol.

Retrieves a daily close price series from the specified data provider. The default provider is determined by the registry (typically Yahoo Finance for equities).

Parameters:

symbol (str) – Ticker symbol (e.g., 'AAPL', 'EURUSD=X', 'BTC-USD').
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date (string, datetime, or pandas Timestamp). None fetches from the earliest available date.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to today.
source (str | None, default: None) – Provider name (e.g., 'yahoo', 'fred'). None uses the default provider.
**kwargs (Any) – Additional keyword arguments forwarded to the provider’s fetch_prices method.

Returns:

Price series with a DatetimeIndex.

Return type:

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> prices = fetch_prices("AAPL", start="2020-01-01")

See also

fetch_ohlcv: Fetch full OHLCV data. fetch_macro: Fetch macroeconomic series from FRED.

fetch_ohlcv(symbol, start=None, end=None, source=None, **kwargs)[source]¶

Fetch OHLCV (Open, High, Low, Close, Volume) data for a symbol.

Returns a DataFrame with standard column names suitable for backtesting, technical analysis, and charting.

Parameters:

symbol (str) – Ticker symbol (e.g., 'AAPL').
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date. None fetches from the earliest available date.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to today.
source (str | None, default: None) – Provider name. None uses the default provider.
**kwargs (Any) – Additional keyword arguments forwarded to the provider.

Returns:

DataFrame with columns open, high,: low, close, volume and a DatetimeIndex.

Return type:

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> df = fetch_ohlcv("AAPL", start="2020-01-01")

See also

fetch_prices: Fetch close prices only (lighter weight). fetch_macro: Fetch macroeconomic series.

fetch_macro(series_id, start=None, end=None, source='fred', **kwargs)[source]¶

Fetch macroeconomic data series.

Retrieves economic indicators from FRED (Federal Reserve Economic Data) or other macro data providers. Common series include GDP, unemployment rate (UNRATE), federal funds rate (DFF), CPI, and Treasury yields.

Parameters:

series_id (str) – Series identifier (e.g., 'GDP', 'UNRATE', 'DFF', 'T10Y2Y').
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date. None fetches the full history.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to the latest available release.
source (str, default: 'fred') – Provider name (default 'fred').
**kwargs (Any) – Additional keyword arguments forwarded to the provider.

Returns:

Macro data series with a DatetimeIndex.

Return type:

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> gdp = fetch_macro("GDP", source="fred")

See also

fetch_prices: Fetch asset prices. fetch_ohlcv: Fetch OHLCV bar data.

list_providers()[source]¶

List all available data providers.

Returns the names of all registered providers (e.g., 'yahoo', 'fred', 'nasdaq', 'csv'). The list depends on which optional dependencies are installed.

Returns:: List of registered provider names.
Return type:: list[str]

Example

>>> providers = list_providers()
>>> "yahoo" in providers
True

align_series(*series, method='inner')[source]¶

Align multiple series to a common index.

Parameters:

*series (Series) – Two or more series to align.
method (Literal['inner', 'outer'], default: 'inner') – Join method. 'inner' keeps only dates present in all series; 'outer' keeps all dates (filling gaps with NaN).

Returns:

Aligned series sharing the same index.

Return type:

tuple[Series, ...]

detect_outliers(data, method='zscore', threshold=3.0)[source]¶

Flag rows that contain outlier values.

Parameters:

data (DataFrame | Series) – Input data.
method (Literal['zscore', 'iqr', 'mad'], default: 'zscore') – Detection method.
threshold (float, default: 3.0) – Sensitivity threshold.

Returns:

Boolean series with True for outlier rows.

Return type:

fill_missing(data, method='ffill', limit=None)[source]¶

Fill or remove missing values.

Parameters:

data (DataFrame | Series) – Input data possibly containing NaN values.
method (Literal['ffill', 'bfill', 'interpolate', 'drop'], default: 'ffill') – Strategy for handling missing values.
limit (int | None, default: None) – Maximum number of consecutive NaN values to fill. Only used with 'ffill', 'bfill', and 'interpolate'.

Returns:

Data with missing values handled.

Return type:

handle_splits_dividends(prices, splits=None, dividends=None)[source]¶

Adjust a price series for stock splits and dividends.

Parameters:

prices (Series) – Raw (unadjusted) price series indexed by date.
splits (Series | None, default: None) – Split ratios indexed by date. A 2-for-1 split is represented as 2.0. Dates not present in prices are ignored.
dividends (Series | None, default: None) – Cash dividend amounts indexed by ex-date.

Returns:

Adjusted price series.

Return type:

remove_duplicates(data, keep='last')[source]¶

Remove duplicate index entries.

Parameters:

data (DataFrame) – Data whose index may contain duplicates.
keep (Literal['first', 'last', False], default: 'last') – Which duplicate to keep.

Returns:

Data with unique index values.

Return type:

remove_outliers(data, method='zscore', threshold=3.0)[source]¶

Remove rows containing outlier values from the data.

Parameters:

data (DataFrame | Series) – Input data with a DatetimeIndex.
method (Literal['zscore', 'iqr', 'mad'], default: 'zscore') – Outlier detection method.
threshold (float, default: 3.0) – Sensitivity threshold. For z-score and MAD this is the number of standard deviations; for IQR it is the multiplier applied to the interquartile range.

Returns:

Data with outlier rows removed.

Return type:

resample_ohlcv(ohlcv, freq='W')[source]¶

Resample OHLCV data to a lower frequency.

The aggregation follows standard financial conventions:

open – first value in the period
high – maximum value in the period
low – minimum value in the period
close – last value in the period
volume – sum over the period

Parameters:

ohlcv (DataFrame) – DataFrame with columns open, high, low, close, and volume (case-insensitive) indexed by date.
freq (str, default: 'W') – Target frequency (any pandas offset alias).

Returns:

Resampled OHLCV data.

Return type:

winsorize(data, limits=(0.01, 0.01))[source]¶

Clip extreme values at the given percentile limits.

Parameters:

data (DataFrame | Series) – Input data.
limits (tuple[float, float], default: (0.01, 0.01)) – Lower and upper percentile fractions to clip. (0.01, 0.01) clips the bottom 1 % and top 1 % of values.

Returns:

Winsorized data with the same shape as the input.

Return type:

expanding_zscore(data)[source]¶

Compute an expanding-window z-score.

Parameters:: data (Series) – Input time series.
Returns:: Z-scores computed using all data up to and including each point.
Return type:: Series

normalize_prices(prices, base=100.0)[source]¶

Rebase a price series so that it starts at base.

Parameters:

prices (Series | DataFrame) – Price series.
base (float, default: 100.0) – Desired starting value.

Returns:

Rebased price series.

Return type:

percentile_rank(data, window=252)[source]¶

Compute a rolling percentile rank.

For each date the value is ranked within the preceding window observations and expressed as a percentile (0–1).

Parameters:

data (Series) – Input time series.
window (int, default: 252) – Rolling window size.

Returns:

Rolling percentile ranks.

Return type:

rank_transform(data)[source]¶

Apply a cross-sectional rank transform.

Values are replaced with their rank divided by the count of non-NaN values, producing output in the range (0, 1].

Parameters:: data (Series | DataFrame) – Input data.
Returns:: Rank-transformed data.
Return type:: Series | DataFrame

rolling_zscore(data, window=252)[source]¶

Compute a rolling-window z-score.

Parameters:

data (Series) – Input time series.
window (int, default: 252) – Rolling window size.

Returns:

Z-scores computed over the trailing window observations.

Return type:

to_excess_returns(returns, risk_free_rate)[source]¶

Compute excess returns above a risk-free rate.

Parameters:

returns (Series) – Asset return series.
risk_free_rate (Series | float) – Risk-free rate. If a pd.Series, it is aligned to returns by index.

Returns:

Excess return series.

Return type:

to_prices(returns, initial_price=100.0, method='simple')[source]¶

Convert a return series back to prices.

Parameters:

returns (Series | DataFrame) – Return series (may contain a leading NaN).
initial_price (float, default: 100.0) – Starting price level.
method (Literal['simple', 'log'], default: 'simple') – Must match the method used to compute the returns.

Returns:

Reconstructed price series beginning at initial_price.

Return type:

to_returns(prices, method='simple')[source]¶

Convert a price series to returns.

Parameters:

prices (Series | DataFrame) – Price series indexed by date.
method (Literal['simple', 'log'], default: 'simple') – 'simple' computes arithmetic returns (P_t / P_{t-1}) - 1. 'log' computes logarithmic returns ln(P_t / P_{t-1}).

Returns:

Return series. The first row will be NaN.

Return type:

check_completeness(data, expected_freq='B')[source]¶

Report on data completeness relative to an expected frequency.

Parameters:

data (Series | DataFrame) – Time-series data with a DatetimeIndex.
expected_freq (str, default: 'B') – Expected frequency (e.g. 'B' for business days, 'D' for calendar days).

Returns:

Dictionary containing:

expected_count – number of expected periods
actual_count – number of actual observations
missing_count – number of missing periods
missing_dates – list of missing dates
completeness_pct – percentage of expected dates present

Return type:

check_staleness(data, max_unchanged=5)[source]¶

Detect stale (stuck/unchanged) values in a time series.

Parameters:

data (Series | DataFrame) – Time-series data.
max_unchanged (int, default: 5) – Number of consecutive identical values before flagging as stale.

Returns:

Dictionary containing:

stale_periods – list of (start, end, length) tuples for each run of identical values exceeding max_unchanged.
total_stale_rows – total number of rows within stale periods.

Return type:

data_quality_report(data, freq='B')[source]¶

Generate a comprehensive data quality report.

Combines completeness, staleness, and value-range checks into a single report dictionary.

Parameters:

data (DataFrame) – Time-series data with a DatetimeIndex.
freq (str, default: 'B') – Expected frequency for completeness checking.

Returns:

Dictionary containing:

completeness – output of check_completeness()
staleness – output of check_staleness()
missing_values – NaN counts per column
duplicated_dates – number of duplicate index entries
date_range – (first_date, last_date)
shape – (rows, cols)
dtypes – column data types

Return type:

validate_ohlcv(df)[source]¶

Validate OHLCV data for common issues.

Checks performed:

high_lt_low – rows where high < low
close_outside_range – rows where close is outside [low, high]
negative_volume – rows with negative volume
missing_values – count of NaN values per column
gaps – missing business days in the index

Parameters:: df (DataFrame) – DataFrame with columns open, high, low, close, and volume (case-insensitive).
Returns:: Dictionary keyed by check name with details of any issues found.
Return type:: dict[str, Any]

validate_returns(returns, max_abs=0.5)[source]¶

Validate a return series for suspicious values.

Parameters:

returns (Series | DataFrame) – Return series (simple or log).
max_abs (float, default: 0.5) – Returns with absolute value greater than this are flagged.

Returns:

Dictionary containing:

suspicious – indices where |return| > max_abs
has_nan – whether any NaN values exist
nan_count – number of NaN values
min – minimum return value
max – maximum return value

Return type:

janitor_clean_names(df)[source]¶

Clean DataFrame column names using pyjanitor.

Converts column names to lowercase snake_case, strips whitespace, and replaces special characters with underscores.

Parameters:: df (DataFrame) – DataFrame with messy column names.
Returns:: DataFrame with cleaned column names.
Return type:: DataFrame

janitor_remove_empty(df)[source]¶

Remove empty rows and columns using pyjanitor.

Drops rows and columns that are entirely NaN or empty.

Parameters:: df (DataFrame) – DataFrame possibly containing empty rows/columns.
Returns:: DataFrame with empty rows and columns removed.
Return type:: DataFrame

fuzzy_merge(df1, df2, left_col, right_col, threshold=80.0)[source]¶

Merge two DataFrames using fuzzy string matching via rapidfuzz.

For each value in left_col of df1, the best match above threshold in right_col of df2 is found. Matched rows are joined; unmatched rows from df1 are retained with NaN for df2 columns.

Parameters:

df1 (DataFrame) – Left DataFrame.
df2 (DataFrame) – Right DataFrame.
left_col (str) – Column name in df1 to match on.
right_col (str) – Column name in df2 to match on.
threshold (float, default: 80.0) – Minimum similarity score (0–100) to consider a match.

Returns:

Merged DataFrame with an additional match_score column indicating the similarity score for each matched pair.

Return type:

parse_dates_flexible(series)[source]¶

Parse mixed-format date strings using dateparser.

Handles a wide variety of date formats and natural language dates (e.g. 'yesterday', '3 days ago').

Parameters:: series (Series) – Series of date strings in potentially mixed formats.
Returns:: Series of datetime objects. Values that cannot be parsed are set to NaT.
Return type:: Series

parse_prices(series)[source]¶

Parse price strings into numeric amounts and currencies.

Uses the price-parser library to extract amounts and currency codes from strings like '$1,234.56' or 'EUR 99.99'.

Parameters:

series (Series) – Series of price strings.

Returns:

DataFrame with columns:

amount – extracted numeric price (float, NaN if unparseable).
currency – extracted currency code (str or None).

Return type:

normalize_countries(series)[source]¶

Standardise country names and codes using country-converter.

Parameters:

series (Series) – Series of country names, ISO codes, or other country identifiers in various formats.

Returns:

DataFrame with columns:

name_short – standardised short country name.
iso3 – ISO 3166-1 alpha-3 code.
iso2 – ISO 3166-1 alpha-2 code.

Return type:

fix_text(series)[source]¶

Fix text encoding issues using ftfy and unidecode.

Repairs mojibake, normalises Unicode, and transliterates non-ASCII characters to their closest ASCII equivalents.

Parameters:: series (Series) – Series of strings that may contain encoding artefacts.
Returns:: Series with fixed text encoding. NaN values are preserved.
Return type:: Series

pandera_validate(df, schema)[source]¶

Validate a DataFrame against a pandera schema.

Wraps schema.validate() and returns the validated DataFrame (which may include coerced dtypes).

Parameters:

df (DataFrame) – DataFrame to validate.
schema (Any) – Pandera schema defining the expected structure, dtypes, and value constraints.

Returns:

The validated (and potentially coerced) DataFrame.

Return type:

Raises:

pandera.errors.SchemaError – If validation fails.

create_ohlcv_schema(strict=False, coerce=True)[source]¶

Create a pandera schema for OHLCV financial data.

The schema enforces:

Columns open, high, low, close are positive floats.
Column volume is a non-negative integer or float.
high >= low for every row.
close is within [low, high] for every row.

Parameters:

strict (bool, default: False) – If True, extra columns not in the schema cause validation to fail.
coerce (bool, default: True) – If True, attempt to coerce column dtypes before validation.

Returns:

Schema suitable for passing to pandera_validate().

Return type:

create_returns_schema(max_abs_return=1.0, allow_nan=False, strict=False, coerce=True)[source]¶

Create a pandera schema for financial return data.

The schema enforces:

All return columns are float type.
Return values are within [-max_abs_return, max_abs_return].

Parameters:

max_abs_return (float, default: 1.0) – Maximum allowed absolute return value. Values outside [-max_abs_return, max_abs_return] fail validation.
allow_nan (bool, default: False) – Whether NaN values are allowed in return columns.
strict (bool, default: False) – If True, extra columns cause failure.
coerce (bool, default: True) – If True, attempt dtype coercion before validation.

Returns:

Schema suitable for passing to pandera_validate().

Return type:

Loaders¶

High-level data loading API.

Convenience functions that delegate to the provider registry.

fetch_prices(symbol, start=None, end=None, source=None, **kwargs)[source]¶

Fetch closing prices for a symbol.

Retrieves a daily close price series from the specified data provider. The default provider is determined by the registry (typically Yahoo Finance for equities).

Parameters:

symbol (str) – Ticker symbol (e.g., 'AAPL', 'EURUSD=X', 'BTC-USD').
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date (string, datetime, or pandas Timestamp). None fetches from the earliest available date.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to today.
source (str | None, default: None) – Provider name (e.g., 'yahoo', 'fred'). None uses the default provider.
**kwargs (Any) – Additional keyword arguments forwarded to the provider’s fetch_prices method.

Returns:

Price series with a DatetimeIndex.

Return type:

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> prices = fetch_prices("AAPL", start="2020-01-01")

See also

fetch_ohlcv: Fetch full OHLCV data. fetch_macro: Fetch macroeconomic series from FRED.

fetch_ohlcv(symbol, start=None, end=None, source=None, **kwargs)[source]¶

Fetch OHLCV (Open, High, Low, Close, Volume) data for a symbol.

Returns a DataFrame with standard column names suitable for backtesting, technical analysis, and charting.

Parameters:

symbol (str) – Ticker symbol (e.g., 'AAPL').
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date. None fetches from the earliest available date.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to today.
source (str | None, default: None) – Provider name. None uses the default provider.
**kwargs (Any) – Additional keyword arguments forwarded to the provider.

Returns:

DataFrame with columns open, high,: low, close, volume and a DatetimeIndex.

Return type:

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> df = fetch_ohlcv("AAPL", start="2020-01-01")

See also

fetch_prices: Fetch close prices only (lighter weight). fetch_macro: Fetch macroeconomic series.

fetch_macro(series_id, start=None, end=None, source='fred', **kwargs)[source]¶

Fetch macroeconomic data series.

Parameters:

series_id (str) – Series identifier (e.g., 'GDP', 'UNRATE', 'DFF', 'T10Y2Y').
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date. None fetches the full history.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date. None fetches up to the latest available release.
source (str, default: 'fred') – Provider name (default 'fred').
**kwargs (Any) – Additional keyword arguments forwarded to the provider.

Returns:

Macro data series with a DatetimeIndex.

Return type:

Raises:

DataFetchError – If the provider fails to fetch the data.

Example

>>> gdp = fetch_macro("GDP", source="fred")

See also

fetch_prices: Fetch asset prices. fetch_ohlcv: Fetch OHLCV bar data.

list_providers()[source]¶

List all available data providers.

Returns the names of all registered providers (e.g., 'yahoo', 'fred', 'nasdaq', 'csv'). The list depends on which optional dependencies are installed.

Returns:: List of registered provider names.
Return type:: list[str]

Example

>>> providers = list_providers()
>>> "yahoo" in providers
True

Cleaning¶

Data cleaning utilities for financial time series.

remove_outliers(data, method='zscore', threshold=3.0)[source]¶

Remove rows containing outlier values from the data.

Parameters:

data (DataFrame | Series) – Input data with a DatetimeIndex.
method (Literal['zscore', 'iqr', 'mad'], default: 'zscore') – Outlier detection method.
threshold (float, default: 3.0) – Sensitivity threshold. For z-score and MAD this is the number of standard deviations; for IQR it is the multiplier applied to the interquartile range.

Returns:

Data with outlier rows removed.

Return type:

winsorize(data, limits=(0.01, 0.01))[source]¶

Clip extreme values at the given percentile limits.

Parameters:

data (DataFrame | Series) – Input data.
limits (tuple[float, float], default: (0.01, 0.01)) – Lower and upper percentile fractions to clip. (0.01, 0.01) clips the bottom 1 % and top 1 % of values.

Returns:

Winsorized data with the same shape as the input.

Return type:

fill_missing(data, method='ffill', limit=None)[source]¶

Fill or remove missing values.

Parameters:

data (DataFrame | Series) – Input data possibly containing NaN values.
method (Literal['ffill', 'bfill', 'interpolate', 'drop'], default: 'ffill') – Strategy for handling missing values.
limit (int | None, default: None) – Maximum number of consecutive NaN values to fill. Only used with 'ffill', 'bfill', and 'interpolate'.

Returns:

Data with missing values handled.

Return type:

detect_outliers(data, method='zscore', threshold=3.0)[source]¶

Flag rows that contain outlier values.

Parameters:

data (DataFrame | Series) – Input data.
method (Literal['zscore', 'iqr', 'mad'], default: 'zscore') – Detection method.
threshold (float, default: 3.0) – Sensitivity threshold.

Returns:

Boolean series with True for outlier rows.

Return type:

handle_splits_dividends(prices, splits=None, dividends=None)[source]¶

Adjust a price series for stock splits and dividends.

Parameters:

prices (Series) – Raw (unadjusted) price series indexed by date.
splits (Series | None, default: None) – Split ratios indexed by date. A 2-for-1 split is represented as 2.0. Dates not present in prices are ignored.
dividends (Series | None, default: None) – Cash dividend amounts indexed by ex-date.

Returns:

Adjusted price series.

Return type:

remove_duplicates(data, keep='last')[source]¶

Remove duplicate index entries.

Parameters:

data (DataFrame) – Data whose index may contain duplicates.
keep (Literal['first', 'last', False], default: 'last') – Which duplicate to keep.

Returns:

Data with unique index values.

Return type:

align_series(*series, method='inner')[source]¶

Align multiple series to a common index.

Parameters:

*series (Series) – Two or more series to align.
method (Literal['inner', 'outer'], default: 'inner') – Join method. 'inner' keeps only dates present in all series; 'outer' keeps all dates (filling gaps with NaN).

Returns:

Aligned series sharing the same index.

Return type:

tuple[Series, ...]

resample_ohlcv(ohlcv, freq='W')[source]¶

Resample OHLCV data to a lower frequency.

The aggregation follows standard financial conventions:

open – first value in the period
high – maximum value in the period
low – minimum value in the period
close – last value in the period
volume – sum over the period

Parameters:

ohlcv (DataFrame) – DataFrame with columns open, high, low, close, and volume (case-insensitive) indexed by date.
freq (str, default: 'W') – Target frequency (any pandas offset alias).

Returns:

Resampled OHLCV data.

Return type:

Advanced Cleaning¶

Advanced data cleaning integrations using optional packages.

Provides wrappers around pyjanitor, rapidfuzz, dateparser, price-parser, country-converter, ftfy, and unidecode for column name cleaning, fuzzy merging, flexible date parsing, price parsing, country normalisation, and text encoding fixes.

janitor_clean_names(df)[source]¶

Clean DataFrame column names using pyjanitor.

Converts column names to lowercase snake_case, strips whitespace, and replaces special characters with underscores.

Parameters:: df (DataFrame) – DataFrame with messy column names.
Returns:: DataFrame with cleaned column names.
Return type:: DataFrame

janitor_remove_empty(df)[source]¶

Remove empty rows and columns using pyjanitor.

Drops rows and columns that are entirely NaN or empty.

Parameters:: df (DataFrame) – DataFrame possibly containing empty rows/columns.
Returns:: DataFrame with empty rows and columns removed.
Return type:: DataFrame

fuzzy_merge(df1, df2, left_col, right_col, threshold=80.0)[source]¶

Merge two DataFrames using fuzzy string matching via rapidfuzz.

For each value in left_col of df1, the best match above threshold in right_col of df2 is found. Matched rows are joined; unmatched rows from df1 are retained with NaN for df2 columns.

Parameters:

df1 (DataFrame) – Left DataFrame.
df2 (DataFrame) – Right DataFrame.
left_col (str) – Column name in df1 to match on.
right_col (str) – Column name in df2 to match on.
threshold (float, default: 80.0) – Minimum similarity score (0–100) to consider a match.

Returns:

Merged DataFrame with an additional match_score column indicating the similarity score for each matched pair.

Return type:

parse_dates_flexible(series)[source]¶

Parse mixed-format date strings using dateparser.

Handles a wide variety of date formats and natural language dates (e.g. 'yesterday', '3 days ago').

Parameters:: series (Series) – Series of date strings in potentially mixed formats.
Returns:: Series of datetime objects. Values that cannot be parsed are set to NaT.
Return type:: Series

parse_prices(series)[source]¶

Parse price strings into numeric amounts and currencies.

Uses the price-parser library to extract amounts and currency codes from strings like '$1,234.56' or 'EUR 99.99'.

Parameters:

series (Series) – Series of price strings.

Returns:

DataFrame with columns:

amount – extracted numeric price (float, NaN if unparseable).
currency – extracted currency code (str or None).

Return type:

normalize_countries(series)[source]¶

Standardise country names and codes using country-converter.

Parameters:

series (Series) – Series of country names, ISO codes, or other country identifiers in various formats.

Returns:

DataFrame with columns:

name_short – standardised short country name.
iso3 – ISO 3166-1 alpha-3 code.
iso2 – ISO 3166-1 alpha-2 code.

Return type:

fix_text(series)[source]¶

Fix text encoding issues using ftfy and unidecode.

Repairs mojibake, normalises Unicode, and transliterates non-ASCII characters to their closest ASCII equivalents.

Parameters:: series (Series) – Series of strings that may contain encoding artefacts.
Returns:: Series with fixed text encoding. NaN values are preserved.
Return type:: Series

Validation¶

Data quality checks and validation for financial time series.

validate_ohlcv(df)[source]¶

Validate OHLCV data for common issues.

Checks performed:

high_lt_low – rows where high < low
close_outside_range – rows where close is outside [low, high]
negative_volume – rows with negative volume
missing_values – count of NaN values per column
gaps – missing business days in the index

Parameters:: df (DataFrame) – DataFrame with columns open, high, low, close, and volume (case-insensitive).
Returns:: Dictionary keyed by check name with details of any issues found.
Return type:: dict[str, Any]

validate_returns(returns, max_abs=0.5)[source]¶

Validate a return series for suspicious values.

Parameters:

returns (Series | DataFrame) – Return series (simple or log).
max_abs (float, default: 0.5) – Returns with absolute value greater than this are flagged.

Returns:

Dictionary containing:

suspicious – indices where |return| > max_abs
has_nan – whether any NaN values exist
nan_count – number of NaN values
min – minimum return value
max – maximum return value

Return type:

check_completeness(data, expected_freq='B')[source]¶

Report on data completeness relative to an expected frequency.

Parameters:

data (Series | DataFrame) – Time-series data with a DatetimeIndex.
expected_freq (str, default: 'B') – Expected frequency (e.g. 'B' for business days, 'D' for calendar days).

Returns:

Dictionary containing:

expected_count – number of expected periods
actual_count – number of actual observations
missing_count – number of missing periods
missing_dates – list of missing dates
completeness_pct – percentage of expected dates present

Return type:

check_staleness(data, max_unchanged=5)[source]¶

Detect stale (stuck/unchanged) values in a time series.

Parameters:

data (Series | DataFrame) – Time-series data.
max_unchanged (int, default: 5) – Number of consecutive identical values before flagging as stale.

Returns:

Dictionary containing:

stale_periods – list of (start, end, length) tuples for each run of identical values exceeding max_unchanged.
total_stale_rows – total number of rows within stale periods.

Return type:

data_quality_report(data, freq='B')[source]¶

Generate a comprehensive data quality report.

Combines completeness, staleness, and value-range checks into a single report dictionary.

Parameters:

data (DataFrame) – Time-series data with a DatetimeIndex.
freq (str, default: 'B') – Expected frequency for completeness checking.

Returns:

Dictionary containing:

completeness – output of check_completeness()
staleness – output of check_staleness()
missing_values – NaN counts per column
duplicated_dates – number of duplicate index entries
date_range – (first_date, last_date)
shape – (rows, cols)
dtypes – column data types

Return type:

Advanced Validation¶

Advanced data validation using pandera.

Provides pandera schema validation for DataFrames and pre-built schemas for common financial data formats (OHLCV, returns).

pandera_validate(df, schema)[source]¶

Validate a DataFrame against a pandera schema.

Wraps schema.validate() and returns the validated DataFrame (which may include coerced dtypes).

Parameters:

df (DataFrame) – DataFrame to validate.
schema (Any) – Pandera schema defining the expected structure, dtypes, and value constraints.

Returns:

The validated (and potentially coerced) DataFrame.

Return type:

Raises:

pandera.errors.SchemaError – If validation fails.

create_ohlcv_schema(strict=False, coerce=True)[source]¶

Create a pandera schema for OHLCV financial data.

The schema enforces:

Columns open, high, low, close are positive floats.
Column volume is a non-negative integer or float.
high >= low for every row.
close is within [low, high] for every row.

Parameters:

strict (bool, default: False) – If True, extra columns not in the schema cause validation to fail.
coerce (bool, default: True) – If True, attempt to coerce column dtypes before validation.

Returns:

Schema suitable for passing to pandera_validate().

Return type:

create_returns_schema(max_abs_return=1.0, allow_nan=False, strict=False, coerce=True)[source]¶

Create a pandera schema for financial return data.

The schema enforces:

All return columns are float type.
Return values are within [-max_abs_return, max_abs_return].

Parameters:

max_abs_return (float, default: 1.0) – Maximum allowed absolute return value. Values outside [-max_abs_return, max_abs_return] fail validation.
allow_nan (bool, default: False) – Whether NaN values are allowed in return columns.
strict (bool, default: False) – If True, extra columns cause failure.
coerce (bool, default: True) – If True, attempt dtype coercion before validation.

Returns:

Schema suitable for passing to pandera_validate().

Return type:

Transforms¶

Data transformations for financial time series.

to_returns(prices, method='simple')[source]¶

Convert a price series to returns.

Parameters:

prices (Series | DataFrame) – Price series indexed by date.
method (Literal['simple', 'log'], default: 'simple') – 'simple' computes arithmetic returns (P_t / P_{t-1}) - 1. 'log' computes logarithmic returns ln(P_t / P_{t-1}).

Returns:

Return series. The first row will be NaN.

Return type:

to_prices(returns, initial_price=100.0, method='simple')[source]¶

Convert a return series back to prices.

Parameters:

returns (Series | DataFrame) – Return series (may contain a leading NaN).
initial_price (float, default: 100.0) – Starting price level.
method (Literal['simple', 'log'], default: 'simple') – Must match the method used to compute the returns.

Returns:

Reconstructed price series beginning at initial_price.

Return type:

to_excess_returns(returns, risk_free_rate)[source]¶

Compute excess returns above a risk-free rate.

Parameters:

returns (Series) – Asset return series.
risk_free_rate (Series | float) – Risk-free rate. If a pd.Series, it is aligned to returns by index.

Returns:

Excess return series.

Return type:

normalize_prices(prices, base=100.0)[source]¶

Rebase a price series so that it starts at base.

Parameters:

prices (Series | DataFrame) – Price series.
base (float, default: 100.0) – Desired starting value.

Returns:

Rebased price series.

Return type:

rank_transform(data)[source]¶

Apply a cross-sectional rank transform.

Values are replaced with their rank divided by the count of non-NaN values, producing output in the range (0, 1].

Parameters:: data (Series | DataFrame) – Input data.
Returns:: Rank-transformed data.
Return type:: Series | DataFrame

percentile_rank(data, window=252)[source]¶

Compute a rolling percentile rank.

For each date the value is ranked within the preceding window observations and expressed as a percentile (0–1).

Parameters:

data (Series) – Input time series.
window (int, default: 252) – Rolling window size.

Returns:

Rolling percentile ranks.

Return type:

expanding_zscore(data)[source]¶

Compute an expanding-window z-score.

Parameters:: data (Series) – Input time series.
Returns:: Z-scores computed using all data up to and including each point.
Return type:: Series

rolling_zscore(data, window=252)[source]¶

Compute a rolling-window z-score.

Parameters:

data (Series) – Input time series.
window (int, default: 252) – Rolling window size.

Returns:

Z-scores computed over the trailing window observations.

Return type:

Calendar¶

Trading calendar utilities.

Wraps exchange-calendars and pandas-market-calendars for trading day schedules, market hours, and holiday detection.

get_trading_calendar(exchange='XNYS')[source]¶

Get a trading calendar for an exchange.

Parameters:: exchange (str, default: 'XNYS') – Exchange MIC code (default: NYSE = ‘XNYS’).
Return type:: Any
Returns:: exchange_calendars.ExchangeCalendar instance.

Example

>>> cal = get_trading_calendar("XNYS")

trading_days(start, end, exchange='XNYS')[source]¶

Get trading days between two dates.

Parameters:

start (Union[str, date, datetime, Timestamp, datetime64]) – Start date.
end (Union[str, date, datetime, Timestamp, datetime64]) – End date.
exchange (str, default: 'XNYS') – Exchange MIC code.

Return type:

DatetimeIndex

Returns:

DatetimeIndex of valid trading days.

is_business_day(dt)[source]¶

Check if a date is a business day (Mon-Fri).

Parameters:: dt (Union[str, date, datetime, Timestamp, datetime64]) – Date to check.
Return type:: bool
Returns:: True if weekday, False if weekend.

Cache¶

Caching infrastructure for data fetches.

Provides both in-memory TTL cache and optional disk caching via diskcache.

class MemoryCache[source]¶

Bases: object

Simple in-memory cache with TTL.

Parameters:: ttl (int | None, default: None)

__init__(ttl=None)[source]¶

Parameters:: ttl (int | None, default: None)
Return type:: None

property ttl: int¶

get(key)[source]¶

Get a cached value if it exists and hasn’t expired.

Parameters:: key (str)
Return type:: Any | None

set(key, value)[source]¶

Store a value in the cache.

Parameters:

key (str)
value (Any)

Return type:

clear()[source]¶

Clear all cached entries.

Return type:: None

class DiskCache[source]¶

Bases: object

Disk-based cache for persisting fetched data across sessions.

Falls back to no-op if diskcache is not installed.

Parameters:: cache_dir (Path | None, default: None)

__init__(cache_dir=None)[source]¶

Parameters:: cache_dir (Path | None, default: None)
Return type:: None

get(key)[source]¶

Retrieve cached data from disk.

Parameters:: key (str)
Return type:: Series | DataFrame | None

set(key, value, ttl=None)[source]¶

Store data to disk cache.

Parameters:

key (str)
value (Series | DataFrame)
ttl (int | None, default: None)

Return type:

clear()[source]¶

Clear the disk cache.

Return type:: None

Base¶

Abstract base classes and provider registry for data sources.

class DataProvider[source]¶

Bases: ABC

Abstract base class for all data providers.

Subclasses must implement fetch_prices and declare their name.

name: str = ''¶

abstractmethod fetch_prices(symbol, start=None, end=None, **kwargs)[source]¶

Fetch closing prices for a symbol.

Parameters:

symbol (str) – Ticker or identifier.
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date.
kwargs (Any)

Return type:

Returns:

Price series with DatetimeIndex.

fetch_ohlcv(symbol, start=None, end=None, **kwargs)[source]¶

Fetch OHLCV data for a symbol.

Parameters:

symbol (str) – Ticker or identifier.
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date.
kwargs (Any)

Return type:

Returns:

DataFrame with open, high, low, close, volume columns.

fetch_macro(series_id, start=None, end=None, **kwargs)[source]¶

Fetch macroeconomic data series.

Parameters:

series_id (str) – Series identifier (e.g., ‘GDP’, ‘UNRATE’).
start (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – Start date.
end (Union[str, date, datetime, Timestamp, datetime64, None], default: None) – End date.
kwargs (Any)

Return type:

Returns:

Macro data series with DatetimeIndex.

class ProviderRegistry[source]¶

Bases: object

Registry for data providers.

Allows registering, retrieving, and listing data providers by name.

__init__()[source]¶

Return type:: None

register(provider, *, default=False)[source]¶

Parameters:

provider (DataProvider) – DataProvider instance to register.
default (bool, default: False) – If True, make this the default provider.

Return type: