How We Validate Macro Data Accuracy Before Serving It banner image

Builders

Engineering

How We Validate Macro Data Accuracy Before Serving It

A look inside the multi-stage data validation pipeline that ensures every macro indicator served by FXMacroData is accurate, timely, and consistent — from initial ingest and schema checks to outlier filtering, cross-source reconciliation, and business-day integrity rules.

Engineering

Macro data is only as useful as it is accurate. Here is how every data point served by FXMacroData passes through a five-stage validation pipeline before it reaches your application.

When a central bank releases a policy rate decision or a statistics bureau publishes a new inflation print, the raw announcement often arrives with noise: encoding artifacts, partial pages, missing fields, or revisions that contradict the prior month. Serving that data naively — as many aggregators do — passes the problem straight to the developer.

At FXMacroData we treat data quality as a first-class product concern. Every indicator from every currency goes through a deterministic validation pipeline before it is written to Firestore and exposed through the API. This post walks through that pipeline layer by layer — from the moment a fetcher downloads a raw response to the moment a value becomes queryable at endpoints like /v1/announcements/{currency}/{indicator}.

Pipeline at a Glance

① Ingest ② Schema Check ③ Range & Outlier Filter ④ Cross-Source Reconciliation ⑤ Business-Day Integrity

Stage 1 — Ingest: Structured Source Fetchers

Validation starts before a single value is extracted. Every currency has a dedicated fetcher class that targets an official primary source — the central bank website, the national statistics bureau, or a government data portal. We deliberately avoid secondary aggregators in the ingest path: their lag, licensing terms, and occasional silent revisions introduce uncertainty we cannot control.

Fetchers are asynchronous Python classes implementing an async with context manager pattern. On entry they open an aiohttp.ClientSession with a realistic User-Agent and controlled timeouts; on exit they close cleanly regardless of whether the fetch succeeded or raised. Inside each fetcher, parsing is strict: HTML is parsed with lxml or BeautifulSoup using exact element selectors rather than regex fallbacks against raw markup, and JSON APIs are accessed via typed accessor keys that raise immediately if a field is absent or renamed upstream.

Fetcher contract — required output keys

{
    "date": "2026-03-31",          # ISO-8601 date string
    "val": 3.5,                    # float — never string
    "announcement_datetime": "..." # UTC ISO-8601 if available
}

The output contract is enforced at the fetcher boundary: any record missing date or val is discarded before it reaches the next stage. announcement_datetime is optional at ingest but required for publication-facing endpoints that expose event timing to API consumers.


Stage 2 — Schema Check: Type and Completeness Validation

Raw fetcher output is handed to a schema validator that applies four checks on every record:

Date format

Parsed as an ISO-8601 date. Non-parseable strings, future dates beyond a two-day grace window, and dates before 1960 are all rejected.

Value type

val must coerce to a finite Python float. NaN, Inf, and non-numeric strings (e.g. "n/a", empty strings) are rejected rather than coerced to zero.

Duplicate detection

If two records share the same (currency, indicator, date) key, the pipeline keeps the most recently ingested one and logs the collision. Silent overwrites are auditable.

Currency–indicator pairing

Every record is validated against the published indicator catalogue. A fetcher mistakenly writing unemployment for a currency that does not expose that indicator raises an error and halts the batch.

Schema failures are surfaced as structured Cloud Logging entries tagged with severity=ERROR, stage=schema_check, and the originating fetcher name. This makes cross-run diffing straightforward in the GCP console.


Stage 3 — Range & Outlier Filter

Structural validity is necessary but not sufficient. A value of 250.0 for USD CPI is syntactically valid but obviously wrong. Stage 3 applies two complementary checks to catch these semantic errors.

Hard range bounds

Each indicator has a catalogue entry that includes optional min_val and max_val bounds derived from plausible historical ranges plus a generous safety margin. Policy rates, for example, are bounded between -5.0 and 30.0 percent. Year-over-year inflation is bounded between -30.0 and 300.0 percent — wide enough to accommodate hyperinflationary episodes without constraining legitimate emerging-market data. Values outside these bounds are quarantined pending manual review.

Rolling z-score outlier detection

For indicators with at least 24 months of history in Firestore, the pipeline computes a 36-month rolling mean and standard deviation and flags any new record whose z-score exceeds |4.0|. Unlike hard bounds, z-score flags do not automatically discard records — they create a review entry and attach an outlier_flag: true field to the Firestore document so that API consumers can optionally filter outlier-flagged records in their own workflows.

Why 4σ rather than 3σ?

Macro indicators genuinely exhibit fat tails. COVID-19 supply shocks, the 2022 energy crisis, and rapid central bank hiking cycles all produced statistically rare but real readings. A 3σ threshold would quarantine legitimate data during regime changes, exactly when accurate readings are most important.


Stage 4 — Cross-Source Reconciliation

For a subset of high-importance indicators — central bank policy rates, headline CPI, and unemployment — the pipeline maintains a secondary source to cross-reference against. This is not a live fallback at request time (all data served to users comes exclusively from Firestore); it is an ingest-time consistency check.

When the primary and secondary values for the same (currency, indicator, date) diverge by more than a configurable tolerance, an alert is raised and the primary value is held pending investigation. For policy rates the tolerance is 5 basis points; for CPI it is 0.1 percentage points. The tolerances are intentionally narrow for these indicators because even small discrepancies often indicate a parsing error, a reporting lag, or a preliminary-vs-final revision conflict.

Primary sources

  • Central bank official releases
  • National statistics bureaus
  • Government data portals

Cross-reference sources

  • Parallel official endpoints (e.g. BIS)
  • Revision-flagged historical records
  • Internal prior-period consistency check

Beyond per-record cross-checks, the pipeline also runs a month-over-month continuity check: if a new record represents a change of more than N standard deviations from the trailing 12-month average change, it is treated as a candidate revision conflict. Preliminary releases frequently differ from final revisions; the pipeline logs both values and exposes a revised flag when a date's value is updated after initial publication.


Stage 5 — Business-Day Integrity

The final validation stage addresses a subtle but important constraint: every announcement_datetime must fall on a valid business day in the market timezone of the currency being published. Statistical bureaus and central banks do not publish announcements on weekends or public holidays — so if the pipeline produces a timestamp that lands on a Saturday in Tokyo or a bank holiday in Sydney, something went wrong upstream.

The validator calls is_valid_announcement_date(currency, local_date), which checks the date against per-currency timezone definitions and complete holiday calendars maintained in the codebase. Every currency served by the API — AUD, EUR, GBP, JPY, USD, CAD, CHF, NZD, and all others — has its own independent timezone and holiday table. Currencies do not inherit from their FX session; a Friday in New York can be Saturday in Sydney, and the validator handles this precisely.

Business-day validation (simplified)

def is_valid_announcement_date(currency: str, local_date: date) -> bool:
    tz = CURRENCY_TIMEZONE[currency]
    # Reject weekends
    if local_date.weekday() >= 5:
        return False
    # Reject public holidays
    if local_date in _build_holiday_set(currency, local_date.year):
        return False
    return True

When a computed date fails this check, next_valid_announcement_date advances it to the next business day — for example, rolling a Christmas announcement to the following Monday. This ensures the release calendar endpoint served to API consumers always contains dates that can be used directly in trading calendars without manual cleaning. These business-day rules are also enforced by a CI test suite that fails the build if any currency in the catalogue is missing timezone or holiday data.

Release calendar accuracy: Upcoming event dates from the release calendar endpoint — such as the next Fed meeting or RBA rate decision — are guaranteed to fall on valid business days in the currency's market timezone. The endpoint at /api/v1/calendar/{currency} reflects this validated schedule directly.


Continuous Monitoring and Alerting

Passing the validation pipeline once is not enough. The pipeline runs on a schedule — triggered by Cloud Tasks and backfill workflows — and each run produces structured telemetry that feeds a monitoring layer.

Stage alerts

Failures at any pipeline stage emit a Cloud Logging entry immediately for triage.

Content hash

Every Firestore write includes a content_hash to detect and surface silent upstream revisions.

Staleness check

Readers detect when stored data falls more than N days behind the requested range and surface a gap signal rather than silently returning stale values.

When a fetcher fails to return data — network timeout, upstream site change, or response structure change — the pipeline does not fall back to live upstream calls at request time. Instead it emits a validation failure and returns an empty result or a structured DataUnavailableError to the caller. This prevents stale or partially-validated data from reaching the API layer, even temporarily.


Handling Revisions and Restatements

Macro data revisions are a fact of life. Initial GDP estimates are revised two and three times. Payrolls get significantly restated. The pipeline handles revisions explicitly rather than silently overwriting:

  • First-print storage: The pipeline stores the first value for a given (currency, indicator, date) with a revised: false flag.
  • Revision detection: On subsequent ingest runs, if the value for a date has changed by more than the indicator's revision threshold, the document is updated and revised: true is set.
  • History preservation: The original first-print value is preserved in a prior_val field for auditing and comparison purposes.
  • API transparency: The revised field is exposed in API responses so consuming applications can distinguish preliminary from final readings.

This matters most for indicators like Non-Farm Payrolls, where the preliminary print and the subsequent revision can differ by tens of thousands of jobs — a meaningful signal in its own right for FX traders tracking the USD employment narrative via the non-farm payrolls endpoint.


What This Means for API Consumers

The practical outcome of this pipeline for anyone querying the API:

  • No NaN or null values in the series — records with invalid values are excluded at Stage 2 rather than passed through as holes.
  • Dates you can trust — every date in a response is a valid calendar date on a business day for that currency's market, suitable for direct use in trading calendars or backtesting engines.
  • Announcement timestamps at second precision — where available, announcement_datetime reflects the precise UTC second of the official release, not a midnight placeholder.
  • Revision flags — the revised field lets you distinguish whether you are working with a preliminary or final reading.
  • Consistent indicator units — rate indicators are consistently in percent, not decimal (e.g. 5.25 not 0.0525), matching the representation on official central bank websites.

Query any indicator — policy rate, headline CPI, unemployment — and the response you receive has already passed all five stages. The indicator catalogue documents exactly which sources feed each series, so you can verify the provenance of any data point independently.

Explore the data

Every indicator in the catalogue has passed through this pipeline. Browse the full endpoint reference to see the series available for your currency of interest.

API Reference →