Backtest Your Agent Logic, Not Just Your Strategy
Author: FXMacroData Team
Published: May 21, 2026
Most AI trading teams still backtest only one layer: signal-to-PnL. That misses the highest-risk component in modern systems, the agent itself. If your model misreads a macro print, drifts from schema, or violates policy under pressure, a good strategy can still produce bad trades.
Agent logic backtesting solves this by replaying historical contexts and scoring decision quality before any order reaches your broker. In FX, this matters most around event-heavy windows on pairs like USD/JPY and EUR/USD.
Why Strategy-Only Backtests Miss Real Failure Modes
When you only evaluate PnL, you hide three critical failure classes:
- Interpretation errors: the model misreads a release such as NFP and builds a thesis on the wrong direction.
- Contract errors: output breaks your schema during high-volatility periods.
- Risk-policy bypass: the model recommends oversizing or ignores invalidation criteria.
These issues often appear before PnL degradation becomes obvious. Agent-level backtesting catches them earlier.
The Four-Layer Agent Backtest Framework
Layer 1: Context Replay
Reconstruct each timestamp as the model would have seen it in real time. Pull only data available up to decision time from FXMacroData endpoints and calendar snapshots from the release calendar.
curl "https://fxmacrodata.com/api/v1/announcements/usd/core_pce?api_key=YOUR_API_KEY"
curl "https://fxmacrodata.com/api/v1/announcements/eur/inflation?api_key=YOUR_API_KEY"
curl "https://fxmacrodata.com/api/v1/forex?base=EUR"e=USD&api_key=YOUR_API_KEY"
Layer 2: Decision Replay
Run the agent on each context with the exact production prompt and constraints. Store raw output plus parsed output so you can audit both reasoning and structure.
{
"pair": "EUR/USD",
"action": "long|short|flat",
"confidence": 0.0,
"thesis": "string",
"invalidation": "string",
"size_pct": 0.0
}
Layer 3: Policy Simulation
Replay the same gatekeeper rules you use live: max risk, event-window lockouts, confidence floors, and concentration constraints.
Layer 4: Outcome Attribution
Separate outcome buckets:
- Correct thesis, good policy compliance, profitable.
- Correct thesis, poor execution quality.
- Incorrect thesis, policy should have blocked.
- Schema or process failure independent of market direction.
This tells you whether to improve prompts, policies, or execution plumbing.
Designing a High-Quality Replay Dataset
Most replay pipelines fail because the dataset is too clean or too narrow. Build your dataset from mixed regimes, not just recent months.
A practical split:
- 40% normal sessions: low-vol, trend-following and range-bound mixes.
- 35% event windows: high-impact releases such as Core PCE and policy-rate days.
- 25% stress windows: broad risk-off days with unusually high spread and latency noise.
For each timestamp, capture only what was known then. That includes release schedule context from the calendar, current spot path, and any policy context from central-bank communication archives.
Replay row fields (recommended):
- ts_utc
- pair
- context_payload_hash
- prompt_version
- model_version
- raw_output
- parsed_output
- policy_decision
- simulated_execution
- realized_outcome
Hashing context payloads helps detect accidental future-data leakage during refactors.
How to Grade Reasoning, Not Just Direction
Direction-only scoring hides important degradation. Add a simple reasoning rubric scored by deterministic checks plus light human audit:
- Causal correctness: does thesis reference the right macro driver?
- Constraint awareness: does recommendation reflect risk rules?
- Uncertainty calibration: does confidence match context quality?
- Action discipline: does model choose
flatwhen evidence is weak?
Track this as ReasoningConsistency so you can compare models and prompts beyond PnL.
Scoring Agent Quality (Beyond Hit Rate)
A robust scorecard should track at least these metrics:
- Schema pass rate: percent of outputs that parse cleanly.
- Policy compliance rate: percent of outputs that satisfy hard constraints.
- Reasoning consistency: how often thesis aligns with supplied context.
- Latency distribution: p50/p95 decision time in realistic pipeline conditions.
- Regime stability: score drift across trending, range-bound, and event-shock windows.
Example weighted score:
AgentScore = 0.30 * SchemaPass
+ 0.25 * PolicyCompliance
+ 0.20 * ReasoningConsistency
+ 0.15 * RegimeStability
+ 0.10 * LatencyScore
If you run a safety-first workflow, increase the weights on schema and policy compliance. If you run an event-speed workflow, increase the weight on latency and event-window behavior.
Minimal Replay Harness
Use a replay runner that logs each decision and score component.
from dataclasses import dataclass
@dataclass
class ReplayResult:
ts: str
parsed_ok: bool
policy_ok: bool
reasoning_ok: bool
latency_ms: int
pnl_r: float
def evaluate_one(ctx, agent, gatekeeper) -> ReplayResult:
raw = agent.run(ctx)
parsed = agent.parse(raw)
parsed_ok = parsed is not None
if not parsed_ok:
return ReplayResult(ctx["ts"], False, False, False, agent.last_latency_ms, 0.0)
gate = gatekeeper.validate(parsed, ctx)
policy_ok = gate.allowed
reasoning_ok = gate.reasoning_consistent
pnl_r = gate.simulated_r if policy_ok else 0.0
return ReplayResult(
ts=ctx["ts"],
parsed_ok=parsed_ok,
policy_ok=policy_ok,
reasoning_ok=reasoning_ok,
latency_ms=agent.last_latency_ms,
pnl_r=pnl_r,
)
The key is deterministic replay: same input context, same prompt version, same validation rules.
From Replay Results to Deployment Decisions
Do not promote model or prompt changes directly from point metrics. Use explicit deployment gates:
- Gate 1: schema pass rate must not regress.
- Gate 2: policy compliance must remain above threshold in event windows.
- Gate 3: reasoning consistency must improve or remain stable.
- Gate 4: latency p95 must stay within operational budget.
Only if all gates pass should you begin paper-trading shadow mode. Then require a minimum shadow sample size before live deployment.
Promotion policy example:
- Replay pass: required
- Shadow mode: 3 weeks minimum
- Live rollout: 20% traffic for 5 days, then full
- Auto-rollback: any schema fail burst or policy breach cluster
This prevents the classic cycle where teams overfit to replay and under-test operational behavior.
Common Testing Mistakes
- Leakage: accidentally including future fields in context.
- Prompt drift: backtesting with one prompt and trading live with another.
- No regime segmentation: averaging results across very different volatility states.
- No policy replay: treating all model outputs as tradable.
How This Improves Live Trading Reliability
Agent-logic backtesting improves reliability in ways classic backtests cannot:
- Finds failure clusters around central-bank days, from the Federal Reserve to the Bank of England.
- Reveals which errors are prompt-related versus policy-related.
- Supports safer model upgrades because you can compare decision behavior across versions before deployment.
- Creates a reusable audit trail for every accepted or rejected trade candidate.
If you already track PnL, this adds the missing observability layer that keeps AI automation from degrading silently.
Bottom Line
Backtesting strategy logic is necessary. Backtesting agent logic is what makes AI trading workflows durable. The strongest systems evaluate both: market edge and decision integrity.
Next step: build a monthly replay benchmark and require every prompt/model change to pass it before reaching live mode. Add positioning context from COT and session filters from FX sessions to stress-test behavior under different market states.