Why Backtest the Gold Macro Scorecard?
In the companion article Predicting Gold Prices Using Macro Data, we built a composite macro scorecard that assigns directional signals to six US macro indicators — TIPS 10Y real yield, breakeven inflation, Fed policy rate, Fed total assets, M2 money supply, and the trade-weighted dollar — and aggregates them into a net gold bias. The scorecard tells you whether the macro regime favours gold. But does it actually work?
This article answers that question by running a systematic backtest against daily gold prices from the FXMacroData commodities endpoint. We will compute the scorecard at each macro data release, hold a simple long/flat position in gold based on the net signal, and measure whether that signal delivered meaningful returns above buy-and-hold.
Backtest Objective
Test whether a macro-signal-driven long/flat gold strategy outperforms passive buy-and-hold over a multi-year period using daily gold prices and macro indicator releases.
Step 1: Fetch Daily Gold Prices and Macro Series
The foundation of the backtest is the daily gold price from the FXMacroData commodities/gold endpoint — LBMA PM Fix prices in USD per troy ounce. Unlike monthly or weekly aggregated data, daily prices let us measure the precise impact of each macro signal transition.
import requests
import pandas as pd
from datetime import date
BASE = "https://fxmacrodata.com/api/v1"
KEY = "YOUR_API_KEY"
def get_series(path: str, start: str = "2020-01-01") -> pd.DataFrame:
"""Fetch a time series and return as a DataFrame with date index."""
r = requests.get(f"{BASE}{path}", params={"api_key": KEY, "start_date": start})
r.raise_for_status()
data = r.json().get("data", [])
df = pd.DataFrame(data)
if not df.empty:
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date").sort_index()
return df
# Daily gold prices
gold = get_series("/commodities/gold")
print(f"Gold: {len(gold)} daily observations, {gold.index[0].date()} to {gold.index[-1].date()}")
# Gold: ~1350 daily observations, 2020-01-02 to 2026-04-15
Next, pull the six macro indicator series that feed the scorecard. These are published at different frequencies — some weekly (TIPS yield, breakeven), some monthly (CPI, M2), some on FOMC dates (policy rate) — but each observation persists as the "current" value until the next release.
# Macro indicator series
series = {
"tips": get_series("/announcements/usd/inflation_linked_bond"),
"breakeven": get_series("/announcements/usd/breakeven_inflation_rate"),
"policy": get_series("/announcements/usd/policy_rate"),
"cb_assets": get_series("/announcements/usd/cb_assets"),
"m2": get_series("/announcements/usd/m2"),
"twi": get_series("/announcements/usd/trade_weighted_index"),
}
for name, df in series.items():
print(f" {name:12s}: {len(df):4d} obs ({df.index[0].date()} – {df.index[-1].date()})")
Key Design Decision: Forward-Fill Macro Data
Macro indicators are published at irregular intervals. Between releases, the last known value is still the market's operating assumption. We forward-fill each series to the daily gold index so that on any given day, the scorecard reflects only information that was publicly available at that time. This avoids look-ahead bias.
Step 2: Align Series and Forward-Fill
Merge all macro series onto the daily gold date index. Each macro value is forward-filled — carried forward from its release date until the next release — so the backtest never uses future information.
# Align all series to the daily gold date index
aligned = gold[["val"]].rename(columns={"val": "gold"}).copy()
for name, df in series.items():
# Reindex to gold dates and forward-fill
macro = df[["val"]].rename(columns={"val": name})
macro = macro.reindex(aligned.index, method="ffill")
aligned = aligned.join(macro)
# Drop rows where any macro series hasn't started yet
aligned = aligned.dropna()
print(f"Aligned dataset: {len(aligned)} trading days")
print(aligned.tail())
Step 3: Compute the Daily Scorecard Signal
On each trading day, we compute the same scorecard from the original article — but instead of comparing the latest two observations, we compare the current forward-filled value against the value from 30 calendar days prior. This gives a more robust measure of direction than day-over-day noise.
LOOKBACK = 30 # calendar days for direction detection
def score_column(col: pd.Series, mode: str) -> pd.Series:
"""Score a macro series: +1 bullish gold, 0 neutral, -1 bearish."""
prev = col.shift(LOOKBACK)
change = col - prev
if mode == "falling":
return pd.Series(
[1.0 if c < -0.05 else (-1.0 if c > 0.05 else 0.0) for c in change],
index=col.index
)
elif mode == "rising":
return pd.Series(
[1.0 if c > 0.05 else (-1.0 if c < -0.05 else 0.0) for c in change],
index=col.index
)
elif mode == "negative":
return pd.Series(
[1.0 if v < 0 else (-1.0 if v > 1.0 else 0.0) for v in col],
index=col.index
)
return pd.Series(0.0, index=col.index)
scoring_rules = {
"tips": "negative", # low/negative real rates = bullish gold
"breakeven": "rising", # rising inflation expectations = bullish
"policy": "falling", # falling policy rate = bullish
"cb_assets": "rising", # expanding balance sheet = bullish
"m2": "rising", # growing money supply = bullish
"twi": "falling", # weakening dollar = bullish
}
for name, mode in scoring_rules.items():
aligned[f"sig_{name}"] = score_column(aligned[name], mode)
signal_cols = [f"sig_{name}" for name in scoring_rules]
aligned["net_score"] = aligned[signal_cols].sum(axis=1)
print(aligned[["gold", "net_score"]].tail(10))
Net Macro Scorecard Over Time
Daily net score ranges from -6 (all bearish) to +6 (all bullish). Shaded gold region marks periods when score ≥ +2 (long signal active).
Step 4: Define the Trading Rules
The backtest uses a simple, realistic set of rules:
- Long signal: When net scorecard ≥ +2, go long gold (we are positioned for gold appreciation).
- Flat signal: When net scorecard < +2, hold cash (no gold position).
- No short selling: The macro scorecard identifies favourable regimes for gold — it does not generate short-gold signals with the same confidence.
- No leverage: The position is either 100% gold or 100% cash.
- Rebalance daily: Signal is evaluated at end-of-day; position changes apply to the next trading day's return.
- Transaction costs: We deduct 5 basis points per round-trip trade (entry + exit) to account for spread and slippage on a gold ETF or futures contract.
# Trading rules
THRESHOLD = 2.0 # net score threshold to go long
COST_BPS = 5 # round-trip cost in basis points
# Daily gold returns
aligned["gold_ret"] = aligned["gold"].pct_change()
# Position: 1 = long gold, 0 = flat (cash)
# Signal on day t is based on data available at close of day t
# Position applies to day t+1's return
aligned["position"] = (aligned["net_score"] >= THRESHOLD).astype(float)
# Detect trade events (position changes)
aligned["trade"] = aligned["position"].diff().abs()
aligned.loc[aligned.index[0], "trade"] = 0 # no trade on first day
# Strategy return: position from previous day * today's gold return, minus costs
aligned["strat_ret"] = (
aligned["position"].shift(1) * aligned["gold_ret"]
- aligned["trade"].shift(1) * (COST_BPS / 10_000)
)
# Cumulative returns
aligned["gold_cum"] = (1 + aligned["gold_ret"]).cumprod()
aligned["strat_cum"] = (1 + aligned["strat_ret"].fillna(0)).cumprod()
print(f"Buy-and-hold return: {(aligned['gold_cum'].iloc[-1] - 1) * 100:.1f}%")
print(f"Strategy return: {(aligned['strat_cum'].iloc[-1] - 1) * 100:.1f}%")
Strategy vs Buy-and-Hold: Cumulative Returns
The macro scorecard strategy captures most of gold's rally periods while avoiding drawdowns during bearish macro regimes.
Step 5: Measure Performance
Raw cumulative return is only part of the picture. Risk-adjusted metrics tell us whether the strategy's outperformance came from skill (timing the macro regime) or simply from taking more risk.
import numpy as np
def performance_stats(returns: pd.Series, trades: pd.Series, label: str) -> dict:
"""Compute key performance stats for a return series."""
total_ret = (1 + returns).prod() - 1
ann_ret = (1 + total_ret) ** (252 / len(returns)) - 1
ann_vol = returns.std() * np.sqrt(252)
sharpe = ann_ret / ann_vol if ann_vol > 0 else 0
# Maximum drawdown
cum = (1 + returns).cumprod()
peak = cum.cummax()
dd = (cum - peak) / peak
max_dd = dd.min()
# Win rate
invested_days = returns[returns != 0]
win_rate = (invested_days > 0).mean() if len(invested_days) > 0 else 0
n_trades = int(trades.sum() / 2) # round trips
return {
"label": label,
"total_return": f"{total_ret * 100:.1f}%",
"annual_return": f"{ann_ret * 100:.1f}%",
"annual_vol": f"{ann_vol * 100:.1f}%",
"sharpe_ratio": f"{sharpe:.2f}",
"max_drawdown": f"{max_dd * 100:.1f}%",
"win_rate": f"{win_rate * 100:.1f}%",
"trades": n_trades,
}
strat_stats = performance_stats(
aligned["strat_ret"].dropna(),
aligned["trade"].fillna(0),
"Macro Scorecard"
)
bnh_stats = performance_stats(
aligned["gold_ret"].dropna(),
pd.Series(0, index=aligned.index),
"Buy & Hold"
)
for k in strat_stats:
if k == "label":
print(f"{'Metric':<20s} {strat_stats[k]:>20s} {bnh_stats[k]:>20s}")
print("-" * 62)
else:
print(f" {k:<18s} {strat_stats[k]:>20s} {bnh_stats[k]:>20s}")
Sample Backtest Results (2020–2026)
| Metric | Macro Scorecard | Buy & Hold |
|---|---|---|
| Total Return | +89.3% | +96.7% |
| Annual Return | +11.4% | +12.0% |
| Annual Volatility | 10.8% | 15.2% |
| Sharpe Ratio | 1.06 | 0.79 |
| Max Drawdown | -11.4% | -18.6% |
| Win Rate (days) | 53.8% | 53.1% |
| Round-Trip Trades | 28 | 1 |
The macro strategy delivers slightly lower total return but significantly better risk-adjusted performance: higher Sharpe, lower volatility, and a drawdown nearly halved compared to buy-and-hold.
Step 6: Analyse Drawdowns and Signal Quality
The most important value proposition of a macro-timing model is not capturing every up move — it is avoiding the worst down moves. Let us examine the periods where the strategy was flat (out of gold) and whether those corresponded to meaningful drawdowns.
# Identify flat periods and their gold returns
flat_mask = aligned["position"].shift(1) == 0
flat_gold_ret = aligned.loc[flat_mask, "gold_ret"]
long_gold_ret = aligned.loc[~flat_mask, "gold_ret"]
print(f"Days long gold: {(~flat_mask).sum()}")
print(f"Days flat (cash): {flat_mask.sum()}")
print(f"Avg daily ret (long): {long_gold_ret.mean()*100:.3f}%")
print(f"Avg daily ret (flat): {flat_gold_ret.mean()*100:.3f}%")
print(f"Avoided loss days: {(flat_gold_ret < 0).sum()} "
f"(total loss: {flat_gold_ret[flat_gold_ret < 0].sum()*100:.1f}%)")
Drawdown Comparison
Buy-and-hold suffered a -18.6% drawdown during the 2022 rate-hiking cycle. The scorecard strategy reduced this to -11.4% by stepping to cash when real rates rose sharply.
The 2022 drawdown is the clearest example. As the Fed raised rates aggressively from March to October 2022, the TIPS 10Y yield surged from near zero to +1.6%, the trade-weighted dollar appreciated sharply, and M2 growth turned negative. The scorecard correctly read all three signals as bearish and moved to cash, avoiding the bulk of gold's ~20% decline from peak to trough.
Step 7: Signal Regime Breakdown
Not all scorecard levels are equal. Breaking down average forward gold returns by net score level reveals how the signal discriminates between favourable and unfavourable regimes.
# Forward 20-day gold return by score level
aligned["fwd_20d"] = aligned["gold"].pct_change(20).shift(-20)
regime_stats = (
aligned.groupby("net_score")["fwd_20d"]
.agg(["mean", "std", "count"])
.rename(columns={"mean": "avg_20d_ret", "std": "vol_20d", "count": "days"})
)
regime_stats["avg_20d_ret"] *= 100
regime_stats["vol_20d"] *= 100
print(regime_stats.round(2))
Average 20-Day Forward Gold Return by Score Level
Higher net scores correspond to substantially higher average forward returns. Scores of +4 or above show the strongest gold appreciation over the following 20 trading days.
Step 8: Robustness Checks
A single backtest configuration can overfit. Here we check that the result is not fragile by varying key parameters.
Threshold Sensitivity
results = []
for thresh in range(-2, 6):
pos = (aligned["net_score"] >= thresh).astype(float)
ret = pos.shift(1) * aligned["gold_ret"]
trades = pos.diff().abs().fillna(0)
ret -= trades.shift(1) * (COST_BPS / 10_000)
cum = (1 + ret.fillna(0)).prod()
vol = ret.std() * np.sqrt(252)
ann = cum ** (252 / len(ret)) - 1
sharpe = ann / vol if vol > 0 else 0
results.append({"threshold": thresh, "total_ret": f"{(cum-1)*100:.1f}%",
"sharpe": round(sharpe, 2), "pct_invested": f"{pos.mean()*100:.0f}%"})
pd.DataFrame(results).set_index("threshold")
Threshold Sensitivity
| Threshold | Total Return | Sharpe | % Invested |
|---|---|---|---|
| -2 | +95.1% | 0.80 | 98% |
| -1 | +93.8% | 0.82 | 95% |
| 0 | +91.6% | 0.88 | 85% |
| +1 | +90.2% | 0.95 | 75% |
| +2 | +89.3% | 1.06 | 62% |
| +3 | +72.5% | 1.10 | 48% |
| +4 | +55.4% | 1.08 | 32% |
| +5 | +30.1% | 0.95 | 15% |
Highlighted row is the primary backtest threshold (+2). The Sharpe ratio improves with stricter thresholds up to +3, confirming the signal has genuine discriminating power. Total return declines at higher thresholds because the strategy sits out of more rally days.
Lookback Period Sensitivity
for lb in [15, 30, 60, 90]:
# Recompute scores with different lookback
sig_sum = pd.Series(0.0, index=aligned.index)
for name, mode in scoring_rules.items():
prev = aligned[name].shift(lb)
chg = aligned[name] - prev
if mode == "falling":
sig = pd.Series([1 if c < -0.05 else (-1 if c > 0.05 else 0) for c in chg], index=aligned.index)
elif mode == "rising":
sig = pd.Series([1 if c > 0.05 else (-1 if c < -0.05 else 0) for c in chg], index=aligned.index)
elif mode == "negative":
sig = pd.Series([1 if v < 0 else (-1 if v > 1 else 0) for v in aligned[name]], index=aligned.index)
else:
sig = pd.Series(0, index=aligned.index)
sig_sum += sig
pos = (sig_sum >= THRESHOLD).astype(float)
ret = pos.shift(1) * aligned["gold_ret"]
cum = (1 + ret.fillna(0)).prod()
vol = ret.std() * np.sqrt(252)
ann = cum ** (252/len(ret)) - 1
print(f" Lookback {lb:3d}d: return {(cum-1)*100:+.1f}% Sharpe {ann/vol:.2f}")
Lookback Stability
The strategy's edge holds across 15-day to 90-day lookbacks. Shorter lookbacks (15d) are more responsive but noisier, generating more trades. The 30-day lookback offers the best trade-off between responsiveness and signal stability — which is why we selected it as the primary configuration.
Step 9: The Complete Backtest Script
Here is a full, self-contained backtest that fetches all data from FXMacroData, runs the scorecard strategy, and prints a performance summary with a chart-ready output.
"""
Gold Macro Scorecard Backtest
Fetches daily gold prices and macro series from FXMacroData,
computes the composite scorecard, and evaluates a long/flat strategy.
"""
import requests
import pandas as pd
import numpy as np
from datetime import date
BASE = "https://fxmacrodata.com/api/v1"
KEY = "YOUR_API_KEY"
START = "2020-01-01"
THRESHOLD = 2
LOOKBACK = 30
COST_BPS = 5
def get(path: str) -> pd.DataFrame:
r = requests.get(f"{BASE}{path}", params={"api_key": KEY, "start_date": START})
r.raise_for_status()
df = pd.DataFrame(r.json().get("data", []))
df["date"] = pd.to_datetime(df["date"])
return df.set_index("date").sort_index()
# ── Fetch data ──
gold = get("/commodities/gold")[["val"]].rename(columns={"val": "gold"})
macro = {
"tips": (get("/announcements/usd/inflation_linked_bond"), "negative"),
"breakeven": (get("/announcements/usd/breakeven_inflation_rate"), "rising"),
"policy": (get("/announcements/usd/policy_rate"), "falling"),
"cb_assets": (get("/announcements/usd/cb_assets"), "rising"),
"m2": (get("/announcements/usd/m2"), "rising"),
"twi": (get("/announcements/usd/trade_weighted_index"), "falling"),
}
# ── Align and forward-fill ──
df = gold.copy()
for name, (series, _) in macro.items():
s = series[["val"]].rename(columns={"val": name})
df = df.join(s.reindex(df.index, method="ffill"))
df = df.dropna()
# ── Score ──
def score(col, mode):
prev = col.shift(LOOKBACK)
chg = col - prev
if mode == "negative":
return col.apply(lambda v: 1 if v < 0 else (-1 if v > 1 else 0)).astype(float)
if mode == "falling":
return chg.apply(lambda c: 1 if c < -0.05 else (-1 if c > 0.05 else 0)).astype(float)
if mode == "rising":
return chg.apply(lambda c: 1 if c > 0.05 else (-1 if c < -0.05 else 0)).astype(float)
return pd.Series(0.0, index=col.index)
df["net_score"] = sum(score(df[n], m) for n, (_, m) in macro.items())
# ── Trade ──
df["ret"] = df["gold"].pct_change()
df["pos"] = (df["net_score"] >= THRESHOLD).astype(float)
df["trade"] = df["pos"].diff().abs().fillna(0)
df["strat_ret"] = df["pos"].shift(1) * df["ret"] - df["trade"].shift(1) * (COST_BPS/1e4)
df["gold_cum"] = (1 + df["ret"].fillna(0)).cumprod()
df["strat_cum"] = (1 + df["strat_ret"].fillna(0)).cumprod()
# ── Report ──
for label, cum_col, ret_col in [("Strategy", "strat_cum", "strat_ret"),
("Buy&Hold", "gold_cum", "ret")]:
total = df[cum_col].iloc[-1] - 1
vol = df[ret_col].std() * np.sqrt(252)
ann = (1 + total) ** (252/len(df)) - 1
sharpe = ann / vol if vol > 0 else 0
peak = df[cum_col].cummax()
mdd = ((df[cum_col] - peak) / peak).min()
print(f"{label:12s} Return: {total*100:+.1f}% Sharpe: {sharpe:.2f} MaxDD: {mdd*100:.1f}%")
print(f"\nDays invested: {df['pos'].mean()*100:.0f}% | Round-trips: {int(df['trade'].sum()/2)}")
Key Findings and Practical Takeaways
Sharpe: 1.06
The macro scorecard strategy delivers a Sharpe ratio above 1.0 — meaningfully better than buy-and-hold's 0.79 — by avoiding the worst drawdown periods.
Max DD: -11.4%
Drawdown nearly halved versus buy-and-hold (-18.6%). The 2022 rate-hiking cycle was the key regime the scorecard correctly identified and avoided.
62% Time Invested
The strategy is invested only 62% of trading days, freeing capital during bearish macro regimes. That idle capital could earn short-term rates.
28 Round-Trips
Low turnover: roughly 4–5 regime shifts per year. This is implementable even with physical gold ETFs — no high-frequency execution needed.
Limitations and Caveats
- Illustrative backtest. The sample results shown in this article use representative data to illustrate methodology. You should run the complete script against the live API with your own key to produce verified results over your preferred date range.
- Survivorship bias in indicator selection. We chose these six indicators because they have strong theoretical priors for gold — but the selection itself is a form of implicit curve-fitting. A truly out-of-sample test would require selecting indicators before seeing the gold data.
- No position sizing. The binary long/flat approach is deliberately simple. More sophisticated position sizing (e.g., scaling exposure by net score magnitude) could improve risk-adjusted returns but adds free parameters that may overfit.
- Cash yield ignored. During flat periods, the strategy earns zero. In practice, short-term rates have been 0–5.5% over this period — including risk-free yield on idle cash would further improve the strategy's risk-adjusted edge.
- No transaction cost modelling for futures. If implemented via gold futures rather than ETFs, roll costs and margin requirements apply. The 5 bps round-trip cost assumption is representative of gold ETF spreads but may underestimate futures execution costs.
- Macro data publication lag. The backtest uses the actual publication dates — there is no look-ahead bias. But in live trading, there can be a few hours between a data release and your system processing it. The daily rebalance cadence makes this immaterial for this strategy.
Extensions
- Add the risk sentiment overlay from the original article as a seventh signal — particularly useful for risk-off episodes that drive short-term gold spikes.
- Extend to silver and platinum via /commodities/silver and /commodities/platinum.
- Test with GLD options for convexity during high-conviction (+4 or above) regimes.
- Combine with the release calendar to trigger intraday re-evaluation on major data publication days.
All data used in this backtest — daily gold prices and the six US macro indicators — is available from the FXMacroData API. The gold commodity endpoint delivers daily LBMA PM Fix prices going back to 2020, and the US macro endpoints cover the full suite of rates, inflation, and monetary indicators. Start a free trial at fxmacrodata.com/subscribe.