01What a Trading System Is
A system is not an indicator, a signal, or a feeling. It is a complete set of rules that answers every decision a trade demands — and answers them the same way every time. This section defines the object the rest of the handbook builds, tests, and operates.
The definition that matters
A trading system is a repeatable, fully specified procedure that converts market data into trading decisions with no decision left to in-the-moment judgement. Given the same inputs, it produces the same output — whether executed by you, a colleague, or a machine. That property is what separates a system from a "style".
The practical test: could a stranger trade your account identically from your written rules, without phoning you? If the honest answer is no, you have intuitions, not a system — and intuitions cannot be backtested, sized, audited, or improved methodically.
Discretionary, systematic, and hybrid
These are points on a spectrum of how much judgement enters the loop, not a moral hierarchy. Each can be profitable; each fails differently.
| Dimension | Discretionary | Systematic | Hybrid |
|---|---|---|---|
| Decision source | Trader judgement in the moment | Pre-defined rules, mechanically applied | Rules generate candidates; trader vetoes/confirms |
| Backtestable? | No — judgement isn't reproducible | Yes — fully | Partially — the rule layer only |
| Scales with capital/markets? | Poorly (operator is the bottleneck) | Well | Limited by the human step |
| Primary failure mode | Emotion, inconsistency, fatigue | Regime change the rules didn't anticipate | Selective override that quietly destroys the edge |
| Best for | Reading context, news, anomalies | Repeatable, measurable edges | Edges that need human context but rule discipline |
Most successful discretionary traders are, in fact, undocumented systematic traders: they apply a consistent internal procedure they have never written down. The work of building a system is largely the work of extracting that procedure into explicit rules — which is precisely where contradictions and gaps surface.
Why systematise an edge you already trade
- Measurability. You cannot improve what you cannot measure. A specified system has an expectancy, a drawdown, a sample size — numbers you can act on.
- Falsifiability. Rules can be proven wrong on history before they cost you live money. A feeling cannot.
- Consistency. The system trades the same on your best day and your worst. Most blow-ups are not bad rules — they are good rules abandoned under stress.
- Compounding of knowledge. Every trade becomes a labelled data point, not a vague memory. The edge sharpens with sample size.
- Leverage. A specified system can be automated, monitored, and scaled. An intuition lives and dies with your attention.
What a system is not
- A system is not a signal service — signals are outputs; a system is the full procedure that also defines size, risk, and exit.
- A system is not a guarantee — a positive expectancy is a long-run statistical claim, not a promise about the next trade or the next month.
- A system is not an edge — it is the container for one. A perfectly specified system with no edge loses money with great consistency.
The two things every viable system needs
Strip everything else away and a tradeable system rests on two independent pillars. Lose either and the account dies — just on different timelines.
1 · A real edge
Positive expectancy after costs, demonstrated over a sample large enough to distinguish skill from luck. Without it, better risk control only slows the bleed.
2 · Survival
Risk and position-sizing rules that guarantee you are still solvent after the inevitable losing streak. Without it, a real edge is wiped out before it can pay you.
The rest of this handbook is the engineering of those two pillars: Sections 02–06 build the edge and its rules; Sections 03–04 and 10–14 build and protect survival.
02Anatomy of a Complete System
A complete system answers eight questions, in order. Skip one and you have left a decision to chance. The order is not cosmetic — each component constrains the next, and getting the sequence wrong (sizing before stops, entry before regime) is one of the most common structural errors.
The eight components
| # | Component | The question it answers | Concrete forms |
|---|---|---|---|
| 01 | Universe | What instruments are we even allowed to trade? | A fixed list (e.g. major FX pairs), or a screen (liquidity, spread, ATR floor) |
| 02 | Regime filter | Is the system permitted to act right now? | Trend filter (price vs 200-EMA), volatility band, session window, news blackout |
| 03 | Setup | What recurring condition defines an opportunity? | Pullback to a level, range break, oversold extreme, momentum cross |
| 04 | Entry | What exact event triggers the order, and how do we get in? | Trigger candle close + market/limit/stop order at a defined price |
| 05 | Initial stop | Where is the idea proven wrong? (This defines 1R.) | Beyond structure, k × ATR, or a fixed distance |
| 06 | Position size | How much, given the stop and the risk budget? | Fixed-fractional %, volatility-targeted, fractional Kelly |
| 07 | Exit logic | How and when do we take profit or cut? | Fixed R-target, trailing stop, time stop, opposite signal |
| 08 | Manage | What happens to the trade while it is open? | Move to breakeven at +1R, scale out in tranches, pyramid |
A worked specification
The same idea, written first as a discretionary "style" and then as a system, makes the difference concrete.
As a style (untradeable)
"I buy GBPUSD pullbacks in an uptrend when it looks like the dip is done, and I take profit into resistance."
As a system (tradeable)
Every word below is checkable and reproducible — and therefore backtestable.
- Universe: GBPUSD only, 1-hour bars.
- Regime filter: close > 200-EMA and 50-EMA > 200-EMA (uptrend confirmed). No trades in the 30 minutes around high-impact GBP/USD news.
- Setup: price pulls back and the low touches the 20-EMA while the regime filter holds.
- Entry: buy-stop 2 pips above the high of the first bar that closes back above the 20-EMA.
- Initial stop: 1.5 × ATR(14) below entry. This distance defines 1R.
- Position size: risk 0.5% of equity; size = (equity × 0.5%) ÷ (stop distance in pips × pip value).
- Exit: take half at +1R, trail the remainder under each new swing low; time-stop the trade if +1R is not reached within 24 bars.
- Manage: move stop to breakeven once +1R is filled.
Notice that the system version exposes decisions the style hid: how far is a valid pullback, which EMA, what confirms "the dip is done", where exactly is the stop. Those hidden decisions are where discretionary edges silently drift — and where a system makes the drift impossible.
03Edge & Expectancy
A system makes money for exactly one reason: positive expectancy realised over a large enough sample. Not a high win rate, not a good feeling, not a clever indicator. This section is the arithmetic of edge — and the traps that arithmetic exposes.
Expectancy: the master number
Expectancy is the average profit or loss per trade you can expect over many trades. It is the product of how often you win and how much you win versus lose.
# W% = win rate, L% = loss rate = 1 − W%
# avgWin / avgLoss in currency or pips
To compare systems across instruments and account sizes, normalise everything to R — the initial risk per trade. One R is the distance from entry to your initial stop. A trade that makes twice its risk is +2R; a trade stopped out is −1R.
Expectancy[R] = (W% × avgWinR) − (L% × avgLossR)
# > 0 means the system is profitable per unit of risk, before frequency
Win rate is a vanity metric
Win rate alone tells you nothing about profitability, because it ignores the size of wins versus losses. The payoff ratio b = avgWin ÷ avgLoss couples them. The win rate you need merely to break even falls as your payoff ratio rises:
| Payoff ratio (b) | Break-even win rate | Win rate for healthy edge | Typical archetype |
|---|---|---|---|
| 0.5 : 1 | 66.7% | > 75% | Mean reversion / scalping |
| 1 : 1 | 50.0% | > 55% | Range / oscillator systems |
| 2 : 1 | 33.3% | > 40% | Swing / breakout |
| 3 : 1 | 25.0% | > 33% | Trend-pullback |
| 5 : 1 | 16.7% | > 25% | Trend-following / momentum |
Profit factor — a second lens
Profit factor is gross profit divided by gross loss. It is closely related to expectancy but reads more intuitively as "how many dollars I make per dollar I lose".
# 1.0 = break-even · 1.3–1.6 = solid · > 2.0 = excellent (and worth double-checking for look-ahead bias)
Frequency: expectancy is per trade, growth is per year
Per-trade expectancy alone does not grow an account — expectancy multiplied by trade frequency does. A smaller edge taken often can dominate a larger edge taken rarely.
System A
+0.2R per trade × 200 trades/year = +40R/year. Low edge, high frequency.
System B
+0.5R per trade × 30 trades/year = +15R/year. High edge, low frequency.
System A compounds faster and reaches statistical significance sooner — but only if its costs per trade (Section 13) do not eat the thinner edge. Frequency amplifies both your edge and your friction.
Expectancy is a claim about samples, not trades
A positive expectancy is a statement about the long-run average, and every long-run average hides brutal short-run variance. A robust system will produce long losing streaks — they are a feature of randomness around a positive mean, not evidence the edge is gone.
# e.g. L% = 60%, N = 500 → ln(500)/ln(1.667) ≈ 12 consecutive losses are normal
04Position Sizing & Risk of Ruin
Edge tells you whether to play; sizing tells you whether you survive long enough to collect. Most accounts are not killed by bad systems — they are killed by good systems sized too aggressively to outlast a normal losing streak. This is the survival pillar, and it is pure arithmetic.
The job of position sizing
Sizing has one job: convert a risk budget and a stop distance into a quantity. Everything flows from the stop you already defined in Section 02 — which is why stops come first.
Position size = Risk per trade ($) ÷ (stop distance × value per unit move)
# FX: lots = (Equity × risk%) ÷ (stopPips × pipValuePerLot)
Equity $10,000 · risk 0.5% · GBPUSD stop 25 pips · pip value ≈ $10/standard lot.
- Risk per trade = 10,000 × 0.5% = $50
- Lots = 50 ÷ (25 × 10) = 0.20 lots (20,000 units)
Widen the stop to 50 pips and the size halves to 0.10 lots — same dollar risk. The market decides the stop; your budget decides the dollars; size is whatever reconciles them.
Sizing methods, ranked
| Method | Idea | Strength | Weakness | Verdict |
|---|---|---|---|---|
| Fixed lot | Same size every trade | Trivial | Ignores stop distance and account size; risk varies wildly per trade | Avoid |
| Fixed fractional | Risk a constant % of equity per trade | Auto-scales up in wins, down in losses; bounds drawdown | Slow recovery after deep drawdown | Default. Start here. |
| Volatility targeting | Size so each trade contributes equal volatility (size ∝ 1/ATR) | Normalises risk across instruments and regimes | Needs reliable volatility estimate; reacts to vol spikes | Excellent for multi-instrument |
| Fixed ratio | Increase size after a fixed profit increment (Δ) | Aggressive growth for small accounts | Risk grows non-linearly; punishing in drawdown | Niche |
| Kelly / fractional Kelly | Bet the growth-optimal fraction (or a fraction of it) | Mathematically maximises long-run growth | Assumes you know edge exactly; full Kelly is brutally volatile | Use a fraction, as a ceiling — never raw |
The Kelly criterion — and why nobody trades full Kelly
Kelly gives the fraction of capital that maximises long-run geometric growth.
# p = win prob, q = 1 − p, b = payoff ratio (avgWin ÷ avgLoss)
# example: p = 0.40, b = 3 → f* = (0.40 × 4 − 1) ÷ 3 = 0.20 → 20% of equity per trade
Twenty percent per trade is correct in theory and insane in practice. Full Kelly produces gut-wrenching drawdowns (a 50%+ drawdown is routine), and it assumes you know p and b exactly — you do not; you estimated them from a finite, noisy backtest. Over-estimate the edge and Kelly over-bets you straight into ruin.
Risk of ruin
Risk of ruin (RoR) is the probability of losing enough capital to be unable (or unwilling) to continue. For a simplified 1:1 system risking a fixed unit per trade, with capital expressed as N units (N = 1 ÷ risk%):
# N = number of "units" of risk in your account = 1 ÷ risk-per-trade
| Risk per trade | Units (N) | RoR at 55% win (1:1) | RoR at 50% win (no edge) |
|---|---|---|---|
| 1% | 100 | ≈ 0% | 100% (eventually certain) |
| 2% | 50 | ≈ 0.01% | 100% |
| 5% | 20 | ≈ 1.9% | 100% |
| 10% | 10 | ≈ 13.7% | 100% |
| 20% | 5 | ≈ 36.6% | 100% |
Drawdown is the binding constraint
Losses and the gains needed to undo them are not symmetric. A drawdown of depth d requires a gain of d ÷ (1 − d) just to get back to even — and that gap explodes as the hole deepens.
Portfolio heat & correlation
Per-trade risk is not enough; you must cap portfolio heat — total open risk across all positions at once. Correlated positions are the trap: long EURUSD and long GBPUSD are not two 0.5% bets, they are closer to one 1% bet on a falling dollar.
- Cap aggregate open risk (e.g. total heat ≤ 2–3% of equity), not just per-trade risk.
- Treat correlated instruments as one position for the heat calculation; size the cluster, not each leg.
- Cap correlated clusters so a single macro move (a dollar spike, a risk-off day) cannot hit every open trade at full size simultaneously.
05Strategy Archetypes
Every edge is a bet that a specific market behaviour will repeat. Each archetype below works in one regime and bleeds in its opposite — there is no all-weather edge. The meta-skill is knowing which archetype you are running, which regime it needs, and how to tell when that regime has left.
The two parents: trend and mean reversion
Almost every system descends from one of two opposing beliefs: that moves continue (momentum/trend), or that moves revert (mean reversion). They are negatively correlated by construction, which is also why running both can smooth an equity curve.
The archetype map
| Archetype | The bet | Typical win% / payoff | Needs this regime | Bleeds when |
|---|---|---|---|---|
| Trend-following | Strong moves persist; let winners run | 30–45% / 2–5R | Sustained directional trends | Choppy, range-bound, mean-reverting markets (death by a thousand cuts) |
| Mean reversion | Extremes overshoot and snap back | 60–75% / 0.5–1R | Range-bound, stationary, high-noise | A trend or regime break runs through your fade (picking up coins in front of a roller) |
| Breakout | Range expansion births a new trend | 35–50% / 2–3R | Volatility compression resolving | False breakouts in chop; getting whipsawed at the edges |
| Momentum (cross-sectional) | Recent winners keep outperforming losers | ~50% / varies | Dispersion across a basket; persistent leadership | Sharp reversals / correlation spikes flip the rankings |
| Carry (FX) | Earn the interest-rate differential (swap) | High win% / small, steady | Calm, risk-on, low-volatility | Risk-off: "up the stairs, down the elevator" — slow gains, violent reversals |
| Pairs / stat-arb | A cointegrated spread reverts to its mean | High win% / small | A stable statistical relationship | The relationship structurally breaks (the spread never returns) |
| Session / time-based | Intraday seasonality (e.g. London open, NY overlap) | Varies | Repeatable liquidity/volatility windows | The seasonality decays or shifts with market structure |
Multi-timeframe: a structure, not an archetype
Multi-timeframe (MTF) is not a separate edge — it is a way of combining the above. The standard pattern: a higher timeframe sets the bias (which direction the regime filter permits), and a lower timeframe provides the trigger (a precise, cheaper entry). MTF tightens stops and improves reward:risk, but it cannot manufacture an edge that the underlying bet does not have.
06From Idea to Specification
Most systems do not fail in the backtest — they fail because the rules were never truly nailed down, so the trader ran a slightly different system every week and never knew it. Specification is the unglamorous discipline that turns a belief into something you can test, size, audit, and improve. It is also the single hardest step.
Hypothesis before indicators
Start from a market behaviour you can state in one sentence, not from an indicator you find interesting. The hypothesis is the edge; the indicator is merely how you measure the hypothesis.
Indicator-first (backwards)
"The RSI looks useful — let me find settings that would have worked." This is curve-fitting with extra steps.
Hypothesis-first (correct)
"Liquid FX trends resume intraday after a shallow pullback to the mean." Now pick the cheapest indicator that captures that.
A hypothesis is testable and can be wrong. "RSI is good" cannot be wrong because it says nothing. If you cannot state, in plain language, what the market is doing and why your rules profit from it, you do not yet have a system to specify.
The specification: zero ambiguity
To specify is to answer all eight components (Section 02) such that two people — or a machine — would execute identically. Re-run the stranger test against every clause: could someone else act on this word without asking you what you meant?
| Vague (a style) | Specified (a system) |
|---|---|
| "Strong uptrend" | close > 200-EMA and 50-EMA slope > 0 over last 10 bars |
| "Near support" | within 0.25 × ATR(14) of a level touched ≥ 2× in the last 50 bars |
| "Wait for confirmation" | a bar that closes back above the 20-EMA |
| "Don't trade the news" | no entries in the 30 min before/after a high-impact GBP or USD release |
| "Take profit into resistance" | limit at the nearest level above; else exit at +2R |
Contradictions and gaps: a spec must be total
A real specification is total — it defines an action for every state the market can present. Two failure modes hide here, and both are invisible until they cost money:
- Contradictions — two rules that fire at once and disagree. "Buy when RSI < 30" and "never buy below the 200-EMA" both trigger when an oversold market is also below trend. Which wins? If the spec doesn't say, you'll improvise — differently each time.
- Gaps — states the rules never anticipated. The setup forms on the exact bar the regime filter flips. A target and a stop are both hit inside one bar. Price gaps past your entry. An undefined state is a coin-flip you didn't know you were making.
Structure versus parameters
Separate the structural logic (the bet: "buy pullbacks in an uptrend") from the tunable parameters (the 20 in 20-EMA, the 1.5 in 1.5×ATR). Structure encodes your hypothesis; parameters are knobs. Every free parameter is a degree of freedom you can accidentally fit to noise (Section 09).
- Minimise free parameters. Three robust ones beat ten finely-tuned ones. Each knob should earn its place with an economic reason, not a backtest improvement.
- Prefer parameters that generalise. A 200-EMA trend filter is a broad, well-understood concept; a 187-period filter that tested 0.3% better is a red flag.
- Fix what you can justify; only tune what you must. The fewer things you optimise, the less you can overfit.
07Data Foundations
A backtest is only as honest as the data underneath it. Bad data does not produce obvious errors — it manufactures plausible, profitable-looking edges that evaporate the moment real money is on the line. Most "my backtest lied to me" stories are data stories.
The raw material: bars and OHLCV
A bar aggregates price action over an interval into five numbers: Open, High, Low, Close, Volume. The interval can be time-based (1H), tick-based (every 500 ticks), or volume-based. The critical limitation: a bar tells you the range but not the path within it.
Tick vs bar data
| Aspect | Tick data | Bar (OHLCV) data |
|---|---|---|
| Granularity | Every quote/trade | Summary per interval |
| Resolves intrabar order? | Yes | No |
| Models spread/slippage? | Yes (bid/ask) | Approximation only |
| Size & cost | Huge, expensive to store/process | Compact, cheap |
| Use for | Scalping, intrabar logic, realistic fills | Swing/position, higher-TF research |
Data quality — clean before you trust
- Spikes & bad ticks: erroneous prints that trigger phantom signals. Filter outliers.
- Gaps & missing sessions: holidays, outages, weekend gaps in FX. Decide how each is handled, consistently.
- Duplicate / misaligned bars: repeated timestamps or off-by-one alignment silently shift signals.
- Timezone drift: the deadliest quiet bug — a "daily" bar means something different at broker-server time vs UTC vs your local time, and a session filter built on the wrong zone is wrong on every bar.
FX-specific realities
Foreign exchange has no central exchange, which changes the data picture in ways equity traders often miss:
- No single price: each broker/liquidity provider quotes slightly differently. There is no canonical tape.
- Variable spread: spread widens at session opens, around news, and in thin liquidity (Asian session, Friday close). A fixed-spread assumption flatters the backtest.
- Swap / rollover: holding overnight earns or pays interest on the rate differential. For multi-day holds, unmodelled swap can flip a carry trade's sign.
- Sessions & gaps: the market is continuous Mon–Fri but liquidity rotates across Tokyo/London/New York; the weekend gap can leap past stops.
Feed parity: one source, end to end
| Tier | Use | Trade-off |
|---|---|---|
| Free tick (e.g. Dukascopy) | Research, spikes, archetype validation | Great granularity, but not your execution venue — for exploration only |
| Broker API history + live (single provider) | Production backtest & live | Parity across the whole pipeline; the configuration you actually trade |
| Mixed vendors | — | Avoid: divergence you cannot attribute to the strategy |
08Backtesting
A backtest is a laboratory for trying to prove your hypothesis false on history before risking capital. Its job is not a pretty equity curve — it is an honest estimate of expectancy and of the conditions under which that expectancy holds. The moment you start trying to make it look good, you have stopped being the scientist and become the mark.
Vectorised vs event-driven
| Aspect | Vectorised | Event-driven |
|---|---|---|
| How it runs | Computes signals across the whole series at once | Steps through time bar-by-bar as if live |
| Speed | Very fast — ideal for scanning many ideas | Slow — one finalist at a time |
| Path-dependent logic | Awkward (trailing stops, partial fills, pyramiding) | Natural — mirrors how live execution works |
| Look-ahead risk | Easy to introduce accidentally (whole-series ops) | Structurally harder — you only see the past |
| Best for | Research, parameter sweeps, idea triage | Validating the finalist; matching the live engine |
The event-driven loop (the mental model)
Everything hinges on one phrase: data available up to now. For each bar, in order:
- Update indicators using only bars that have already closed.
- Evaluate the regime filter, then the setup, then the entry trigger.
- Simulate the fill with costs (spread, commission, slippage).
- Manage open trades (stops, targets, trails) against this bar's range.
- Mark equity and record the trade's R-multiple.
If step 1 ever peeks at the current or a future bar's close to make a decision you'd act on now, you have introduced look-ahead bias — and your results are fiction (Section 09).
Costs: the quiet edge-killer
Costs scale with frequency, and they attack thin edges first. Model them pessimistically — optimism here is self-deception with a spreadsheet.
costs ≈ (spread + commission + slippage) ÷ (stop distance) # all in the same units
Gross edge +0.20R/trade, 200 trades/year → +40R gross.
- Costs 0.10R/trade → net +0.10R → +20R (half the edge gone to friction).
- Costs 0.20R/trade → net 0.00R → break-even — a real edge, fully consumed by costs.
The thinner and faster the edge, the more a credible cost model decides whether it is real. Stress your costs upward and see if the edge survives.
In-sample and out-of-sample
Split your history. Develop, optimise, and iterate freely on the in-sample (IS) period. Reserve an out-of-sample (OOS) period that you test against once.
What a credible backtest reports
- A long period spanning multiple regimes (trending, ranging, high- and low-volatility, at least one crisis).
- Costs included; IS and OOS shown separately; the trade count stated.
- The distribution of outcomes, not just the total — equity curve, drawdown depth and duration, and the full metric set (Section 10).
- Sensitivity to small parameter changes (Section 09) — robustness, not a single hero result.
09Overfitting & the Bias Catalog
Overfitting is the reason a beautiful backtest becomes a losing live system. It is fitting the noise in your historical sample instead of the signal that will repeat — and it is seductive precisely because it always looks like progress. This is the single most dangerous failure in system development.
What overfitting actually is
A market history contains both a (possibly real) pattern and a large amount of random noise specific to that sample. Overfitting is when your rules and parameters memorise the noise — the exact wiggles that will never recur — rather than the generalisable structure. The more free parameters you have and the more variations you try, the easier it becomes to fit noise perfectly.
The cruelty is the asymmetry: overfitting is invisible in the backtest (where it looks superb) and only revealed in live trading (where it costs money). You cannot detect it by admiring results — only by methodology applied before you see the results you were hoping for.
Symptoms
- Too many parameters relative to the number of trades — degrees of freedom that let you fit anything.
- Fragility: performance collapses when a parameter moves slightly — a peak, not a plateau.
- Too-good metrics: Sharpe > 3, 80%+ win rate and large payoff, a near-straight equity line. Real edges are noisier than this.
- Great in-sample, poor out-of-sample — the textbook signature.
- Works on one instrument only and breaks on similar ones — a genuine behavioural edge usually generalises at least somewhat.
The bias catalog
Overfitting is the headline, but it travels with a family of biases that all inflate backtest results. Know each by name so you can hunt for it deliberately.
| Bias | What it is | Defence |
|---|---|---|
| Look-ahead | Using information not available at decision time (a bar's close, a future value, a revised figure) | Event-driven loop; only closed bars; lag any revised data |
| Survivorship | Testing only instruments that still exist; the failures were deleted | Use point-in-time universes that include delisted/dead names |
| Data-snooping / multiple testing | Trying many ideas and keeping the best — which looks good by luck alone | Count your trials; raise the bar; deflate the result (below) |
| Optimisation bias | Tuning parameters to the in-sample period's specific noise | Few parameters; walk-forward; demand plateaus |
| Selection bias | Cherry-picking the test window, instrument, or start date that flatters | Fixed, pre-declared test period across regimes |
| Hindsight in rule design | Adding rules that "explain" past losses you already saw | Pre-register the hypothesis before looking; resist post-hoc patches |
| Cost omission | Ignoring spread, slippage, swap | Pessimistic cost model (Section 08) |
The multiple-comparisons problem
If you test enough strategies, the best one will look brilliant even if none has any edge — the maximum of many random results is large by chance. The more configurations you search, the higher your performance bar must rise to mean anything.
The parsimony toolkit
- Minimise parameters and justify each one economically, not by backtest gain.
- Out-of-sample testing, used sparingly (Section 08).
- Walk-forward analysis and Monte Carlo — the core robustness tools (Section 11).
- Demand plateaus, not peaks, in parameter space.
- Hold out a final, untouched dataset for one last sanity check before going live.
- Pre-register the hypothesis and count trials honestly. The discipline must precede the results.
10Performance Metrics
No single number describes a system. CAGR ignores risk; win rate ignores payoff; Sharpe hides drawdown duration. A system is a profile across four families — return, risk-adjusted return, drawdown, and trade quality — and every individual metric is gameable in isolation. Read them as a set.
Return, risk-adjusted return, and the equity picture
CAGR compounds the growth rate; risk-adjusted ratios divide return by some measure of pain.
Sharpe = (Rp − Rf) ÷ σ p # annualised ≈ daily Sharpe × √252; penalises ALL volatility
Sortino = (Rp − Rf) ÷ σ downside # penalises only downside — fairer to asymmetric systems
Calmar = CAGR ÷ |max drawdown| # return per unit of worst pain
Drawdown and trade quality
Maximum drawdown is the largest peak-to-trough equity decline; its duration — how long you stay underwater — is often the more punishing number. At the trade level, MAE/MFE (Maximum Adverse / Favourable Excursion) measure how far each trade ran against and for you before closing — invaluable for calibrating stops (are you getting stopped just before reversals?) and targets (are you leaving most of the move on the table?).
| Metric | Definition | Healthy range | The gotcha |
|---|---|---|---|
| CAGR | Compound annual growth rate | Context-dependent | Says nothing about risk taken to earn it |
| Sharpe | Excess return ÷ total volatility | > 1 acceptable, > 2 strong | Penalises upside; assumes near-normal returns; smoothable |
| Sortino | Excess return ÷ downside volatility | > 2 good | Fairer to asymmetric systems, but noisier to estimate |
| Calmar / MAR | CAGR ÷ |max drawdown| | > 0.5 ok, > 1 strong | Hostage to the single worst DD and the length of the test |
| Max drawdown | Largest peak-to-trough decline | < 20% comfortable for most | One number hides duration and frequency |
| DD duration | Longest time underwater | Shorter is better | The metric that actually breaks discipline |
| Profit factor | Gross profit ÷ gross loss | 1.3–1.6 solid | > 2 — verify it isn't look-ahead |
| Expectancy [R] | Average R per trade | > 0; > 0.1 good | Meaningless below ~100 trades |
| Win rate | Wins ÷ total | Only with payoff context | Pure vanity in isolation |
| MAE / MFE | Worst / best excursion per trade | Used to tune stops & targets | Needs trade-by-trade path data |
11Robustness & Validation
A single backtest — even out-of-sample — is one path through one history with one set of parameters. Robustness testing asks the harder question: would this edge have survived different data, different parameters, and different luck? Here you stop admiring the system and start trying to break it on purpose.
Walk-forward analysis — the gold standard
Walk-forward analysis (WFA) mimics how you'd actually run a system: optimise on a window, trade the next unseen window with those settings, then roll the window forward and repeat. The concatenated out-of-sample segments form a realistic equity curve that was never optimised in hindsight.
Anchored WFA
In-sample window expands from a fixed start. Uses all history; adapts slowly. Good when more data always helps.
Rolling WFA
In-sample window is a fixed length that slides. Adapts to changing regimes; discards old data. Good when markets evolve.
Monte Carlo — the range of luck
Your backtest's max drawdown is a single sample of what randomness could deal you; the next one could be worse. Monte Carlo simulation reshuffles or resamples your trade results thousands of times to reveal the distribution of outcomes — especially the drawdowns you didn't happen to get but easily could.
- Trade-order shuffling: reorder the same results — same total return, very different drawdown paths.
- Bootstrap resampling: draw trades with replacement to build the outcome distribution.
- Randomised skipping: drop a fraction of trades at random — does the edge survive missing some signals?
Sensitivity, regime, and stress
- Parameter sensitivity: nudge every parameter ±10–20%; performance should degrade gracefully (a plateau), not collapse (Section 09).
- Regime slicing: break results out by trending / ranging / high- vs low-volatility periods. A robust system needn't excel everywhere, but it must not be catastrophic in its off-regime — and its regime filter should keep it largely flat there.
- Stress testing: replay the worst historical windows, double your slippage, widen spreads, and gap price through a stop. If the system only survives benign conditions, it isn't validated.
- Positive expectancy after pessimistic costs, over 100+ trades spanning multiple regimes.
- Out-of-sample and walk-forward results hold (WFE not a cliff).
- Parameter plateau, not a peak; survives ±20% nudges.
- Monte Carlo worst-case drawdown is one your sizing and psychology can survive.
- No single regime, instrument, or year carries the entire result.
12From Backtest to Live
The gap between a validated backtest and a profitable live account is where most edges quietly die — not from a flawed system, but from the implementation gap. Crossing it is a deliberate, staged process, and the first thing you measure live is not profit.
Three testing modes
| Mode | What it tests | Blind spot |
|---|---|---|
| Paper / simulation | Logic and operational bugs, with idealised fills | Real slippage and the operator's nerves |
| Forward test (real-time data) | The truest data preview — same feed, no peeking ahead | Fills if paper; psychology if not live |
| Live micro-size | Real fills and real psychology — the only test that includes you | Costs real money (kept tiny on purpose) |
The implementation gap
These are the backtest assumptions that break in contact with a live venue. Each one widens the gap between hypothetical and realised expectancy:
- Slippage worse than modelled, especially on stops in fast markets.
- Latency between signal and fill — the price you saw isn't always the price you get.
- Partial or rejected fills, and requotes in thin liquidity.
- Spread spikes at session opens and news that your fixed-spread backtest never charged you.
- The operator — hesitating on a valid signal, overriding a loss, sizing up after a win.
Incubate, then scale in
Do not jump from simulation to full size. Run the system forward — paper or micro — for a window long enough to span a few dozen trades and at least one regime shift, then ramp capital in stages, each stage gated on the live edge continuing to track the backtest within tolerance.
# advance a stage only if: live expectancy ≈ backtest expectancy (within tolerance)
# AND drawdown is inside the limit AND execution stats match the cost model
- Feed parity verified — same data source as backtest (Section 07).
- Costs modelled and matching observed spread/slippage.
- Position-size formula re-derived and unit-tested against hand calculations.
- Max-drawdown limit and a kill switch coded, not just intended.
- Logging of every decision and fill; alerting on anomalies; reconciliation of system state vs broker state.
- A written, pre-committed pause rule (Section 14) — decided in calm, not in drawdown.
13Execution & Operations
A validated edge can still bleed out through execution. This is the engineering layer — how orders actually reach the market, the FX-specific frictions, and the fail-safes that matter most precisely when a system is handling real money and something goes wrong.
Order types
| Order | Behaviour | Certainty | Use for |
|---|---|---|---|
| Market | Fills now at best available price | Fill certain, price uncertain | When speed > price; risky in thin liquidity |
| Limit | Fills at your price or better | Price certain, fill uncertain | Entries at a level, taking profit |
| Stop (stop-market) | Becomes a market order when the level trades | Fill certain once triggered, price uncertain (slippage) | Stop-losses, breakout entries |
| Stop-limit | Becomes a limit order when triggered | Price certain, fill uncertain | Careful entries — dangerous for stop-losses, as it can leave you unprotected in a fast move |
| Trailing stop | A stop that follows price by a set distance | Locks in gains progressively | Letting winners run |
FX execution realities
- Spread on every round trip: you pay it entering and it's baked into your exit. It is the most certain cost you have — model it on the bid/ask you actually trade.
- Swap / rollover applies at the daily rollover (around 17:00 New York); triple swap is typically charged once a week for the weekend. Material for any multi-day hold.
- Liquidity windows: the London/New York overlap is deepest and cheapest; the Asian session and the Friday close are thin and wide; the Sunday open can gap.
- The weekend gap: price can open Monday far from Friday's close, leaping over stops — size and hold with that in mind.
The decision / risk split
Fail-safes & operational hardening
In trading, bugs rarely throw a clean exception — they lose money silently. Engineer the system to fail loudly and fail flat.
- Kill switch: a global halt triggered by a daily-loss limit, an error-rate spike, or a data/broker disconnect — flatten and stop, don't "keep trying".
- Circuit breakers: max-daily-loss and max-drawdown limits that automatically halt new entries.
- Idempotency: idempotent order keys so a retry after a timeout never double-submits a position.
- State reconciliation: continuously verify the system's view of open positions against the broker's truth; alert on any mismatch.
- Connectivity & heartbeat: detect disconnects fast and behave safely — never leave orphaned orders or unmanaged positions.
- Deterministic runtime: freeze inputs at decision time (point-in-time data, cached values), so the same bar always produces the same action — no surprise recomputation.
- Full audit log: every signal, order, fill, and rejection recorded — both for debugging and for honest performance attribution.
14Monitoring & Edge Decay
A live system is not "set and forget". Edges decay, regimes shift, and markets adapt to the inefficiencies you're exploiting. The job after launch is to know — quantitatively — whether the system is still the one you validated, and to have decided in advance what you'll do when it isn't.
Track live against backtest, continuously
Maintain rolling live statistics — expectancy, profit factor, win rate, average R — and compare them to the distribution your backtest and Monte Carlo produced (Section 11). Inside those bounds is normal variance; persistently outside them is a signal worth investigating.
Variance or decay? — the hard distinction
The central difficulty of monitoring is telling a normal losing streak (variance around a still-positive mean, which Section 03 proved is inevitable) from genuine edge decay (the inefficiency is gone). Over-react and you abandon good systems in normal drawdowns; under-react and you feed a dead one. The only defence is pre-defined, quantitative thresholds set while calm.
| Cause of decay | What happened | Tell |
|---|---|---|
| Regime change | Your archetype's regime left (trend turned to chop) | Underperformance concentrated in one regime; filter no longer firing |
| Crowding | Others found and arbitraged the same edge | Slow, persistent erosion of expectancy across regimes |
| Structural change | Market microstructure, spreads, or participants shifted | Costs/slippage drift up; fills worsen vs the model |
| Parameter drift | The world moved; your fixed parameters didn't | Walk-forward would now pick very different values |
Pre-committed pause and retire rules
Decide these in calm and write them down — because in a live drawdown your judgement is compromised by the very situation it's judging.
- Pause when a hard limit is breached: max drawdown hit, or daily-loss cap reached. Stop new entries, reassess.
- Review when live expectancy sits outside its control band for a pre-set number of trades — investigate before deciding.
- Retire when the thesis is invalidated — the market behaviour the system bets on demonstrably no longer holds. A dead edge doesn't deserve loyalty.
15Psychology & Adherence
The system is the easy part. The hard part is the human operating it. The most common reason a profitable system loses money is not a flaw in the rules — it is a failure to follow them. Discipline is not a personality trait you either have or lack; it is infrastructure you build so that in-the-moment judgement can't quietly destroy the edge.
How a profitable system becomes a losing one
Every item below converts a positive-expectancy system into a negative one — without changing a single rule:
- Overriding a valid signal because it "feels wrong" — usually right when the next winner arrives.
- Skipping trades after a losing streak — abandoning the system at the bottom, missing the recovery.
- Sizing up after wins — the most dangerous one; the biggest losses tend to follow the biggest, most confident bets.
- Revenge trading after a loss — taking setups outside the system to "make it back".
- Moving stops to avoid being wrong — converting a defined 1R loss into an undefined disaster.
The biases doing the damage
| Bias | Mechanism | Damage to the system |
|---|---|---|
| Loss aversion | Losses hurt ~2× as much as equivalent gains feel good | Cutting winners early, holding losers past the stop |
| Recency bias | Over-weighting the last few trades | Abandoning the system after a normal losing streak |
| Outcome bias | Judging a decision by its single result | Distrusting a good system after an unlucky loss |
| Gambler's fallacy | Believing you're "due" for a win | Sizing up to recover, breaking risk rules |
| Post-win overconfidence | Recent success inflates perceived skill | Sizing up into the next, larger loss |
| Confirmation bias | Seeking evidence for what you want to do | Rationalising past the regime filter's "no" |
Systematic does not mean emotionless
Even a fully automated system leaves you one decision: whether to keep it running through a drawdown. That single choice — made under maximum emotional pressure — is where most automated edges die too. The answer is not to "be more disciplined"; it is to remove the moment of weakness from the loop wherever you can.
The journal: making adherence measurable
Log every trade — entry, exit, size, R-multiple — and whether you followed the system, and if not, why. Then separate system P&L (what the rules would have made) from deviation P&L (the cost of your overrides). This turns discipline from a vague aspiration into a number you can confront.
Building discipline infrastructure
- Pre-commit the rules in calm — including pause/retire thresholds (Section 14) — when you are not in a trade.
- Automate what you can — automation removes the moment of weakness entirely; a rule a machine executes can't be overridden in a panic.
- Size so you can sleep (Section 04) — most overrides are driven by positions that are simply too large.
- Use a pre-trade checklist — force every entry through the same gate, every time.
16Worked Example & Pre-Launch Checklist
Every preceding section, applied once, end to end, to a single concrete system. This is the full lifecycle — from a one-sentence hypothesis to a monitored live system with pre-committed exit rules — run as one continuous workflow.
The lifecycle loop
End to end: the GBPUSD trend-pullback
Taking the system specified in Section 02 through every stage:
- Hypothesise. "Liquid FX trends persist intraday; after a shallow pullback to the mean within an established uptrend, continuation is more likely than reversal." One sentence, falsifiable.
- Specify. The full eight-component spec from Section 02 — universe (GBPUSD 1H), regime filter (200/50-EMA + news blackout), setup (pullback to 20-EMA), entry (buy-stop above the reclaim bar), stop (1.5×ATR = 1R), size (0.5% fixed-fractional), exit (half at +1R, trail remainder, 24-bar time stop), manage (breakeven at +1R). The stranger test passes.
- Data. Several years of 1H bars from a single feed that will also be the live broker, spanning trending, ranging, and at least one volatile crisis period; tick data reserved to resolve intrabar stop-vs-target order (Section 07).
- Backtest. Fast vectorised triage to confirm the idea has a pulse, then an event-driven re-test mirroring the live engine, with pessimistic spread + slippage + swap (Section 08).
- Validate. In-sample / out-of-sample split; rolling walk-forward; Monte Carlo on the trade sequence; ±20% parameter sweep for a plateau; regime slicing (Section 11).
- Forward & micro-live. Run forward on real-time data, then micro-size live, tracking live expectancy against the backtest distribution (Section 12).
- Scale & monitor. Ramp size only while live tracks backtest; maintain a control band; obey the pre-committed pause/retire rules (Sections 12 & 14).
| Metric | In-sample | Out-of-sample | Read |
|---|---|---|---|
| Trades | 420 | 180 | Ample sample both windows |
| Expectancy | +0.28R | +0.24R | Holds OOS — encouraging |
| Win rate / payoff | 42% / 2.6 | 40% / 2.5 | Low win rate, high payoff — consistent with a trend system |
| Profit factor | 1.55 | 1.48 | Solid, not suspiciously high |
| Max drawdown | 14% | 16% | Survivable; check Monte Carlo tail |
| Walk-forward efficiency | 0.86 | OOS keeps most of IS edge — robust, not a cliff | |
The master pre-launch checklist
- Edge: positive expectancy after pessimistic costs, 100+ trades, multiple regimes (§03, §08).
- Specification: total, contradiction-free, passes the stranger test; few justified parameters (§06).
- Data: single feed, parity backtest↔live, gaps and timezones handled (§07).
- Validation: OOS holds, walk-forward efficiency healthy, parameter plateau, Monte Carlo worst-case survivable (§11).
- Sizing: risk-per-trade and portfolio heat set backwards from a survivable drawdown; well under Kelly (§04).
- Operations: kill switch, daily-loss cap, idempotent orders, reconciliation, logging — coded, not intended (§13).
- Discipline: pre-committed pause/retire thresholds; a journal that separates system P&L from deviation P&L (§14, §15).
Failure-mode map
| How systems blow up | Prevented by |
|---|---|
| Trading with no real edge | §03 expectancy · §08 honest backtest · §11 validation |
| Sizing too large → ruin / abandonment | §04 sizing & risk of ruin |
| Overfitting a gorgeous backtest | §09 parsimony · §11 walk-forward & Monte Carlo |
| Look-ahead / data bias inflating results | §07 data hygiene · §08 event-driven loop |
| Costs quietly eating the edge | §08 cost model · §13 execution |
| No regime awareness (right system, wrong market) | §02 filter · §05 archetypes · §14 monitoring |
| Non-deterministic logic in the risk path | §13 decision/risk split & fail-safes |
| Abandoning a good system / clinging to a dead one | §14 pre-committed rules · §15 adherence |
17Glossary
The core vocabulary of system development, in plain terms. Each definition is the working sense used throughout this handbook.
Edge
A statistical advantage that produces positive expectancy after costs over a large sample.
Expectancy
Average profit/loss per trade: (win% × avg win) − (loss% × avg loss). The master profitability number.
R-multiple
Profit/loss expressed in units of initial risk. A trade making twice its risk is +2R; a full stop-out is −1R.
Payoff ratio
Average winning trade ÷ average losing trade (reward:risk). Couples with win rate to determine the edge.
Profit factor
Gross profit ÷ gross loss. Above 1.0 is profitable; 1.3–1.6 is solid.
Win rate
Proportion of trades that profit. Meaningless without the payoff ratio.
Drawdown
A decline in equity from a prior peak, measured in percent or currency.
Max drawdown
The largest peak-to-trough equity decline over a period; its duration often matters more than its depth.
CAGR
Compound annual growth rate — the smoothed annualised return. Ignores risk in isolation.
Sharpe ratio
Excess return ÷ total volatility. Penalises all volatility and assumes near-normal returns.
Sortino ratio
Excess return ÷ downside volatility — fairer to systems with asymmetric (upside-skewed) returns.
Calmar / MAR
CAGR ÷ |max drawdown| — return per unit of worst pain.
Position sizing
Translating a risk budget and stop distance into a trade quantity. The survival lever.
Fixed fractional
Risking a constant percentage of equity per trade. The sensible default.
Kelly criterion
The growth-optimal bet fraction. Used only fractionally and as a ceiling, never raw.
Risk of ruin
The probability of losing enough capital to be unable or unwilling to continue.
Portfolio heat
Total open risk across all positions at once; correlated positions count as one.
Regime
The prevailing market behaviour (trending, ranging, high/low volatility). Every edge needs a specific one.
Backtest
Simulating a system on historical data to estimate its expectancy and conditions of success.
In-sample / out-of-sample
Data used for development (IS) versus data reserved for honest, one-shot testing (OOS).
Walk-forward analysis
Repeatedly optimising on one window and testing on the next unseen window, rolling forward.
Monte Carlo
Reshuffling/resampling trade results many times to reveal the distribution of outcomes and drawdowns.
Overfitting
Fitting the noise in a historical sample rather than the signal; looks great in backtest, fails live.
Look-ahead bias
Using information not available at decision time. The deadliest silent inflator of results.
Survivorship bias
Testing only instruments that still exist, ignoring those that failed and were removed.
Slippage
The difference between the expected fill price and the actual one; worst on stops in fast markets.
Spread
The bid/ask gap — a cost paid on entry and embedded in exit. Widens in thin liquidity and around news.
Swap / rollover
Interest earned or paid for holding an FX position overnight, based on the rate differential.
MAE / MFE
Maximum Adverse / Favourable Excursion — how far a trade ran against / for you before closing.
Kill switch
A global halt that flattens positions and stops trading on a defined dangerous condition.
Idempotency
Designing order submission so a retry never duplicates a position — critical for safe automation.
Walk-forward efficiency
Out-of-sample performance ÷ in-sample performance; a measure of how well an edge generalises.
Keep reading
The Technical Analysis Handbook covers the entry-and-structure layer this handbook assumes — candlesticks, market structure, indicators, patterns, confluence, and order flow — each with exact rules.
Read the Technical Analysis Handbook →