Multi-window burn-rate thresholds explained
Nines uses two burn-rate window pairs: 1h+5m at 14.4× (fast) and 6h+30m at 6× (slow). Both windows in a pair must exceed the threshold for the pair to fire.
Definitions
- Burn rate
- The ratio of actual error consumption to the SLO budget refill rate.
burn_rate = error_fraction / (1 - slo_target). A 1× burn is sustainable indefinitely. - Time to exhaustion
window_length / burn_rate. At 14.4× a 7-day budget exhausts in ~12 hours; a 30-day budget in ~50 hours.- Window pair
- A long window plus a short window with a shared burn-rate threshold. Both windows must exceed the threshold simultaneously for the pair to fire.
Thresholds
| Pair | Long window | Short window | Threshold | Time to exhaust 7d budget |
|---|---|---|---|---|
| Fast | 1h | 5m | 14.4× | ~12h |
| Slow | 6h | 30m | 6× | ~28h |
The 14.4× and 6× constants are fixed in Nines and not user-tunable. They are not affected by the configured SLO window: 14.4× always means "burning 14.4× the sustainable rate" regardless of whether the SLO window is 7 or 30 days.
Derivation
The Google SRE Workbook (Chapter 5) defines burn-rate thresholds as a fraction of SLO budget consumed in the alerting window:
burn_rate_threshold = budget_pct_to_alert_on / (alert_window / slo_window)
Workbook reference values, against a 30-day SLO window:
- Fast: 2% of budget consumed in 1h →
0.02 / (1h / 30d)= 14.4×. - Slow: 5% of budget consumed in 6h →
0.05 / (6h / 30d)= 6×.
Why two windows per pair
- Long window alone
- A 1h rolling average ramps from zero. A service that goes 100% down at 09:00 doesn't trip 14.4× on the 1h window until ~09:25. Detection lag is unacceptable.
- Short window alone
- A 5m window can be pushed past 14.4× by a single bad minute. False-positive rate is unacceptable.
- Both windows
- The long window establishes that the degradation is sustained; the short window confirms it is currently active. Both must exceed the threshold to fire.
Why two pairs
The fast pair catches outages within minutes. The slow pair catches degradations too gradual to push the 1h window past 14.4× but still on track to violate the SLO. Neither pair alone covers both ranges:
- Fast pair only: an 8× degradation running for half a day never fires.
- Slow pair only: a total outage takes ~4 hours to fire, instead of ~5 minutes.
Worked examples
All examples assume a 99.5% availability SLO over 7 days.
- Persistent 1-of-5 region failure
- 20% error rate, 40× burn. Region-failure detector silent (1 of 5 is not a majority). Both fast windows cross 14.4× within minutes; fast pair fires.
- Free-plan 2-region monitor, 1 region permanently broken
- 50% error rate, 100× burn. Region-failure silent (1 of 2 is exactly half). Fast pair fires within minutes and stays open.
- 1-in-200 errors
- 0.5% error rate, 1× burn. Sustainable. No alert.
- 1-in-100 errors
- 1% error rate, 2× burn. Below 6× threshold. No incident; budget-remaining graph trends down. 7-day budget exhausts in ~3.5 days.
- Single 5-second blip
- Short window spikes briefly; long window barely moves. Pair does not fire.
- Three short outages in a day, each auto-resolved
- Each region-failure incident opens and closes. The rolling 7-day burn window retains the cumulative error rate. If cumulative damage crosses 6×, slow pair fires.
Warmup gate
A pair only evaluates once the monitor's age is at least the long window. Before then the pair returns Unknown and cannot fire. If both pairs are ineligible (monitor < 1h old) the detector returns Unknown for that monitor — see Detectors for full Unknown semantics.
Tuning
The 14.4× and 6× thresholds and the 1h/5m and 6h/30m windows are not user-tunable. What you can tune (Business and Founder plans) are the SLO inputs that determine what counts as a 1× rate:
- Availability SLO target percentage
- Rolling window length
- Latency target percentile
- Latency-excluded regions
The latency threshold (latency_threshold_ms) is tunable per monitor on every plan.
See also
- Incident detectors: region-failure and burn-rate
- Burn-rate incidents — incident lifecycle.
- Error budgets — budget math.
- Latency SLOs vs availability SLOs — both SLI types use these thresholds.