Multi-window burn-rate thresholds explained

Nines uses two burn-rate window pairs: 1h+5m at 14.4× (fast) and 6h+30m at 6× (slow). Both windows in a pair must exceed the threshold for the pair to fire.

Definitions

Burn rate: The ratio of actual error consumption to the SLO budget refill rate. burn_rate = error_fraction / (1 - slo_target). A 1× burn is sustainable indefinitely.
Time to exhaustion: window_length / burn_rate. At 14.4× a 7-day budget exhausts in ~12 hours; a 30-day budget in ~50 hours.
Window pair: A long window plus a short window with a shared burn-rate threshold. Both windows must exceed the threshold simultaneously for the pair to fire.

Thresholds

Pair	Long window	Short window	Threshold	Time to exhaust 7d budget
Fast	1h	5m	14.4×	~12h
Slow	6h	30m	6×	~28h

The 14.4× and 6× constants are fixed in Nines and not user-tunable. They are not affected by the configured SLO window: 14.4× always means "burning 14.4× the sustainable rate" regardless of whether the SLO window is 7 or 30 days.

Derivation

The Google SRE Workbook (Chapter 5) defines burn-rate thresholds as a fraction of SLO budget consumed in the alerting window:

burn_rate_threshold = budget_pct_to_alert_on / (alert_window / slo_window)

Workbook reference values, against a 30-day SLO window:

Fast: 2% of budget consumed in 1h → 0.02 / (1h / 30d) = 14.4×.
Slow: 5% of budget consumed in 6h → 0.05 / (6h / 30d) = 6×.

Why two windows per pair

Long window alone: A 1h rolling average ramps from zero. A service that goes 100% down at 09:00 doesn't trip 14.4× on the 1h window until ~09:25. Detection lag is unacceptable.
Short window alone: A 5m window can be pushed past 14.4× by a single bad minute. False-positive rate is unacceptable.
Both windows: The long window establishes that the degradation is sustained; the short window confirms it is currently active. Both must exceed the threshold to fire.

Why two pairs

The fast pair catches outages within minutes. The slow pair catches degradations too gradual to push the 1h window past 14.4× but still on track to violate the SLO. Neither pair alone covers both ranges:

Fast pair only: an 8× degradation running for half a day never fires.
Slow pair only: a total outage takes ~4 hours to fire, instead of ~5 minutes.

Worked examples

All examples assume a 99.5% availability SLO over 7 days.

Persistent 1-of-5 region failure: 20% error rate, 40× burn. Region-failure detector silent (1 of 5 is not a majority). Both fast windows cross 14.4× within minutes; fast pair fires.
Free-plan 2-region monitor, 1 region permanently broken: 50% error rate, 100× burn. Region-failure silent (1 of 2 is exactly half). Fast pair fires within minutes and stays open.
1-in-200 errors: 0.5% error rate, 1× burn. Sustainable. No alert.
1-in-100 errors: 1% error rate, 2× burn. Below 6× threshold. No incident; budget-remaining graph trends down. 7-day budget exhausts in ~3.5 days.
Single 5-second blip: Short window spikes briefly; long window barely moves. Pair does not fire.
Three short outages in a day, each auto-resolved: Each region-failure incident opens and closes. The rolling 7-day burn window retains the cumulative error rate. If cumulative damage crosses 6×, slow pair fires.

Warmup gate

A pair only evaluates once the monitor's age is at least the long window. Before then the pair returns Unknown and cannot fire. If both pairs are ineligible (monitor < 1h old) the detector returns Unknown for that monitor — see Detectors for full Unknown semantics.

Tuning

The 14.4× and 6× thresholds and the 1h/5m and 6h/30m windows are not user-tunable. What you can tune (Business and Founder plans) are the SLO inputs that determine what counts as a 1× rate:

Availability SLO target percentage
Rolling window length
Latency target percentile
Latency-excluded regions

The latency threshold (latency_threshold_ms) is tunable per monitor on every plan.