Multi-window burn-rate thresholds explained

Nines uses two burn-rate window pairs: 1h+5m at 14.4× (fast) and 6h+30m at 6× (slow). Both windows in a pair must exceed the threshold for the pair to fire.

Definitions

Burn rate
The ratio of actual error consumption to the SLO budget refill rate. burn_rate = error_fraction / (1 - slo_target). A 1× burn is sustainable indefinitely.
Time to exhaustion
window_length / burn_rate. At 14.4× a 7-day budget exhausts in ~12 hours; a 30-day budget in ~50 hours.
Window pair
A long window plus a short window with a shared burn-rate threshold. Both windows must exceed the threshold simultaneously for the pair to fire.

Thresholds

PairLong windowShort windowThresholdTime to exhaust 7d budget
Fast1h5m14.4×~12h
Slow6h30m~28h

The 14.4× and 6× constants are fixed in Nines and not user-tunable. They are not affected by the configured SLO window: 14.4× always means "burning 14.4× the sustainable rate" regardless of whether the SLO window is 7 or 30 days.

Derivation

The Google SRE Workbook (Chapter 5) defines burn-rate thresholds as a fraction of SLO budget consumed in the alerting window:

burn_rate_threshold = budget_pct_to_alert_on / (alert_window / slo_window)

Workbook reference values, against a 30-day SLO window:

  • Fast: 2% of budget consumed in 1h → 0.02 / (1h / 30d) = 14.4×.
  • Slow: 5% of budget consumed in 6h → 0.05 / (6h / 30d) = 6×.

Why two windows per pair

Long window alone
A 1h rolling average ramps from zero. A service that goes 100% down at 09:00 doesn't trip 14.4× on the 1h window until ~09:25. Detection lag is unacceptable.
Short window alone
A 5m window can be pushed past 14.4× by a single bad minute. False-positive rate is unacceptable.
Both windows
The long window establishes that the degradation is sustained; the short window confirms it is currently active. Both must exceed the threshold to fire.

Why two pairs

The fast pair catches outages within minutes. The slow pair catches degradations too gradual to push the 1h window past 14.4× but still on track to violate the SLO. Neither pair alone covers both ranges:

  • Fast pair only: an 8× degradation running for half a day never fires.
  • Slow pair only: a total outage takes ~4 hours to fire, instead of ~5 minutes.

Worked examples

All examples assume a 99.5% availability SLO over 7 days.

Persistent 1-of-5 region failure
20% error rate, 40× burn. Region-failure detector silent (1 of 5 is not a majority). Both fast windows cross 14.4× within minutes; fast pair fires.
Free-plan 2-region monitor, 1 region permanently broken
50% error rate, 100× burn. Region-failure silent (1 of 2 is exactly half). Fast pair fires within minutes and stays open.
1-in-200 errors
0.5% error rate, 1× burn. Sustainable. No alert.
1-in-100 errors
1% error rate, 2× burn. Below 6× threshold. No incident; budget-remaining graph trends down. 7-day budget exhausts in ~3.5 days.
Single 5-second blip
Short window spikes briefly; long window barely moves. Pair does not fire.
Three short outages in a day, each auto-resolved
Each region-failure incident opens and closes. The rolling 7-day burn window retains the cumulative error rate. If cumulative damage crosses 6×, slow pair fires.

Warmup gate

A pair only evaluates once the monitor's age is at least the long window. Before then the pair returns Unknown and cannot fire. If both pairs are ineligible (monitor < 1h old) the detector returns Unknown for that monitor — see Detectors for full Unknown semantics.

Tuning

The 14.4× and 6× thresholds and the 1h/5m and 6h/30m windows are not user-tunable. What you can tune (Business and Founder plans) are the SLO inputs that determine what counts as a 1× rate:

  • Availability SLO target percentage
  • Rolling window length
  • Latency target percentile
  • Latency-excluded regions

The latency threshold (latency_threshold_ms) is tunable per monitor on every plan.

See also