Burn-Rate Incidents
Nines automatically opens an incident when your error budget is burning significantly faster than your SLO window allows — before the budget is exhausted. This gives you time to act while there is still headroom.
Why burn-rate alerting?
A simple threshold alert on budget-remaining fires too late: by the time the budget reaches zero, the damage is done. Burn-rate alerting fires early, when the rate of consumption indicates you will run out of budget before the window ends — even if plenty of budget remains right now.
This approach is described in the Google SRE Workbook (Chapter 5) and uses two pairs of windows to balance precision and recall.
Multi-window burn-rate detection
Nines evaluates two burn-rate conditions continuously for each monitor that has an SLO configured:
- Fast burn — both the 1-hour window and the 5-minute window exceed 14.4× the sustainable burn rate. A 14.4× rate over a 7-day window consumes the entire error budget in roughly 12 hours. The short 5-minute confirmation window prevents a single spike from triggering a false alarm.
- Slow burn — both the 6-hour window and the 30-minute window exceed 6× the sustainable rate. A 6× rate over 7 days exhausts the budget in about 28 hours. The longer windows are necessary to detect a slower drain that the 1-hour window might miss.
An incident is opened if either condition fires. Once open, it resolves automatically when the burn rate falls below the threshold and stays below it for a cooldown period.
Availability and latency burn-rate incidents
Burn-rate incidents are raised independently for the two SLO types:
-
Availability burn (
availability_burn) — triggered when the ratio of failed checks is consuming the availability error budget too fast. -
Latency burn (
latency_burn) — triggered when the fraction of slow checks (those exceeding the configured threshold) is consuming the latency error budget too fast. Only available forhttp_checkmonitors with a latency SLO configured.
A monitor can have both types of burn-rate incident open simultaneously if both SLOs are degrading at the same time.
Owner-only visibility
Burn-rate incidents are an internal reliability signal. They are visible only to the account owner on the monitor detail page and the incidents list. They do not appear on your public status page — your users see only region-failure incidents, not SLO budget health. This separation ensures that SLO math stays internal to your team.
Behavior when data is unavailable
If burn-rate data is briefly unavailable — for example, because a monitor has just been created and has not yet collected enough history — Nines treats the burn rate as unknown rather than zero or firing:
- If no incident is open, none is opened. No data is not evidence of a problem.
- If an incident is already open, it stays open and the cooldown clock is paused. Nines does not resolve the incident on missing data — it waits for confirmed recovery.
This prevents spurious resolutions during monitoring infrastructure blips, and avoids opening false alarms when a monitor first starts collecting data.
Relationship to region-failure incidents
Region-failure incidents (opened when all regions agree the target is unreachable) are separate from burn-rate incidents. A region-failure incident can co-exist with a burn-rate incident on the same monitor. The two types are distinguishable in the incident list by the SLI type label shown on each card.