Incident detectors: region-failure and burn-rate

Nines runs two incident detectors per monitor. Both run on every plan.

Detectors

Region-failure detector: Opens an incident when a strict majority of probe regions report down or error on the most recent check. Auto-resolves when all regions report up.
Burn-rate detector: Opens an incident when SLO error budget is consumed faster than it refills, evaluated across two time-window pairs (see Multi-window thresholds). Two SLI types: availability_burn and latency_burn. Latency runs only on http_check monitors.

Region-failure detector

Each tick, Nines reads the most recent result per probe region and counts regions reporting down or error. The threshold is strict majority: down × 2 > total_regions.

10-region monitor: 6 down to fire (5 is exactly half).
5-region monitor: 3 down to fire.
2-region monitor: 2 down to fire (1-of-2 is exactly half, never a strict majority).

An open region-failure incident auto-resolves the moment all regions report up. The incident title is {monitor name}: down.

Burn-rate detector

Burn rate is the ratio of actual error consumption to the rate the SLO budget refills. A 1× burn is sustainable forever; a 14.4× burn exhausts a 7-day budget in roughly 12 hours.

Each tick (1 minute), the detector evaluates two window pairs per monitor and per SLI type:

Pair	Long window	Short window	Threshold
Fast	1h	5m	14.4×
Slow	6h	30m	6×

Both windows in a pair must exceed the threshold for the pair to fire. If either pair fires, an incident opens. The incident auto-resolves when all evaluated pairs return below threshold and stay below for a 5-minute cooldown.

Warmup gate

A burn-rate window pair only evaluates once the monitor's age is at least the long window:

Fast pair (1h+5m): evaluated once monitor is 1h old.
Slow pair (6h+30m): evaluated once monitor is 6h old.

Below the threshold, the pair returns Unknown. No incident opens; an open incident is not auto-resolved. If no pair is eligible (monitor under 1h old), the entire detector returns Unknown for that monitor. The region-failure detector is not gated by warmup.

Unknown state

The burn-rate detector treats VM query errors, missing data, and warmup ineligibility identically:

If no incident is open, none is opened.
If an incident is open, it stays open and the cooldown clock is paused.

This prevents spurious resolution during transient datasource issues.

Detector coordination

The two detectors run independently with one suppression rule:

While a region-failure incident is open on a monitor, the availability_burn evaluation for that monitor is skipped — no duplicate "service down" incident card.
latency_burn evaluation is not suppressed and runs on every tick regardless of region-failure state.
When the region-failure incident resolves, availability_burn evaluation resumes on the next tick.

Failure modes burn-rate catches that region-failure misses

Persistent minority failure: 1 of 5 regions failing every check: 20% error rate, never a majority. Against a 99.5% availability SLO that is a 40× burn rate. Fast pair fires within minutes.
Sub-majority flapping: Different regions fail at different times, never simultaneously a majority. Snapshot-based region-failure stays clear; rolling-window burn rate sees continuous degradation.
Free-plan 2-region monitor with 1 region down: 1 of 2 is exactly half, not a strict majority. Region-failure never fires. Burn rate is 100×, fast pair fires.
Multi-region blip between ticks: A 30-second outage that hits all regions but recovers before the next tick is invisible to the snapshot-based detector. The rolling window still records the failed checks.
Cumulative damage from short, resolved outages: Three 90-second outages in a day, each region-failure incident opened and auto-resolved. The 7-day burn-rate window remembers all of them; if the cumulative pattern crosses 6×, slow pair fires.

Plan tier matrix

Capability	Free	Pro	Business	Founder
Region-failure detector	✓	✓	✓	✓
Burn-rate detector (availability + latency)	✓	✓	✓	✓
SLO panels visible	✓	✓	✓	✓
`latency_threshold_ms` tunable per monitor	✓	✓	✓	✓
Availability target / rolling window / latency percentile / excluded regions	—	—	✓	✓

Defaults applied to monitors that don't set custom values: 99.5% availability over 7 days, 500 ms p95 over 7 days. Free and Pro monitors use these defaults; the latency threshold can still be tuned per monitor on Free and Pro.

Incident visibility

Burn-rate incidents are visible to the account owner on the monitor detail page and the incidents list. They never appear on a public status page. Region-failure incidents appear on both.