← Back to Blog

Why Your Uptime Monitor's Alerting Is Broken (and What Burn-Rate Alerting Fixes)

It's the third of the month and the SLO report is open on your screen. 99.5% availability. Your target is 99.9%. You're in the red, deploys are frozen, and the engineering channel wants to know what happened. You scroll back through the incident log. Two pages, both resolved in under ten minutes. A handful of brief blips nobody followed up on. Nothing that looks like it should add up to half an hour of lost budget, let alone two hours.

The monitor isn't lying. The percentage it computed is real. What's broken is the alerting that sits on top of it. It was built to answer a different question than the one your SLO is asking, and the gap between those questions is where your error budget went.

This isn't a tuning problem. The traditional alerting model (count consecutive failures, fire at a threshold) is twenty years old and was designed for a world where the only real question was "is the box up?" If you have an SLO, you need a different model. Google's SRE Workbook calls it multi-window, multi-burn-rate alerting. It's what runs in front of services with hundreds of millions of users, and it fits in a handful of constants and a query loop.

Two Failure Modes, One Bad Model

Most uptime monitors give you exactly one alerting knob: how many consecutive checks must fail before a pager fires. Three is the default everywhere I've looked. Five is what you set after the third false alarm. Vendors dress this up with extras like "confirm from a second region" or "wait 90 seconds," but the underlying model never changes. Count failures, compare to a threshold, alert.

That model fails in two opposite directions at the same time.

It pages you for a two-second blip

Your service hiccups for two seconds. Maybe a GC pause, maybe a flaky transit link. From a monitor polling every 30 seconds out of three regions, you get one or two failed checks bunched in a small window. If your rule is three consecutive failures from a single region at 30-second intervals, that's 90 seconds, well within reach of a blip that resolved before any user noticed.

Your phone buzzes at 3 AM. You SSH in, the dashboards are green, the logs say nothing. You go back to bed angry. By Friday you've been paged six times for events that together cost maybe 30 seconds of real user impact. The next time the pager goes off you reach for snooze without reading it. That habit is the actual cost of false positives, not the lost sleep.

It stays silent through a four-hour slow burn

Now flip it. A bad deploy ships and your service starts returning 500s on about 8% of requests. Eight percent isn't enough to fail any individual check window. The failures are scattered across regions and minutes, never three in a row from one prober. Your alert never trips. Your status page shows green because "consecutive failures" is the only thing it knows how to color red.

Four hours later you've burned through 40% of your monthly error budget. You find out from a support ticket. The pattern was intermittent, and intermittent is exactly the case this model is blind to.

This is the worse half of the two. False positives train you to ignore the pager. False negatives let real damage compound until a customer tells you about it.

It's the same bug both times

These aren't two settings you can tune independently. They're the same defect. The alert is counting raw failures in a fixed window, and raw failure counts carry no information about how much of your budget is being consumed. A two-second blip and a four-hour 8% degradation can look similar in raw counts if the window is the wrong size. They have nothing in common as far as a 99.9% SLO is concerned. The model has no way to tell them apart, so whatever threshold you pick will be wrong for one of them and usually wrong for both.

What an SLO Alert Should Actually Ask

The question your alert should answer is not "did N checks fail in a row." It's: at the current failure rate, am I going to blow my error budget before the month resets?

That's what burn rate means. It's a unitless multiplier on how fast you're consuming budget. A burn rate of 1.0 means you're spending it exactly as fast as it accrues; sustainable forever. A burn rate of 2.0 exhausts your monthly budget in half a month. 14.4 exhausts it in about 50 hours. 100 means you should already be in the incident channel.

The math is one division. For a 99.9% SLO the allowed error rate is 0.001. If you observe an error rate E over a window, your burn rate is E / 0.001. A few reference points:

  • Error rate 0.1% gives burn rate 1.0 (sustainable)
  • Error rate 1% gives burn rate 10x
  • Error rate 10% gives burn rate 100x
  • Error rate 100% (a full outage) gives burn rate 1000x

What you want to alert on isn't a count of failures. It's a burn rate that's been high enough, for long enough, that your budget is in trouble.

Multi-Window, Multi-Burn-Rate

The cleanest writeup of this lives in chapter 5 of Google's SRE Workbook (O'Reilly, 2018, free to read online). The pattern has two properties that matter:

  1. It uses two burn-rate thresholds, one for fast burns and one for slow burns, not one.
  2. For each threshold, it requires two windows to agree before firing. A long window confirms the burn is sustained. A short window confirms it's still happening, not trailing residue from an incident that already resolved.

The two pairs Nines runs are taken straight from the workbook:

  • Fast pair. 1-hour window AND 5-minute window, both at or above 14.4x burn rate. This is the wake-someone-up alert. It catches "we're losing roughly 2% of the monthly budget per hour."
  • Slow pair. 6-hour window AND 30-minute window, both at or above 6x burn rate. This catches the slow 8%-degradation case. Worth a business-hours look before it eats the rest of the month.

The thresholds aren't arbitrary. 14.4x over an hour means you've burned about 2% of a 30-day budget by the time the alert fires. You still have most of the month to recover. 6x over six hours means roughly 5% burned at fire time. Slower, but caught well before the budget is gone. The full derivation is in the workbook. The short version is that these numbers trade detection speed against alert volume, and they've been pressure-tested against real Google services.

Why the AND matters

The two-window AND is the part that trips people up. Why not just alert when the long window crosses threshold?

Because long windows are slow to clear. Suppose you have a 30-minute outage at 100x burn. Your 1-hour window will read high for the full hour the outage occupies, including the 30 minutes after the service is healthy again. Alert on the long window alone and you page repeatedly on an incident that's already resolved.

The short window fixes that. The 5-minute window in the fast pair clears within five minutes of burn dropping below threshold. With the AND in place, the alert fires when burn is both sustained and current, and auto-resolves as soon as it's current-clear, even while the long window is still recovering. The long window suppresses noise. The short window keeps you honest. Drop either one and you don't have the property you wanted.

Why two pairs

Run only the fast pair (1h+5m at 14.4x) and you miss slow burns. An 8% error rate sustained for six hours produces a 1-hour burn rate of about 8x, well under 14.4. You'd quietly burn 40% of your budget. The fast pair simply doesn't see it.

Run only the slow pair (6h+30m at 6x) and you wait 30 minutes to page on a full outage. A total outage is 1000x burn, but the 30-minute short window has to fill before the AND can trigger. 30 minutes of nothing on a hard-down is unacceptable.

Two pairs cover both regimes. The fast pair catches catastrophic-but-transient. The slow pair catches survivable-but-sustained. Together they're the smallest configuration that catches both failure modes the traditional model misses.

The Rule, in Code

The whole rule is small enough to read in one sitting. Here's the configuration block from the Nines burn-rate worker, copied verbatim:

const (
    fastBurnThreshold = 14.4
    slowBurnThreshold = 6.0
)

var allBurnPairs = []burnWindowPair{
    {time.Hour, 5 * time.Minute, fastBurnThreshold},
    {6 * time.Hour, 30 * time.Minute, slowBurnThreshold},
}

Two thresholds, two pairs, four windows. Every minute, for each active monitor, the worker computes the burn rate in each window and applies the AND. If any pair has both windows over threshold, an incident opens. If both windows have been under threshold for a five-minute cooldown, the incident auto-resolves.

Two parts of the implementation are worth calling out, because they're where the workbook example meets the messiness of running this in production.

Warm-up gating

A 1-hour window has nothing useful to say about a monitor that was created five minutes ago. The data isn't there. A naive implementation either reports zero, which silently clears the alert, or reports a number computed from a partial window, which is worse because it looks real.

The worker filters out any pair whose long window hasn't fully elapsed since the monitor was created. New monitors don't page for the first hour, or the first six hours for the slow pair. The cooldown clock stays paused until they're warm. The same logic applies when the metrics store is briefly unavailable. Burn state is unknown, so no state change fires and no false signal goes out.

Don't double-page on a hard outage

Burn-rate alerting and region-failure alerting answer different questions, but they fire on overlapping symptoms. A service that's hard-down should produce one incident, not two. So the worker checks for an open region-failure incident before it evaluates availability burn. If the region detector already owns the "service is down" signal, burn-rate evaluation skips that monitor for the cycle. You see one card on the incidents page, not two redundant ones pointing at the same outage.

Region-failure detection uses majority voting. An incident only opens when a strict majority of active regions report failure, so a single Cloudflare PoP having a bad minute doesn't trip it. (Why strict-majority beats both single-region and unanimous voting is its own post; I'll get to it.)

The Two Failure Modes, Replayed

Run the two scenarios from the top of this post against the burn-rate rule.

The two-second blip. One or two failed checks in a 30-second span. Over the 5-minute short window that's roughly 5% error rate, or 50x burn on a 99.9% SLO. Over the 1-hour long window the same single failure is 0.83% error rate, or about 8.3x burn. The long window is under 14.4x. The AND fails. No page. The blip nudges your raw daily uptime from 99.93% to 99.92% and nobody loses sleep.

The four-hour 8% degradation. Sustained 8% error rate, which is 80x burn on a 99.9% SLO. Within about 30 minutes of the deploy, both the 6-hour long window and the 30-minute short window read 80x. Both clear the 6x slow threshold. The AND triggers. An incident opens inside half an hour, not after the customer tweet.

One rule, both cases handled correctly. That's the whole pitch.

Latency Burns the Same Way

Everything above is about availability burn: what fraction of checks failed. Your service can also be fully up and unusable, with every request taking eight seconds. Latency needs the same treatment. Pick a target (say p95 under 500ms 99% of the time), define a budget for slow requests, compute a burn rate against it, alert with the same multi-window pattern.

The Nines latency path uses the same 1h+5m and 6h+30m pairs at the same 14.4x and 6x thresholds, against a latency SLI instead of an availability one. The math is identical. The input changes from fraction of failed checks to fraction of requests over the latency target. Picking a defensible latency target is harder than it sounds, and you can't average p95s across regions. That's another post.

When This Pattern Is the Wrong Tool

Multi-window burn-rate alerting is the right tool when:

  • You have an SLO. If your only question is "is the box up?" and you don't care about percentages, plain consecutive-failure alerts are fine. Burn rate is overkill.
  • You have enough check volume for the math to stabilize. A monitor at 5-minute intervals has 12 data points per hour and the percentages get jittery. 1-minute or shorter intervals give you a much cleaner signal. (My post on check interval and SLO precision goes deeper on why.)
  • You can live with a 30-minute floor on time-to-page for slow burns. That's set by the 30-minute short window in the slow pair. For most services it's fine; slow burns aren't 3 AM emergencies. If yours are, you'll want a tighter signal alongside this one.

Burn-rate alerting also doesn't replace synthetic checks for "did the deploy break the checkout page?" or smoke tests for "does login still work?" Those answer different questions. Burn rate answers exactly one: am I on track to blow my SLO?

What This Looks Like in Nines

Most uptime monitoring SaaS doesn't ship multi-window burn-rate alerts. The ones that do are enterprise SRE platforms that cost upwards of a thousand dollars a month and expect you to staff an SRE team to configure them.

Nines does this on the free tier. Every active monitor gets multi-window availability burn-rate alerting on the same 14.4x and 6x thresholds Google uses. HTTP monitors also get latency burn-rate alerts at the same thresholds. Region failures go through strict-majority voting. The whole rule is a handful of constants and a query loop, and it runs on every monitor by default. There's no SLO config to fill out.

If you're running a service with an SLO and you're tired of pages that don't matter and silences that do, try Nines free. Add a monitor, leave it alone for an hour while the long windows warm up, and you'll have multi-window burn-rate alerting tuned to your SLO. The rule that pages Google's SREs should be paging yours too.