madhadron, a programmer's miscellany

When should a metric trigger an alert?

Episode Summary

How do you pick when an alert should fire? Not what needs an alert or what metric to alert on. What actual condition should trigger the alert?

Episode Transcription

How do you pick when an alert should fire? Not, what problems should I be woken up for, or, what metric will tell me that, but how do you actually pick the condition when that metric triggers an alert?

So, we have a metric. It could be error rates or CPU utilization or a bunch of other things. Is a given value acceptable or not? You have to compare it to something.

There are three things you can compare it to:

You can mix and match these in various ways, but in the end, no matter how much obfuscation or fancy machine learning you throw at it, this is what it reduces to.

So, what do you need to consider when you're doing each of these?

Let's start with comparing to its own history. We're going to take the history of this metric and calculate an acceptable range of values from it, for example, have the values of this metric in the last 15 minutes been below the 90th percentile of values of this metric over the last 6 hours?

That's fine, but if you look at something like traffic to a website, it varies throughout the day. What was 90th percentile overnight might be well below normal traffic when everyone reaches their desks at 9AM, and we would trigger a false alert every morning.

If we want to compare to a metric's history, we need a stationary metric. "Stationary" here is a term from mathematics. It means that the properties of our metric don't depend on time. And if the metric we're using isn't stationary, we need to try to correct it until it is.

To make a metric stationary we either need another metric that moves together with it under normal conditions that we can divide by or subtract off to remove the varying part, or, if we believe the non-stationary variations we will see in the future are going to be like those in the past, we can calculate a prediction of them from the metric's history and subtract that off.

This kind of calculation was first done by statisticians dealing with seasonal changes in their data, so it goes under the name "seasonality."

Seasonality can get quite involved. Consider the effect of something like Black Friday in the US or Singles Day in east Asia on traffic to a shopping website. Predicting that seasonality isn't possible without years of data. If you find yourself worrying about that level of history for your alerts, or more than the last couple days or week of seasonality, you should probably back away and try a different approach.

The second comparison you can make is to a collection of similarly behaved metrics. This makes sense when you have a cohort of similar things that we have reason to believe behave the same, and that they won't all react the same way to whatever we're alerting for.

If I'm trying to detect when a disk is failing in a fleet of 50 identical servers, the cohort is useful. It's implausible that the disks are all failing identically on all 50 servers. It's probably only going to happen on one or two, and the rest of them provide a safe comparison.

If I deployed a new version of my program on one of the 50 servers that doesn't affect the others, the cohort is again useful.

On the other hand, if my 50 servers are under a denial of service attack and the load balancer is doing its job and spreading the load evenly across them, the cohort doesn't help. They're all suffering the same.

Or if I deploy a code change to one server that corrupts data in a way that causes it, and the other servers, to hang when trying to read it, the cohort is useless. They all hang together.

The last comparison you can make is to a theoretical model.

Setting an alert based on a theoretical model can be as simple as eyeballing the history of the metric and saying, this should never go above 80% for more than 10 seconds. If it does, let me know.

This isn't a very sophisticated theoretical model, but understanding what a sophisticated theoretical model is trying to tell you at 2AM when it wakes you up is painful. Simple is good when you can manage it.

That being said, you can have a very sophisticated understanding that leads to a very simple theoretical model. For example, the Cassandra database's storage engine, in the worst case, can double the footprint of its data stored on disk while doing compaction. A generic alert needs to wake you up when Cassandra's data disk is approaching half full.

On the other hand, that worst case is irrelevant to almost every real Cassandra cluster out there. If you know your workload you can estimate how much extra space compaction will really use in your case and set a threshold much higher.

It's still a simple alert, but it's driven by very sophisticated understanding.

You may end up combining these, such as using a cohort to correct a metric so that you can compare it to a theoretical model, but in the end these are your three tools: comparing to history, comparing to cohorts, and comparing to models.