Bomb Squad: Automatic Detection and Suppression of Prometheus Cardinality Explosions

Cody Boggs
FreshTracks.io
Published in
4 min readSep 18, 2018

--

Hello and welcome to another exciting episode of “Things That Can Go Horribly Wrong With Your Monitoring Infrastructure”!

Today we’ll be talking about cardinality explosions in Prometheus, and what you can do about them.

The Situation

Prometheus is running like a top, and your engineering teams are instrumenting their code with scrape-able metrics just as you’d hope. Then your Prometheus instance catches fire and falls down faster than a shaky Jenga tower. Everyone goes blind due to lack of metrics, and the one tool that could tell you what happened (Prometheus) is also the service that just crashed and burned.

Wait… What?

The situation described above could be brought about by a number of events, but the one that seems more likely than the rest is a cardinality explosion¹. This post aims to help operators learn more about what these “explosions” are, and how they can be mitigated.

Why Did My Cardinality Explode?

There’s a good chance that your cardinality explosion is the result of a code deploy that began stuffing high-cardinality² data into one or more series labels. This causes a rapid and sustained inflation of unique series, and while Prometheus is great at handling stable high-cardinality data, cardinality explosions are… not that. Instead, they are volatile, onerous, and frustrating to pinpoint.

Why Does This Hurt?

A cardinality explosion causes problems along several dimensions:

  1. Severe memory inflation in both Prometheus and the offending application(s)
  2. Increased scrape durations
  3. Querying becomes effectively impossible
  4. Prometheus remote_write destinations struggle or crash

Let’s imagine a typical web service running on a Kubernetes cluster with 3 pods receiving approximately 900 requests per second in aggregate (900 because I can only math with easy numbers). Now imagine that we push a code change that inserts the timestamp of each web request into a single label of a single metric. Maybe it was intentional (though ill-advised), or maybe it’s a weird library-generated thing. Either way, we are now exposing all of our usual metrics, plus an additional 900 series per pod — per second. (If you’re not sufficiently unnerved, pretend that we’re inserting timestamps into multiple labels of multiple metrics… shudder.)

What happens in this case? Well, for starters our application’s internal Prometheus client registry is keeping track of every one of our series as they get created, and continuing to render them upon each request to our /metrics endpoint. The count of said series is now increasing at a rate of ~300 per-pod per-second. This doesn’t bode well for the application.

Meanwhile, Prometheus is dutifully scraping our pods and is reading an ever-growing list of data points from them. As it does this, the rate of sample ingestion continues to climb, more and more redundant data is being transmitted across the wire, and more and more computation is required just to process the ever-increasing volume of data that must be reprocessed (with interest) on the next scrape interval.

Pics or it Didn’t Happen

In case you’re wondering how this manifests, here’s a set of graphs from a single Prometheus instance scraping 10 pods of a toy app that emits 100 exploding series per pod per second (simulating ~1000 requests per second for a moderately busy web service):

Four time series line graphs showing key performance metrics for Prometheus, all four of which are trending up and to the right. This is a super bad thing.

Who Ya Gonna Call? Bomb Squad!

Without some extra tooling, there’s not much you can do except roast marshmallows over the smoldering remains of your Prometheus instance. Alternately, you could try a small tool we built called Bomb Squad.

Bomb Squad is a sidecar to Prometheus (K8s only for now, but not forever) that follows a particular routine:

  1. Bootstrap recording rules into Prometheus
  2. Monitor for exploding metrics
  3. When found, identify exploding label
  4. Create a new metric to indicate a metric+label pair is exploding
  5. Insert “silencing rule” relabel config(s)
  6. Once the issue causing the is resolved, remove silencing rules upon request and reset Bomb Squad indicator metric (#4) to 0

It basically uses Prometheus to save Prometheus, by way of recording rules (for detection) and relabel configs (for suppression).

Bomb Squad is still an alpha project, but it seems like it might be helpful in some shops. Plans are underway to expand its capabilities and support more than just K8s deployments, as detailed in the Github repo: https://github.com/open-fresh/bomb-squad. The repo readme gives a bit more detail on how Bomb Squad operates and gives some rough steps to try it out locally (without making your metrics team scream). There’s also a live demo that was given at PromCon 2018 in Munich:

Contributions are very welcome!

  1. A cardinality explosion is the sudden rapid creation of new series due to one or more labels on one or metrics being populated with high-cardinality data
  2. High-cardinality data is any data that, when placed into a proper set, has a high number of discrete elements. In this context, we care about cardinalities in the tens-of-thousands and up

--

--