Load Testing Prometheus Metric Ingestion
Prometheus is a popular open source project for time series ingestion, storage, and queries. The Prometheus ecosystem is rapidly expanding with alternative implementations such as Cortex and Thanos to meet the diverse use-cases of adopting organizations. These alternative implementations share a common exposition format that is in the process of being formalized as the OpenMetrics standard.
It is important for organizations to understand series throughput limits to properly scale their metrics ingestion system, but alternative implementations have different performance and scaling characteristics. Avalanche is a simple metrics generation tool that can be used to load test metric ingestion throughput. Avalanche serves a metrics endpoint with configurable series generation, seen in the example below.
Running Avalanche
The easiest way to run an Avalanche server is with the public Docker image.
docker run -p 9001:9001 quay.io/freshtracks.io/avalanche
Avalanche supports several flags for configuring the generated series:
metric-count
Configure the number of metric names exposed in the endpoint. This option is useful for testing how the number of series present on the endpoint impacts ingestor performance.
series-count
Configure the number of series per-metric. The total number of series rendered to the endpoint is equal to (metric-count * series-count). This option is useful for testing how the (metric-count/series-count) ratio impacts ingestor performance.
label-count
Configure the number of labels present in each series.
value-interval
Update series values every {interval} seconds.
series-interval
Update the ‘series_id’ label every {interval} seconds, cycling series. This option is helpful for testing series creation and termination performance.
metric-interval
Update metric names every {interval} seconds, cycling series. This option is helpful for testing series creation and termination performance.
Measuring Ingestion Performance
Configure Prometheus to scrape itself, plus add the following scrape config for Avalanche:
Some interesting Prometheus metrics to watch while scraping Avalanche include:
- scrape_duration_seconds
- prometheus_target_sync_*
- prometheus_target_scrape_*
- prometheus_tsdb_*
- prometheus_local_storage_*
- prometheus_rule_*
Clustering
Real-life Prometheus instances are likely to scrape multiple hosts and endpoints. This behavior can be easily replicated with a Kubernetes deployment. An example deployment configuration to deploy 5 instances of Avalanche looks like this:
Next, configure Prometheus to run as a pod in your cluster and update the Prometheus scrape config to use Kubernetes service discovery to find each exposed Avalanche endpoint with a scrape target config as shown in the example config below:
Node Exporter and cAdvisor metrics can provide insights into performance and resource utilization of Prometheus once it is running in a pod and scraping Avalanche endpoints. Resource usage is generally correlated to total series/second ingested and “prometheus_target_interval_length_seconds” will exceed requested scrape intervals when Prometheus is under-provisioned.
Alternative Implementations
Avalanche is useful for guiding parameter tuning, scaling, and development of other metrics collection systems that adopt the OpenMetrics standard. An example is Cortex, which has much different performance characteristics than Prometheus, and exposes a host of metrics useful for performance investigations prefixed with cortex_
.
FreshTracks uses Avalanche to test and scale our Cortex system, and we’ve found that operations such as rule evaluation, chunk flushing, and series termination are particularly useful to track because certain metric workload combinations can significantly degrade ingestion performance. This data has been useful to help us proactively avoid and tune for those combinations.