What high cardinality metrics are and why they get expensive

High cardinality metrics are monitoring measurements that have millions of possible unique value combinations — like tracking every individual user ID or request ID. They give you detailed insight into problems but can make your monitoring costs skyrocket. This is especially relevant for AI agents, where every call can carry unique session or trace data.

When you monitor a service, you attach labels to your measurements — things like 'error count by region' or 'response time by endpoint'. Add labels that change for every single user or request (like a unique user ID), and the number of combinations explodes. That explosion is called high cardinality, and most standard monitoring tools struggle to handle it efficiently.

Tools like Prometheus were not designed for this scale, so teams either hit performance walls or face large storage bills. The observability community actively debates which specialized tools (such as Honeycomb or ClickHouse) handle it better, and how sampling — only recording a fraction of events — can cut costs without losing visibility. For anyone building AI agents that log per-request traces, designing your labels carefully from the start can save significant money later.

Key points

  • High cardinality means a metric has an enormous number of unique label combinations.
  • Common causes: attaching user IDs, request IDs, or session tokens as metric labels.
  • Standard tools like Prometheus slow down or become very costly at high cardinality.
  • Specialized tools (Honeycomb, ClickHouse) or sampling techniques are common solutions.
  • AI agent builders should plan metric label design early to avoid runaway monitoring costs.

Quick term guide

high cardinality
A dataset where the number of unique value combinations is very large, making it hard and expensive to store or query.
monitoring
Watching a system to see if it is working well or having problems.
AI agents
AI agents are AI tools that can carry out steps toward a goal, not just answer once.
AI agent
An AI program that can inspect information and suggest what to do next.
monitoring tool
Software that checks whether an app, website, or server is working normally.
Prometheus
A popular open-source tool that collects and stores numbers about how well a service is running.
observability
The ability to monitor and understand what's happening inside a running system by looking at its outputs and logs.
sampling
Recording only a small fraction of events instead of all of them, to reduce storage and cost while keeping a useful picture.
Read original