Time Series Data at Scale: TSDB Architecture Guide

Dec 20, 2024 Metasphere Engineering 11 min read

You start with PostgreSQL because you already run it. The monitoring data goes into a metrics table with a timestamp column, a metric name, a few label columns, and a value. Works great for the first month when you’re tracking 500 time series from 10 services. Then the Kubernetes migration happens. Suddenly you have 200 pods emitting 30 metrics each at 15-second intervals. The metrics table hits 400 million rows. VACUUM takes 45 minutes. Your dashboard query that used to return in 200ms now takes 12 seconds. The B-tree indexes are fighting for I/O with the write load. Your DBA sends a pointed email about the metrics table consuming 60% of the database’s I/O budget.

Every team hits this wall. The PostgreSQL experience teaches you, viscerally, what makes time series workloads fundamentally different from everything else you’ve stored in a relational database. The good news: the architectural principles for handling them are well-understood and the tooling is mature. And the choice of which TSDB to use matters far less than understanding the design decisions that make any of them perform well.

The Write Pattern That Breaks General-Purpose Databases

Let’s do the math on a real production environment. A Kubernetes cluster with 100 nodes, each running 50 pods, each emitting 20 metrics at 15-second intervals. That is 100 x 50 x 20 = 100,000 metric points every 15 seconds. 400,000 per minute. 576 million per day. A medium-sized production infrastructure generates this volume from monitoring alone, before you add application-level business metrics.

PostgreSQL B-tree index maintenance under this write load creates continuous I/O contention. Queries and writes fight for the same disk bandwidth. VACUUM runs compete with ingestion. You can tune work_mem, partition tables by time range, and increase max_wal_size. These help. They do not fix the fundamental problem: B-tree indexes were never designed for append-only, time-ordered, high-throughput writes. You are fighting the storage engine itself.

Purpose-built TSDBs solve this through three mechanisms that work together. Time-based partitioning shards data by time range, so recent writes always go to the active chunk. In-memory write buffers flush to disk in batches, eliminating per-write disk I/O. And compression algorithms exploit temporal locality: delta encoding on timestamps (successive values differ by a constant interval) and XOR encoding on metric values (successive readings are often similar) together produce 10-20x better storage efficiency than a relational schema. A metric that occupies 16 bytes per sample in PostgreSQL occupies 1-2 bytes per sample in VictoriaMetrics. That is not a marginal improvement. That is a different category of storage.

Cardinality: The Silent TSDB Killer

Cardinality management is the single most important operational discipline for anyone running a TSDB. Every label you add to a metric multiplies cardinality by the number of unique values that label can take. This multiplicative effect is the thing that catches every team eventually.

Here is how it escalates. You start with http_request_duration_seconds. One time series. Add a service label with 10 services: 10 series. Add endpoint with 50 unique paths: 500 series. Add status_code with 5 values: 2,500 series. All fine. Then someone adds user_id with 80,000 active users: 200 million series. Your TSDB falls over. This exact scenario plays out routinely.

The dangerous label types are the ones with unbounded cardinality: user_id, session_id, request_id, trace_id. Adding any of these to a per-request metric creates a new time series per unique value, potentially millions per day. Prometheus enforces a configurable per-scrape limit (typically 100,000) to prevent this from taking down the entire TSDB. VictoriaMetrics handles higher cardinality more gracefully but still degrades above 50-100 million active series per cluster.

The correct alternative for request-level granularity is distributed tracing, not metrics. Do not try to make metrics do what traces are designed for. Traces handle high-cardinality identifiers. Metrics do not. This is a fundamental architectural boundary: use metrics for aggregated trends (p99 latency by service), use traces for individual request investigation (why was this specific request slow?). Understanding how metrics, traces, and logs fit together is core to any observability monitoring practice.

Query Optimization for Time Series

Cardinality is the write-side problem. Query design is the read-side problem, and it is just as capable of taking down your monitoring stack. A TSDB can ingest millions of points per second and still deliver terrible dashboard performance if queries are poorly constructed. The most common failure: a Grafana panel running SELECT * equivalent queries over 30 days of raw data with no aggregation function. This forces the TSDB to scan billions of data points, decompress them, and stream them to the client. The dashboard times out, the user refreshes, and now two identical expensive queries are running simultaneously. Congratulations, you just DDoS’d your own monitoring system.

Query patterns that perform well at scale share three characteristics. First, they specify an aggregation function appropriate to the visualization. A line chart showing request latency over 7 days does not need 40 million raw data points. It needs avg_over_time or quantile_over_time at a resolution the human eye can actually distinguish. Second, they use label matchers that reduce scan scope early. Querying http_request_duration_seconds{service="checkout", region="us-east-1"} scans a fraction of the data compared to http_request_duration_seconds with no label filters. Third, they align the query step interval to the dashboard time range. A reasonable baseline: 15-second steps for 1-hour views, 5-minute steps for 24-hour views, 1-hour steps for 7-day views, and 6-hour steps for 30-day views. Every Grafana panel should have its $__interval or min step configured to match this discipline. Panels with no step interval configured default to computing one based on pixel width, which often produces unnecessarily fine resolution.

Recording rules are the most underused optimization in Prometheus and VictoriaMetrics deployments, and it is baffling how often teams overlook them. A recording rule pre-computes an expensive query on the write path and stores the result as a new time series. Instead of computing histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) across 50 services every time a dashboard panel loads, you define a recording rule that computes it once per evaluation interval and writes the result as job:http_request_duration_seconds:p99_5m. The dashboard queries the pre-computed series, which is a single time series per job instead of thousands of histogram buckets.

The performance difference is dramatic. One observability team running SLA dashboards for 120 microservices reported that their primary SLA dashboard took 18 seconds to load because every panel computed histogram quantiles from raw bucket data. After migrating the five most expensive queries to recording rules, load time dropped to 1.8 seconds. A 90% reduction in query latency from a configuration change that took an afternoon. Recording rules also reduce load on the TSDB query engine during incidents, which is precisely when fast dashboards matter most. If your dashboards are slow when things are on fire, they are not serving their purpose.

Beyond individual query optimization, audit your Grafana dashboards quarterly. Hunt for panels that query more data than they display, dashboards that auto-refresh at 5-second intervals when 30 seconds would suffice, and queries that scan high-cardinality label dimensions unnecessarily. A single poorly written dashboard used by 20 engineers generates more query load than the entire alerting pipeline. Integrating query performance reviews into your observability monitoring practice prevents dashboard sprawl from degrading the monitoring system itself.

With cardinality under control and queries performing well, the full time series architecture connects producers through a collector layer to the TSDB with downsampling rules, retention tiers, and long-term archival. Here is what that looks like end to end.

Downsampling Strategy

Raw metric data is valuable for recent troubleshooting and ruinously expensive to retain indefinitely. The numbers tell the story: one metric at 15-second resolution produces 2.1 million data points per year. Multiply by 10,000 active time series and you are storing 21 billion data points per year at raw resolution. Terabytes of storage for metrics alone.

The standard retention strategy uses three tiers:

Raw resolution (15 seconds to 1 minute) retained for 15-30 days. Enough for incident investigation at full granularity. When someone asks “what happened right before the alert fired?”, you need second-level precision.

Medium resolution (5 minutes) retained for 6-12 months. Sufficient for capacity planning, trend analysis, and SLA reporting. The difference between a 15-second data point and a 5-minute average is invisible on a weekly graph.

Low resolution (1 hour or 1 day) retained for 2-5 years. For year-over-year comparisons and long-term infrastructure planning. “How much did traffic grow last year?” doesn’t need 15-second precision.

Configure downsampling rules before the TSDB needs them, not after the storage bill arrives. This is the mistake that catches every team. VictoriaMetrics supports downsampling natively with -downsampling.period flags. Thanos uses a compactor component that runs as a separate process against object storage. Data engineering teams should plan for downsampling compute capacity at setup time. The computation consumes CPU proportional to the number of active time series. Teams building retention pipelines will find the patterns in data engineering pipelines practice directly applicable.

Time series data management at scale comes down to three disciplines: choosing a storage engine built for append-only temporal writes, controlling label cardinality before it becomes a crisis, and configuring downsampling tiers that balance query precision against storage cost. Get all three right and your monitoring infrastructure scales with your platform. Skip any one of them and you will discover the limits of your database at 3 AM during an incident.

The next layer is where that data turns into operational action.

Alerting Architecture on Time Series Data

The value of a TSDB is not the data it stores. It is the decisions that data enables. Alerting is where time series data translates into operational action, and poor alerting design wastes everything you invested in the layers above.

The foundational principle is simple and non-negotiable: alert on symptoms, not causes. An alert that fires when error rate exceeds 1% of total requests tells an on-call engineer something is broken and customers are affected. An alert that fires when a specific log message appears tells them almost nothing actionable. The error rate alert captures every failure mode, including ones nobody anticipated. The log-based alert captures only the exact failure someone thought to write a rule for. Symptom-based alerting built on aggregated time series metrics catches categories of failure rather than individual instances.

Multi-window alerting adds precision to this approach. A short burn-rate window of 5 minutes detects acute spikes. If error rate jumps from 0.1% to 15% in 5 minutes, something broke suddenly and needs immediate attention. A long burn-rate window of 1 hour detects sustained degradation. If error rate creeps from 0.1% to 0.8% over 60 minutes, a slow-building problem is consuming the error budget. Requiring both windows to breach before paging reduces false positives dramatically. Google’s SRE practices formalize this as multi-burn-rate alerting, and teams that adopt it report 50-70% fewer false pages compared to simple threshold alerts.

Alert fatigue is the operational failure that undermines even well-designed monitoring. Research on incident response teams consistently shows that organizations with more than 50 active alerts per on-call rotation see investigation rates drop below 30%. Engineers stop reading alerts when most of them are noise. They miss the critical ones because they have learned to ignore the channel. The fix is ruthless pruning. Every alert must have a documented response procedure. If no one would take action on a specific alert, it is a dashboard metric, not a page. Review alert volume monthly. If an alert fires more than 10 times per week without requiring human intervention, auto-remediate it or eliminate it. There is no third option.

Routing completes the alerting pipeline. Not every alert deserves the same response channel. Critical alerts indicating customer-facing impact route to PagerDuty or equivalent for immediate page. Warning-level alerts indicating degradation that will become critical within hours route to a dedicated Slack channel for the responsible team. Informational alerts that track trends feed dashboards only. Severity-based routing ensures that PagerDuty escalations remain rare and meaningful. Teams that route everything to the same Slack channel have effectively created an expensive, unstructured log file that nobody reads. For organizations building out their alerting infrastructure alongside broader platform investments, the patterns described in performance and capacity engineering provide complementary guidance on turning metric thresholds into capacity decisions before they become incidents.

Frequently Asked Questions

What makes time series data architecturally different from relational data?

Three properties separate time series from relational workloads. Writes arrive in high-volume, time-ordered bursts (thousands of points per second) rather than random inserts. Queries are almost always range scans over time windows, not point lookups. Data has explicit retention tiers: raw data for 15-30 days, 5-minute averages for 12 months, hourly aggregates for 5+ years. PostgreSQL works at low scale but degrades on both write throughput and cardinality at production monitoring volumes.

What is the cardinality explosion problem in time series databases?

Cardinality is the count of unique time series: every combination of metric name plus label values. A metric like http_request_duration with labels for service, endpoint, status_code, and region can produce hundreds of thousands of combinations. Adding unbounded labels like user_id or request_id creates millions of series per day. Most TSDBs degrade or stop indexing under that load. Prometheus enforces a per-scrape cardinality limit of roughly 100,000 to prevent this.

What is downsampling and why is it required for retention?

Downsampling replaces raw data with statistical summaries at coarser resolutions. Raw 15-second metrics for one year produce about 2 million points per series. Downsampled to 5-minute averages: 105,000 points. Hourly averages for 5 years: 43,800 points. Tiered retention with raw, medium, and coarse resolution gives full precision for incidents and trend visibility for planning without unbounded storage growth.

What is the difference between VictoriaMetrics and InfluxDB?

VictoriaMetrics is a Prometheus-compatible TSDB optimized for storage efficiency and query performance. It uses MetricsQL (a Prometheus superset) and handles high cardinality better than Prometheus itself. InfluxDB uses Flux, supports multiple field values per timestamp, and fits IoT and financial use cases better. For infrastructure monitoring at scale, VictoriaMetrics is the default choice. For IoT or custom time series applications, InfluxDB is the stronger fit.

When is TimescaleDB sufficient vs. a dedicated TSDB?

TimescaleDB works when your stack already runs PostgreSQL, write volume stays under 100K points per second per node, cardinality is bounded, and you need SQL joins alongside time series queries. Switch to a dedicated TSDB when throughput exceeds TimescaleDB’s ceiling, cardinality climbs above 5 million unique series, or you need native Prometheus integration. PostgreSQL operational familiarity has real value. Don’t switch unless volume or cardinality forces it.