Prometheus

Pull-based monitoring, time-series database, and alerting toolkit

01

Overview

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud in 2012. It joined the Cloud Native Computing Foundation (CNCF) in 2016 and graduated in 2018, becoming the second graduated project after Kubernetes. Prometheus 3.0, released in November 2024, is the current major version and introduced native OTLP ingestion, UTF-8 metric/label name support, native histograms (stable since v3.8), and a new UI. Prometheus collects metrics by pulling (scraping) HTTP endpoints at configured intervals, storing the data in a local time-series database (TSDB), and evaluating alert rules against that data.

Core Pull-Based Model

Prometheus scrapes targets rather than waiting for them to push data. Each target exposes a /metrics HTTP endpoint in a text-based format. This means Prometheus controls the collection rate, can detect when targets are down (failed scrape), and requires no client-side queuing or state.

Core Time-Series Database

All data is stored as time series — streams of timestamped values identified by a metric name and a set of key-value labels. For example: http_requests_total{method="GET", status="200"}. The TSDB is optimized for append-heavy, high-cardinality workloads.

Query PromQL

Prometheus ships with a powerful functional query language for selecting, aggregating, and transforming time-series data. PromQL powers dashboards (Grafana), alert rules, and recording rules.

Alert Alertmanager

A separate component that handles alert deduplication, grouping, routing, silencing, and notification delivery (email, Slack, PagerDuty, webhook). Prometheus evaluates alert rules and sends firing alerts to Alertmanager.

What Prometheus is NOT

  • Not a log aggregator — use Loki, Elasticsearch, or Fluentd/Fluent Bit for logs. Prometheus handles numeric metrics only.
  • Not for long-term storage by default — local TSDB retention is typically 15–30 days. For long-term storage, use Thanos or Mimir.
  • Not 100% accurate — it is designed for operational monitoring with slight data loss tolerance, not for billing or financial-grade precision.
02

Architecture

The Prometheus ecosystem consists of several components that work together. The Prometheus server is the central piece that scrapes, stores, and queries metrics. Supporting components handle service discovery, short-lived job metrics, alerting, and visualization.

+-----------------+ +-----------------+ +-------------------+ | Targets | | Prometheus | | Alertmanager | | (apps, nodes, |<----| Server |---->| (dedup, route, | | exporters) |scrape| |alert | notify) | | /metrics | | +-------------+ | +-------------------+ +-----------------+ | | TSDB | | | | | (local disk)| | +----+----+ +-----------------+ | +-------------+ | | Slack | | Service | | +-------------+ | | Email | | Discovery |---->| | Rule Engine | | | PagerDuty| | (K8s, Consul, | | +-------------+ | +---------+ | file, DNS) | +--------+--------+ +-----------------+ | | PromQL +-----------------+ +--------+--------+ | Pushgateway | | Grafana | | (short-lived | | (visualization)| | batch jobs) | +-----------------+ +-----------------+

Core Prometheus Server

The main binary. Handles scraping targets, storing time series in the local TSDB, evaluating recording and alerting rules, and serving the PromQL query API. Runs as a single stateful process.

Core TSDB

The embedded time-series database. Data is organized into 2-hour blocks on disk, compacted over time. Designed for high ingestion rates with efficient compression. Not replicated — a single Prometheus instance is a single point of failure.

Discovery Service Discovery

Prometheus dynamically discovers scrape targets via integrations with Kubernetes, Consul, DNS, EC2, file-based configs, and more. No need to hard-code every target IP.

Edge case Pushgateway

For short-lived batch jobs that cannot be scraped (they exit before Prometheus can pull). Jobs push metrics to the Pushgateway; Prometheus scrapes the gateway. Use sparingly — it breaks the pull model and can become a single point of failure.

Ecosystem Exporters

Third-party agents that expose metrics from systems that don't natively speak Prometheus. Node Exporter (Linux hardware/OS), Blackbox Exporter (probing), cAdvisor (containers), database exporters, and hundreds more.

Ecosystem Grafana

The standard visualization layer. Grafana queries Prometheus via PromQL and renders dashboards. Not part of Prometheus itself, but virtually every Prometheus deployment uses Grafana.

03

Configuration

Prometheus is configured via a YAML file, typically prometheus.yml. The configuration defines global settings, scrape targets, alerting rules, and remote storage endpoints.

Full prometheus.yml example

# prometheus.yml
global:
  scrape_interval: 15s          # How often to scrape targets
  evaluation_interval: 15s      # How often to evaluate rules
  scrape_timeout: 10s           # Per-scrape timeout
  external_labels:
    cluster: production
    region: us-east-1

# Alert rules and recording rules
rule_files:
  - "rules/*.yml"

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Scrape configurations
scrape_configs:
  # Prometheus monitors itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node Exporter on all hosts
  - job_name: "node"
    file_sd_configs:
      - files:
          - "targets/nodes.yml"
        refresh_interval: 5m

  # Application with relabeling
  - job_name: "my-app"
    metrics_path: /metrics
    scheme: https
    tls_config:
      insecure_skip_verify: false
    static_configs:
      - targets:
          - "app1.example.com:8443"
          - "app2.example.com:8443"
        labels:
          env: production
          team: backend

  # Kubernetes service discovery
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        regex: (.+);(.+)
        replacement: $2:$1
Prometheus 3.0 change

Prometheus v3 no longer automatically adds default ports (:80 / :443) to scrape targets based on scheme. Targets must include an explicit port. The scrape_classic_histograms option was also renamed to always_scrape_classic_histograms.

Key configuration blocks

Global global

Sets defaults for all scrape configs: scrape_interval, evaluation_interval, scrape_timeout. Also defines external_labels that are attached to all time series and alerts when communicating with external systems (federation, remote write, Alertmanager).

Scrape scrape_configs

A list of jobs. Each job defines how to discover and scrape a set of targets. Contains static_configs, file_sd_configs, kubernetes_sd_configs, and other SD mechanisms. Each job can override global settings.

Discovery file_sd_configs

Reads target lists from JSON or YAML files on disk. Files are re-read at refresh_interval. Useful when targets are managed by an external tool (Ansible, Terraform) that writes target files.

Advanced relabel_configs

Powerful label manipulation rules applied before scraping. Can drop targets, rewrite labels, extract metadata from service discovery, and set the scrape endpoint. Essential for Kubernetes SD.

File-based service discovery target file

# targets/nodes.yml
- targets:
    - "node1.example.com:9100"
    - "node2.example.com:9100"
    - "node3.example.com:9100"
  labels:
    datacenter: dc1
    env: production

- targets:
    - "staging-node1.example.com:9100"
  labels:
    datacenter: dc1
    env: staging
04

PromQL

PromQL (Prometheus Query Language) is a functional expression language for querying time-series data. It powers Grafana dashboards, alert rules, and recording rules. Understanding PromQL is essential for effective Prometheus usage.

Selectors and vector types

# Instant vector: current value of all time series matching the selector
http_requests_total{job="my-app", status="200"}

# Range vector: values over a time window (needed for rate/increase)
http_requests_total{job="my-app"}[5m]

# Label matchers
http_requests_total{method="GET"}                  # exact match
http_requests_total{method!="GET"}                 # not equal
http_requests_total{status=~"5.."}                 # regex match
http_requests_total{status!~"2.."}                 # negative regex

Essential functions and operators

# rate(): per-second rate of increase for counters (use with [range])
rate(http_requests_total{job="my-app"}[5m])

# irate(): instant rate based on last two samples (more spiky)
irate(http_requests_total{job="my-app"}[5m])

# increase(): total increase over a time range
increase(http_requests_total{job="my-app"}[1h])

# sum(): aggregate across label dimensions
sum(rate(http_requests_total[5m])) by (method, status)

# avg(): average across instances
avg(node_cpu_seconds_total{mode="idle"}) by (instance)

# histogram_quantile(): calculate percentiles from histograms
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# topk(): top N time series by value
topk(5, rate(http_requests_total[5m]))

# absent(): returns 1 if the metric does not exist (useful for alerts)
absent(up{job="my-app"})

# predict_linear(): linear regression prediction
predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0

# double_exponential_smoothing(): smoothed prediction (renamed from holt_winters in v3.0)
double_exponential_smoothing(node_memory_MemAvailable_bytes[1h], 0.3, 0.7)

Common dashboard queries

HTTP Request rate

sum(rate(http_requests_total[5m]))
  by (method, status)

Latency P99 response time

histogram_quantile(0.99,
  sum(rate(
    http_request_duration_seconds_bucket[5m]
  )) by (le)
)

CPU Usage per instance

100 - (avg by (instance) (
  irate(node_cpu_seconds_total
    {mode="idle"}[5m])
) * 100)

Disk Filesystem full prediction

predict_linear(
  node_filesystem_avail_bytes
    {mountpoint="/"}[6h],
  24 * 3600
) < 0

Recording rules

Recording rules precompute frequently used or expensive PromQL expressions and save them as new time series. This speeds up dashboards and prevents query timeouts at scale.

# rules/recording.yml
groups:
  - name: http_rules
    interval: 15s
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      - record: job:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          )

      - record: instance:node_cpu_utilization:ratio
        expr: |
          1 - avg by (instance) (
            irate(node_cpu_seconds_total{mode="idle"}[5m])
          )
05

Exporters

Exporters are agents that translate metrics from third-party systems into the Prometheus exposition format. They run alongside the monitored system and expose a /metrics HTTP endpoint that Prometheus scrapes.

ExporterPurposeDefault Port
Node ExporterLinux hardware and OS metrics (CPU, memory, disk, network)9100
Blackbox ExporterProbe endpoints via HTTP, HTTPS, DNS, TCP, ICMP9115
cAdvisorContainer resource usage and performance (CPU, memory, I/O per container)8080
mysqld_exporterMySQL server metrics (queries, connections, replication lag)9104
postgres_exporterPostgreSQL metrics (connections, locks, replication, query stats)9187
redis_exporterRedis server metrics (memory, keys, connections, commands)9121
nginx_exporterNGINX stub status metrics (connections, requests)9113
process_exporterPer-process metrics (CPU, memory, file descriptors)9256

The /metrics endpoint format

All exporters (and instrumented applications) expose metrics in the Prometheus exposition format — a simple text-based format:

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1542
http_requests_total{method="GET",status="404"} 23
http_requests_total{method="POST",status="201"} 89

# HELP http_request_duration_seconds Request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 320
http_request_duration_seconds_bucket{le="0.05"} 1100
http_request_duration_seconds_bucket{le="0.1"} 1350
http_request_duration_seconds_bucket{le="0.5"} 1500
http_request_duration_seconds_bucket{le="1"} 1540
http_request_duration_seconds_bucket{le="+Inf"} 1542
http_request_duration_seconds_sum 78.42
http_request_duration_seconds_count 1542

# HELP node_memory_MemAvailable_bytes Memory available in bytes
# TYPE node_memory_MemAvailable_bytes gauge
node_memory_MemAvailable_bytes 4.294967296e+09

Writing a custom exporter (Python)

from prometheus_client import start_http_server, Counter, Gauge, Histogram
import time, random

# Define metrics
REQUEST_COUNT = Counter(
    'myapp_requests_total',
    'Total requests processed',
    ['method', 'endpoint']
)
QUEUE_SIZE = Gauge(
    'myapp_queue_size',
    'Current items in processing queue'
)
REQUEST_LATENCY = Histogram(
    'myapp_request_duration_seconds',
    'Request latency in seconds',
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

if __name__ == '__main__':
    start_http_server(8000)  # Expose /metrics on port 8000
    while True:
        REQUEST_COUNT.labels(method='GET', endpoint='/api').inc()
        QUEUE_SIZE.set(random.randint(0, 100))
        REQUEST_LATENCY.observe(random.random())
        time.sleep(1)
Recommendation

Prefer instrumentation (embedding metrics directly in your application code) over exporters. Official client libraries exist for Go, Java/JVM, Python, Ruby, .NET, Rust, and C++, with popular community libraries for Node.js, PHP, Elixir, and more. Only use exporters for third-party software you cannot modify. Instrumented applications produce more meaningful, application-specific metrics.

06

Service Discovery

In dynamic environments (Kubernetes, cloud, containers), targets come and go. Service discovery lets Prometheus automatically find scrape targets without manual configuration changes.

SD TypeUse CaseConfig Key
StaticFixed, known targetsstatic_configs
File-basedExternal tool writes target filesfile_sd_configs
KubernetesPods, services, endpoints, nodes in K8skubernetes_sd_configs
ConsulServices registered in Consulconsul_sd_configs
DNSSRV or A recordsdns_sd_configs
EC2AWS EC2 instancesec2_sd_configs
GCEGoogle Compute Engine instancesgce_sd_configs
AzureAzure VMsazure_sd_configs

Kubernetes service discovery

scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
            - production
    relabel_configs:
      # Only scrape pods with annotation prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Use custom metrics path from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # Use custom port from annotation
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

      # Map pod labels to Prometheus labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)

      # Add namespace and pod name labels
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

Consul service discovery

scrape_configs:
  - job_name: "consul-services"
    consul_sd_configs:
      - server: "consul.example.com:8500"
        services: []  # Discover all services
    relabel_configs:
      # Use Consul service name as job label
      - source_labels: [__meta_consul_service]
        target_label: job
      # Add datacenter label
      - source_labels: [__meta_consul_dc]
        target_label: datacenter
      # Only scrape services tagged with "prometheus"
      - source_labels: [__meta_consul_tags]
        regex: .*,prometheus,.*
        action: keep

Relabeling explained

Relabeling is the mechanism that transforms metadata labels from service discovery into the final labels Prometheus uses. The most important actions:

  • keep — only keep targets where the source label matches the regex
  • drop — discard targets where the source label matches
  • replace — set a target label to a regex-transformed value of source labels
  • labelmap — copy labels matching a regex to new label names
  • labeldrop — remove labels matching a regex from all targets
  • hashmod — used for sharding: assign targets to a specific Prometheus instance based on hash
07

Alerting

Prometheus alerting is a two-step process. Prometheus evaluates alert rules (PromQL expressions) and sends firing alerts to Alertmanager, which handles deduplication, grouping, silencing, inhibition, and routing to notification channels.

Alert rules

# rules/alerts.yml
groups:
  - name: instance_alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

      - alert: HighCPUUsage
        expr: instance:node_cpu_utilization:ratio > 0.90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU utilization is above 90% for 10 minutes on {{ $labels.instance }}."

      - alert: DiskWillFillIn24Hours
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk space critical on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} on {{ $labels.instance }} will be full within 24 hours."

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "More than 5% of requests are failing with 5xx errors."

      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
          ) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency on {{ $labels.job }}"
          description: "P99 request latency is above 1 second for 10 minutes."

Alertmanager configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/XXXX"
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alertmanager@example.com"
  smtp_auth_username: "alertmanager"
  smtp_auth_password: "secret"

route:
  receiver: slack-default
  group_by: [alertname, cluster, service]
  group_wait: 30s          # Wait before sending first notification for a group
  group_interval: 5m       # Wait between notifications for the same group
  repeat_interval: 4h      # Re-notify after this interval if alert still firing

  routes:
    - matchers:
        - severity = critical
      receiver: pagerduty-critical
      continue: true        # Also send to next matching route

    - matchers:
        - severity = critical
      receiver: slack-critical

    - matchers:
        - severity = warning
      receiver: slack-warnings

receivers:
  - name: slack-default
    slack_configs:
      - channel: "#monitoring"
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        send_resolved: true

  - name: slack-critical
    slack_configs:
      - channel: "#incidents"
        title: 'CRITICAL: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        send_resolved: true

  - name: slack-warnings
    slack_configs:
      - channel: "#monitoring"
        send_resolved: true

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "your-pagerduty-events-v2-key"
        severity: critical

  - name: email-fallback
    email_configs:
      - to: "oncall@example.com"
        send_resolved: true

inhibit_rules:
  # If a critical alert is firing, suppress warnings for the same alertname
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: [alertname, cluster, service]
Key concepts

Grouping combines related alerts into a single notification (e.g., all InstanceDown alerts in a cluster). Inhibition suppresses less severe alerts when a more severe alert is already firing. Silences are temporary mutes for known issues or maintenance windows, managed via the Alertmanager web UI.

08

Storage & Retention

Prometheus stores all data in its local TSDB on disk. The TSDB is highly optimized for time-series workloads but is limited to a single node. For long-term storage and high availability, you need external solutions.

Local TSDB configuration

# Start Prometheus with storage flags
prometheus \
  --config.file=prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/data \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --web.enable-lifecycle \
  --web.enable-admin-api \
  --web.config.file=web-config.yml
Note

Only set --storage.tsdb.min-block-duration=2h and --storage.tsdb.max-block-duration=2h if using Thanos Sidecar (it requires compaction to be disabled). For standalone Prometheus, leave these at their defaults so the TSDB can compact blocks normally. The old --storage.tsdb.retention flag (without .time) was removed in Prometheus 3.0.

Retention Time-based

--storage.tsdb.retention.time=30d — keep data for 30 days. Default is 15 days. Older blocks are deleted automatically. Set based on your operational needs and disk budget.

Retention Size-based

--storage.tsdb.retention.size=50GB — cap total TSDB size. When the limit is reached, oldest blocks are removed first. Useful for preventing disk exhaustion.

Remote write / read

Prometheus supports remote write (send data to external storage) and remote read (query external storage). This enables long-term retention and global querying without keeping all data locally.

# prometheus.yml - remote write to Thanos/Mimir/Cortex
remote_write:
  - url: "http://mimir:9009/api/v1/push"
    queue_config:
      max_samples_per_send: 1000
      batch_send_deadline: 5s
      max_shards: 20
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop           # Don't send Go runtime metrics

remote_read:
  - url: "http://mimir:9009/prometheus/api/v1/read"
    read_recent: false         # Only query remote for data beyond local retention

Long-term storage solutions

SolutionModelStorage Backend
ThanosSidecar + store gateway. Uploads TSDB blocks to object storage. Global query layer.S3, GCS, Azure Blob, MinIO
Grafana MimirHorizontally scalable, multi-tenant. Receives via remote write. Built on Cortex.S3, GCS, Azure Blob, MinIO
CortexOriginal horizontally scalable Prometheus. Mimir is its successor.S3, GCS, DynamoDB, Cassandra
VictoriaMetricsHigh-performance TSDB, drop-in Prometheus replacement. Single binary or clustered.Local disk, S3 (enterprise)
Recommendation

For most new deployments, use Grafana Mimir (via remote write) or Thanos (via sidecar) for long-term storage. Keep local retention at 2–7 days for fast queries. Both solutions store data in object storage (S3/GCS), which is cheap and durable. Thanos is simpler if you want to keep the sidecar model; Mimir is better if you want a fully centralized, multi-tenant system.

09

Federation & Scaling

A single Prometheus server can handle millions of active time series, but eventually you need to scale horizontally. Prometheus offers federation for aggregation, and projects like Thanos and Mimir provide true horizontal scaling.

Hierarchical federation

A global Prometheus server scrapes aggregated metrics from lower-level Prometheus servers. Each lower-level server scrapes its own targets and runs recording rules to pre-aggregate data.

# Global Prometheus scraping federated endpoints
scrape_configs:
  - job_name: "federate-dc1"
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{job=~".+"}'                    # All job-level metrics
        - 'job:http_requests_total:rate5m' # Recording rules only
    static_configs:
      - targets:
          - "prometheus-dc1.example.com:9090"
        labels:
          datacenter: dc1

  - job_name: "federate-dc2"
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{job=~".+"}'
        - 'job:http_requests_total:rate5m'
    static_configs:
      - targets:
          - "prometheus-dc2.example.com:9090"
        labels:
          datacenter: dc2
Warning

Federation with match[]={__name__=~".+"} (all metrics) does not scale. Only federate pre-aggregated recording rules and essential metrics. Federating raw high-cardinality metrics will overwhelm the global instance. Use Thanos or Mimir for global querying of raw data.

Scaling strategies

Recommended Functional sharding

Split scraping by job or team. One Prometheus instance scrapes infrastructure metrics, another scrapes application metrics. Each instance is independent. Use Thanos Querier to provide a unified query layer across all instances.

Advanced Hashmod sharding

Use hashmod relabeling to split targets across N Prometheus instances. Each instance only scrapes targets where hash(labels) % N == shard_id. Requires coordination and a global query layer.

Recommended Thanos

Run a Thanos sidecar alongside each Prometheus. Thanos Querier fans out queries to all sidecars and deduplicates results. Thanos Store Gateway provides access to long-term data in object storage. No need to federate.

Recommended Grafana Mimir

All Prometheus instances remote-write to a centralized Mimir cluster. Mimir handles ingestion, storage, compaction, and querying. Multi-tenant, horizontally scalable. No sidecars needed.

Recording rules for performance

Recording rules are critical at scale. Pre-aggregating metrics reduces query-time cardinality and prevents expensive PromQL from timing out.

# rules/recording-aggregate.yml
groups:
  - name: aggregated_metrics
    interval: 30s
    rules:
      # Reduce cardinality by dropping instance label
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status)

      # Pre-compute error ratio per job
      - record: job:http_errors:ratio5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # Aggregate node metrics for federation
      - record: cluster:node_cpu:avg_utilization
        expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m]))
10

Docker Deployment

A production-ready Docker Compose setup for Prometheus with Node Exporter for host metrics and Alertmanager for alert routing.

Docker Compose stack

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:v3.10.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - ./prometheus/targets:/etc/prometheus/targets:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--storage.tsdb.retention.size=20GB"
      - "--web.enable-lifecycle"
      - "--web.enable-admin-api"
      - "--web.external-url=https://prometheus.example.com"
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.28.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
      - "--web.external-url=https://alertmanager.example.com"
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.9.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/rootfs"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    pid: host
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.52.1
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:12.3.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "changeme"
      GF_USERS_ALLOW_SIGN_UP: "false"
    networks:
      - monitoring

volumes:
  prometheus_data:
  alertmanager_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

Directory structure

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   ├── rules/
│   │   ├── alerts.yml
│   │   └── recording.yml
│   └── targets/
│       └── nodes.yml
└── alertmanager/
    └── alertmanager.yml
Recommendation

Use --web.enable-lifecycle to allow configuration reloads via curl -X POST http://localhost:9090/-/reload without restarting the container. Mount config files as :ro (read-only) for security. Always pin image versions.

11

Kubernetes Monitoring

The standard way to deploy Prometheus on Kubernetes is the kube-prometheus-stack Helm chart (formerly prometheus-operator). It deploys Prometheus, Alertmanager, Grafana, Node Exporter, kube-state-metrics, and pre-configured dashboards and alert rules.

Installing with Helm

# Add the Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword="changeme" \
  --set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=5Gi

Prometheus Operator and CRDs

The Prometheus Operator introduces Custom Resource Definitions (CRDs) that let you configure monitoring declaratively using Kubernetes manifests:

CRD ServiceMonitor

Defines how to scrape a Kubernetes service. The Operator reads ServiceMonitors and automatically generates the Prometheus scrape config. Teams can create their own ServiceMonitors in their namespace.

CRD PodMonitor

Like ServiceMonitor but targets pods directly (without a Service object). Useful for sidecar proxies, DaemonSets, and pods that don't have a Service.

CRD PrometheusRule

Defines alerting and recording rules. The Operator syncs these to the Prometheus instance. Teams can manage their own rules without editing the central config.

CRD AlertmanagerConfig

Namespace-scoped alerting configuration. Lets teams define their own routes and receivers without access to the global Alertmanager config.

ServiceMonitor example

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: production
  labels:
    release: monitoring    # Must match the Helm release label selector
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: http-metrics   # Name of the port in the Service spec
      path: /metrics
      interval: 15s
  namespaceSelector:
    matchNames:
      - production

Common Kubernetes metrics

MetricSourceDescription
kube_pod_status_phasekube-state-metricsPod lifecycle phase (Pending, Running, Failed)
kube_deployment_spec_replicaskube-state-metricsDesired replica count for a deployment
kube_deployment_status_replicas_availablekube-state-metricsCurrently available replicas
container_cpu_usage_seconds_totalcAdvisor/kubeletCumulative CPU time consumed per container
container_memory_working_set_bytescAdvisor/kubeletCurrent memory usage (what K8s uses for OOM decisions)
node_cpu_seconds_totalNode ExporterCPU time per mode (idle, system, user, iowait)
node_memory_MemAvailable_bytesNode ExporterAvailable memory on the node
kubelet_volume_stats_used_byteskubeletPersistentVolume disk usage
12

Production Checklist

  • Set retention limits — configure both --storage.tsdb.retention.time and --storage.tsdb.retention.size to prevent disk exhaustion.
  • Use recording rules — pre-aggregate expensive queries. Dashboards should query recording rules, not raw high-cardinality metrics.
  • Monitor Prometheus itself — scrape Prometheus's own /metrics. Alert on prometheus_tsdb_head_series growth, scrape failures (up == 0), and rule evaluation latency.
  • Set up long-term storage — use Thanos or Mimir for data beyond local retention. Object storage (S3/GCS) is cheap and durable.
  • Use persistent volumes — never run Prometheus with ephemeral storage. Use named Docker volumes or Kubernetes PVCs with fast SSDs.
  • Alert on alert pipeline — test that Alertmanager actually delivers notifications. Use Watchdog alert (always-firing) to verify the pipeline is working.
  • Use service discovery — avoid static_configs in dynamic environments. Use file-based SD, Kubernetes SD, or Consul SD.
  • Control cardinality — high-cardinality labels (user IDs, request IDs, IP addresses) will blow up TSDB memory. Use metric_relabel_configs to drop or aggregate.
  • Enable lifecycle API--web.enable-lifecycle allows hot config reloads without restarts.
  • Secure Prometheus — Prometheus supports built-in TLS and basic authentication via --web.config.file (bcrypt-hashed passwords). For advanced auth (OAuth2, RBAC), put it behind a reverse proxy (nginx, OAuth2 Proxy, or service mesh).
  • Run Alertmanager in HA — deploy at least 2 Alertmanager instances in a cluster. They use gossip protocol for deduplication. A single Alertmanager is a SPOF for your alerting pipeline.
  • Pin exporter versions — use specific image tags for all exporters. An exporter upgrade can change metric names and break dashboards and alerts.
  • Document alert runbooks — every alert should have a runbook_url annotation linking to a runbook with investigation and remediation steps.
  • Test alerts with promtool — validate configuration and unit test alert rules before deploying: promtool check config prometheus.yml and promtool test rules test.yml.