Prometheus
Pull-based monitoring, time-series database, and alerting toolkit
Overview
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud in 2012. It joined the Cloud Native Computing Foundation (CNCF) in 2016 and graduated in 2018, becoming the second graduated project after Kubernetes. Prometheus 3.0, released in November 2024, is the current major version and introduced native OTLP ingestion, UTF-8 metric/label name support, native histograms (stable since v3.8), and a new UI. Prometheus collects metrics by pulling (scraping) HTTP endpoints at configured intervals, storing the data in a local time-series database (TSDB), and evaluating alert rules against that data.
Core Pull-Based Model
Prometheus scrapes targets rather than waiting for them to push data. Each target exposes a /metrics HTTP endpoint in a text-based format. This means Prometheus controls the collection rate, can detect when targets are down (failed scrape), and requires no client-side queuing or state.
Core Time-Series Database
All data is stored as time series — streams of timestamped values identified by a metric name and a set of key-value labels. For example: http_requests_total{method="GET", status="200"}. The TSDB is optimized for append-heavy, high-cardinality workloads.
Query PromQL
Prometheus ships with a powerful functional query language for selecting, aggregating, and transforming time-series data. PromQL powers dashboards (Grafana), alert rules, and recording rules.
Alert Alertmanager
A separate component that handles alert deduplication, grouping, routing, silencing, and notification delivery (email, Slack, PagerDuty, webhook). Prometheus evaluates alert rules and sends firing alerts to Alertmanager.
What Prometheus is NOT
- Not a log aggregator — use Loki, Elasticsearch, or Fluentd/Fluent Bit for logs. Prometheus handles numeric metrics only.
- Not for long-term storage by default — local TSDB retention is typically 15–30 days. For long-term storage, use Thanos or Mimir.
- Not 100% accurate — it is designed for operational monitoring with slight data loss tolerance, not for billing or financial-grade precision.
Architecture
The Prometheus ecosystem consists of several components that work together. The Prometheus server is the central piece that scrapes, stores, and queries metrics. Supporting components handle service discovery, short-lived job metrics, alerting, and visualization.
Core Prometheus Server
The main binary. Handles scraping targets, storing time series in the local TSDB, evaluating recording and alerting rules, and serving the PromQL query API. Runs as a single stateful process.
Core TSDB
The embedded time-series database. Data is organized into 2-hour blocks on disk, compacted over time. Designed for high ingestion rates with efficient compression. Not replicated — a single Prometheus instance is a single point of failure.
Discovery Service Discovery
Prometheus dynamically discovers scrape targets via integrations with Kubernetes, Consul, DNS, EC2, file-based configs, and more. No need to hard-code every target IP.
Edge case Pushgateway
For short-lived batch jobs that cannot be scraped (they exit before Prometheus can pull). Jobs push metrics to the Pushgateway; Prometheus scrapes the gateway. Use sparingly — it breaks the pull model and can become a single point of failure.
Ecosystem Exporters
Third-party agents that expose metrics from systems that don't natively speak Prometheus. Node Exporter (Linux hardware/OS), Blackbox Exporter (probing), cAdvisor (containers), database exporters, and hundreds more.
Ecosystem Grafana
The standard visualization layer. Grafana queries Prometheus via PromQL and renders dashboards. Not part of Prometheus itself, but virtually every Prometheus deployment uses Grafana.
Configuration
Prometheus is configured via a YAML file, typically prometheus.yml. The configuration defines global settings, scrape targets, alerting rules, and remote storage endpoints.
Full prometheus.yml example
# prometheus.yml
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
scrape_timeout: 10s # Per-scrape timeout
external_labels:
cluster: production
region: us-east-1
# Alert rules and recording rules
rule_files:
- "rules/*.yml"
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Scrape configurations
scrape_configs:
# Prometheus monitors itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Node Exporter on all hosts
- job_name: "node"
file_sd_configs:
- files:
- "targets/nodes.yml"
refresh_interval: 5m
# Application with relabeling
- job_name: "my-app"
metrics_path: /metrics
scheme: https
tls_config:
insecure_skip_verify: false
static_configs:
- targets:
- "app1.example.com:8443"
- "app2.example.com:8443"
labels:
env: production
team: backend
# Kubernetes service discovery
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_ip]
action: replace
target_label: __address__
regex: (.+);(.+)
replacement: $2:$1
Prometheus v3 no longer automatically adds default ports (:80 / :443) to scrape targets based on scheme. Targets must include an explicit port. The scrape_classic_histograms option was also renamed to always_scrape_classic_histograms.
Key configuration blocks
Global global
Sets defaults for all scrape configs: scrape_interval, evaluation_interval, scrape_timeout. Also defines external_labels that are attached to all time series and alerts when communicating with external systems (federation, remote write, Alertmanager).
Scrape scrape_configs
A list of jobs. Each job defines how to discover and scrape a set of targets. Contains static_configs, file_sd_configs, kubernetes_sd_configs, and other SD mechanisms. Each job can override global settings.
Discovery file_sd_configs
Reads target lists from JSON or YAML files on disk. Files are re-read at refresh_interval. Useful when targets are managed by an external tool (Ansible, Terraform) that writes target files.
Advanced relabel_configs
Powerful label manipulation rules applied before scraping. Can drop targets, rewrite labels, extract metadata from service discovery, and set the scrape endpoint. Essential for Kubernetes SD.
File-based service discovery target file
# targets/nodes.yml
- targets:
- "node1.example.com:9100"
- "node2.example.com:9100"
- "node3.example.com:9100"
labels:
datacenter: dc1
env: production
- targets:
- "staging-node1.example.com:9100"
labels:
datacenter: dc1
env: staging
PromQL
PromQL (Prometheus Query Language) is a functional expression language for querying time-series data. It powers Grafana dashboards, alert rules, and recording rules. Understanding PromQL is essential for effective Prometheus usage.
Selectors and vector types
# Instant vector: current value of all time series matching the selector
http_requests_total{job="my-app", status="200"}
# Range vector: values over a time window (needed for rate/increase)
http_requests_total{job="my-app"}[5m]
# Label matchers
http_requests_total{method="GET"} # exact match
http_requests_total{method!="GET"} # not equal
http_requests_total{status=~"5.."} # regex match
http_requests_total{status!~"2.."} # negative regex
Essential functions and operators
# rate(): per-second rate of increase for counters (use with [range])
rate(http_requests_total{job="my-app"}[5m])
# irate(): instant rate based on last two samples (more spiky)
irate(http_requests_total{job="my-app"}[5m])
# increase(): total increase over a time range
increase(http_requests_total{job="my-app"}[1h])
# sum(): aggregate across label dimensions
sum(rate(http_requests_total[5m])) by (method, status)
# avg(): average across instances
avg(node_cpu_seconds_total{mode="idle"}) by (instance)
# histogram_quantile(): calculate percentiles from histograms
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# topk(): top N time series by value
topk(5, rate(http_requests_total[5m]))
# absent(): returns 1 if the metric does not exist (useful for alerts)
absent(up{job="my-app"})
# predict_linear(): linear regression prediction
predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
# double_exponential_smoothing(): smoothed prediction (renamed from holt_winters in v3.0)
double_exponential_smoothing(node_memory_MemAvailable_bytes[1h], 0.3, 0.7)
Common dashboard queries
HTTP Request rate
sum(rate(http_requests_total[5m]))
by (method, status)
Latency P99 response time
histogram_quantile(0.99,
sum(rate(
http_request_duration_seconds_bucket[5m]
)) by (le)
)
CPU Usage per instance
100 - (avg by (instance) (
irate(node_cpu_seconds_total
{mode="idle"}[5m])
) * 100)
Disk Filesystem full prediction
predict_linear(
node_filesystem_avail_bytes
{mountpoint="/"}[6h],
24 * 3600
) < 0
Recording rules
Recording rules precompute frequently used or expensive PromQL expressions and save them as new time series. This speeds up dashboards and prevents query timeouts at scale.
# rules/recording.yml
groups:
- name: http_rules
interval: 15s
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
)
- record: instance:node_cpu_utilization:ratio
expr: |
1 - avg by (instance) (
irate(node_cpu_seconds_total{mode="idle"}[5m])
)
Exporters
Exporters are agents that translate metrics from third-party systems into the Prometheus exposition format. They run alongside the monitored system and expose a /metrics HTTP endpoint that Prometheus scrapes.
| Exporter | Purpose | Default Port |
|---|---|---|
| Node Exporter | Linux hardware and OS metrics (CPU, memory, disk, network) | 9100 |
| Blackbox Exporter | Probe endpoints via HTTP, HTTPS, DNS, TCP, ICMP | 9115 |
| cAdvisor | Container resource usage and performance (CPU, memory, I/O per container) | 8080 |
| mysqld_exporter | MySQL server metrics (queries, connections, replication lag) | 9104 |
| postgres_exporter | PostgreSQL metrics (connections, locks, replication, query stats) | 9187 |
| redis_exporter | Redis server metrics (memory, keys, connections, commands) | 9121 |
| nginx_exporter | NGINX stub status metrics (connections, requests) | 9113 |
| process_exporter | Per-process metrics (CPU, memory, file descriptors) | 9256 |
The /metrics endpoint format
All exporters (and instrumented applications) expose metrics in the Prometheus exposition format — a simple text-based format:
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1542
http_requests_total{method="GET",status="404"} 23
http_requests_total{method="POST",status="201"} 89
# HELP http_request_duration_seconds Request latency in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 320
http_request_duration_seconds_bucket{le="0.05"} 1100
http_request_duration_seconds_bucket{le="0.1"} 1350
http_request_duration_seconds_bucket{le="0.5"} 1500
http_request_duration_seconds_bucket{le="1"} 1540
http_request_duration_seconds_bucket{le="+Inf"} 1542
http_request_duration_seconds_sum 78.42
http_request_duration_seconds_count 1542
# HELP node_memory_MemAvailable_bytes Memory available in bytes
# TYPE node_memory_MemAvailable_bytes gauge
node_memory_MemAvailable_bytes 4.294967296e+09
Writing a custom exporter (Python)
from prometheus_client import start_http_server, Counter, Gauge, Histogram
import time, random
# Define metrics
REQUEST_COUNT = Counter(
'myapp_requests_total',
'Total requests processed',
['method', 'endpoint']
)
QUEUE_SIZE = Gauge(
'myapp_queue_size',
'Current items in processing queue'
)
REQUEST_LATENCY = Histogram(
'myapp_request_duration_seconds',
'Request latency in seconds',
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
if __name__ == '__main__':
start_http_server(8000) # Expose /metrics on port 8000
while True:
REQUEST_COUNT.labels(method='GET', endpoint='/api').inc()
QUEUE_SIZE.set(random.randint(0, 100))
REQUEST_LATENCY.observe(random.random())
time.sleep(1)
Prefer instrumentation (embedding metrics directly in your application code) over exporters. Official client libraries exist for Go, Java/JVM, Python, Ruby, .NET, Rust, and C++, with popular community libraries for Node.js, PHP, Elixir, and more. Only use exporters for third-party software you cannot modify. Instrumented applications produce more meaningful, application-specific metrics.
Service Discovery
In dynamic environments (Kubernetes, cloud, containers), targets come and go. Service discovery lets Prometheus automatically find scrape targets without manual configuration changes.
| SD Type | Use Case | Config Key |
|---|---|---|
| Static | Fixed, known targets | static_configs |
| File-based | External tool writes target files | file_sd_configs |
| Kubernetes | Pods, services, endpoints, nodes in K8s | kubernetes_sd_configs |
| Consul | Services registered in Consul | consul_sd_configs |
| DNS | SRV or A records | dns_sd_configs |
| EC2 | AWS EC2 instances | ec2_sd_configs |
| GCE | Google Compute Engine instances | gce_sd_configs |
| Azure | Azure VMs | azure_sd_configs |
Kubernetes service discovery
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
- production
relabel_configs:
# Only scrape pods with annotation prometheus.io/scrape: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom metrics path from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Use custom port from annotation
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Map pod labels to Prometheus labels
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
# Add namespace and pod name labels
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
Consul service discovery
scrape_configs:
- job_name: "consul-services"
consul_sd_configs:
- server: "consul.example.com:8500"
services: [] # Discover all services
relabel_configs:
# Use Consul service name as job label
- source_labels: [__meta_consul_service]
target_label: job
# Add datacenter label
- source_labels: [__meta_consul_dc]
target_label: datacenter
# Only scrape services tagged with "prometheus"
- source_labels: [__meta_consul_tags]
regex: .*,prometheus,.*
action: keep
Relabeling explained
Relabeling is the mechanism that transforms metadata labels from service discovery into the final labels Prometheus uses. The most important actions:
keep— only keep targets where the source label matches the regexdrop— discard targets where the source label matchesreplace— set a target label to a regex-transformed value of source labelslabelmap— copy labels matching a regex to new label nameslabeldrop— remove labels matching a regex from all targetshashmod— used for sharding: assign targets to a specific Prometheus instance based on hash
Alerting
Prometheus alerting is a two-step process. Prometheus evaluates alert rules (PromQL expressions) and sends firing alerts to Alertmanager, which handles deduplication, grouping, silencing, inhibition, and routing to notification channels.
Alert rules
# rules/alerts.yml
groups:
- name: instance_alerts
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
- alert: HighCPUUsage
expr: instance:node_cpu_utilization:ratio > 0.90
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU utilization is above 90% for 10 minutes on {{ $labels.instance }}."
- alert: DiskWillFillIn24Hours
expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk space critical on {{ $labels.instance }}"
description: "Filesystem {{ $labels.mountpoint }} on {{ $labels.instance }} will be full within 24 hours."
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "More than 5% of requests are failing with 5xx errors."
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.job }}"
description: "P99 request latency is above 1 second for 10 minutes."
Alertmanager configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/T00/B00/XXXX"
smtp_smarthost: "smtp.example.com:587"
smtp_from: "alertmanager@example.com"
smtp_auth_username: "alertmanager"
smtp_auth_password: "secret"
route:
receiver: slack-default
group_by: [alertname, cluster, service]
group_wait: 30s # Wait before sending first notification for a group
group_interval: 5m # Wait between notifications for the same group
repeat_interval: 4h # Re-notify after this interval if alert still firing
routes:
- matchers:
- severity = critical
receiver: pagerduty-critical
continue: true # Also send to next matching route
- matchers:
- severity = critical
receiver: slack-critical
- matchers:
- severity = warning
receiver: slack-warnings
receivers:
- name: slack-default
slack_configs:
- channel: "#monitoring"
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
send_resolved: true
- name: slack-critical
slack_configs:
- channel: "#incidents"
title: 'CRITICAL: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
send_resolved: true
- name: slack-warnings
slack_configs:
- channel: "#monitoring"
send_resolved: true
- name: pagerduty-critical
pagerduty_configs:
- routing_key: "your-pagerduty-events-v2-key"
severity: critical
- name: email-fallback
email_configs:
- to: "oncall@example.com"
send_resolved: true
inhibit_rules:
# If a critical alert is firing, suppress warnings for the same alertname
- source_matchers:
- severity = critical
target_matchers:
- severity = warning
equal: [alertname, cluster, service]
Grouping combines related alerts into a single notification (e.g., all InstanceDown alerts in a cluster). Inhibition suppresses less severe alerts when a more severe alert is already firing. Silences are temporary mutes for known issues or maintenance windows, managed via the Alertmanager web UI.
Storage & Retention
Prometheus stores all data in its local TSDB on disk. The TSDB is highly optimized for time-series workloads but is limited to a single node. For long-term storage and high availability, you need external solutions.
Local TSDB configuration
# Start Prometheus with storage flags
prometheus \
--config.file=prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/data \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=50GB \
--web.enable-lifecycle \
--web.enable-admin-api \
--web.config.file=web-config.yml
Only set --storage.tsdb.min-block-duration=2h and --storage.tsdb.max-block-duration=2h if using Thanos Sidecar (it requires compaction to be disabled). For standalone Prometheus, leave these at their defaults so the TSDB can compact blocks normally. The old --storage.tsdb.retention flag (without .time) was removed in Prometheus 3.0.
Retention Time-based
--storage.tsdb.retention.time=30d — keep data for 30 days. Default is 15 days. Older blocks are deleted automatically. Set based on your operational needs and disk budget.
Retention Size-based
--storage.tsdb.retention.size=50GB — cap total TSDB size. When the limit is reached, oldest blocks are removed first. Useful for preventing disk exhaustion.
Remote write / read
Prometheus supports remote write (send data to external storage) and remote read (query external storage). This enables long-term retention and global querying without keeping all data locally.
# prometheus.yml - remote write to Thanos/Mimir/Cortex
remote_write:
- url: "http://mimir:9009/api/v1/push"
queue_config:
max_samples_per_send: 1000
batch_send_deadline: 5s
max_shards: 20
write_relabel_configs:
- source_labels: [__name__]
regex: "go_.*"
action: drop # Don't send Go runtime metrics
remote_read:
- url: "http://mimir:9009/prometheus/api/v1/read"
read_recent: false # Only query remote for data beyond local retention
Long-term storage solutions
| Solution | Model | Storage Backend |
|---|---|---|
| Thanos | Sidecar + store gateway. Uploads TSDB blocks to object storage. Global query layer. | S3, GCS, Azure Blob, MinIO |
| Grafana Mimir | Horizontally scalable, multi-tenant. Receives via remote write. Built on Cortex. | S3, GCS, Azure Blob, MinIO |
| Cortex | Original horizontally scalable Prometheus. Mimir is its successor. | S3, GCS, DynamoDB, Cassandra |
| VictoriaMetrics | High-performance TSDB, drop-in Prometheus replacement. Single binary or clustered. | Local disk, S3 (enterprise) |
For most new deployments, use Grafana Mimir (via remote write) or Thanos (via sidecar) for long-term storage. Keep local retention at 2–7 days for fast queries. Both solutions store data in object storage (S3/GCS), which is cheap and durable. Thanos is simpler if you want to keep the sidecar model; Mimir is better if you want a fully centralized, multi-tenant system.
Federation & Scaling
A single Prometheus server can handle millions of active time series, but eventually you need to scale horizontally. Prometheus offers federation for aggregation, and projects like Thanos and Mimir provide true horizontal scaling.
Hierarchical federation
A global Prometheus server scrapes aggregated metrics from lower-level Prometheus servers. Each lower-level server scrapes its own targets and runs recording rules to pre-aggregate data.
# Global Prometheus scraping federated endpoints
scrape_configs:
- job_name: "federate-dc1"
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job=~".+"}' # All job-level metrics
- 'job:http_requests_total:rate5m' # Recording rules only
static_configs:
- targets:
- "prometheus-dc1.example.com:9090"
labels:
datacenter: dc1
- job_name: "federate-dc2"
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{job=~".+"}'
- 'job:http_requests_total:rate5m'
static_configs:
- targets:
- "prometheus-dc2.example.com:9090"
labels:
datacenter: dc2
Federation with match[]={__name__=~".+"} (all metrics) does not scale. Only federate pre-aggregated recording rules and essential metrics. Federating raw high-cardinality metrics will overwhelm the global instance. Use Thanos or Mimir for global querying of raw data.
Scaling strategies
Recommended Functional sharding
Split scraping by job or team. One Prometheus instance scrapes infrastructure metrics, another scrapes application metrics. Each instance is independent. Use Thanos Querier to provide a unified query layer across all instances.
Advanced Hashmod sharding
Use hashmod relabeling to split targets across N Prometheus instances. Each instance only scrapes targets where hash(labels) % N == shard_id. Requires coordination and a global query layer.
Recommended Thanos
Run a Thanos sidecar alongside each Prometheus. Thanos Querier fans out queries to all sidecars and deduplicates results. Thanos Store Gateway provides access to long-term data in object storage. No need to federate.
Recommended Grafana Mimir
All Prometheus instances remote-write to a centralized Mimir cluster. Mimir handles ingestion, storage, compaction, and querying. Multi-tenant, horizontally scalable. No sidecars needed.
Recording rules for performance
Recording rules are critical at scale. Pre-aggregating metrics reduces query-time cardinality and prevents expensive PromQL from timing out.
# rules/recording-aggregate.yml
groups:
- name: aggregated_metrics
interval: 30s
rules:
# Reduce cardinality by dropping instance label
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, method, status)
# Pre-compute error ratio per job
- record: job:http_errors:ratio5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# Aggregate node metrics for federation
- record: cluster:node_cpu:avg_utilization
expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m]))
Docker Deployment
A production-ready Docker Compose setup for Prometheus with Node Exporter for host metrics and Alertmanager for alert routing.
Docker Compose stack
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:v3.10.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- ./prometheus/targets:/etc/prometheus/targets:ro
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--storage.tsdb.retention.size=20GB"
- "--web.enable-lifecycle"
- "--web.enable-admin-api"
- "--web.external-url=https://prometheus.example.com"
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.28.0
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
- "--web.external-url=https://alertmanager.example.com"
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.9.0
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/rootfs"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
pid: host
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.52.1
container_name: cadvisor
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg
networks:
- monitoring
grafana:
image: grafana/grafana:12.3.0
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
GF_SECURITY_ADMIN_PASSWORD: "changeme"
GF_USERS_ALLOW_SIGN_UP: "false"
networks:
- monitoring
volumes:
prometheus_data:
alertmanager_data:
grafana_data:
networks:
monitoring:
driver: bridge
Directory structure
monitoring/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ ├── rules/
│ │ ├── alerts.yml
│ │ └── recording.yml
│ └── targets/
│ └── nodes.yml
└── alertmanager/
└── alertmanager.yml
Use --web.enable-lifecycle to allow configuration reloads via curl -X POST http://localhost:9090/-/reload without restarting the container. Mount config files as :ro (read-only) for security. Always pin image versions.
Kubernetes Monitoring
The standard way to deploy Prometheus on Kubernetes is the kube-prometheus-stack Helm chart (formerly prometheus-operator). It deploys Prometheus, Alertmanager, Grafana, Node Exporter, kube-state-metrics, and pre-configured dashboards and alert rules.
Installing with Helm
# Add the Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.adminPassword="changeme" \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=5Gi
Prometheus Operator and CRDs
The Prometheus Operator introduces Custom Resource Definitions (CRDs) that let you configure monitoring declaratively using Kubernetes manifests:
CRD ServiceMonitor
Defines how to scrape a Kubernetes service. The Operator reads ServiceMonitors and automatically generates the Prometheus scrape config. Teams can create their own ServiceMonitors in their namespace.
CRD PodMonitor
Like ServiceMonitor but targets pods directly (without a Service object). Useful for sidecar proxies, DaemonSets, and pods that don't have a Service.
CRD PrometheusRule
Defines alerting and recording rules. The Operator syncs these to the Prometheus instance. Teams can manage their own rules without editing the central config.
CRD AlertmanagerConfig
Namespace-scoped alerting configuration. Lets teams define their own routes and receivers without access to the global Alertmanager config.
ServiceMonitor example
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: production
labels:
release: monitoring # Must match the Helm release label selector
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: http-metrics # Name of the port in the Service spec
path: /metrics
interval: 15s
namespaceSelector:
matchNames:
- production
Common Kubernetes metrics
| Metric | Source | Description |
|---|---|---|
kube_pod_status_phase | kube-state-metrics | Pod lifecycle phase (Pending, Running, Failed) |
kube_deployment_spec_replicas | kube-state-metrics | Desired replica count for a deployment |
kube_deployment_status_replicas_available | kube-state-metrics | Currently available replicas |
container_cpu_usage_seconds_total | cAdvisor/kubelet | Cumulative CPU time consumed per container |
container_memory_working_set_bytes | cAdvisor/kubelet | Current memory usage (what K8s uses for OOM decisions) |
node_cpu_seconds_total | Node Exporter | CPU time per mode (idle, system, user, iowait) |
node_memory_MemAvailable_bytes | Node Exporter | Available memory on the node |
kubelet_volume_stats_used_bytes | kubelet | PersistentVolume disk usage |
Production Checklist
- Set retention limits — configure both
--storage.tsdb.retention.timeand--storage.tsdb.retention.sizeto prevent disk exhaustion. - Use recording rules — pre-aggregate expensive queries. Dashboards should query recording rules, not raw high-cardinality metrics.
- Monitor Prometheus itself — scrape Prometheus's own
/metrics. Alert onprometheus_tsdb_head_seriesgrowth, scrape failures (up == 0), and rule evaluation latency. - Set up long-term storage — use Thanos or Mimir for data beyond local retention. Object storage (S3/GCS) is cheap and durable.
- Use persistent volumes — never run Prometheus with ephemeral storage. Use named Docker volumes or Kubernetes PVCs with fast SSDs.
- Alert on alert pipeline — test that Alertmanager actually delivers notifications. Use
Watchdogalert (always-firing) to verify the pipeline is working. - Use service discovery — avoid
static_configsin dynamic environments. Use file-based SD, Kubernetes SD, or Consul SD. - Control cardinality — high-cardinality labels (user IDs, request IDs, IP addresses) will blow up TSDB memory. Use
metric_relabel_configsto drop or aggregate. - Enable lifecycle API —
--web.enable-lifecycleallows hot config reloads without restarts. - Secure Prometheus — Prometheus supports built-in TLS and basic authentication via
--web.config.file(bcrypt-hashed passwords). For advanced auth (OAuth2, RBAC), put it behind a reverse proxy (nginx, OAuth2 Proxy, or service mesh). - Run Alertmanager in HA — deploy at least 2 Alertmanager instances in a cluster. They use gossip protocol for deduplication. A single Alertmanager is a SPOF for your alerting pipeline.
- Pin exporter versions — use specific image tags for all exporters. An exporter upgrade can change metric names and break dashboards and alerts.
- Document alert runbooks — every alert should have a
runbook_urlannotation linking to a runbook with investigation and remediation steps. - Test alerts with
promtool— validate configuration and unit test alert rules before deploying:promtool check config prometheus.ymlandpromtool test rules test.yml.