Grafana
Unified observability dashboards — metrics, logs, and traces in one place
Overview
Grafana is the open-source observability platform for visualizing metrics, logs, and traces. It does not store data itself — instead, it connects to data sources like Prometheus, Loki, Elasticsearch, InfluxDB, and dozens more, then lets you build dashboards, set alerts, and explore data through a unified interface. Grafana is the visualization layer of the modern observability stack.
Edition Grafana OSS
The fully open-source core (AGPL v3). Includes dashboards, alerting, data source plugins, provisioning, and the Explore view. Sufficient for most deployments. Self-hosted.
Edition Grafana Enterprise
Commercial self-hosted edition. Adds RBAC with fine-grained permissions, SAML/team sync, reporting (PDF export on schedule), enhanced data sources (Oracle, Splunk, ServiceNow), audit logging, and data source caching.
Edition Grafana Cloud
Fully managed SaaS. Includes hosted Grafana, Mimir (metrics), Loki (logs), Tempo (traces), synthetic monitoring, and on-call incident management. Generous free tier (10k metrics, 50 GB logs, 50 GB traces, 3 users, 14-day retention).
Core value Unified Observability
Grafana's core value proposition: one dashboard platform for all your data, regardless of where it lives. Correlate Prometheus metrics with Loki logs and Tempo traces in a single pane. No vendor lock-in — swap backends freely.
Why Grafana dominates
Grafana became the de facto visualization tool because it decouples the dashboard from the data store. Unlike vendor-specific UIs (CloudWatch console, Datadog, Kibana), Grafana lets you query any backend through a pluggable data source system. This means you can run Prometheus for infrastructure metrics, Loki for logs, PostgreSQL for business data, and Elasticsearch for full-text search — all visualized in a single dashboard.
Architecture
Grafana is a stateless web application written in Go (backend) and React/TypeScript (frontend). It needs a database for its own metadata (dashboards, users, alerts) but does not store observability data.
Key components
Core Grafana Server
Single Go binary. Serves the web UI, REST API, data source proxy, and alerting engine. Default port 3000. Stateless — all state lives in the database.
Core Database Backend
SQLite (default, file-based, fine for single instances), PostgreSQL (recommended for production and HA), or MySQL/MariaDB. Stores dashboards, users, organizations, alert definitions, and preferences.
Plugin Data Source Plugins
Grafana queries observability backends through plugins. Built-in: Prometheus, Loki, Elasticsearch, InfluxDB, PostgreSQL, MySQL, CloudWatch, Azure Monitor. Community plugins add hundreds more.
Core Alerting Engine
Since v8 (opt-in) and default since v9, Grafana uses unified alerting — a single alerting system that evaluates rules against any data source. Replaces the legacy per-panel alerting (removed entirely in v11). Manages alert rules, contact points, notification policies, and silences.
Deployment models
- Single instance — one Grafana server with SQLite. Simple, good for small teams. No HA.
- HA cluster — multiple Grafana instances behind a load balancer, sharing a PostgreSQL database. Requires session affinity or shared session store.
- Kubernetes — deploy via the official Helm chart (
grafana/grafana). StatefulSet or Deployment with external PostgreSQL. ConfigMaps for provisioning. - Grafana Cloud — fully managed. No infrastructure to operate.
Data Sources
Data sources are the connection layer between Grafana and your observability backends. Grafana does not store time series or logs — it queries external systems in real time. Each data source plugin knows how to speak a specific query language (PromQL, LogQL, SQL, etc.) and translate the results into Grafana's internal data frame format.
| Data Source | Type | Query Language | Use Case |
|---|---|---|---|
| Prometheus | Metrics | PromQL | Infrastructure & application metrics. The most common Grafana data source. |
| Loki | Logs | LogQL | Log aggregation. Designed to pair with Grafana. Labels mirror Prometheus. |
| Elasticsearch | Logs / Search | Lucene / KQL | Full-text log search, APM data, document-oriented queries. |
| InfluxDB | Metrics | InfluxQL / SQL | Time series DB popular for IoT and custom metrics. Flux is deprecated; InfluxDB 3 uses SQL and InfluxQL. |
| PostgreSQL | SQL | SQL | Business metrics, application data, custom reporting. |
| MySQL | SQL | SQL | Same as PostgreSQL. Grafana supports time series and table queries. |
| CloudWatch | Metrics / Logs | CloudWatch Metrics Insights | AWS infrastructure monitoring. EC2, RDS, Lambda, ELB metrics. |
| Azure Monitor | Metrics / Logs | KQL | Azure resource metrics, Log Analytics, Application Insights. |
How data source plugins work
When a dashboard panel executes a query, Grafana's data source proxy forwards the request to the configured backend. The plugin handles authentication, query translation, and response parsing. Plugins can be:
- Built-in — shipped with Grafana (Prometheus, Loki, Elasticsearch, etc.)
- Core plugins — maintained by Grafana Labs, installed separately
- Community plugins — third-party, installed via
grafana-cli plugins installor the Grafana UI
# Install a community plugin (example: Zabbix data source)
grafana-cli plugins install alexanderzobnin-zabbix-app
# Install a specific version
grafana-cli plugins install alexanderzobnin-zabbix-app 4.4.0
# List installed plugins
grafana-cli plugins ls
# Remove a plugin
grafana-cli plugins remove alexanderzobnin-zabbix-app
Adding a Prometheus data source via API
curl -X POST http://admin:admin@localhost:3000/api/datasources \
-H 'Content-Type: application/json' \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://prometheus:9090",
"access": "proxy",
"isDefault": true
}'
Dashboards & Panels
Dashboards are the heart of Grafana. Each dashboard is a JSON document containing panels, variables, annotations, and layout information. Panels are individual visualizations — each one queries a data source and renders the result.
Panel types
Viz Time Series
The default panel. Line, bar, or point charts over time. Supports multiple queries, overrides, thresholds, and tooltip linking. Used for metrics like CPU, memory, request rate.
Viz Stat
Single large value with optional sparkline. Perfect for KPIs: total requests, error count, uptime percentage. Color changes based on thresholds.
Viz Gauge
Circular or bar gauge showing a value against min/max. Great for disk usage, CPU saturation, SLA percentages. Threshold colors show green/yellow/red zones.
Viz Table
Tabular data with sorting, filtering, and cell coloring. Useful for top-N queries, inventory lists, and SQL result sets.
Viz Logs
Displays log lines from Loki or Elasticsearch. Supports search, filtering, context view, and linking to trace IDs. The primary panel for log exploration.
Viz Heatmap
Color-coded matrix showing distribution over time. Ideal for latency histograms (e.g., request duration buckets from Prometheus histograms).
Variables and templating
Dashboard variables make dashboards reusable. A variable creates a dropdown at the top of the dashboard that dynamically changes all panel queries. Common patterns:
- Query variable — populated from a data source query, e.g.,
label_values(up, instance)to list all Prometheus instances - Custom variable — static list of values, e.g.,
production,staging,development - Interval variable — time intervals like
1m, 5m, 15m, 1hfor rate() window control - Ad hoc filters — let users add arbitrary label filters without editing queries
# Using variables in PromQL queries
rate(http_requests_total{instance=~"$instance", job="$job"}[${interval}])
# Multi-value variable with regex match
node_cpu_seconds_total{cpu=~"$cpu", mode!="idle"}
Annotations
Annotations overlay events on time series panels — deployments, incidents, config changes. They can be added manually, via the API, or queried from a data source. Annotations help correlate metric changes with real-world events.
# Create an annotation via the API (e.g., from a CI/CD pipeline)
curl -X POST http://admin:admin@localhost:3000/api/annotations \
-H 'Content-Type: application/json' \
-d '{
"dashboardUID": "abc123",
"time": 1700000000000,
"tags": ["deploy", "v2.1.0"],
"text": "Deployed version 2.1.0 to production"
}'
Dashboard JSON model
Every dashboard is a JSON document. You can export it from the UI (Share → Export), store it in Git, and provision it automatically. The JSON includes panel definitions, queries, variables, layout coordinates, and metadata.
{
"dashboard": {
"title": "Node Exporter Full",
"uid": "rYdddlPWk",
"tags": ["linux", "prometheus"],
"timezone": "browser",
"panels": [
{
"type": "timeseries",
"title": "CPU Usage",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "100 - (avg by(instance)(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
]
}
],
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(node_cpu_seconds_total, instance)"
}
]
}
}
}
Alerting
Grafana v8 introduced unified alerting as an opt-in feature, and v9 made it the default, replacing the legacy panel-based alerting (removed entirely in v11) with a centralized, multi-data-source alerting system. Alert rules are evaluated by the Grafana server on a schedule, independent of dashboards. Notifications are routed through contact points and notification policies.
Core Alert Rules
Define a query, a condition (threshold, no-data, error), and an evaluation interval. Rules can query any data source. Grouped into rule groups within folders for organization.
Core Contact Points
Where notifications go: email, Slack, PagerDuty, OpsGenie, Microsoft Teams, webhooks, Alertmanager, and many more. Each contact point configures a specific integration.
Routing Notification Policies
A routing tree that matches alerts to contact points based on labels. Define default routes, label matchers, grouping, group wait/interval, and repeat intervals. Similar to Alertmanager's routing tree.
Control Silences & Mute Timings
Silences suppress notifications for a specific time window (e.g., during maintenance). Mute timings are recurring schedules (e.g., suppress non-critical alerts outside business hours).
Alert rule example (provisioned YAML)
# provisioning/alerting/rules.yml
apiVersion: 1
groups:
- orgId: 1
name: infrastructure-alerts
folder: Infrastructure
interval: 1m
rules:
- uid: high-cpu-alert
title: High CPU Usage
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus-uid
model:
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
intervalMs: 1000
maxDataPoints: 43200
- refId: B
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
type: reduce
expression: A
reducer: last
- refId: C
datasourceUid: __expr__
model:
type: threshold
expression: B
conditions:
- evaluator:
type: gt
params: [85]
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "CPU usage above 85% on {{ $labels.instance }}"
description: "CPU has been above 85% for 5 minutes."
Notification policy example
# provisioning/alerting/notification-policies.yml
apiVersion: 1
policies:
- orgId: 1
receiver: email-default
group_by: ['grafana_folder', 'alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: pagerduty-critical
matchers:
- severity = critical
continue: false
- receiver: slack-warnings
matchers:
- severity = warning
group_wait: 1m
repeat_interval: 1h
Grafana's built-in alerting uses an embedded Alertmanager for routing and grouping. For large-scale deployments already using Prometheus Alertmanager, you can configure Grafana to forward alerts to your existing Alertmanager instead. This avoids duplicating routing configuration.
Provisioning
Provisioning lets you configure Grafana declaratively through YAML files instead of the UI. On startup, Grafana reads provisioning files and applies them. This enables GitOps workflows — store your entire Grafana configuration in Git and deploy it automatically.
Provisioning data sources
# /etc/grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: 15s
httpMethod: POST
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: false
jsonData:
maxLines: 1000
derivedFields:
- datasourceUid: tempo-uid
matcherRegex: "traceID=(\\w+)"
name: TraceID
url: "$${__value.raw}"
- name: PostgreSQL
type: postgres
url: pg-host:5432
database: app_metrics
user: grafana_reader
secureJsonData:
password: "${PG_PASSWORD}"
jsonData:
sslmode: require
maxOpenConns: 10
connMaxLifetime: 14400
Provisioning dashboards
# /etc/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: default
orgId: 1
type: file
disableDeletion: false
updateIntervalSeconds: 30
allowUiUpdates: false
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Place your dashboard JSON files in /var/lib/grafana/dashboards/. Organize them into subdirectories — when foldersFromFilesStructure is true, directory names become Grafana folders.
Provisioning contact points
# /etc/grafana/provisioning/alerting/contact-points.yml
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-warnings
receivers:
- uid: slack-warn-1
type: slack
settings:
recipient: "#alerts-warning"
token: "${SLACK_BOT_TOKEN}"
title: |
{{ len .Alerts.Firing }} firing | {{ len .Alerts.Resolved }} resolved
text: |
{{ range .Alerts }}
*{{ .Labels.alertname }}* - {{ .Annotations.summary }}
{{ end }}
- orgId: 1
name: pagerduty-critical
receivers:
- uid: pd-critical-1
type: pagerduty
settings:
integrationKey: "${PD_INTEGRATION_KEY}"
severity: critical
GitOps workflow
Workflow Grafana as Code
The production pattern for managing Grafana configuration:
- Store all provisioning YAML and dashboard JSON files in a Git repository
- Review changes via pull requests — dashboard and alerting changes get the same review as application code
- Deploy via CI/CD pipeline that copies files into the Grafana container or mounts them as ConfigMaps in Kubernetes
- Enforce
allowUiUpdates: falseto prevent ad-hoc UI changes that drift from Git - Tools: Grafonnet (Jsonnet library for generating dashboards), Grizzly (CLI for Grafana resources), Terraform Grafana provider
Set editable: false on provisioned data sources and allowUiUpdates: false on dashboard providers. This prevents users from making UI changes that will be overwritten on next restart. All changes go through Git.
Authentication & RBAC
Grafana supports multiple authentication providers and a hierarchical permission model. In OSS, permissions are organization- and folder-level. Grafana Enterprise adds fine-grained RBAC with custom roles.
Authentication providers
| Provider | Edition | Notes |
|---|---|---|
| Built-in (username/password) | OSS | Local accounts stored in the Grafana database. Fine for small teams. |
| LDAP | OSS | Bind to Active Directory or OpenLDAP. Map LDAP groups to Grafana orgs/roles. |
| OAuth 2.0 / OIDC | OSS | GitHub, GitLab, Google, Azure AD, Okta, Keycloak, generic OIDC. Most common for SSO. |
| SAML | Enterprise | Enterprise SSO standard. IdP-initiated and SP-initiated flows. Attribute mapping for roles. |
| Grafana Cloud SSO | Cloud | Managed by Grafana Labs. Includes team sync with identity providers. |
OAuth example (Keycloak)
# grafana.ini
[auth.generic_oauth]
enabled = true
name = Keycloak
client_id = grafana
client_secret = ${KEYCLOAK_CLIENT_SECRET}
scopes = openid email profile
auth_url = https://keycloak.example.com/realms/corp/protocol/openid-connect/auth
token_url = https://keycloak.example.com/realms/corp/protocol/openid-connect/token
api_url = https://keycloak.example.com/realms/corp/protocol/openid-connect/userinfo
role_attribute_path = contains(groups[*], 'grafana-admins') && 'Admin' || contains(groups[*], 'grafana-editors') && 'Editor' || 'Viewer'
allow_sign_up = true
Organization and team model
- Organizations — top-level tenants. Each org has its own dashboards, data sources, and users. Users can belong to multiple orgs with different roles. Useful for multi-tenant setups.
- Teams — groups of users within an org. Assign folder and dashboard permissions to teams instead of individual users.
- Roles (OSS) —
Viewer(read-only),Editor(create/edit dashboards),Admin(full org control). Assigned per-organization. - Folder permissions — dashboards are organized in folders. Each folder can have specific viewer/editor permissions per user or team.
RBAC (Enterprise)
Grafana Enterprise adds fine-grained role-based access control with custom roles and permissions on specific resources:
- Create custom roles with granular permissions (e.g., can edit dashboards in folder X but only view data source Y)
- Permissions on data sources, folders, dashboards, service accounts, and alerting resources
- Role assignment via the API or LDAP/SAML attribute mapping
- Audit logging of all permission changes
Loki Integration
Grafana Loki is a log aggregation system designed to work seamlessly with Grafana. Unlike Elasticsearch, Loki does not index log contents — it only indexes metadata labels (like Prometheus). This makes it dramatically cheaper to operate at scale, at the cost of slower full-text search.
LogQL basics
LogQL is Loki's query language, inspired by PromQL. It has two types of queries: log queries (return log lines) and metric queries (return computed values from logs).
# Stream selector - required, selects log streams by label
{job="nginx", env="production"}
# Filter expressions - narrow down log lines
{job="nginx"} |= "error" # contains "error"
{job="nginx"} !~ "healthcheck|readiness" # does not match regex
{job="nginx"} |= "error" != "timeout" # contains "error" but not "timeout"
# Parser - extract fields from log lines
{job="nginx"} | json # parse JSON logs
{job="nginx"} | logfmt # parse logfmt logs
{job="nginx"} | pattern `<ip> - - <_> "<method> <uri> <_>" <status> <size>`
# Label filter after parsing
{job="nginx"} | json | status >= 500
# Metric queries - aggregate log data into numbers
rate({job="nginx"} |= "error" [5m]) # errors per second
sum by (status) (count_over_time({job="nginx"} | json [1h])) # count by status code
quantile_over_time(0.95, {job="nginx"} | json | unwrap response_time [5m]) # p95 latency
Deploying Loki with Grafana
# docker-compose.yml (Loki + Alloy + Grafana)
services:
loki:
image: grafana/loki:3.6.1
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
volumes:
- loki-data:/loki
- ./loki-config.yml:/etc/loki/local-config.yaml
restart: unless-stopped
alloy:
image: grafana/alloy:1.6.0
volumes:
- /var/log:/var/log:ro
- ./alloy-config.alloy:/etc/alloy/config.alloy
command: run /etc/alloy/config.alloy
restart: unless-stopped
grafana:
image: grafana/grafana:12.3.0
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana-data:/var/lib/grafana
restart: unless-stopped
volumes:
loki-data:
grafana-data:
Loki configuration (minimal production)
# loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 744h # 31 days
max_query_length: 721h
max_query_parallelism: 32
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
Loki uses 10-20x less storage and compute than Elasticsearch for the same log volume because it does not build full-text indexes. The trade-off: grep-style queries on unindexed fields are slower. Loki excels when you filter by labels first, then search within a narrow stream. If you need millisecond full-text search across all logs, Elasticsearch is still the better choice.
High Availability
Running Grafana in HA ensures the dashboard platform stays available if a single instance fails. Since Grafana is stateless (all state lives in the database), horizontal scaling is straightforward.
Required Shared Database
All Grafana instances must connect to the same PostgreSQL or MySQL database. SQLite does not support concurrent access from multiple instances. PostgreSQL is the recommended backend for HA.
Required Load Balancer
Place a load balancer (Nginx, HAProxy, ALB) in front of Grafana instances. Use sticky sessions (session affinity) to route a user to the same backend, or configure a shared session store (Redis, database).
Alert HA Unified Alerting HA
When running multiple instances, alert rule evaluation must be coordinated to avoid duplicate notifications. Enable HA alerting with the ha_peers setting so instances form a gossip cluster using a Memberlist ring.
Optional Shared File Storage
If using file-based provisioning or image rendering, all instances need access to the same files. Use NFS, EFS, or mount ConfigMaps in Kubernetes.
HA configuration
# grafana.ini - HA settings
[database]
type = postgres
host = pg-primary.example.com:5432
name = grafana
user = grafana
password = ${GF_DATABASE_PASSWORD}
ssl_mode = require
max_open_conn = 50
max_idle_conn = 25
conn_max_lifetime = 14400
[unified_alerting]
enabled = true
# HA: list all peer addresses (each Grafana instance)
ha_listen_address = "0.0.0.0:9094"
ha_peers = "grafana-0:9094,grafana-1:9094,grafana-2:9094"
ha_peer_timeout = 15s
[live]
# Required for HA - use Redis as pubsub for Grafana Live
ha_engine = redis
ha_engine_address = redis:6379
Docker Deployment
The most common way to self-host Grafana is with Docker Compose alongside Prometheus, Loki, and Grafana Alloy (the unified telemetry collector). This gives you a complete observability stack with metrics, logs, and dashboards.
Full observability stack (Docker Compose)
# docker-compose.yml - Grafana + Prometheus + Loki + Alloy
services:
grafana:
image: grafana/grafana:12.3.0
container_name: grafana
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD}"
GF_USERS_ALLOW_SIGN_UP: "false"
GF_SERVER_ROOT_URL: "https://grafana.example.com"
GF_DATABASE_TYPE: postgres
GF_DATABASE_HOST: postgres:5432
GF_DATABASE_NAME: grafana
GF_DATABASE_USER: grafana
GF_DATABASE_PASSWORD: "${POSTGRES_PASSWORD}"
GF_DATABASE_SSL_MODE: disable
GF_INSTALL_PLUGINS: grafana-clock-panel
volumes:
- grafana-data:/var/lib/grafana
- ./provisioning:/etc/grafana/provisioning
- ./dashboards:/var/lib/grafana/dashboards
depends_on:
postgres:
condition: service_healthy
restart: unless-stopped
prometheus:
image: prom/prometheus:v3.10.0
container_name: prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus-data:/prometheus
restart: unless-stopped
loki:
image: grafana/loki:3.6.1
container_name: loki
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml:ro
- loki-data:/loki
restart: unless-stopped
alloy:
image: grafana/alloy:1.6.0
container_name: alloy
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./alloy-config.alloy:/etc/alloy/config.alloy:ro
command: run /etc/alloy/config.alloy
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.10.0
container_name: node-exporter
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
postgres:
image: postgres:16-alpine
container_name: grafana-postgres
environment:
POSTGRES_DB: grafana
POSTGRES_USER: grafana
POSTGRES_PASSWORD: "${POSTGRES_PASSWORD}"
volumes:
- postgres-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U grafana"]
interval: 5s
timeout: 3s
retries: 5
restart: unless-stopped
volumes:
grafana-data:
prometheus-data:
loki-data:
postgres-data:
Prometheus configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'grafana'
static_configs:
- targets: ['grafana:3000']
Environment variables
Grafana supports configuration via environment variables. Every grafana.ini setting can be overridden with GF_<SECTION>_<KEY> in uppercase. This is the preferred approach for Docker deployments.
# .env file
GRAFANA_ADMIN_PASSWORD=changeme-strong-password
POSTGRES_PASSWORD=another-strong-password
GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp.example.com:587
GF_SMTP_USER=grafana@example.com
GF_SMTP_PASSWORD=smtp-password
GF_SMTP_FROM_ADDRESS=grafana@example.com
The default admin password is admin. Always set GF_SECURITY_ADMIN_PASSWORD via environment variable or secret. After first login, change the password immediately. In production, disable the built-in admin and use OAuth/LDAP instead.
Best Practices
Dashboard design
- Follow the USE method — for every resource, show Utilization, Saturation, and Errors. This gives a complete picture of system health.
- Top-to-bottom, left-to-right flow — place the most critical metrics at the top. Start with high-level stats, drill down into details below.
- Use variables for everything — instance, job, namespace, environment. One dashboard should serve all environments.
- Set meaningful thresholds — color-code values (green/yellow/red) so operators can spot problems at a glance without reading numbers.
- Link dashboards together — use data links to drill from a high-level overview dashboard to detailed per-service dashboards.
- Avoid dashboard sprawl — 5 great dashboards beat 50 mediocre ones. Start with the Grafana community dashboards (grafana.com/dashboards) and customize.
Folder organization
Structure Recommended folder layout
- Infrastructure/ — node exporter, Docker, Kubernetes cluster dashboards
- Applications/ — per-service application metrics (request rate, latency, errors)
- Databases/ — PostgreSQL, MySQL, Redis, MongoDB dashboards
- Networking/ — Nginx, HAProxy, DNS, VPN dashboards
- Logs/ — Loki-based log exploration dashboards
- Business/ — KPIs, SLOs, revenue metrics from SQL data sources
Backup strategies
- Database backups — back up the PostgreSQL/MySQL database regularly. This captures dashboards, users, alert rules, and all Grafana metadata. Use
pg_dumpon schedule. - Export dashboards to Git — use the Grafana API or
grizzlyto export all dashboards as JSON and commit to a Git repository. This is your disaster recovery plan. - Provisioning files — if you use provisioning (and you should), the YAML files in Git are your backup. A fresh Grafana instance with the same provisioning files will reconstruct everything.
# Export all dashboards via the API
for uid in $(curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
http://localhost:3000/api/search | jq -r '.[].uid'); do
curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
"http://localhost:3000/api/dashboards/uid/$uid" | \
jq '.dashboard' > "dashboards/${uid}.json"
done
# Backup Grafana PostgreSQL database
pg_dump -h pg-host -U grafana grafana | gzip > grafana-backup-$(date +%Y%m%d).sql.gz
Upgrade procedures
- Read the changelog — Grafana publishes detailed release notes with breaking changes. Always read them before upgrading.
- Backup the database first — Grafana runs database migrations on startup. If the migration fails, you need the backup.
- Upgrade one minor version at a time — do not skip major versions (e.g., 9.x → 11.x). Go 9.x → 10.x → 11.x to ensure migrations run correctly.
- Test in staging — deploy the new version against a copy of the production database first.
- Pin the image tag — use
grafana/grafana:12.3.0, not:latest. Explicit upgrades only.
Production Checklist
- Use PostgreSQL for the database — never use SQLite in production or HA deployments. Set
max_open_connandconn_max_lifetime. - Set a strong admin password — change the default
admin/adminimmediately. Better yet, disable built-in auth and use OAuth/LDAP. - Enable HTTPS — terminate TLS at a reverse proxy (Nginx, Traefik, Caddy) or configure Grafana's built-in TLS. Set
GF_SERVER_ROOT_URLto the HTTPS URL. - Configure authentication — set up OAuth, LDAP, or SAML. Disable anonymous access (
GF_AUTH_ANONYMOUS_ENABLED=false). Disable user signup (GF_USERS_ALLOW_SIGN_UP=false). - Provision everything from files — data sources, dashboards, alerting rules, contact points. Store in Git. Set
editable: falseandallowUiUpdates: false. - Set up alerting — configure contact points (Slack, PagerDuty, email). Define notification policies with proper routing. Test that alerts fire and resolve correctly.
- Configure log rotation — Grafana logs to stdout by default in Docker. Ensure your Docker log driver rotates logs. Set
GF_LOG_MODE=consoleandGF_LOG_LEVEL=warnin production. - Set resource limits — in Docker or Kubernetes, set memory and CPU limits. Grafana typically needs 256 MB–1 GB RAM depending on dashboard complexity and concurrent users.
- Enable HA if critical — run 2+ instances behind a load balancer with shared PostgreSQL. Configure
ha_peersfor unified alerting. - Back up the database — schedule daily
pg_dumpbackups. Test restores regularly. Keep 30 days of backups. - Pin the Grafana version — use specific image tags. Upgrade deliberately after reading the changelog and testing in staging.
- Restrict plugin installation — only install plugins from trusted sources. Review community plugins before deploying. Set
GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINSonly when necessary. - Monitor Grafana itself — Grafana exposes Prometheus metrics on
/metrics. Scrape it with Prometheus and create a meta-dashboard for Grafana health (API latency, active users, alerting evaluation time). - Set up image rendering — for alert notifications with images and PDF reporting, deploy the
grafana/grafana-image-renderercontainer alongside Grafana.