Grafana Production Guide

Overview

Grafana is the open-source observability platform for visualizing metrics, logs, and traces. It does not store data itself — instead, it connects to data sources like Prometheus, Loki, Elasticsearch, InfluxDB, and dozens more, then lets you build dashboards, set alerts, and explore data through a unified interface. Grafana is the visualization layer of the modern observability stack.

Edition Grafana OSS

The fully open-source core (AGPL v3). Includes dashboards, alerting, data source plugins, provisioning, and the Explore view. Sufficient for most deployments. Self-hosted.

Edition Grafana Enterprise

Commercial self-hosted edition. Adds RBAC with fine-grained permissions, SAML/team sync, reporting (PDF export on schedule), enhanced data sources (Oracle, Splunk, ServiceNow), audit logging, and data source caching.

Edition Grafana Cloud

Fully managed SaaS. Includes hosted Grafana, Mimir (metrics), Loki (logs), Tempo (traces), synthetic monitoring, and on-call incident management. Generous free tier (10k metrics, 50 GB logs, 50 GB traces, 3 users, 14-day retention).

Core value Unified Observability

Grafana's core value proposition: one dashboard platform for all your data, regardless of where it lives. Correlate Prometheus metrics with Loki logs and Tempo traces in a single pane. No vendor lock-in — swap backends freely.

Why Grafana dominates

Grafana became the de facto visualization tool because it decouples the dashboard from the data store. Unlike vendor-specific UIs (CloudWatch console, Datadog, Kibana), Grafana lets you query any backend through a pluggable data source system. This means you can run Prometheus for infrastructure metrics, Loki for logs, PostgreSQL for business data, and Elasticsearch for full-text search — all visualized in a single dashboard.

Architecture

Grafana is a stateless web application written in Go (backend) and React/TypeScript (frontend). It needs a database for its own metadata (dashboards, users, alerts) but does not store observability data.

┌──────────────────────────────────────────────────────┐ │ Grafana Server │ │ │ │ ┌──────────┐ ┌──────────┐ ┌────────────────────┐ │ │ │ HTTP API │ │ Frontend │ │ Alerting Engine │ │ │ │ (Go) │ │ (React) │ │ (Unified Alerts) │ │ │ └────┬─────┘ └──────────┘ └────────┬───────────┘ │ │ │ │ │ │ ┌────┴──────────────────────────┐ │ │ │ │ Data Source Proxy │ │ │ │ │ (routes queries to backends) │ │ │ │ └────┬───────┬───────┬──────┬──┘ │ │ └───────┼───────┼───────┼──────┼───────┼──────────────┘ │ │ │ │ │ ┌────┴──┐ ┌─┴────┐ ┌┴───┐ ┌┴────┐ ┌┴────────────┐ │Prometh│ │ Loki │ │ ES │ │Pg/My│ │ Alertmanager │ │ eus │ │ │ │ │ │ SQL │ │ / Notifiers │ └───────┘ └──────┘ └────┘ └─────┘ └──────────────┘ ┌───────────────────────────┐ │ Grafana Database │ │ (SQLite / PostgreSQL │ │ / MySQL) │ │ Stores: dashboards, │ │ users, orgs, alerts, │ │ preferences, API keys │ └───────────────────────────┘

Key components

Core Grafana Server

Single Go binary. Serves the web UI, REST API, data source proxy, and alerting engine. Default port 3000. Stateless — all state lives in the database.

Core Database Backend

SQLite (default, file-based, fine for single instances), PostgreSQL (recommended for production and HA), or MySQL/MariaDB. Stores dashboards, users, organizations, alert definitions, and preferences.

Plugin Data Source Plugins

Grafana queries observability backends through plugins. Built-in: Prometheus, Loki, Elasticsearch, InfluxDB, PostgreSQL, MySQL, CloudWatch, Azure Monitor. Community plugins add hundreds more.

Core Alerting Engine

Since v8 (opt-in) and default since v9, Grafana uses unified alerting — a single alerting system that evaluates rules against any data source. Replaces the legacy per-panel alerting (removed entirely in v11). Manages alert rules, contact points, notification policies, and silences.

Deployment models

Single instance — one Grafana server with SQLite. Simple, good for small teams. No HA.
HA cluster — multiple Grafana instances behind a load balancer, sharing a PostgreSQL database. Requires session affinity or shared session store.
Kubernetes — deploy via the official Helm chart (grafana/grafana). StatefulSet or Deployment with external PostgreSQL. ConfigMaps for provisioning.
Grafana Cloud — fully managed. No infrastructure to operate.

Data Sources

Data sources are the connection layer between Grafana and your observability backends. Grafana does not store time series or logs — it queries external systems in real time. Each data source plugin knows how to speak a specific query language (PromQL, LogQL, SQL, etc.) and translate the results into Grafana's internal data frame format.

Data Source	Type	Query Language	Use Case
Prometheus	Metrics	PromQL	Infrastructure & application metrics. The most common Grafana data source.
Loki	Logs	LogQL	Log aggregation. Designed to pair with Grafana. Labels mirror Prometheus.
Elasticsearch	Logs / Search	Lucene / KQL	Full-text log search, APM data, document-oriented queries.
InfluxDB	Metrics	InfluxQL / SQL	Time series DB popular for IoT and custom metrics. Flux is deprecated; InfluxDB 3 uses SQL and InfluxQL.
PostgreSQL	SQL	SQL	Business metrics, application data, custom reporting.
MySQL	SQL	SQL	Same as PostgreSQL. Grafana supports time series and table queries.
CloudWatch	Metrics / Logs	CloudWatch Metrics Insights	AWS infrastructure monitoring. EC2, RDS, Lambda, ELB metrics.
Azure Monitor	Metrics / Logs	KQL	Azure resource metrics, Log Analytics, Application Insights.

How data source plugins work

When a dashboard panel executes a query, Grafana's data source proxy forwards the request to the configured backend. The plugin handles authentication, query translation, and response parsing. Plugins can be:

Built-in — shipped with Grafana (Prometheus, Loki, Elasticsearch, etc.)
Core plugins — maintained by Grafana Labs, installed separately
Community plugins — third-party, installed via grafana-cli plugins install or the Grafana UI

# Install a community plugin (example: Zabbix data source)
grafana-cli plugins install alexanderzobnin-zabbix-app

# Install a specific version
grafana-cli plugins install alexanderzobnin-zabbix-app 4.4.0

# List installed plugins
grafana-cli plugins ls

# Remove a plugin
grafana-cli plugins remove alexanderzobnin-zabbix-app

Adding a Prometheus data source via API

curl -X POST http://admin:admin@localhost:3000/api/datasources \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://prometheus:9090",
    "access": "proxy",
    "isDefault": true
  }'

Dashboards & Panels

Dashboards are the heart of Grafana. Each dashboard is a JSON document containing panels, variables, annotations, and layout information. Panels are individual visualizations — each one queries a data source and renders the result.

Panel types

Viz Time Series

The default panel. Line, bar, or point charts over time. Supports multiple queries, overrides, thresholds, and tooltip linking. Used for metrics like CPU, memory, request rate.

Viz Stat

Single large value with optional sparkline. Perfect for KPIs: total requests, error count, uptime percentage. Color changes based on thresholds.

Viz Gauge

Circular or bar gauge showing a value against min/max. Great for disk usage, CPU saturation, SLA percentages. Threshold colors show green/yellow/red zones.

Viz Table

Tabular data with sorting, filtering, and cell coloring. Useful for top-N queries, inventory lists, and SQL result sets.

Viz Logs

Displays log lines from Loki or Elasticsearch. Supports search, filtering, context view, and linking to trace IDs. The primary panel for log exploration.

Viz Heatmap

Color-coded matrix showing distribution over time. Ideal for latency histograms (e.g., request duration buckets from Prometheus histograms).

Variables and templating

Dashboard variables make dashboards reusable. A variable creates a dropdown at the top of the dashboard that dynamically changes all panel queries. Common patterns:

Query variable — populated from a data source query, e.g., label_values(up, instance) to list all Prometheus instances
Custom variable — static list of values, e.g., production,staging,development
Interval variable — time intervals like 1m, 5m, 15m, 1h for rate() window control
Ad hoc filters — let users add arbitrary label filters without editing queries

# Using variables in PromQL queries
rate(http_requests_total{instance=~"$instance", job="$job"}[${interval}])

# Multi-value variable with regex match
node_cpu_seconds_total{cpu=~"$cpu", mode!="idle"}

Annotations

Annotations overlay events on time series panels — deployments, incidents, config changes. They can be added manually, via the API, or queried from a data source. Annotations help correlate metric changes with real-world events.

# Create an annotation via the API (e.g., from a CI/CD pipeline)
curl -X POST http://admin:admin@localhost:3000/api/annotations \
  -H 'Content-Type: application/json' \
  -d '{
    "dashboardUID": "abc123",
    "time": 1700000000000,
    "tags": ["deploy", "v2.1.0"],
    "text": "Deployed version 2.1.0 to production"
  }'

Dashboard JSON model

Every dashboard is a JSON document. You can export it from the UI (Share → Export), store it in Git, and provision it automatically. The JSON includes panel definitions, queries, variables, layout coordinates, and metadata.

{
  "dashboard": {
    "title": "Node Exporter Full",
    "uid": "rYdddlPWk",
    "tags": ["linux", "prometheus"],
    "timezone": "browser",
    "panels": [
      {
        "type": "timeseries",
        "title": "CPU Usage",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "100 - (avg by(instance)(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ]
      }
    ],
    "templating": {
      "list": [
        {
          "name": "instance",
          "type": "query",
          "query": "label_values(node_cpu_seconds_total, instance)"
        }
      ]
    }
  }
}

Alerting

Grafana v8 introduced unified alerting as an opt-in feature, and v9 made it the default, replacing the legacy panel-based alerting (removed entirely in v11) with a centralized, multi-data-source alerting system. Alert rules are evaluated by the Grafana server on a schedule, independent of dashboards. Notifications are routed through contact points and notification policies.

Core Alert Rules

Define a query, a condition (threshold, no-data, error), and an evaluation interval. Rules can query any data source. Grouped into rule groups within folders for organization.

Core Contact Points

Where notifications go: email, Slack, PagerDuty, OpsGenie, Microsoft Teams, webhooks, Alertmanager, and many more. Each contact point configures a specific integration.

Routing Notification Policies

A routing tree that matches alerts to contact points based on labels. Define default routes, label matchers, grouping, group wait/interval, and repeat intervals. Similar to Alertmanager's routing tree.

Control Silences & Mute Timings

Silences suppress notifications for a specific time window (e.g., during maintenance). Mute timings are recurring schedules (e.g., suppress non-critical alerts outside business hours).

Alert rule example (provisioned YAML)

# provisioning/alerting/rules.yml
apiVersion: 1
groups:
  - orgId: 1
    name: infrastructure-alerts
    folder: Infrastructure
    interval: 1m
    rules:
      - uid: high-cpu-alert
        title: High CPU Usage
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus-uid
            model:
              expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              intervalMs: 1000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [85]
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "CPU usage above 85% on {{ $labels.instance }}"
          description: "CPU has been above 85% for 5 minutes."

Notification policy example

# provisioning/alerting/notification-policies.yml
apiVersion: 1
policies:
  - orgId: 1
    receiver: email-default
    group_by: ['grafana_folder', 'alertname']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: pagerduty-critical
        matchers:
          - severity = critical
        continue: false
      - receiver: slack-warnings
        matchers:
          - severity = warning
        group_wait: 1m
        repeat_interval: 1h

Grafana vs external Alertmanager

Grafana's built-in alerting uses an embedded Alertmanager for routing and grouping. For large-scale deployments already using Prometheus Alertmanager, you can configure Grafana to forward alerts to your existing Alertmanager instead. This avoids duplicating routing configuration.

Provisioning

Provisioning lets you configure Grafana declaratively through YAML files instead of the UI. On startup, Grafana reads provisioning files and applies them. This enables GitOps workflows — store your entire Grafana configuration in Git and deploy it automatically.

Provisioning data sources

# /etc/grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: 15s
      httpMethod: POST

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false
    jsonData:
      maxLines: 1000
      derivedFields:
        - datasourceUid: tempo-uid
          matcherRegex: "traceID=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"

  - name: PostgreSQL
    type: postgres
    url: pg-host:5432
    database: app_metrics
    user: grafana_reader
    secureJsonData:
      password: "${PG_PASSWORD}"
    jsonData:
      sslmode: require
      maxOpenConns: 10
      connMaxLifetime: 14400

Provisioning dashboards

# /etc/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: false
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Place your dashboard JSON files in /var/lib/grafana/dashboards/. Organize them into subdirectories — when foldersFromFilesStructure is true, directory names become Grafana folders.

Provisioning contact points

# /etc/grafana/provisioning/alerting/contact-points.yml
apiVersion: 1
contactPoints:
  - orgId: 1
    name: slack-warnings
    receivers:
      - uid: slack-warn-1
        type: slack
        settings:
          recipient: "#alerts-warning"
          token: "${SLACK_BOT_TOKEN}"
          title: |
            {{ len .Alerts.Firing }} firing | {{ len .Alerts.Resolved }} resolved
          text: |
            {{ range .Alerts }}
            *{{ .Labels.alertname }}* - {{ .Annotations.summary }}
            {{ end }}

  - orgId: 1
    name: pagerduty-critical
    receivers:
      - uid: pd-critical-1
        type: pagerduty
        settings:
          integrationKey: "${PD_INTEGRATION_KEY}"
          severity: critical

GitOps workflow

Workflow Grafana as Code

The production pattern for managing Grafana configuration:

Store all provisioning YAML and dashboard JSON files in a Git repository
Review changes via pull requests — dashboard and alerting changes get the same review as application code
Deploy via CI/CD pipeline that copies files into the Grafana container or mounts them as ConfigMaps in Kubernetes
Enforce allowUiUpdates: false to prevent ad-hoc UI changes that drift from Git
Tools: Grafonnet (Jsonnet library for generating dashboards), Grizzly (CLI for Grafana resources), Terraform Grafana provider

Recommendation

Set editable: false on provisioned data sources and allowUiUpdates: false on dashboard providers. This prevents users from making UI changes that will be overwritten on next restart. All changes go through Git.

Authentication & RBAC

Grafana supports multiple authentication providers and a hierarchical permission model. In OSS, permissions are organization- and folder-level. Grafana Enterprise adds fine-grained RBAC with custom roles.

Authentication providers

Provider	Edition	Notes
Built-in (username/password)	OSS	Local accounts stored in the Grafana database. Fine for small teams.
LDAP	OSS	Bind to Active Directory or OpenLDAP. Map LDAP groups to Grafana orgs/roles.
OAuth 2.0 / OIDC	OSS	GitHub, GitLab, Google, Azure AD, Okta, Keycloak, generic OIDC. Most common for SSO.
SAML	Enterprise	Enterprise SSO standard. IdP-initiated and SP-initiated flows. Attribute mapping for roles.
Grafana Cloud SSO	Cloud	Managed by Grafana Labs. Includes team sync with identity providers.

OAuth example (Keycloak)

# grafana.ini
[auth.generic_oauth]
enabled = true
name = Keycloak
client_id = grafana
client_secret = ${KEYCLOAK_CLIENT_SECRET}
scopes = openid email profile
auth_url = https://keycloak.example.com/realms/corp/protocol/openid-connect/auth
token_url = https://keycloak.example.com/realms/corp/protocol/openid-connect/token
api_url = https://keycloak.example.com/realms/corp/protocol/openid-connect/userinfo
role_attribute_path = contains(groups[*], 'grafana-admins') && 'Admin' || contains(groups[*], 'grafana-editors') && 'Editor' || 'Viewer'
allow_sign_up = true

Organization and team model

Organizations — top-level tenants. Each org has its own dashboards, data sources, and users. Users can belong to multiple orgs with different roles. Useful for multi-tenant setups.
Teams — groups of users within an org. Assign folder and dashboard permissions to teams instead of individual users.
Roles (OSS) — Viewer (read-only), Editor (create/edit dashboards), Admin (full org control). Assigned per-organization.
Folder permissions — dashboards are organized in folders. Each folder can have specific viewer/editor permissions per user or team.

RBAC (Enterprise)

Grafana Enterprise adds fine-grained role-based access control with custom roles and permissions on specific resources:

Create custom roles with granular permissions (e.g., can edit dashboards in folder X but only view data source Y)
Permissions on data sources, folders, dashboards, service accounts, and alerting resources
Role assignment via the API or LDAP/SAML attribute mapping
Audit logging of all permission changes

Loki Integration

Grafana Loki is a log aggregation system designed to work seamlessly with Grafana. Unlike Elasticsearch, Loki does not index log contents — it only indexes metadata labels (like Prometheus). This makes it dramatically cheaper to operate at scale, at the cost of slower full-text search.

LogQL basics

LogQL is Loki's query language, inspired by PromQL. It has two types of queries: log queries (return log lines) and metric queries (return computed values from logs).

# Stream selector - required, selects log streams by label
{job="nginx", env="production"}

# Filter expressions - narrow down log lines
{job="nginx"} |= "error"                 # contains "error"
{job="nginx"} !~ "healthcheck|readiness" # does not match regex
{job="nginx"} |= "error" != "timeout"    # contains "error" but not "timeout"

# Parser - extract fields from log lines
{job="nginx"} | json                      # parse JSON logs
{job="nginx"} | logfmt                    # parse logfmt logs
{job="nginx"} | pattern `<ip> - - <_> "<method> <uri> <_>" <status> <size>`

# Label filter after parsing
{job="nginx"} | json | status >= 500

# Metric queries - aggregate log data into numbers
rate({job="nginx"} |= "error" [5m])                    # errors per second
sum by (status) (count_over_time({job="nginx"} | json [1h]))  # count by status code
quantile_over_time(0.95, {job="nginx"} | json | unwrap response_time [5m])  # p95 latency

Deploying Loki with Grafana

# docker-compose.yml (Loki + Alloy + Grafana)
services:
  loki:
    image: grafana/loki:3.6.1
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - loki-data:/loki
      - ./loki-config.yml:/etc/loki/local-config.yaml
    restart: unless-stopped

  alloy:
    image: grafana/alloy:1.6.0
    volumes:
      - /var/log:/var/log:ro
      - ./alloy-config.alloy:/etc/alloy/config.alloy
    command: run /etc/alloy/config.alloy
    restart: unless-stopped

  grafana:
    image: grafana/grafana:12.3.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana-data:/var/lib/grafana
    restart: unless-stopped

volumes:
  loki-data:
  grafana-data:

Loki configuration (minimal production)

# loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 744h    # 31 days
  max_query_length: 721h
  max_query_parallelism: 32

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h

Loki vs Elasticsearch

Loki uses 10-20x less storage and compute than Elasticsearch for the same log volume because it does not build full-text indexes. The trade-off: grep-style queries on unindexed fields are slower. Loki excels when you filter by labels first, then search within a narrow stream. If you need millisecond full-text search across all logs, Elasticsearch is still the better choice.

High Availability

Running Grafana in HA ensures the dashboard platform stays available if a single instance fails. Since Grafana is stateless (all state lives in the database), horizontal scaling is straightforward.

Required Shared Database

All Grafana instances must connect to the same PostgreSQL or MySQL database. SQLite does not support concurrent access from multiple instances. PostgreSQL is the recommended backend for HA.

Required Load Balancer

Place a load balancer (Nginx, HAProxy, ALB) in front of Grafana instances. Use sticky sessions (session affinity) to route a user to the same backend, or configure a shared session store (Redis, database).

Alert HA Unified Alerting HA

When running multiple instances, alert rule evaluation must be coordinated to avoid duplicate notifications. Enable HA alerting with the ha_peers setting so instances form a gossip cluster using a Memberlist ring.

Optional Shared File Storage

If using file-based provisioning or image rendering, all instances need access to the same files. Use NFS, EFS, or mount ConfigMaps in Kubernetes.

HA configuration

# grafana.ini - HA settings
[database]
type = postgres
host = pg-primary.example.com:5432
name = grafana
user = grafana
password = ${GF_DATABASE_PASSWORD}
ssl_mode = require
max_open_conn = 50
max_idle_conn = 25
conn_max_lifetime = 14400

[unified_alerting]
enabled = true
# HA: list all peer addresses (each Grafana instance)
ha_listen_address = "0.0.0.0:9094"
ha_peers = "grafana-0:9094,grafana-1:9094,grafana-2:9094"
ha_peer_timeout = 15s

[live]
# Required for HA - use Redis as pubsub for Grafana Live
ha_engine = redis
ha_engine_address = redis:6379

┌─────────────┐ │ Load Balancer│ │ (Nginx/ALB) │ └──────┬──────┘ │ ┌──────────────┼──────────────┐ │ │ │ ┌──────┴──────┐ ┌────┴────────┐ ┌───┴───────┐ │ Grafana #1 │ │ Grafana #2 │ │ Grafana #3│ │ :3000 │ │ :3000 │ │ :3000 │ │ :9094 (HA) │ │ :9094 (HA) │ │ :9094 (HA)│ └──────┬──────┘ └──────┬──────┘ └─────┬─────┘ │ gossip ring (memberlist) │ │ │ │ └───────────────┼──────────────┘ │ ┌────────┴────────┐ │ PostgreSQL │ │ (primary + │ │ replicas) │ └─────────────────┘

Docker Deployment

The most common way to self-host Grafana is with Docker Compose alongside Prometheus, Loki, and Grafana Alloy (the unified telemetry collector). This gives you a complete observability stack with metrics, logs, and dashboards.

Full observability stack (Docker Compose)

# docker-compose.yml - Grafana + Prometheus + Loki + Alloy
services:
  grafana:
    image: grafana/grafana:12.3.0
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD}"
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_ROOT_URL: "https://grafana.example.com"
      GF_DATABASE_TYPE: postgres
      GF_DATABASE_HOST: postgres:5432
      GF_DATABASE_NAME: grafana
      GF_DATABASE_USER: grafana
      GF_DATABASE_PASSWORD: "${POSTGRES_PASSWORD}"
      GF_DATABASE_SSL_MODE: disable
      GF_INSTALL_PLUGINS: grafana-clock-panel
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
      - ./dashboards:/var/lib/grafana/dashboards
    depends_on:
      postgres:
        condition: service_healthy
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v3.10.0
    container_name: prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    restart: unless-stopped

  loki:
    image: grafana/loki:3.6.1
    container_name: loki
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml:ro
      - loki-data:/loki
    restart: unless-stopped

  alloy:
    image: grafana/alloy:1.6.0
    container_name: alloy
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./alloy-config.alloy:/etc/alloy/config.alloy:ro
    command: run /etc/alloy/config.alloy
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.10.0
    container_name: node-exporter
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped

  postgres:
    image: postgres:16-alpine
    container_name: grafana-postgres
    environment:
      POSTGRES_DB: grafana
      POSTGRES_USER: grafana
      POSTGRES_PASSWORD: "${POSTGRES_PASSWORD}"
    volumes:
      - postgres-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U grafana"]
      interval: 5s
      timeout: 3s
      retries: 5
    restart: unless-stopped

volumes:
  grafana-data:
  prometheus-data:
  loki-data:
  postgres-data:

Prometheus configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000']

Environment variables

Grafana supports configuration via environment variables. Every grafana.ini setting can be overridden with GF_<SECTION>_<KEY> in uppercase. This is the preferred approach for Docker deployments.

# .env file
GRAFANA_ADMIN_PASSWORD=changeme-strong-password
POSTGRES_PASSWORD=another-strong-password
GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp.example.com:587
GF_SMTP_USER=grafana@example.com
GF_SMTP_PASSWORD=smtp-password
GF_SMTP_FROM_ADDRESS=grafana@example.com

Warning

The default admin password is admin. Always set GF_SECURITY_ADMIN_PASSWORD via environment variable or secret. After first login, change the password immediately. In production, disable the built-in admin and use OAuth/LDAP instead.

Best Practices

Dashboard design

Follow the USE method — for every resource, show Utilization, Saturation, and Errors. This gives a complete picture of system health.
Top-to-bottom, left-to-right flow — place the most critical metrics at the top. Start with high-level stats, drill down into details below.
Use variables for everything — instance, job, namespace, environment. One dashboard should serve all environments.
Set meaningful thresholds — color-code values (green/yellow/red) so operators can spot problems at a glance without reading numbers.
Link dashboards together — use data links to drill from a high-level overview dashboard to detailed per-service dashboards.
Avoid dashboard sprawl — 5 great dashboards beat 50 mediocre ones. Start with the Grafana community dashboards (grafana.com/dashboards) and customize.

Folder organization

Structure Recommended folder layout

Infrastructure/ — node exporter, Docker, Kubernetes cluster dashboards
Applications/ — per-service application metrics (request rate, latency, errors)
Databases/ — PostgreSQL, MySQL, Redis, MongoDB dashboards
Networking/ — Nginx, HAProxy, DNS, VPN dashboards
Logs/ — Loki-based log exploration dashboards
Business/ — KPIs, SLOs, revenue metrics from SQL data sources

Backup strategies

Database backups — back up the PostgreSQL/MySQL database regularly. This captures dashboards, users, alert rules, and all Grafana metadata. Use pg_dump on schedule.
Export dashboards to Git — use the Grafana API or grizzly to export all dashboards as JSON and commit to a Git repository. This is your disaster recovery plan.
Provisioning files — if you use provisioning (and you should), the YAML files in Git are your backup. A fresh Grafana instance with the same provisioning files will reconstruct everything.

# Export all dashboards via the API
for uid in $(curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
  http://localhost:3000/api/search | jq -r '.[].uid'); do
  curl -s -H "Authorization: Bearer $GRAFANA_API_KEY" \
    "http://localhost:3000/api/dashboards/uid/$uid" | \
    jq '.dashboard' > "dashboards/${uid}.json"
done

# Backup Grafana PostgreSQL database
pg_dump -h pg-host -U grafana grafana | gzip > grafana-backup-$(date +%Y%m%d).sql.gz

Upgrade procedures

Read the changelog — Grafana publishes detailed release notes with breaking changes. Always read them before upgrading.
Backup the database first — Grafana runs database migrations on startup. If the migration fails, you need the backup.
Upgrade one minor version at a time — do not skip major versions (e.g., 9.x → 11.x). Go 9.x → 10.x → 11.x to ensure migrations run correctly.
Test in staging — deploy the new version against a copy of the production database first.
Pin the image tag — use grafana/grafana:12.3.0, not :latest. Explicit upgrades only.

Production Checklist

Use PostgreSQL for the database — never use SQLite in production or HA deployments. Set max_open_conn and conn_max_lifetime.
Set a strong admin password — change the default admin/admin immediately. Better yet, disable built-in auth and use OAuth/LDAP.
Enable HTTPS — terminate TLS at a reverse proxy (Nginx, Traefik, Caddy) or configure Grafana's built-in TLS. Set GF_SERVER_ROOT_URL to the HTTPS URL.
Configure authentication — set up OAuth, LDAP, or SAML. Disable anonymous access (GF_AUTH_ANONYMOUS_ENABLED=false). Disable user signup (GF_USERS_ALLOW_SIGN_UP=false).
Provision everything from files — data sources, dashboards, alerting rules, contact points. Store in Git. Set editable: false and allowUiUpdates: false.
Set up alerting — configure contact points (Slack, PagerDuty, email). Define notification policies with proper routing. Test that alerts fire and resolve correctly.
Configure log rotation — Grafana logs to stdout by default in Docker. Ensure your Docker log driver rotates logs. Set GF_LOG_MODE=console and GF_LOG_LEVEL=warn in production.
Set resource limits — in Docker or Kubernetes, set memory and CPU limits. Grafana typically needs 256 MB–1 GB RAM depending on dashboard complexity and concurrent users.
Enable HA if critical — run 2+ instances behind a load balancer with shared PostgreSQL. Configure ha_peers for unified alerting.
Back up the database — schedule daily pg_dump backups. Test restores regularly. Keep 30 days of backups.
Pin the Grafana version — use specific image tags. Upgrade deliberately after reading the changelog and testing in staging.
Restrict plugin installation — only install plugins from trusted sources. Review community plugins before deploying. Set GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS only when necessary.
Monitor Grafana itself — Grafana exposes Prometheus metrics on /metrics. Scrape it with Prometheus and create a meta-dashboard for Grafana health (API latency, active users, alerting evaluation time).
Set up image rendering — for alert notifications with images and PDF reporting, deploy the grafana/grafana-image-renderer container alongside Grafana.

Grafana

Overview

Edition Grafana OSS

Edition Grafana Enterprise

Edition Grafana Cloud

Core value Unified Observability

Why Grafana dominates

Architecture

Key components

Core Grafana Server

Core Database Backend

Plugin Data Source Plugins

Core Alerting Engine

Deployment models

Data Sources

How data source plugins work

Adding a Prometheus data source via API

Dashboards & Panels

Panel types

Viz Time Series

Viz Stat

Viz Gauge

Viz Table

Viz Logs

Viz Heatmap

Variables and templating

Annotations

Dashboard JSON model

Alerting

Core Alert Rules

Core Contact Points

Routing Notification Policies

Control Silences & Mute Timings

Alert rule example (provisioned YAML)

Notification policy example

Provisioning

Provisioning data sources

Provisioning dashboards

Provisioning contact points

GitOps workflow

Workflow Grafana as Code

Authentication & RBAC

Authentication providers

OAuth example (Keycloak)

Organization and team model

RBAC (Enterprise)

Loki Integration

LogQL basics

Deploying Loki with Grafana

Loki configuration (minimal production)

High Availability

Required Shared Database

Required Load Balancer

Alert HA Unified Alerting HA

Optional Shared File Storage

HA configuration

Docker Deployment

Full observability stack (Docker Compose)

Prometheus configuration

Environment variables

Best Practices

Dashboard design

Folder organization

Structure Recommended folder layout

Backup strategies

Upgrade procedures

Production Checklist

Prometheus

What it is

Key concepts

PromQL examples

Grafana integration

Loki

What it is

Architecture

Deployment modes

Log collection

Alertmanager

What it is

Relationship to Grafana