Consul

Service discovery, service mesh, and distributed configuration by HashiCorp

01

Overview

HashiCorp Consul is a distributed system for service networking. It provides service discovery (find where services are running), a service mesh (secure service-to-service communication with mTLS), a key/value store (distributed configuration), and health checking (know which instances are healthy). Consul is designed for multi-datacenter, multi-cloud environments and integrates tightly with the rest of the HashiCorp ecosystem (Vault, Nomad, Terraform).

Core Service Discovery

Services register themselves with Consul. Other services query Consul via DNS or HTTP API to find healthy instances. No hardcoded IPs — services find each other dynamically.

Core Service Mesh (Connect)

Consul Connect provides service-to-service authorization and encryption via mutual TLS. Sidecar proxies (Envoy) handle traffic transparently — applications don't need to implement TLS themselves.

Feature Key/Value Store

A hierarchical, distributed KV store for dynamic configuration. Applications read config from Consul KV instead of files. Supports watches for real-time config updates, sessions for distributed locking, and leader election.

Feature Health Checking

Consul runs health checks against registered services and nodes. Unhealthy instances are automatically removed from DNS responses and service catalogs. Supports HTTP, TCP, script, TTL, and gRPC check types.

Multi-DC Multi-Datacenter

Consul natively supports multiple datacenters. Each DC runs its own Consul cluster. DCs communicate via WAN gossip and mesh gateways, enabling cross-DC service discovery and config replication without exposing internal networks.

Ecosystem HashiCorp Stack

Consul integrates with Vault (secrets management, automatic TLS certificate rotation), Nomad (workload orchestration with native Consul service registration), and Terraform (provision Consul infrastructure as code).

02

Architecture

Consul uses a client-server model. Every node in the infrastructure runs a Consul agent, either as a server or a client. Servers maintain the cluster state, and clients forward requests to servers.

Server Server Agents

Server agents participate in the Raft consensus protocol to maintain a consistent, replicated state. They store the service catalog, KV data, ACL policies, and config entries. Run 3 or 5 servers per datacenter for fault tolerance (Raft requires a quorum: majority must be alive).

  • 3 servers — tolerates 1 failure
  • 5 servers — tolerates 2 failures
  • Never run an even number (split-brain risk)

Client Client Agents

Client agents run on every node that hosts services. They are lightweight — they register local services, run health checks, and forward queries to servers. Clients participate in the LAN gossip pool for membership and failure detection but do not store state.

Datacenter topology

Datacenter "dc1" +-----------------------------------------------+ | Server 1 (leader) <---Raft---> Server 2 | | ^ ^ | | | Raft | | | v v | | Server 3 | | | | --- LAN Gossip (Serf) across all agents --- | | | | Client A Client B Client C Client D | | (web-01) (web-02) (api-01) (db-01) | +-----------------------------------------------+ | WAN Gossip | +-----------------------------------------------+ | Datacenter "dc2" | | Server 4 (leader) <--Raft--> Server 5 | | Server 6 | | Client E Client F Client G | +-----------------------------------------------+

Gossip protocol (Serf)

Consul uses the Serf gossip protocol for two distinct pools:

  • LAN gossip — all agents (servers + clients) within a single datacenter. Used for membership, failure detection, and event broadcasting. Operates on port 8301 (TCP + UDP).
  • WAN gossip — only server agents across datacenters. Enables cross-DC communication and service discovery. Operates on port 8302 (TCP + UDP).

Anti-entropy

Consul clients periodically synchronize their local state (registered services, health checks) with the server catalog. This anti-entropy mechanism ensures that if a client restarts or a registration is lost, the catalog converges back to the correct state. The sync interval scales with cluster size: 1 minute for 1–128 nodes, 2 minutes for 129–256 nodes, 3 minutes for 257–512 nodes, and 4 minutes for 513–1024 nodes. Each agent staggers its start time randomly within the interval window.

Key concept

Consul's architecture separates the data plane (gossip, health checks, DNS) from the control plane (Raft consensus, catalog, KV store). This means even if the server cluster is temporarily unavailable, clients can still serve cached DNS responses and continue running health checks locally.

03

Service Discovery

Service discovery is Consul's foundational feature. Services register with the local Consul agent, and consumers find them via DNS or the HTTP API. Only healthy instances are returned in queries.

Service registration

Services can be registered via a JSON/HCL config file loaded by the agent, or dynamically via the HTTP API.

// /etc/consul.d/web-service.json
{
  "service": {
    "name": "web",
    "port": 8080,
    "tags": ["production", "v2.1"],
    "meta": {
      "version": "2.1.0",
      "team": "platform"
    },
    "check": {
      "http": "http://localhost:8080/health",
      "interval": "10s",
      "timeout": "3s",
      "deregister_critical_service_after": "90s"
    }
  }
}

DNS interface

Consul exposes a DNS interface on port 8600 (by default). Services are queryable at <service>.service.consul.

# Query for healthy instances of the "web" service
dig @127.0.0.1 -p 8600 web.service.consul

# Query for a specific tag (format: <tag>.<service>.service.consul)
dig @127.0.0.1 -p 8600 v2.web.service.consul

# SRV records (returns port information)
dig @127.0.0.1 -p 8600 web.service.consul SRV

# Query a service in another datacenter
dig @127.0.0.1 -p 8600 web.service.dc2.consul

HTTP API

# List all healthy instances of "web"
curl http://localhost:8500/v1/health/service/web?passing

# Register a service via API
curl --request PUT --data @web-service.json \
  http://localhost:8500/v1/agent/service/register

# Deregister a service
curl --request PUT \
  http://localhost:8500/v1/agent/service/deregister/web-1

# List all services in the catalog
curl http://localhost:8500/v1/catalog/services

Health check types

TypeHow it worksUse case
httpConsul sends an HTTP GET to the specified URL. 2xx = passing, 429 = warning, anything else = critical.Web services with a /health endpoint
tcpConsul opens a TCP connection. Success = passing, refused/timeout = critical.Databases, message queues, any TCP service
scriptConsul executes a command. Exit 0 = passing, exit 1 = warning, any other exit = critical.Custom checks (disk space, external deps)
ttlService must call the TTL update endpoint within the specified interval. No update = critical.Services that push status rather than being polled
grpcConsul calls the gRPC health checking protocol.gRPC services implementing the standard health API

Catalog vs agent checks

Agent Agent-level checks

Registered and executed by the local Consul agent. The agent monitors the service and reports status to the server. This is the standard approach — checks run where the service runs.

Catalog Catalog-level registration

Registered directly via the catalog API, bypassing the local agent. No health checks are executed — the entry is static. Used for external services (e.g., a third-party API) that you want to include in Consul's service discovery but can't install an agent on.

04

Key/Value Store

Consul includes a distributed, hierarchical key/value store replicated across all server nodes via Raft consensus. It's used for dynamic configuration, feature flags, coordination primitives (leader election, distributed locks), and service metadata.

KV API

# Set a key
consul kv put config/web/max-connections 100

# Get a key
consul kv get config/web/max-connections

# Get all keys under a prefix
consul kv get -recurse config/web/

# Delete a key
consul kv delete config/web/max-connections

# Delete a prefix recursively
consul kv delete -recurse config/web/

# Export all KV data as JSON
consul kv export "" > consul-kv-backup.json

# Import KV data from JSON
consul kv import @consul-kv-backup.json

HTTP API for KV

# PUT a value (base64-encoded in response)
curl --request PUT --data 'database-primary.dc1.consul' \
  http://localhost:8500/v1/kv/config/db/host

# GET a value
curl http://localhost:8500/v1/kv/config/db/host

# GET with raw value (no JSON wrapper)
curl http://localhost:8500/v1/kv/config/db/host?raw

# CAS (Check-And-Set) — only update if ModifyIndex matches
curl --request PUT --data 'new-value' \
  "http://localhost:8500/v1/kv/config/db/host?cas=42"

Watches

Consul watches monitor KV keys (or services, nodes, etc.) and invoke a handler when the data changes. Useful for dynamic configuration reloading.

// /etc/consul.d/watch-config.json
{
  "watches": [
    {
      "type": "key",
      "key": "config/web/max-connections",
      "handler_type": "script",
      "args": ["/usr/local/bin/reload-config.sh"]
    },
    {
      "type": "keyprefix",
      "prefix": "config/web/",
      "handler_type": "http",
      "http_handler_config": {
        "path": "http://localhost:8080/consul-callback",
        "method": "POST"
      }
    }
  ]
}

Sessions & distributed locking

Consul sessions provide a mechanism for building distributed locks and leader election. A session is tied to a node's health check — if the node fails, the session is invalidated and locks are released.

# Create a session
SESSION_ID=$(curl -s --request PUT \
  --data '{"Name": "my-lock", "TTL": "15s", "Behavior": "release"}' \
  http://localhost:8500/v1/session/create | jq -r '.ID')

# Acquire a lock on a key
curl --request PUT --data 'lock-holder-1' \
  "http://localhost:8500/v1/kv/locks/my-resource?acquire=$SESSION_ID"

# Release the lock
curl --request PUT \
  "http://localhost:8500/v1/kv/locks/my-resource?release=$SESSION_ID"

# Renew a session (reset TTL)
curl --request PUT \
  "http://localhost:8500/v1/session/renew/$SESSION_ID"
Leader election pattern

Multiple service instances race to acquire a lock on a well-known KV key. The winner becomes the leader. All instances watch the key. When the leader's session expires (crash, network failure), the lock is released and another instance acquires it. This is how many HA systems implement leader election with Consul.

Consul Connect is Consul's service mesh feature. It provides mutual TLS (mTLS) encryption between services and intention-based authorization (which service can talk to which). Sidecar proxies (Envoy) handle the encryption and authorization transparently — application code doesn't change.

How it works

Service A Sidecar Proxy A Sidecar Proxy B Service B (localhost:8080) --> (Envoy :21000) ==mTLS==> (Envoy :21001) --> (localhost:9090) | | +--- Intentions checked --+ | | +--- TLS certs from Consul CA ---+

Sidecar proxy registration

// /etc/consul.d/web-with-sidecar.json
{
  "service": {
    "name": "web",
    "port": 8080,
    "connect": {
      "sidecar_service": {
        "proxy": {
          "upstreams": [
            {
              "destination_name": "api",
              "local_bind_port": 5000
            },
            {
              "destination_name": "cache",
              "local_bind_port": 6379
            }
          ]
        }
      }
    },
    "check": {
      "http": "http://localhost:8080/health",
      "interval": "10s"
    }
  }
}

With this configuration, the web service connects to the api service by hitting localhost:5000. The Envoy sidecar proxy intercepts, establishes an mTLS connection to the api service's sidecar, and forwards the traffic.

Intentions (allow/deny)

Intentions define which services may communicate. They are evaluated at the sidecar proxy layer using mTLS identity.

# Allow "web" to talk to "api" (legacy CLI, deprecated since v1.9.0)
consul intention create -allow web api

# Deny "web" from talking to "database" (legacy CLI)
consul intention create -deny web database

# Preferred: use config entries via consul config write
consul config write service-intentions-api.hcl

# List all intentions
consul intention list

# Delete an intention
consul intention delete web api

Intentions should be managed as service-intentions config entries (the preferred approach since Consul v1.9.0):

# service-intentions.hcl
Kind = "service-intentions"
Name = "api"
Sources = [
  {
    Name   = "web"
    Action = "allow"
  },
  {
    Name   = "monitoring"
    Action = "allow"
  },
  {
    Name   = "*"
    Action = "deny"
  }
]

Transparent proxy

In transparent proxy mode (default on Kubernetes), all outbound traffic from a service is automatically redirected through the sidecar proxy via iptables rules. The application connects to services by their normal DNS names — no need to configure upstream local_bind_port values. Consul's Envoy sidecar intercepts the traffic and applies mTLS and intentions automatically.

Recommendation

Use transparent proxy mode when deploying on Kubernetes. It removes the need for applications to be aware of the mesh. For VM-based deployments, explicitly configure upstreams in the sidecar proxy registration.

06

ACL System

Consul's ACL system controls access to the service catalog, KV store, agent APIs, intentions, and all other Consul resources. It uses tokens, policies, and roles. ACLs should be enabled in any production deployment.

Bootstrapping ACLs

# /etc/consul.d/consul.hcl (server config)
acl {
  enabled        = true
  default_policy = "deny"
  down_policy    = "extend-cache"
  enable_token_persistence = true

  tokens {
    initial_management = "b1gs33cr3t-0000-0000-0000-000000000001"
  }
}
# Bootstrap the ACL system (run once on a server)
consul acl bootstrap

# Output includes the initial management token:
# AccessorID:  a1b2c3d4-...
# SecretID:    b1gs33cr3t-0000-0000-0000-000000000001
# Description: Bootstrap Token (Global Management)
# Policies:    global-management

Policies

ACL policies define rules as HCL or JSON. Each rule grants read, write, list, or deny permissions on resources.

# web-service-policy.hcl
# Allow the web service to register itself and read other services
service "web" {
  policy = "write"
}
service_prefix "" {
  policy = "read"
}

# Allow reading KV config for web
key_prefix "config/web/" {
  policy = "read"
}

# Allow the node to register
node_prefix "" {
  policy = "write"
}

# Allow reading health checks
health_prefix "" {
  policy = "read"
}
# Create the policy
consul acl policy create \
  -name "web-service" \
  -description "Policy for web service agents" \
  -rules @web-service-policy.hcl

# Create a token with this policy
consul acl token create \
  -description "Token for web service" \
  -policy-name "web-service"

Roles and auth methods

Roles Grouping policies

Roles bundle multiple policies together. Assign a role to a token instead of individual policies. Easier to manage when many services share the same access patterns.

# Create a role
consul acl role create \
  -name "backend-services" \
  -policy-name "service-read" \
  -policy-name "kv-config-read"

# Create a token with this role
consul acl token create \
  -role-name "backend-services"

Auth Auth methods

Auth methods allow external identity providers (Kubernetes, JWT/OIDC, AWS IAM) to automatically generate Consul ACL tokens. In Kubernetes, the consul-k8s injector uses a Kubernetes auth method so pods get Consul tokens automatically based on their service account.

Critical

Always set default_policy = "deny" in production. With allow (the default when ACLs are disabled), any unauthenticated request has full access to every Consul resource. A deny default means every agent, service, and operator needs an explicit token.

07

DNS & Networking

Consul provides a built-in DNS server that makes service discovery as simple as a DNS lookup. The .consul domain is the default top-level domain for all Consul queries.

DNS query format

QueryReturns
<service>.service.consulA/AAAA records for healthy instances
<service>.service.consul SRVSRV records with port and node info
<tag>.<service>.service.consulHealthy instances filtered by service tag
<service>.service.<dc>.consulService in a specific datacenter
<node>.node.consulA record for a specific node
<query>.query.consulPrepared query result

DNS forwarding setup

Consul DNS runs on port 8600 by default. To use it transparently, configure your system's DNS resolver (systemd-resolved, dnsmasq, or BIND) to forward .consul queries to the Consul agent.

# Using dnsmasq (add to /etc/dnsmasq.d/consul.conf)
server=/consul/127.0.0.1#8600

# Using systemd-resolved
# /etc/systemd/resolved.conf.d/consul.conf
[Resolve]
DNS=127.0.0.1:8600
Domains=~consul

# Using iptables to redirect port 53 to 8600
iptables -t nat -A PREROUTING -p udp -m udp --dport 53 -j REDIRECT --to-ports 8600
iptables -t nat -A PREROUTING -p tcp -m tcp --dport 53 -j REDIRECT --to-ports 8600

# Or configure Consul to bind DNS to port 53 directly (requires root or CAP_NET_BIND_SERVICE)
# In consul.hcl:
# ports { dns = 53 }

Prepared queries

Prepared queries are stored, parameterized service queries with failover logic. They enable cross-datacenter failover and geo-routing at the DNS level.

# Create a prepared query with DC failover
curl --request POST --data '{
  "Name": "web-failover",
  "Service": {
    "Service": "web",
    "Tags": ["production"],
    "Failover": {
      "NearestN": 2,
      "Datacenters": ["dc2", "dc3"]
    }
  }
}' http://localhost:8500/v1/query

# Query via DNS: web-failover.query.consul
# Returns local DC results first, fails over to dc2/dc3 if no healthy instances

Network segments & mesh gateways

Enterprise Network Segments

Enterprise feature. Allows partitioning the LAN gossip pool into isolated segments. Useful when network ACLs prevent full mesh connectivity between all agents. Each segment has its own gossip pool with a dedicated port.

Mesh Mesh Gateways

Mesh gateways enable Consul Connect traffic to cross network boundaries (datacenters, partitions, VPCs) without requiring direct connectivity between all services. Gateway nodes proxy mTLS traffic through a single, well-known endpoint.

08

Multi-Datacenter

Consul is built for multi-datacenter deployments. Each datacenter runs an independent Consul cluster with its own Raft quorum. Datacenters are connected via WAN gossip and can optionally use mesh gateways for service mesh traffic.

WAN federation

Server agents from different datacenters join a shared WAN gossip pool. This enables cross-DC service discovery and RPC forwarding.

# Server config for dc1
datacenter         = "dc1"
primary_datacenter = "dc1"
server             = true
bootstrap_expect   = 3

# Join WAN with dc2 servers
retry_join_wan = ["10.10.2.11", "10.10.2.12", "10.10.2.13"]

# Server config for dc2
datacenter         = "dc2"
primary_datacenter = "dc1"
server             = true
bootstrap_expect   = 3

retry_join_wan = ["10.10.1.11", "10.10.1.12", "10.10.1.13"]

Cross-DC service discovery

# Query a service in a remote datacenter
dig @127.0.0.1 -p 8600 web.service.dc2.consul

# Via HTTP API
curl "http://localhost:8500/v1/health/service/web?dc=dc2&passing"

# List all known datacenters
curl http://localhost:8500/v1/catalog/datacenters

Replication

The primary_datacenter is the authoritative source for certain data. The following are replicated from the primary to secondary DCs:

  • ACL policies, tokens, and roles — managed centrally in the primary DC and replicated
  • Config entries (intentions, service-defaults, proxy-defaults) — replicated for consistent service mesh behavior
  • CA certificates — the root CA is in the primary DC; secondary DCs get intermediate CAs
Warning

KV data is not replicated across datacenters by default. Each DC has its own KV store. If you need shared configuration across DCs, use consul-replicate (HashiCorp tool, largely unmaintained) or manage config entries through a CI/CD pipeline that writes to each DC.

Mesh gateways for DC peering

WAN federation requires direct server-to-server connectivity across DCs. Cluster peering (newer approach) uses mesh gateways instead, requiring only a single gateway endpoint to be reachable. This is simpler for cloud environments where opening multiple ports between VPCs is complex.

# Generate a peering token in dc1
consul peering generate-token -name dc2

# Establish peering from dc2
consul peering establish -name dc1 -peering-token <token>

# List peerings
consul peering list

# After peering, export services to make them discoverable
# Config entry in dc1:
# Kind = "exported-services"
# Name = "default"
# Services = [{ Name = "web", Consumers = [{ Peer = "dc2" }] }]
09

Kubernetes Integration

Consul integrates deeply with Kubernetes via the official Helm chart and the consul-k8s CLI. Since Consul 1.14 / consul-k8s v1.0, Kubernetes deployments use Consul Dataplane instead of per-node client agents. Consul Dataplane runs as a sidecar alongside Envoy, communicates with servers over gRPC (no gossip protocol needed), and simplifies networking, upgrades, and ACL token management. The Helm chart deploys Consul servers, injects Envoy sidecar proxies into application pods, and can sync the Kubernetes service catalog with Consul.

Helm chart deployment

# Add the HashiCorp Helm repository
helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update

# Install Consul with default values
helm install consul hashicorp/consul \
  --namespace consul --create-namespace \
  --values consul-values.yaml
# consul-values.yaml
global:
  name: consul
  datacenter: dc1
  image: "hashicorp/consul:1.22"  # Update to latest stable; check releases.hashicorp.com
  tls:
    enabled: true
  acls:
    manageSystemACLs: true

server:
  replicas: 3
  storageClass: gp3
  storage: 10Gi
  resources:
    requests:
      memory: "200Mi"
      cpu: "100m"
    limits:
      memory: "1Gi"
      cpu: "1000m"

connectInject:
  enabled: true
  transparentProxy:
    defaultEnabled: true
  default: false  # opt-in per pod with annotation

syncCatalog:
  enabled: true
  toConsul: true
  toK8S: true

ui:
  enabled: true
  service:
    type: LoadBalancer

Connect-inject (sidecar injection)

The connectInject controller watches for pods with the annotation consul.hashicorp.com/connect-inject: "true" and automatically injects an Envoy sidecar proxy.

# Example pod with Consul Connect sidecar injection
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
      annotations:
        consul.hashicorp.com/connect-inject: "true"
        consul.hashicorp.com/connect-service-upstreams: "api:5000,cache:6379"
    spec:
      serviceAccountName: web
      containers:
        - name: web
          image: myorg/web:v2.1
          ports:
            - containerPort: 8080
          env:
            - name: API_URL
              value: "http://localhost:5000"
            - name: CACHE_URL
              value: "redis://localhost:6379"

Sync catalog

Catalog sync keeps Kubernetes services and Consul services in sync. Services registered in Consul appear as Kubernetes ExternalName services, and Kubernetes services appear in the Consul catalog.

CRDs and gateways

CRD Custom resources

Consul on Kubernetes uses CRDs for managing service mesh configuration: ServiceIntentions, ServiceDefaults, ServiceRouter, ServiceSplitter, ProxyDefaults, IngressGateway, TerminatingGateway, and more. This allows GitOps workflows via kubectl apply.

Gateway Ingress & terminating

Ingress gateway — exposes mesh services to external traffic (like an ingress controller). Terminating gateway — allows mesh services to connect to external, non-mesh services while maintaining mTLS within the mesh.

# ServiceIntentions CRD
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceIntentions
metadata:
  name: api
spec:
  destination:
    name: api
  sources:
    - name: web
      action: allow
    - name: "*"
      action: deny
10

Docker Deployment

A Docker Compose setup for running a 3-server Consul cluster with client agents. Suitable for development, testing, and small production deployments.

Consul server configuration

# config/server.hcl
datacenter         = "dc1"
data_dir           = "/consul/data"
log_level          = "INFO"
server             = true
bootstrap_expect   = 3

ui_config {
  enabled = true
}

client_addr    = "0.0.0.0"
bind_addr      = "0.0.0.0"

addresses {
  http = "0.0.0.0"
}

retry_join = ["consul-server-1", "consul-server-2", "consul-server-3"]

connect {
  enabled = true
}

acl {
  enabled        = true
  default_policy = "deny"
  down_policy    = "extend-cache"
  enable_token_persistence = true
}

performance {
  raft_multiplier = 1
}

Consul client configuration

# config/client.hcl
datacenter  = "dc1"
data_dir    = "/consul/data"
log_level   = "INFO"
server      = false

client_addr = "0.0.0.0"
bind_addr   = "0.0.0.0"

retry_join = ["consul-server-1", "consul-server-2", "consul-server-3"]

connect {
  enabled = true
}

ports {
  grpc     = 8502  # plaintext gRPC (xDS for Envoy)
  grpc_tls = 8503  # gRPC with TLS (default on servers since v1.14)
}

Docker Compose (3 servers + 2 clients)

# docker-compose.yml
services:
  consul-server-1:
    image: hashicorp/consul:1.22
    container_name: consul-server-1
    command: agent -server -node=server-1
    volumes:
      - ./config/server.hcl:/consul/config/server.hcl:ro
      - consul-data-1:/consul/data
    ports:
      - "8500:8500"   # HTTP API + UI
      - "8600:8600/udp" # DNS
      - "8600:8600/tcp"
    networks:
      - consul-net
    restart: unless-stopped

  consul-server-2:
    image: hashicorp/consul:1.22
    container_name: consul-server-2
    command: agent -server -node=server-2
    volumes:
      - ./config/server.hcl:/consul/config/server.hcl:ro
      - consul-data-2:/consul/data
    networks:
      - consul-net
    restart: unless-stopped

  consul-server-3:
    image: hashicorp/consul:1.22
    container_name: consul-server-3
    command: agent -server -node=server-3
    volumes:
      - ./config/server.hcl:/consul/config/server.hcl:ro
      - consul-data-3:/consul/data
    networks:
      - consul-net
    restart: unless-stopped

  consul-client-1:
    image: hashicorp/consul:1.22
    container_name: consul-client-1
    command: agent -node=client-1
    volumes:
      - ./config/client.hcl:/consul/config/client.hcl:ro
    networks:
      - consul-net
    depends_on:
      - consul-server-1
      - consul-server-2
      - consul-server-3
    restart: unless-stopped

  consul-client-2:
    image: hashicorp/consul:1.22
    container_name: consul-client-2
    command: agent -node=client-2
    volumes:
      - ./config/client.hcl:/consul/config/client.hcl:ro
    networks:
      - consul-net
    depends_on:
      - consul-server-1
      - consul-server-2
      - consul-server-3
    restart: unless-stopped

volumes:
  consul-data-1:
  consul-data-2:
  consul-data-3:

networks:
  consul-net:
    driver: bridge
# Start the cluster
docker compose up -d

# Check cluster members
docker exec consul-server-1 consul members

# Bootstrap ACLs (run once after first start)
docker exec consul-server-1 consul acl bootstrap

# Access the UI at http://localhost:8500
11

Observability

Monitoring a Consul cluster is essential for maintaining reliability. Consul exposes rich telemetry, provides a built-in UI dashboard, supports audit logging (Enterprise), and includes a snapshot mechanism for backups.

Telemetry & Prometheus

# consul.hcl — enable Prometheus metrics
telemetry {
  prometheus_retention_time = "60s"
  disable_hostname          = true
}

# Metrics are then available at:
# http://localhost:8500/v1/agent/metrics?format=prometheus
# prometheus.yml scrape config
scrape_configs:
  - job_name: 'consul'
    metrics_path: '/v1/agent/metrics'
    params:
      format: ['prometheus']
    static_configs:
      - targets:
          - 'consul-server-1:8500'
          - 'consul-server-2:8500'
          - 'consul-server-3:8500'

Key metrics to monitor

MetricWhat it tells youAlert threshold
consul.raft.leader.lastContactTime since the leader last contacted followers> 200ms (leader instability)
consul.raft.commitTimeTime to commit a new log entry> 500ms (slow commits)
consul.serf.member.flapNumber of membership flaps (join/leave churn)> 0 sustained (network issues)
consul.catalog.service.countTotal services registeredSudden drops (deregistration storm)
consul.health.service.criticalCritical health checks> 0 for key services
consul.rpc.requestRPC request rate to serversSpikes may indicate thundering herd

UI dashboard

Consul includes a built-in web UI (enabled with ui_config { enabled = true }) that shows services, nodes, KV store, intentions, and cluster health. Access it at http://<consul-addr>:8500/ui. The UI supports filtering by datacenter, namespace (Enterprise), and partition.

Audit logging (Enterprise)

Consul Enterprise supports audit logging that records every API request, including the token used, the operation, and the result. Essential for compliance and security forensics.

# Enterprise only
audit {
  enabled = true
  sink "file" {
    type   = "file"
    format = "json"
    path   = "/consul/audit/audit.json"
    delivery_guarantee = "best-effort"
    rotate_duration    = "24h"
    rotate_max_files   = 15
  }
}

Snapshots (backup & restore)

# Take a snapshot (includes KV, catalog, ACLs, sessions, etc.)
consul snapshot save consul-backup-$(date +%Y%m%d).snap

# Restore from a snapshot
consul snapshot restore consul-backup-20260320.snap

# Inspect a snapshot
consul snapshot inspect consul-backup-20260320.snap

# Automated snapshot agent (Enterprise, or use cron with OSS)
# Cron example for OSS:
# 0 */6 * * * consul snapshot save /backups/consul-$(date +\%Y\%m\%d-\%H\%M).snap
Recommendation

Take snapshots at least every 6 hours and before any cluster maintenance (upgrades, node replacement). Snapshots are the only way to recover from a total cluster loss. Store them off-cluster in S3, GCS, or another durable location.

12

Production Checklist

  • Run 3 or 5 server agents — never 1 (no fault tolerance), never an even number (split-brain risk). 3 tolerates 1 failure, 5 tolerates 2.
  • Enable ACLs with default deny — set default_policy = "deny". Bootstrap ACLs and create granular tokens for every agent and service. Never use the management token for regular operations.
  • Enable TLS everywhere — encrypt RPC, HTTP, and gossip traffic. Use auto_encrypt for automatic client TLS certificate distribution from servers.
  • Enable gossip encryption — generate a gossip key with consul keygen and set encrypt in the config. All agents must share the same key.
  • Enable Connect (service mesh) — even if you don't need mTLS today, enabling Connect allows incremental adoption. Start with intentions in allow-all mode, then tighten.
  • Pin the Consul version — use specific image tags (hashicorp/consul:1.22.1), never :latest. Upgrade deliberately with tested rollout plans.
  • Set raft_multiplier = 1 — default is 5 (development-friendly). Production should use 1 for tighter leader election timeouts and faster failover.
  • Use persistent storage — server data_dir must be on persistent volumes. Losing Raft data means losing quorum state.
  • Automate snapshots — schedule regular consul snapshot save and store backups off-cluster. This is your disaster recovery mechanism.
  • Monitor key metrics — alert on raft.leader.lastContact, serf.member.flap, and health.service.critical. Set up Prometheus scraping and Grafana dashboards.
  • Configure DNS forwarding — set up dnsmasq or systemd-resolved to forward .consul queries to the local Consul agent. Applications should resolve services via DNS.
  • Set deregister_critical_service_after — on health checks, auto-deregister services that stay critical for too long (e.g., "90s"). Prevents stale entries from accumulating.
  • Use retry_join with multiple addresses — never hardcode a single server IP. Use cloud auto-join (provider=aws tag_key=consul tag_value=server) or multiple IPs/DNS names.
  • Separate server and client configs — servers and clients have different resource requirements and config. Don't run a one-size-fits-all config.
  • Plan datacenter naming — datacenter names are permanent and used in DNS (service.dc1.consul). Choose meaningful, stable names from the start.