Consul
Service discovery, service mesh, and distributed configuration by HashiCorp
Overview
HashiCorp Consul is a distributed system for service networking. It provides service discovery (find where services are running), a service mesh (secure service-to-service communication with mTLS), a key/value store (distributed configuration), and health checking (know which instances are healthy). Consul is designed for multi-datacenter, multi-cloud environments and integrates tightly with the rest of the HashiCorp ecosystem (Vault, Nomad, Terraform).
Core Service Discovery
Services register themselves with Consul. Other services query Consul via DNS or HTTP API to find healthy instances. No hardcoded IPs — services find each other dynamically.
Core Service Mesh (Connect)
Consul Connect provides service-to-service authorization and encryption via mutual TLS. Sidecar proxies (Envoy) handle traffic transparently — applications don't need to implement TLS themselves.
Feature Key/Value Store
A hierarchical, distributed KV store for dynamic configuration. Applications read config from Consul KV instead of files. Supports watches for real-time config updates, sessions for distributed locking, and leader election.
Feature Health Checking
Consul runs health checks against registered services and nodes. Unhealthy instances are automatically removed from DNS responses and service catalogs. Supports HTTP, TCP, script, TTL, and gRPC check types.
Multi-DC Multi-Datacenter
Consul natively supports multiple datacenters. Each DC runs its own Consul cluster. DCs communicate via WAN gossip and mesh gateways, enabling cross-DC service discovery and config replication without exposing internal networks.
Ecosystem HashiCorp Stack
Consul integrates with Vault (secrets management, automatic TLS certificate rotation), Nomad (workload orchestration with native Consul service registration), and Terraform (provision Consul infrastructure as code).
Architecture
Consul uses a client-server model. Every node in the infrastructure runs a Consul agent, either as a server or a client. Servers maintain the cluster state, and clients forward requests to servers.
Server Server Agents
Server agents participate in the Raft consensus protocol to maintain a consistent, replicated state. They store the service catalog, KV data, ACL policies, and config entries. Run 3 or 5 servers per datacenter for fault tolerance (Raft requires a quorum: majority must be alive).
- 3 servers — tolerates 1 failure
- 5 servers — tolerates 2 failures
- Never run an even number (split-brain risk)
Client Client Agents
Client agents run on every node that hosts services. They are lightweight — they register local services, run health checks, and forward queries to servers. Clients participate in the LAN gossip pool for membership and failure detection but do not store state.
Datacenter topology
Gossip protocol (Serf)
Consul uses the Serf gossip protocol for two distinct pools:
- LAN gossip — all agents (servers + clients) within a single datacenter. Used for membership, failure detection, and event broadcasting. Operates on port
8301(TCP + UDP). - WAN gossip — only server agents across datacenters. Enables cross-DC communication and service discovery. Operates on port
8302(TCP + UDP).
Anti-entropy
Consul clients periodically synchronize their local state (registered services, health checks) with the server catalog. This anti-entropy mechanism ensures that if a client restarts or a registration is lost, the catalog converges back to the correct state. The sync interval scales with cluster size: 1 minute for 1–128 nodes, 2 minutes for 129–256 nodes, 3 minutes for 257–512 nodes, and 4 minutes for 513–1024 nodes. Each agent staggers its start time randomly within the interval window.
Consul's architecture separates the data plane (gossip, health checks, DNS) from the control plane (Raft consensus, catalog, KV store). This means even if the server cluster is temporarily unavailable, clients can still serve cached DNS responses and continue running health checks locally.
Service Discovery
Service discovery is Consul's foundational feature. Services register with the local Consul agent, and consumers find them via DNS or the HTTP API. Only healthy instances are returned in queries.
Service registration
Services can be registered via a JSON/HCL config file loaded by the agent, or dynamically via the HTTP API.
// /etc/consul.d/web-service.json
{
"service": {
"name": "web",
"port": 8080,
"tags": ["production", "v2.1"],
"meta": {
"version": "2.1.0",
"team": "platform"
},
"check": {
"http": "http://localhost:8080/health",
"interval": "10s",
"timeout": "3s",
"deregister_critical_service_after": "90s"
}
}
}
DNS interface
Consul exposes a DNS interface on port 8600 (by default). Services are queryable at <service>.service.consul.
# Query for healthy instances of the "web" service
dig @127.0.0.1 -p 8600 web.service.consul
# Query for a specific tag (format: <tag>.<service>.service.consul)
dig @127.0.0.1 -p 8600 v2.web.service.consul
# SRV records (returns port information)
dig @127.0.0.1 -p 8600 web.service.consul SRV
# Query a service in another datacenter
dig @127.0.0.1 -p 8600 web.service.dc2.consul
HTTP API
# List all healthy instances of "web"
curl http://localhost:8500/v1/health/service/web?passing
# Register a service via API
curl --request PUT --data @web-service.json \
http://localhost:8500/v1/agent/service/register
# Deregister a service
curl --request PUT \
http://localhost:8500/v1/agent/service/deregister/web-1
# List all services in the catalog
curl http://localhost:8500/v1/catalog/services
Health check types
| Type | How it works | Use case |
|---|---|---|
http | Consul sends an HTTP GET to the specified URL. 2xx = passing, 429 = warning, anything else = critical. | Web services with a /health endpoint |
tcp | Consul opens a TCP connection. Success = passing, refused/timeout = critical. | Databases, message queues, any TCP service |
script | Consul executes a command. Exit 0 = passing, exit 1 = warning, any other exit = critical. | Custom checks (disk space, external deps) |
ttl | Service must call the TTL update endpoint within the specified interval. No update = critical. | Services that push status rather than being polled |
grpc | Consul calls the gRPC health checking protocol. | gRPC services implementing the standard health API |
Catalog vs agent checks
Agent Agent-level checks
Registered and executed by the local Consul agent. The agent monitors the service and reports status to the server. This is the standard approach — checks run where the service runs.
Catalog Catalog-level registration
Registered directly via the catalog API, bypassing the local agent. No health checks are executed — the entry is static. Used for external services (e.g., a third-party API) that you want to include in Consul's service discovery but can't install an agent on.
Key/Value Store
Consul includes a distributed, hierarchical key/value store replicated across all server nodes via Raft consensus. It's used for dynamic configuration, feature flags, coordination primitives (leader election, distributed locks), and service metadata.
KV API
# Set a key
consul kv put config/web/max-connections 100
# Get a key
consul kv get config/web/max-connections
# Get all keys under a prefix
consul kv get -recurse config/web/
# Delete a key
consul kv delete config/web/max-connections
# Delete a prefix recursively
consul kv delete -recurse config/web/
# Export all KV data as JSON
consul kv export "" > consul-kv-backup.json
# Import KV data from JSON
consul kv import @consul-kv-backup.json
HTTP API for KV
# PUT a value (base64-encoded in response)
curl --request PUT --data 'database-primary.dc1.consul' \
http://localhost:8500/v1/kv/config/db/host
# GET a value
curl http://localhost:8500/v1/kv/config/db/host
# GET with raw value (no JSON wrapper)
curl http://localhost:8500/v1/kv/config/db/host?raw
# CAS (Check-And-Set) — only update if ModifyIndex matches
curl --request PUT --data 'new-value' \
"http://localhost:8500/v1/kv/config/db/host?cas=42"
Watches
Consul watches monitor KV keys (or services, nodes, etc.) and invoke a handler when the data changes. Useful for dynamic configuration reloading.
// /etc/consul.d/watch-config.json
{
"watches": [
{
"type": "key",
"key": "config/web/max-connections",
"handler_type": "script",
"args": ["/usr/local/bin/reload-config.sh"]
},
{
"type": "keyprefix",
"prefix": "config/web/",
"handler_type": "http",
"http_handler_config": {
"path": "http://localhost:8080/consul-callback",
"method": "POST"
}
}
]
}
Sessions & distributed locking
Consul sessions provide a mechanism for building distributed locks and leader election. A session is tied to a node's health check — if the node fails, the session is invalidated and locks are released.
# Create a session
SESSION_ID=$(curl -s --request PUT \
--data '{"Name": "my-lock", "TTL": "15s", "Behavior": "release"}' \
http://localhost:8500/v1/session/create | jq -r '.ID')
# Acquire a lock on a key
curl --request PUT --data 'lock-holder-1' \
"http://localhost:8500/v1/kv/locks/my-resource?acquire=$SESSION_ID"
# Release the lock
curl --request PUT \
"http://localhost:8500/v1/kv/locks/my-resource?release=$SESSION_ID"
# Renew a session (reset TTL)
curl --request PUT \
"http://localhost:8500/v1/session/renew/$SESSION_ID"
Multiple service instances race to acquire a lock on a well-known KV key. The winner becomes the leader. All instances watch the key. When the leader's session expires (crash, network failure), the lock is released and another instance acquires it. This is how many HA systems implement leader election with Consul.
Service Mesh (Connect)
Consul Connect is Consul's service mesh feature. It provides mutual TLS (mTLS) encryption between services and intention-based authorization (which service can talk to which). Sidecar proxies (Envoy) handle the encryption and authorization transparently — application code doesn't change.
How it works
Sidecar proxy registration
// /etc/consul.d/web-with-sidecar.json
{
"service": {
"name": "web",
"port": 8080,
"connect": {
"sidecar_service": {
"proxy": {
"upstreams": [
{
"destination_name": "api",
"local_bind_port": 5000
},
{
"destination_name": "cache",
"local_bind_port": 6379
}
]
}
}
},
"check": {
"http": "http://localhost:8080/health",
"interval": "10s"
}
}
}
With this configuration, the web service connects to the api service by hitting localhost:5000. The Envoy sidecar proxy intercepts, establishes an mTLS connection to the api service's sidecar, and forwards the traffic.
Intentions (allow/deny)
Intentions define which services may communicate. They are evaluated at the sidecar proxy layer using mTLS identity.
# Allow "web" to talk to "api" (legacy CLI, deprecated since v1.9.0)
consul intention create -allow web api
# Deny "web" from talking to "database" (legacy CLI)
consul intention create -deny web database
# Preferred: use config entries via consul config write
consul config write service-intentions-api.hcl
# List all intentions
consul intention list
# Delete an intention
consul intention delete web api
Intentions should be managed as service-intentions config entries (the preferred approach since Consul v1.9.0):
# service-intentions.hcl
Kind = "service-intentions"
Name = "api"
Sources = [
{
Name = "web"
Action = "allow"
},
{
Name = "monitoring"
Action = "allow"
},
{
Name = "*"
Action = "deny"
}
]
Transparent proxy
In transparent proxy mode (default on Kubernetes), all outbound traffic from a service is automatically redirected through the sidecar proxy via iptables rules. The application connects to services by their normal DNS names — no need to configure upstream local_bind_port values. Consul's Envoy sidecar intercepts the traffic and applies mTLS and intentions automatically.
Use transparent proxy mode when deploying on Kubernetes. It removes the need for applications to be aware of the mesh. For VM-based deployments, explicitly configure upstreams in the sidecar proxy registration.
ACL System
Consul's ACL system controls access to the service catalog, KV store, agent APIs, intentions, and all other Consul resources. It uses tokens, policies, and roles. ACLs should be enabled in any production deployment.
Bootstrapping ACLs
# /etc/consul.d/consul.hcl (server config)
acl {
enabled = true
default_policy = "deny"
down_policy = "extend-cache"
enable_token_persistence = true
tokens {
initial_management = "b1gs33cr3t-0000-0000-0000-000000000001"
}
}
# Bootstrap the ACL system (run once on a server)
consul acl bootstrap
# Output includes the initial management token:
# AccessorID: a1b2c3d4-...
# SecretID: b1gs33cr3t-0000-0000-0000-000000000001
# Description: Bootstrap Token (Global Management)
# Policies: global-management
Policies
ACL policies define rules as HCL or JSON. Each rule grants read, write, list, or deny permissions on resources.
# web-service-policy.hcl
# Allow the web service to register itself and read other services
service "web" {
policy = "write"
}
service_prefix "" {
policy = "read"
}
# Allow reading KV config for web
key_prefix "config/web/" {
policy = "read"
}
# Allow the node to register
node_prefix "" {
policy = "write"
}
# Allow reading health checks
health_prefix "" {
policy = "read"
}
# Create the policy
consul acl policy create \
-name "web-service" \
-description "Policy for web service agents" \
-rules @web-service-policy.hcl
# Create a token with this policy
consul acl token create \
-description "Token for web service" \
-policy-name "web-service"
Roles and auth methods
Roles Grouping policies
Roles bundle multiple policies together. Assign a role to a token instead of individual policies. Easier to manage when many services share the same access patterns.
# Create a role
consul acl role create \
-name "backend-services" \
-policy-name "service-read" \
-policy-name "kv-config-read"
# Create a token with this role
consul acl token create \
-role-name "backend-services"
Auth Auth methods
Auth methods allow external identity providers (Kubernetes, JWT/OIDC, AWS IAM) to automatically generate Consul ACL tokens. In Kubernetes, the consul-k8s injector uses a Kubernetes auth method so pods get Consul tokens automatically based on their service account.
Always set default_policy = "deny" in production. With allow (the default when ACLs are disabled), any unauthenticated request has full access to every Consul resource. A deny default means every agent, service, and operator needs an explicit token.
DNS & Networking
Consul provides a built-in DNS server that makes service discovery as simple as a DNS lookup. The .consul domain is the default top-level domain for all Consul queries.
DNS query format
| Query | Returns |
|---|---|
<service>.service.consul | A/AAAA records for healthy instances |
<service>.service.consul SRV | SRV records with port and node info |
<tag>.<service>.service.consul | Healthy instances filtered by service tag |
<service>.service.<dc>.consul | Service in a specific datacenter |
<node>.node.consul | A record for a specific node |
<query>.query.consul | Prepared query result |
DNS forwarding setup
Consul DNS runs on port 8600 by default. To use it transparently, configure your system's DNS resolver (systemd-resolved, dnsmasq, or BIND) to forward .consul queries to the Consul agent.
# Using dnsmasq (add to /etc/dnsmasq.d/consul.conf)
server=/consul/127.0.0.1#8600
# Using systemd-resolved
# /etc/systemd/resolved.conf.d/consul.conf
[Resolve]
DNS=127.0.0.1:8600
Domains=~consul
# Using iptables to redirect port 53 to 8600
iptables -t nat -A PREROUTING -p udp -m udp --dport 53 -j REDIRECT --to-ports 8600
iptables -t nat -A PREROUTING -p tcp -m tcp --dport 53 -j REDIRECT --to-ports 8600
# Or configure Consul to bind DNS to port 53 directly (requires root or CAP_NET_BIND_SERVICE)
# In consul.hcl:
# ports { dns = 53 }
Prepared queries
Prepared queries are stored, parameterized service queries with failover logic. They enable cross-datacenter failover and geo-routing at the DNS level.
# Create a prepared query with DC failover
curl --request POST --data '{
"Name": "web-failover",
"Service": {
"Service": "web",
"Tags": ["production"],
"Failover": {
"NearestN": 2,
"Datacenters": ["dc2", "dc3"]
}
}
}' http://localhost:8500/v1/query
# Query via DNS: web-failover.query.consul
# Returns local DC results first, fails over to dc2/dc3 if no healthy instances
Network segments & mesh gateways
Enterprise Network Segments
Enterprise feature. Allows partitioning the LAN gossip pool into isolated segments. Useful when network ACLs prevent full mesh connectivity between all agents. Each segment has its own gossip pool with a dedicated port.
Mesh Mesh Gateways
Mesh gateways enable Consul Connect traffic to cross network boundaries (datacenters, partitions, VPCs) without requiring direct connectivity between all services. Gateway nodes proxy mTLS traffic through a single, well-known endpoint.
Multi-Datacenter
Consul is built for multi-datacenter deployments. Each datacenter runs an independent Consul cluster with its own Raft quorum. Datacenters are connected via WAN gossip and can optionally use mesh gateways for service mesh traffic.
WAN federation
Server agents from different datacenters join a shared WAN gossip pool. This enables cross-DC service discovery and RPC forwarding.
# Server config for dc1
datacenter = "dc1"
primary_datacenter = "dc1"
server = true
bootstrap_expect = 3
# Join WAN with dc2 servers
retry_join_wan = ["10.10.2.11", "10.10.2.12", "10.10.2.13"]
# Server config for dc2
datacenter = "dc2"
primary_datacenter = "dc1"
server = true
bootstrap_expect = 3
retry_join_wan = ["10.10.1.11", "10.10.1.12", "10.10.1.13"]
Cross-DC service discovery
# Query a service in a remote datacenter
dig @127.0.0.1 -p 8600 web.service.dc2.consul
# Via HTTP API
curl "http://localhost:8500/v1/health/service/web?dc=dc2&passing"
# List all known datacenters
curl http://localhost:8500/v1/catalog/datacenters
Replication
The primary_datacenter is the authoritative source for certain data. The following are replicated from the primary to secondary DCs:
- ACL policies, tokens, and roles — managed centrally in the primary DC and replicated
- Config entries (intentions, service-defaults, proxy-defaults) — replicated for consistent service mesh behavior
- CA certificates — the root CA is in the primary DC; secondary DCs get intermediate CAs
KV data is not replicated across datacenters by default. Each DC has its own KV store. If you need shared configuration across DCs, use consul-replicate (HashiCorp tool, largely unmaintained) or manage config entries through a CI/CD pipeline that writes to each DC.
Mesh gateways for DC peering
WAN federation requires direct server-to-server connectivity across DCs. Cluster peering (newer approach) uses mesh gateways instead, requiring only a single gateway endpoint to be reachable. This is simpler for cloud environments where opening multiple ports between VPCs is complex.
# Generate a peering token in dc1
consul peering generate-token -name dc2
# Establish peering from dc2
consul peering establish -name dc1 -peering-token <token>
# List peerings
consul peering list
# After peering, export services to make them discoverable
# Config entry in dc1:
# Kind = "exported-services"
# Name = "default"
# Services = [{ Name = "web", Consumers = [{ Peer = "dc2" }] }]
Kubernetes Integration
Consul integrates deeply with Kubernetes via the official Helm chart and the consul-k8s CLI. Since Consul 1.14 / consul-k8s v1.0, Kubernetes deployments use Consul Dataplane instead of per-node client agents. Consul Dataplane runs as a sidecar alongside Envoy, communicates with servers over gRPC (no gossip protocol needed), and simplifies networking, upgrades, and ACL token management. The Helm chart deploys Consul servers, injects Envoy sidecar proxies into application pods, and can sync the Kubernetes service catalog with Consul.
Helm chart deployment
# Add the HashiCorp Helm repository
helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update
# Install Consul with default values
helm install consul hashicorp/consul \
--namespace consul --create-namespace \
--values consul-values.yaml
# consul-values.yaml
global:
name: consul
datacenter: dc1
image: "hashicorp/consul:1.22" # Update to latest stable; check releases.hashicorp.com
tls:
enabled: true
acls:
manageSystemACLs: true
server:
replicas: 3
storageClass: gp3
storage: 10Gi
resources:
requests:
memory: "200Mi"
cpu: "100m"
limits:
memory: "1Gi"
cpu: "1000m"
connectInject:
enabled: true
transparentProxy:
defaultEnabled: true
default: false # opt-in per pod with annotation
syncCatalog:
enabled: true
toConsul: true
toK8S: true
ui:
enabled: true
service:
type: LoadBalancer
Connect-inject (sidecar injection)
The connectInject controller watches for pods with the annotation consul.hashicorp.com/connect-inject: "true" and automatically injects an Envoy sidecar proxy.
# Example pod with Consul Connect sidecar injection
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
annotations:
consul.hashicorp.com/connect-inject: "true"
consul.hashicorp.com/connect-service-upstreams: "api:5000,cache:6379"
spec:
serviceAccountName: web
containers:
- name: web
image: myorg/web:v2.1
ports:
- containerPort: 8080
env:
- name: API_URL
value: "http://localhost:5000"
- name: CACHE_URL
value: "redis://localhost:6379"
Sync catalog
Catalog sync keeps Kubernetes services and Consul services in sync. Services registered in Consul appear as Kubernetes ExternalName services, and Kubernetes services appear in the Consul catalog.
CRDs and gateways
CRD Custom resources
Consul on Kubernetes uses CRDs for managing service mesh configuration: ServiceIntentions, ServiceDefaults, ServiceRouter, ServiceSplitter, ProxyDefaults, IngressGateway, TerminatingGateway, and more. This allows GitOps workflows via kubectl apply.
Gateway Ingress & terminating
Ingress gateway — exposes mesh services to external traffic (like an ingress controller). Terminating gateway — allows mesh services to connect to external, non-mesh services while maintaining mTLS within the mesh.
# ServiceIntentions CRD
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceIntentions
metadata:
name: api
spec:
destination:
name: api
sources:
- name: web
action: allow
- name: "*"
action: deny
Docker Deployment
A Docker Compose setup for running a 3-server Consul cluster with client agents. Suitable for development, testing, and small production deployments.
Consul server configuration
# config/server.hcl
datacenter = "dc1"
data_dir = "/consul/data"
log_level = "INFO"
server = true
bootstrap_expect = 3
ui_config {
enabled = true
}
client_addr = "0.0.0.0"
bind_addr = "0.0.0.0"
addresses {
http = "0.0.0.0"
}
retry_join = ["consul-server-1", "consul-server-2", "consul-server-3"]
connect {
enabled = true
}
acl {
enabled = true
default_policy = "deny"
down_policy = "extend-cache"
enable_token_persistence = true
}
performance {
raft_multiplier = 1
}
Consul client configuration
# config/client.hcl
datacenter = "dc1"
data_dir = "/consul/data"
log_level = "INFO"
server = false
client_addr = "0.0.0.0"
bind_addr = "0.0.0.0"
retry_join = ["consul-server-1", "consul-server-2", "consul-server-3"]
connect {
enabled = true
}
ports {
grpc = 8502 # plaintext gRPC (xDS for Envoy)
grpc_tls = 8503 # gRPC with TLS (default on servers since v1.14)
}
Docker Compose (3 servers + 2 clients)
# docker-compose.yml
services:
consul-server-1:
image: hashicorp/consul:1.22
container_name: consul-server-1
command: agent -server -node=server-1
volumes:
- ./config/server.hcl:/consul/config/server.hcl:ro
- consul-data-1:/consul/data
ports:
- "8500:8500" # HTTP API + UI
- "8600:8600/udp" # DNS
- "8600:8600/tcp"
networks:
- consul-net
restart: unless-stopped
consul-server-2:
image: hashicorp/consul:1.22
container_name: consul-server-2
command: agent -server -node=server-2
volumes:
- ./config/server.hcl:/consul/config/server.hcl:ro
- consul-data-2:/consul/data
networks:
- consul-net
restart: unless-stopped
consul-server-3:
image: hashicorp/consul:1.22
container_name: consul-server-3
command: agent -server -node=server-3
volumes:
- ./config/server.hcl:/consul/config/server.hcl:ro
- consul-data-3:/consul/data
networks:
- consul-net
restart: unless-stopped
consul-client-1:
image: hashicorp/consul:1.22
container_name: consul-client-1
command: agent -node=client-1
volumes:
- ./config/client.hcl:/consul/config/client.hcl:ro
networks:
- consul-net
depends_on:
- consul-server-1
- consul-server-2
- consul-server-3
restart: unless-stopped
consul-client-2:
image: hashicorp/consul:1.22
container_name: consul-client-2
command: agent -node=client-2
volumes:
- ./config/client.hcl:/consul/config/client.hcl:ro
networks:
- consul-net
depends_on:
- consul-server-1
- consul-server-2
- consul-server-3
restart: unless-stopped
volumes:
consul-data-1:
consul-data-2:
consul-data-3:
networks:
consul-net:
driver: bridge
# Start the cluster
docker compose up -d
# Check cluster members
docker exec consul-server-1 consul members
# Bootstrap ACLs (run once after first start)
docker exec consul-server-1 consul acl bootstrap
# Access the UI at http://localhost:8500
Observability
Monitoring a Consul cluster is essential for maintaining reliability. Consul exposes rich telemetry, provides a built-in UI dashboard, supports audit logging (Enterprise), and includes a snapshot mechanism for backups.
Telemetry & Prometheus
# consul.hcl — enable Prometheus metrics
telemetry {
prometheus_retention_time = "60s"
disable_hostname = true
}
# Metrics are then available at:
# http://localhost:8500/v1/agent/metrics?format=prometheus
# prometheus.yml scrape config
scrape_configs:
- job_name: 'consul'
metrics_path: '/v1/agent/metrics'
params:
format: ['prometheus']
static_configs:
- targets:
- 'consul-server-1:8500'
- 'consul-server-2:8500'
- 'consul-server-3:8500'
Key metrics to monitor
| Metric | What it tells you | Alert threshold |
|---|---|---|
consul.raft.leader.lastContact | Time since the leader last contacted followers | > 200ms (leader instability) |
consul.raft.commitTime | Time to commit a new log entry | > 500ms (slow commits) |
consul.serf.member.flap | Number of membership flaps (join/leave churn) | > 0 sustained (network issues) |
consul.catalog.service.count | Total services registered | Sudden drops (deregistration storm) |
consul.health.service.critical | Critical health checks | > 0 for key services |
consul.rpc.request | RPC request rate to servers | Spikes may indicate thundering herd |
UI dashboard
Consul includes a built-in web UI (enabled with ui_config { enabled = true }) that shows services, nodes, KV store, intentions, and cluster health. Access it at http://<consul-addr>:8500/ui. The UI supports filtering by datacenter, namespace (Enterprise), and partition.
Audit logging (Enterprise)
Consul Enterprise supports audit logging that records every API request, including the token used, the operation, and the result. Essential for compliance and security forensics.
# Enterprise only
audit {
enabled = true
sink "file" {
type = "file"
format = "json"
path = "/consul/audit/audit.json"
delivery_guarantee = "best-effort"
rotate_duration = "24h"
rotate_max_files = 15
}
}
Snapshots (backup & restore)
# Take a snapshot (includes KV, catalog, ACLs, sessions, etc.)
consul snapshot save consul-backup-$(date +%Y%m%d).snap
# Restore from a snapshot
consul snapshot restore consul-backup-20260320.snap
# Inspect a snapshot
consul snapshot inspect consul-backup-20260320.snap
# Automated snapshot agent (Enterprise, or use cron with OSS)
# Cron example for OSS:
# 0 */6 * * * consul snapshot save /backups/consul-$(date +\%Y\%m\%d-\%H\%M).snap
Take snapshots at least every 6 hours and before any cluster maintenance (upgrades, node replacement). Snapshots are the only way to recover from a total cluster loss. Store them off-cluster in S3, GCS, or another durable location.
Production Checklist
- Run 3 or 5 server agents — never 1 (no fault tolerance), never an even number (split-brain risk). 3 tolerates 1 failure, 5 tolerates 2.
- Enable ACLs with default deny — set
default_policy = "deny". Bootstrap ACLs and create granular tokens for every agent and service. Never use the management token for regular operations. - Enable TLS everywhere — encrypt RPC, HTTP, and gossip traffic. Use
auto_encryptfor automatic client TLS certificate distribution from servers. - Enable gossip encryption — generate a gossip key with
consul keygenand setencryptin the config. All agents must share the same key. - Enable Connect (service mesh) — even if you don't need mTLS today, enabling Connect allows incremental adoption. Start with intentions in allow-all mode, then tighten.
- Pin the Consul version — use specific image tags (
hashicorp/consul:1.22.1), never:latest. Upgrade deliberately with tested rollout plans. - Set
raft_multiplier = 1— default is 5 (development-friendly). Production should use 1 for tighter leader election timeouts and faster failover. - Use persistent storage — server
data_dirmust be on persistent volumes. Losing Raft data means losing quorum state. - Automate snapshots — schedule regular
consul snapshot saveand store backups off-cluster. This is your disaster recovery mechanism. - Monitor key metrics — alert on
raft.leader.lastContact,serf.member.flap, andhealth.service.critical. Set up Prometheus scraping and Grafana dashboards. - Configure DNS forwarding — set up dnsmasq or systemd-resolved to forward
.consulqueries to the local Consul agent. Applications should resolve services via DNS. - Set
deregister_critical_service_after— on health checks, auto-deregister services that stay critical for too long (e.g.,"90s"). Prevents stale entries from accumulating. - Use
retry_joinwith multiple addresses — never hardcode a single server IP. Use cloud auto-join (provider=aws tag_key=consul tag_value=server) or multiple IPs/DNS names. - Separate server and client configs — servers and clients have different resource requirements and config. Don't run a one-size-fits-all config.
- Plan datacenter naming — datacenter names are permanent and used in DNS (
service.dc1.consul). Choose meaningful, stable names from the start.