Postgres Production Architecture Guide

Overview

PostgreSQL is an advanced, open-source object-relational database management system with nearly 40 years of active development (origins dating to 1986 at UC Berkeley). It has earned a reputation for reliability, data integrity, and correctness, making it the default choice for most greenfield projects that need a relational database. As of early 2026, PostgreSQL 18 is the current major release (September 2025), with PostgreSQL 17 widely deployed in production.

What sets PostgreSQL apart from other relational databases is its extensibility. You can define your own data types, build custom functions in multiple languages (PL/pgSQL, PL/Python, PL/Perl, C), create custom index types, and even write extensions that fundamentally change how the database behaves. This extensibility is why PostgreSQL has become the foundation for specialized databases like TimescaleDB (time-series), Citus (distributed), and pgvector (AI/ML embeddings).

Core Strengths

MVCC Concurrency

Multi-Version Concurrency Control allows readers and writers to never block each other. Each transaction sees a consistent snapshot of the database. No read locks, no write starvation — this is fundamental to PostgreSQL's performance under concurrent workloads.

SQL Standards Compliance

PostgreSQL implements a large portion of the SQL:2023 standard. CTEs, window functions, lateral joins, MERGE, JSON path queries, generated columns — features that other databases either lack or implement partially.

ACID Data Integrity

Full ACID compliance with Write-Ahead Logging (WAL). Serializable isolation level available. Foreign keys, check constraints, exclusion constraints, and triggers enforce data integrity at the database level.

EXTENSIBLE Ecosystem

Over 1,000 extensions available. PostGIS (geospatial), pgvector (embeddings), TimescaleDB (time-series), pg_cron (scheduling), pgAudit (audit logging). Extensions can add data types, operators, index methods, and even custom storage engines.

When to Choose PostgreSQL

PostgreSQL is the right choice for most workloads where you need a relational database. It excels at OLTP, handles analytical queries well (especially with partitioning and parallel query), supports JSON/JSONB for semi-structured data, and has native full-text search. The only cases where you might look elsewhere: extremely high write throughput with relaxed consistency (consider Cassandra/ScyllaDB), pure key-value at massive scale (consider Redis/DynamoDB), or when your team is deeply invested in another ecosystem (SQL Server/.NET, Oracle/Java EE).

Recommendation

If you're starting a new project and need a relational database, default to PostgreSQL unless you have a specific, documented reason not to. The ecosystem, community, and operational tooling are unmatched in the open-source world.

Architecture

Understanding PostgreSQL's process model and memory architecture is essential for proper tuning and troubleshooting. Unlike MySQL's thread-per-connection model, PostgreSQL uses a process-per-connection model — each client connection gets its own operating system process. (Note: as of PG 15, the stats collector process was removed and statistics are now stored in shared memory.)

Process Model

The postmaster is the parent process. It listens for incoming connections and forks a new backend process for each one. This model is simple and robust — a crash in one backend doesn't bring down others — but it means each connection consumes real OS memory (typically 5-10 MB per idle connection, more under load). This is why connection pooling is critical in production.

Shared Memory

All backend processes share a region of memory managed by PostgreSQL:

Shared Buffers — The main data cache. Pages read from disk are cached here. Default is 128 MB, but production systems typically set this to 25% of total RAM (up to about 8-16 GB, beyond which returns diminish).
WAL Buffers — Buffer for Write-Ahead Log entries before they're flushed to disk. Default auto-sized to 1/32 of shared_buffers, usually 16 MB is sufficient.
Lock Tables — Tracking all active locks across transactions.
Proc Array — Information about all active backends, used for MVCC visibility checks.

Write-Ahead Logging (WAL)

Every change to data is first written to the WAL before being applied to the actual data files. This ensures durability — if the server crashes, PostgreSQL replays the WAL to recover to a consistent state. WAL is also the foundation for replication (streaming replication sends WAL records to replicas) and Point-in-Time Recovery (PITR).

# WAL files live in pg_wal/ (formerly pg_xlog/)
# Each segment is 16 MB by default
ls -la /var/lib/postgresql/18/main/pg_wal/
# Watch WAL generation rate
pg_stat_wal  # PostgreSQL 14+

MVCC Internals

PostgreSQL's MVCC implementation stores old row versions directly in the table (heap). When a row is updated, the old version is marked with the transaction ID that invalidated it, and a new version is inserted. This means:

Updates are essentially delete + insert (table bloat is a real concern)
VACUUM is required to reclaim space from dead tuples
Long-running transactions prevent VACUUM from cleaning up, leading to table bloat
The visibility map tracks which pages contain only all-visible tuples (enables index-only scans)

Warning

Long-running transactions are one of the most common causes of production incidents in PostgreSQL. A single forgotten BEGIN without a COMMIT can prevent autovacuum from cleaning up dead tuples across the entire database, eventually leading to table bloat, degraded performance, and even transaction ID wraparound.

Tablespaces

Tablespaces allow you to place specific tables or indexes on different physical storage. Useful for putting hot indexes on fast NVMe while keeping large, infrequently accessed tables on cheaper storage. In practice, most cloud deployments use a single tablespace since the storage layer handles tiering.

-- Create a tablespace on fast storage
CREATE TABLESPACE fast_ssd LOCATION '/mnt/nvme/pg_data';

-- Move a table to the fast tablespace
ALTER TABLE orders SET TABLESPACE fast_ssd;

-- Move an index
ALTER INDEX orders_created_idx SET TABLESPACE fast_ssd;

Deployment Models

How you deploy PostgreSQL has a massive impact on operational complexity, performance, and cost. There's no single right answer — it depends on your team's capabilities, compliance requirements, and budget.

Deployment Comparison

Model	Pros	Cons	Best For
Bare Metal	Maximum performance, full control, no hypervisor overhead	Hardware procurement, no live migration, capacity planning is manual	High-performance OLTP, financial systems, latency-critical workloads
VMs	Familiar, snapshotting, live migration, good isolation	Some overhead, storage I/O can be a bottleneck, right-sizing is tricky	Most production workloads, traditional enterprise environments
Containers (K8s)	Declarative, GitOps-friendly, easy dev/staging parity	Stateful workloads on K8s are hard, storage performance, networking complexity	Dev/staging, teams already deep in K8s, with operators like CloudNativePG
Managed (RDS, CloudSQL)	Automated backups, patching, HA, monitoring out of the box	Higher cost, limited extension support, less tuning control, vendor lock-in	Small teams, startups, when you don't have a dedicated DBA

Self-Hosted vs Managed

The decision to self-host vs use a managed service comes down to team expertise and control requirements. If your team has strong PostgreSQL operational knowledge and you need full control over extensions, configuration, and upgrade timing, self-host. If you're a small team without a DBA, managed services like AWS RDS, Google CloudSQL, or Azure Database for PostgreSQL handle the undifferentiated heavy lifting.

Info

Managed services often lag behind upstream PostgreSQL by several months. As of early 2026, RDS supports PostgreSQL through version 18.3, but new major versions typically take a few months to become available after upstream release. If you need day-one access to the latest features or specific extensions not yet supported by your provider, self-hosting is the only option.

Kubernetes Considerations

Running PostgreSQL on Kubernetes is increasingly viable but requires careful planning. The key challenges are storage performance, data durability, and operational tooling. Use a purpose-built operator:

Recommended CloudNativePG

CNCF Sandbox project. Declarative PostgreSQL clusters, automated failover, backup to S3/GCS, rolling updates. The most mature K8s-native operator. Doesn't rely on Patroni — has its own consensus mechanism.

Alternative Crunchy PGO

By Crunchy Data (v5). Uses Patroni for HA. Supports pgBackRest for backups, pgBouncer sidecar for connection pooling. Well-documented, production-tested. Open source (Apache 2.0).

Warning

Never run production PostgreSQL on Kubernetes with emptyDir or hostPath volumes. Always use persistent volumes backed by a proper storage class (e.g., gp3 on AWS, pd-ssd on GCP). Ensure your storage class has allowVolumeExpansion: true — you will need to grow volumes.

Installation on Ubuntu/Debian

Always use the official PGDG (PostgreSQL Global Development Group) repository for the latest stable packages:

# Add the official PostgreSQL repository
sudo apt install -y postgresql-common
sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh

# Install PostgreSQL 18
sudo apt install -y postgresql-18

# Verify
pg_lsclusters
sudo -u postgres psql -c "SELECT version();"

High Availability

PostgreSQL does not have built-in automatic failover. It provides the replication primitives — streaming replication and logical replication — but you need external tooling to detect failures and promote a standby. This is by design: the database doesn't try to be a distributed consensus system.

Streaming Replication

Physical (streaming) replication ships WAL records from the primary to one or more standby servers. The standby continuously applies these records, maintaining an exact byte-for-byte copy of the primary. Standby servers can serve read-only queries (hot standby).

# postgresql.conf on primary
wal_level = replica                # minimal, replica, or logical
max_wal_senders = 10               # max number of standby connections
wal_keep_size = 1GB                # retain WAL for slow standbys (PG 13+)
synchronous_standby_names = ''     # empty = async, set for sync replication

# pg_hba.conf on primary
host replication replicator 10.0.0.0/24 scram-sha-256

# On the standby server
sudo -u postgres pg_basebackup \
  -h primary.example.com \
  -D /var/lib/postgresql/18/main \
  -U replicator -Fp -Xs -P -R

# -R creates standby.signal and sets primary_conninfo in postgresql.auto.conf

Synchronous vs Asynchronous Replication

Aspect	Asynchronous	Synchronous
Data Loss Risk	Possible (transactions committed on primary but not yet shipped)	Zero data loss (RPO = 0)
Commit Latency	Local disk only	Must wait for standby acknowledgment (network RTT added)
Availability Impact	None — primary doesn't depend on standby	If all sync standbys are down, writes block on primary
Use Case	Most production deployments, read replicas	Financial systems, regulatory requirements for zero data loss

Critical

With synchronous replication, if all synchronous standbys go offline, the primary will stop accepting writes entirely. Always configure at least two synchronous standbys or use synchronous_standby_names = 'ANY 1 (standby1, standby2, standby3)' to require only one of multiple standbys to acknowledge.

HA Tooling Comparison

Recommended Patroni

The industry standard for PostgreSQL HA. Uses a distributed consensus store (etcd, ZooKeeper, or Consul) for leader election. Handles automatic failover, planned switchover, and configuration management. Created by Zalando, battle-tested at massive scale.

REST API for management & monitoring
Supports synchronous replication management
Configurable failover conditions
Integrates with HAProxy or PgBouncer for connection routing

Alternative pg_auto_failover

By Citus/Microsoft. Simpler than Patroni — uses a "monitor" node instead of an external DCS. Good for smaller deployments where running etcd is overkill. Supports synchronous replication and read replicas. Limited to primary-standby topology.

Legacy repmgr

Mature tool, widely deployed. Handles replication setup, monitoring, and manual/automatic failover. Uses repmgrd daemon for automatic failover. Less opinionated than Patroni — requires more manual configuration. Community support has declined as Patroni gained dominance.

Cloud Managed HA

RDS Multi-AZ, CloudSQL HA, Azure HA: all handle failover automatically using storage-level replication. Failover takes 30-120 seconds. DNS-based endpoint switching. No topology control — you get what the provider gives you.

Patroni Architecture

+--------------+ +--------------+ +--------------+ | etcd 1 | | etcd 2 | | etcd 3 | | (node 1) | | (node 2) | | (node 3) | +------+-------+ +------+-------+ +------+-------+ | | | +--------+--------+--------+--------+ | DCS Cluster | +--------+-----------------+--------+ v v +--------------+ +--------------+ | Patroni + | streaming rep | Patroni + | | PostgreSQL | ---------------> | PostgreSQL | | (primary) | | (standby) | +------+-------+ +------+-------+ | | +------------+--------------------+ v +--------------+ | HAProxy | | :5000 (rw) | | :5001 (ro) | +--------------+

Split-Brain Prevention

Split-brain occurs when two nodes both believe they are the primary. Patroni prevents this through its DCS (distributed consensus store) — the leader key has a TTL, and the primary must continuously renew its lease. If it can't reach the DCS, it demotes itself. Additionally, configure watchdog support in Patroni to ensure the old primary's PostgreSQL is actually stopped before the new primary starts accepting writes.

Recommendation

Always run an odd number of DCS nodes (3 or 5) to maintain quorum. Place them in different failure domains (different racks, different AZs). The DCS cluster is the foundation of your HA — if it goes down, Patroni can't perform failover.

Connection Pooling

PostgreSQL's process-per-connection model means each connection consumes real memory and OS resources. A typical backend process uses 5-10 MB at idle and can grow to hundreds of MB under load. Without connection pooling, applications that open hundreds or thousands of connections will overwhelm the database — not through query load, but through sheer resource consumption.

Why You Need a Pooler

Most applications don't need to hold a dedicated database connection for every request. A web application serving 1,000 concurrent requests might only need 50-100 actual database connections if queries are fast. A connection pooler multiplexes application connections onto a smaller pool of real PostgreSQL connections.

# Check current connection count
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"

# See connection limits
sudo -u postgres psql -c "SHOW max_connections;"
# Default: 100. Many production systems set 200-500.
# Beyond that, you almost certainly need a pooler.

Pooler Comparison

Feature	PgBouncer	Pgpool-II	Odyssey
Architecture	Lightweight, single-process, event-driven	Multi-process, feature-rich middleware	Multi-threaded, advanced routing
Memory Usage	~2 KB per connection	~100 KB per connection	Moderate (multi-threaded)
Connection Modes	Transaction, session, statement	Session only	Transaction, session
Load Balancing	No (pair with HAProxy)	Yes (read query routing)	No
Prepared Statements	Protocol-level only (PgBouncer 1.21+)	Full support	Full support
Best For	Most production deployments	Read query load balancing	High-concurrency, multi-threaded workloads

PgBouncer Configuration

# /etc/pgbouncer/pgbouncer.ini
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

# Pool mode: transaction is recommended for most apps
pool_mode = transaction

# Pool sizing
default_pool_size = 25        # connections per user/database pair
min_pool_size = 5             # pre-warm this many connections
reserve_pool_size = 5         # extra connections for burst traffic
reserve_pool_timeout = 3      # seconds before using reserve pool

max_client_conn = 1000        # total client connections PgBouncer accepts
max_db_connections = 100      # max real connections to PostgreSQL

# Timeouts
server_idle_timeout = 600     # close idle server connections after 10 min
client_idle_timeout = 0       # 0 = no timeout for idle clients
query_timeout = 0             # 0 = no query timeout (handle in app)

# Logging
log_connections = 0
log_disconnections = 0
stats_period = 60

Pooling Modes Explained

Recommended Transaction Pooling

Connection is assigned to a client for the duration of a transaction, then returned to the pool. Most efficient mode. However, session-level features like SET, LISTEN/NOTIFY, prepared statements (without protocol-level support), and advisory locks won't work as expected.

Caution Session Pooling

Connection is held for the entire client session. No multiplexing benefit — you still need one real connection per client. Only useful for limiting total connections and providing a stable endpoint.

Warning

In transaction pooling mode, SET search_path, SET statement_timeout, and other session-level settings will leak between clients. Use SET LOCAL (which resets at transaction end) or pass settings per-query. This is the number one source of bugs when adopting PgBouncer.

Backup & Recovery

Backups are your last line of defense. A PostgreSQL deployment without tested, automated backups is a disaster waiting to happen. The strategy you choose depends on your RPO (Recovery Point Objective), RTO (Recovery Time Objective), and database size.

Backup Methods Comparison

Method	Type	RPO	RTO	Database Size
pg_dump	Logical (SQL or custom format)	Time of last dump	Hours (proportional to DB size)	< 100 GB
pg_basebackup	Physical (full cluster copy)	Time of last backup	Minutes to hours	Any
PITR (WAL archiving)	Physical + continuous WAL	Near-zero (last WAL segment)	Minutes to hours	Any
pgBackRest	Physical + incremental + PITR	Near-zero	Minutes	Any (optimized for large DBs)

pgBackRest — The Production Standard

pgBackRest is the go-to backup tool for production PostgreSQL. It supports full, differential, and incremental backups, parallel backup and restore, backup verification, and S3/GCS/Azure storage. If you're running PostgreSQL in production and not using pgBackRest, you should switch.

# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=mycompany-pg-backups
repo1-s3-region=us-east-1
repo1-s3-endpoint=s3.amazonaws.com
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=14
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=your-encryption-passphrase

process-max=4
compress-type=zst
compress-level=3

[mydb]
pg1-path=/var/lib/postgresql/18/main
pg1-port=5432

# Create stanza (one-time setup)
sudo -u postgres pgbackrest --stanza=mydb stanza-create

# Full backup
sudo -u postgres pgbackrest --stanza=mydb --type=full backup

# Incremental backup (daily)
sudo -u postgres pgbackrest --stanza=mydb --type=incr backup

# Verify backups (checks backup integrity without full restore)
sudo -u postgres pgbackrest --stanza=mydb verify

# Restore to a point in time
sudo -u postgres pgbackrest --stanza=mydb \
  --type=time "--target=2026-03-19 14:30:00" \
  --target-action=promote restore

WAL Archiving for PITR

# postgresql.conf
archive_mode = on
archive_command = 'pgbackrest --stanza=mydb archive-push %p'
# Or without pgbackrest:
# archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'

Critical

A backup you haven't tested restoring is not a backup — it's a hope. Schedule regular restore tests. At minimum, do a monthly automated restore to a staging environment and verify data integrity. pgBackRest's verify command checks backup consistency but does not validate that the restored database is functionally correct.

Backup Automation

# Example cron schedule
# Full backup every Sunday at 2 AM
0 2 * * 0 postgres pgbackrest --stanza=mydb --type=full backup
# Incremental backup every other day at 2 AM
0 2 * * 1-6 postgres pgbackrest --stanza=mydb --type=incr backup
# Verify backup integrity daily at 6 AM
0 6 * * * postgres pgbackrest --stanza=mydb verify

Recommendation

Always encrypt backups at rest, especially when storing them in object storage. pgBackRest supports AES-256-CBC encryption natively. Store the encryption passphrase in a secrets manager (Vault, AWS Secrets Manager), not in the pgbackrest.conf file.

Performance Tuning

PostgreSQL's default configuration is intentionally conservative — it's designed to run on a system with 512 MB of RAM. For production, you need to tune the key memory parameters, understand query execution plans, choose the right indexes, and keep autovacuum healthy.

Essential Memory Parameters

Parameter	Default	Production Guideline	Notes
`shared_buffers`	128 MB	25% of RAM (max ~16 GB)	Main data cache. Beyond 16 GB, OS page cache is more efficient.
`effective_cache_size`	4 GB	50-75% of RAM	Not allocated — tells the planner how much cache is likely available (shared_buffers + OS cache).
`work_mem`	4 MB	32-256 MB (careful!)	Per-sort/hash operation, per query. A complex query can use multiple work_mem allocations. Set too high and you OOM.
`maintenance_work_mem`	64 MB	512 MB - 2 GB	For VACUUM, CREATE INDEX, ALTER TABLE. Higher = faster maintenance operations.
`wal_buffers`	-1 (auto)	64 MB	Auto-sized to 1/32 of shared_buffers. 64 MB is safe for write-heavy workloads.
`max_connections`	100	200-500 (with pooler)	Each connection = ~5-10 MB. Use PgBouncer and keep this modest.

# postgresql.conf - Example for a 64 GB RAM server
shared_buffers = 16GB
effective_cache_size = 48GB
work_mem = 128MB
maintenance_work_mem = 2GB
wal_buffers = 64MB
max_connections = 300
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16

EXPLAIN ANALYZE

The single most important skill for PostgreSQL performance troubleshooting. EXPLAIN ANALYZE executes the query and shows the actual execution plan with real timing and row counts.

-- Always use ANALYZE to see actual vs estimated rows
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at > '2026-01-01'
  AND o.status = 'completed'
ORDER BY o.total DESC
LIMIT 50;

-- Key things to look for:
-- 1. Seq Scan on large tables (missing index?)
-- 2. actual rows vs estimated rows mismatch (stale statistics?)
-- 3. Sort with external merge (work_mem too low?)
-- 4. Nested Loop with high loop count (missing index on join key?)

Index Strategies

Default B-tree

The default index type. Supports equality and range queries. Covers 90% of use cases. Supports multicolumn indexes (put high-selectivity columns first). Use INCLUDE to create covering indexes for index-only scans.

Specialized GIN

Generalized Inverted Index. For full-text search, JSONB operators (@>, ?), arrays, and trigram similarity. Slower to update but very fast for lookups on multi-valued columns.

Specialized GiST

Generalized Search Tree. For geometric data, range types, full-text search (ts_vector), and PostGIS geospatial queries. Supports nearest-neighbor searches (ORDER BY <->).

Large Tables BRIN

Block Range Index. Tiny index size for naturally ordered data (timestamps, sequential IDs). A BRIN index on a 100 GB table might be only a few MB. Perfect for time-series append-only tables.

-- Covering index for index-only scans
CREATE INDEX idx_orders_status_created
ON orders (status, created_at DESC)
INCLUDE (total, customer_id);

-- Partial index (only index what you query)
CREATE INDEX idx_orders_active
ON orders (created_at DESC)
WHERE status = 'active';

-- BRIN for time-series data
CREATE INDEX idx_events_ts_brin
ON events USING brin (created_at)
WITH (pages_per_range = 32);

-- GIN for JSONB queries
CREATE INDEX idx_metadata_gin
ON documents USING gin (metadata jsonb_path_ops);

Vacuum & Autovacuum Tuning

VACUUM is not optional — it's a critical maintenance operation that reclaims space from dead tuples and updates visibility maps. Autovacuum runs this automatically, but the defaults are often too conservative for busy tables.

# postgresql.conf - Aggressive autovacuum for busy systems
autovacuum_max_workers = 6                  # default 3
autovacuum_naptime = 15s                    # default 1min
autovacuum_vacuum_threshold = 25            # default 50
autovacuum_vacuum_scale_factor = 0.01       # default 0.2 (triggers at 1% dead tuples)
autovacuum_analyze_threshold = 25
autovacuum_analyze_scale_factor = 0.005
autovacuum_vacuum_cost_delay = 2ms          # default 2ms (PG 12+)
autovacuum_vacuum_cost_limit = 1000         # default -1 (uses vacuum_cost_limit = 200)

Warning

If autovacuum falls behind on a high-write table, dead tuples accumulate, table bloat grows, and eventually you risk transaction ID wraparound — which forces PostgreSQL into single-user mode for emergency VACUUM. Monitor pg_stat_user_tables.n_dead_tup and pg_stat_user_tables.last_autovacuum for every table.

Table Partitioning

Declarative partitioning (PostgreSQL 10+) splits a large table into smaller physical tables based on a partition key. This improves query performance through partition pruning, makes maintenance operations faster (VACUUM on a smaller table), and enables efficient data lifecycle management (drop old partitions instead of deleting rows).

-- Range partitioning by date
CREATE TABLE events (
    id bigint GENERATED ALWAYS AS IDENTITY,
    event_type text NOT NULL,
    payload jsonb,
    created_at timestamptz NOT NULL
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE events_2026_01 PARTITION OF events
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE events_2026_02 PARTITION OF events
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
CREATE TABLE events_2026_03 PARTITION OF events
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

-- Automate with pg_partman
CREATE EXTENSION pg_partman;
SELECT partman.create_parent(
    p_parent_table := 'public.events',
    p_control := 'created_at',
    p_interval := '1 month',
    p_premake := 3
);

Security

PostgreSQL security operates at multiple layers: network access control (pg_hba.conf), authentication, encryption in transit (SSL/TLS), authorization (roles and privileges), row-level security, and audit logging. A production deployment should address all of these.

Authentication — pg_hba.conf

The pg_hba.conf file controls who can connect, from where, and how they authenticate. Rules are evaluated top-to-bottom; the first match wins.

# pg_hba.conf - Production example
# TYPE  DATABASE  USER       ADDRESS          METHOD

# Local socket connections
local   all       postgres                    peer
local   all       all                         scram-sha-256

# SSL-only remote connections
hostssl myapp     app_user   10.0.0.0/24      scram-sha-256
hostssl all       admin      10.0.1.0/24      scram-sha-256

# Replication (SSL required)
hostssl replication replicator 10.0.0.0/24    scram-sha-256

# Reject everything else
host    all       all        0.0.0.0/0        reject

Critical

Never use trust or md5 authentication in production. trust allows connections without a password. md5 is vulnerable to pass-the-hash attacks. Use scram-sha-256 (default for password_encryption since PostgreSQL 14). Set password_encryption = scram-sha-256 in postgresql.conf and update pg_hba.conf entries from md5 to scram-sha-256.

SSL/TLS Configuration

# postgresql.conf
ssl = on
ssl_cert_file = '/etc/postgresql/server.crt'
ssl_key_file = '/etc/postgresql/server.key'
ssl_ca_file = '/etc/postgresql/ca.crt'       # for client certificate verification
ssl_min_protocol_version = 'TLSv1.3'         # PG 12+
ssl_ciphers = 'HIGH:!aNULL:!MD5'

Roles & Privileges

Follow the principle of least privilege. Create separate roles for applications, read-only users, and administrators. Never let applications connect as the postgres superuser.

-- Create application role with minimal privileges
CREATE ROLE app_user WITH LOGIN PASSWORD 'strong-password-here';
GRANT CONNECT ON DATABASE myapp TO app_user;
GRANT USAGE ON SCHEMA public TO app_user;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO app_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
    GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO app_user;

-- Read-only role for reporting
CREATE ROLE readonly WITH LOGIN PASSWORD 'another-strong-password';
GRANT CONNECT ON DATABASE myapp TO readonly;
GRANT USAGE ON SCHEMA public TO readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
    GRANT SELECT ON TABLES TO readonly;

Row-Level Security (RLS)

RLS allows you to restrict which rows a user can see or modify based on their role or session properties. Essential for multi-tenant applications.

-- Enable RLS on a multi-tenant table
ALTER TABLE customer_data ENABLE ROW LEVEL SECURITY;

-- Policy: users can only see their own tenant's data
CREATE POLICY tenant_isolation ON customer_data
    USING (tenant_id = current_setting('app.tenant_id')::int);

-- Force RLS even for table owners
ALTER TABLE customer_data FORCE ROW LEVEL SECURITY;

-- Set tenant context in application
SET app.tenant_id = '42';

Audit Logging with pgAudit

# postgresql.conf
shared_preload_libraries = 'pgaudit'
pgaudit.log = 'ddl, role, write'       # log DDL, role changes, and write operations
pgaudit.log_catalog = off               # don't log system catalog queries
pgaudit.log_relation = on               # log table names, not just commands
pgaudit.log_parameter = on              # log query parameters

Recommendation

For compliance (SOC 2, HIPAA, PCI-DSS), pgAudit is a must-have. Ship audit logs to a centralized logging system (ELK, Loki, Splunk) where they can't be tampered with by database administrators. Include pgaudit.log_parameter = on carefully — it logs query parameters which may contain PII.

Upgrades

PostgreSQL releases a new major version annually (typically September) and minor versions quarterly (February, May, August, November). Major upgrades (e.g., 17 to 18) require a migration — the on-disk data format changes. Minor upgrades (e.g., 18.1 to 18.3) are always in-place and binary-compatible.

Upgrade Methods

Recommended pg_upgrade (In-Place)

The standard upgrade method. Copies or hard-links data files and rewrites system catalogs. Downtime is typically minutes to hours depending on database size (with --link mode, it's minutes even for terabyte databases). Run pg_upgrade --check first to validate compatibility.

Fast with --link mode (hard links instead of copying)
Requires both old and new PostgreSQL binaries installed
All extensions must be compatible with the new version
Run ANALYZE on all databases after upgrade to rebuild statistics

Zero-Downtime Logical Replication

Set up logical replication from the old version to a new-version instance. Once caught up, switch application connections. Near-zero downtime but requires more setup. Not all data types and DDL replicate — test thoroughly. Sequences, large objects, and DDL don't replicate.

Supports cross-version replication (PG 14+ to 18)
Application cutover in seconds
More complex to set up and validate
Need to handle sequences and schema changes manually

pg_upgrade Workflow

# 1. Install new PostgreSQL version alongside old
sudo apt install postgresql-18

# 2. Stop both clusters
sudo systemctl stop postgresql@17-main
sudo systemctl stop postgresql@18-main

# 3. Run upgrade check (dry run)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_upgrade \
  --old-datadir=/var/lib/postgresql/18/main \
  --new-datadir=/var/lib/postgresql/18/main \
  --old-bindir=/usr/lib/postgresql/17/bin \
  --new-bindir=/usr/lib/postgresql/18/bin \
  --check

# 4. Run actual upgrade (--link for speed)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_upgrade \
  --old-datadir=/var/lib/postgresql/18/main \
  --new-datadir=/var/lib/postgresql/18/main \
  --old-bindir=/usr/lib/postgresql/17/bin \
  --new-bindir=/usr/lib/postgresql/18/bin \
  --link

# 5. Start new cluster
sudo systemctl start postgresql@18-main

# 6. Rebuild optimizer statistics (critical!)
sudo -u postgres /usr/lib/postgresql/18/bin/vacuumdb \
  --all --analyze-in-stages

# 7. After verifying, remove old cluster
sudo pg_dropcluster 17 main

Warning

With --link mode, the old cluster becomes unusable after upgrade — the data files are shared. If the upgrade fails or you need to rollback, you must restore from backup. Always take a full backup before running pg_upgrade.

Minor Version Upgrades

Minor upgrades are straightforward and should be applied promptly — they contain security fixes and bug fixes. No data migration is needed.

# Minor upgrade (e.g., 18.1 → 18.3)
sudo apt update
sudo apt upgrade postgresql-18
sudo systemctl restart postgresql@18-main

# Verify
sudo -u postgres psql -c "SELECT version();"

Critical

Never skip minor version upgrades for extended periods. PostgreSQL minor releases fix bugs that can cause data corruption, security vulnerabilities, and crashes. Subscribe to the PostgreSQL security mailing list and apply patches within your maintenance window.

Monitoring

PostgreSQL exposes a wealth of internal statistics through its catalog views. The key is knowing which metrics matter, setting up proper alerting, and correlating database metrics with application-level observability.

Essential Statistics Views

Queries pg_stat_statements

The most important extension for performance analysis. Tracks execution statistics for every distinct query: call count, total/mean/min/max time, rows returned, block hits/reads. Requires shared_preload_libraries = 'pg_stat_statements'.

Sessions pg_stat_activity

Shows currently running queries, idle connections, waiting processes, and lock info. Essential for debugging slow queries and connection issues. Check state, wait_event_type, query_start.

Tables pg_stat_user_tables

Per-table stats: sequential scans, index scans, rows inserted/updated/deleted, dead tuples, last vacuum/analyze time. Use this to find tables missing indexes (high seq_scan count) or needing vacuum.

Replication pg_stat_replication

Shows connected standbys, their WAL positions, and replication lag. The replay_lag column (PG 10+) gives human-readable lag. Alert on any lag exceeding your RPO.

Key Metrics to Monitor

-- Cache hit ratio (should be > 99%)
SELECT
  sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) AS ratio
FROM pg_statio_user_tables;

-- Active connections by state
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;

-- Replication lag (on primary)
SELECT client_addr, state, replay_lag
FROM pg_stat_replication;

-- Top 10 slowest queries (requires pg_stat_statements)
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Tables with most dead tuples (vacuum candidates)
SELECT schemaname, relname, n_dead_tup, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

-- Transaction ID age (wraparound risk)
SELECT datname, age(datfrozenxid) AS xid_age,
  current_setting('autovacuum_freeze_max_age') AS freeze_max
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

Prometheus + Grafana Stack

The postgres_exporter by Prometheus Community is the standard way to expose PostgreSQL metrics to Prometheus. Combine with Grafana dashboards for visualization and alerting.

# docker-compose.yml snippet for postgres_exporter
postgres-exporter:
  image: prometheuscommunity/postgres-exporter:latest
  environment:
    DATA_SOURCE_NAME: "postgresql://monitor:password@postgres:5432/myapp?sslmode=require"
  ports:
    - "9187:9187"

-- Create a monitoring role with minimal privileges
CREATE ROLE monitor WITH LOGIN PASSWORD 'monitor-password';
GRANT pg_monitor TO monitor;  -- PG 10+ built-in monitoring role
GRANT CONNECT ON DATABASE myapp TO monitor;

Critical Alerts

Metric	Warning Threshold	Critical Threshold
Replication lag	> 30 seconds	> 5 minutes
Connection utilization	> 70% of max_connections	> 90% of max_connections
Cache hit ratio	< 99%	< 95%
Transaction ID age	> 500M	> 1B (wraparound imminent)
Dead tuples per table	> 10M	> 100M (autovacuum is stuck)
Disk space	< 20% free	< 10% free
Long-running queries	> 5 minutes	> 30 minutes

Recommendation

Use the community Grafana dashboard ID 9628 (PostgreSQL Database) as a starting point. It covers connection stats, transaction rates, tuple operations, cache ratios, and replication lag out of the box. Customize alerting rules based on your SLOs.

Extensions & Ecosystem

PostgreSQL's extension system is one of its greatest strengths. Extensions can add new data types, operators, index access methods, functions, and even background workers — all without forking the core project. Some extensions effectively turn PostgreSQL into a specialized database.

Essential Extensions

Geospatial PostGIS

The gold standard for geospatial data in any database. Adds geometry/geography types, spatial indexes (GiST/SP-GiST), and hundreds of spatial functions. If you need geospatial queries, PostGIS is why you choose PostgreSQL over every other database.

AI/ML pgvector

Vector similarity search for AI/ML embeddings. Supports exact and approximate nearest neighbor search (IVFFlat, HNSW indexes). Store embeddings alongside relational data — no need for a separate vector database. Critical for RAG applications.

Time-Series TimescaleDB

Turns PostgreSQL into a time-series database. Automatic partitioning (hypertables), compression, continuous aggregates, and data retention policies. Core is Apache 2.0; community features under Timescale License (TSL, free for non-DBaaS use). Company rebranded to TigerData in 2025.

Distributed Citus

Distributed PostgreSQL by Microsoft. Shards tables across multiple nodes for horizontal scaling. Supports distributed queries, reference tables, and columnar storage. Open source (AGPLv3). Now available as an extension, not just a fork.

Operations pg_cron

Cron-based job scheduler running inside PostgreSQL. Schedule SQL commands, VACUUM, ANALYZE, partition maintenance — without external cron jobs. Simple, reliable, and runs as a background worker.

Audit pgAudit

Provides detailed session and object audit logging. Required for compliance frameworks (SOC 2, HIPAA, PCI-DSS). Logs to the standard PostgreSQL log, so integrate with your existing log pipeline.

Partitioning pg_partman

Automates partition creation and retention for time-based and ID-based partitioning. Pre-creates future partitions and optionally drops or detaches old ones. Works with native declarative partitioning.

CDC Debezium

Change Data Capture using PostgreSQL's logical decoding. Streams row-level changes to Kafka, Pulsar, or other sinks. Requires wal_level = logical and the pgoutput or wal2json plugin. Essential for event-driven architectures.

Foreign Data Wrappers (FDW)

FDWs let you query external data sources as if they were PostgreSQL tables. The postgres_fdw extension (included in core) enables cross-server queries. Third-party FDWs exist for MySQL, MongoDB, Redis, S3/Parquet files, Elasticsearch, and more.

-- Query a remote PostgreSQL server
CREATE EXTENSION postgres_fdw;

CREATE SERVER remote_server
    FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (host 'remote.example.com', port '5432', dbname 'analytics');

CREATE USER MAPPING FOR app_user
    SERVER remote_server
    OPTIONS (user 'remote_reader', password 'secret');

IMPORT FOREIGN SCHEMA public
    FROM SERVER remote_server
    INTO remote_schema;

-- Now query remote tables as if they were local
SELECT * FROM remote_schema.sales WHERE region = 'EMEA';

Logical Decoding & CDC

# postgresql.conf for logical decoding
wal_level = logical                    # required (not just 'replica')
max_replication_slots = 10             # one per consumer
max_wal_senders = 10

-- Create a logical replication slot for Debezium
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

-- Create a publication for CDC
CREATE PUBLICATION my_publication FOR TABLE orders, customers, products;

Warning

Logical replication slots that are not consumed will cause WAL to accumulate indefinitely, eventually filling your disk. Always monitor pg_replication_slots for inactive slots and drop any that are no longer in use: SELECT pg_drop_replication_slot('unused_slot');

Licensing

PostgreSQL is released under the PostgreSQL License, a liberal open-source license similar to the BSD and MIT licenses. It permits use, modification, and distribution for any purpose — commercial or otherwise — with no copyleft requirements. There is no "enterprise edition," no dual licensing, and no feature gating. Every feature is available to everyone.

License Comparison

Database	License	Enterprise Edition?	Key Restrictions
PostgreSQL	PostgreSQL License (BSD-like)	No — all features included	None. Fully permissive.
MySQL	GPLv2 (Community) / Commercial (Enterprise)	Yes — Thread Pool, Audit, Encryption, Backup	GPL requires source sharing for derivative works, or buy a commercial license from Oracle.
MariaDB	GPLv2 (Server) / BSL (MaxScale, some tools)	Sort of — MaxScale and some features under BSL	BSL converts to GPL after a period, but restricts production use initially.
MongoDB	SSPL	Yes (Atlas features)	SSPL requires open-sourcing entire service stack if offering as a service.

Commercial Support Options

While PostgreSQL itself is free, commercial support is available from several companies. This is relevant for enterprises that need SLA-backed support, consulting, or managed services.

Enterprise EDB (EnterpriseDB)

The largest PostgreSQL company. Offers EDB Postgres Advanced Server (Oracle compatibility layer), 24/7 support, managed services, and migration tooling. Acquired 2ndQuadrant in 2020. Major contributor to PostgreSQL core.

Enterprise Crunchy Data

Kubernetes-native PostgreSQL platform. Crunchy Certified PostgreSQL, PGO operator, pgBackRest, pgBouncer, pgMonitor. Strong in US government and regulated industries. Open-source-first approach.

Support Percona

Database-agnostic support company. Offers Percona Distribution for PostgreSQL (packaged with extensions and tools), monitoring (PMM), backup, and 24/7 support. Good choice if you also run MySQL/MongoDB.

Managed Cloud Providers

AWS RDS/Aurora, Google CloudSQL/AlloyDB, Azure Database for PostgreSQL. Fully managed with automated HA, backups, and patching. Higher cost, less control, but zero operational burden for the database layer.

Recommendation

The PostgreSQL License means you will never face a licensing surprise, an audit, or a forced migration due to license changes. This is a strategic advantage over MySQL (Oracle), MongoDB (SSPL), and Redis (RSALv2/SSPLv1 since 7.4, no longer open source). When presenting database options to customers, lead with this.

Consultant's Checklist

Before deploying PostgreSQL for a customer, work through these questions. They'll surface requirements that drive architecture decisions and prevent post-deployment surprises.

Data volume & growth rate? — Current database size, expected growth per month/year. This drives storage, partitioning, and backup strategy decisions. Terabyte databases need different tooling than 50 GB ones.
HA requirements? — What's the acceptable downtime? RPO (data loss tolerance) and RTO (recovery time). Zero RPO requires synchronous replication. Sub-minute RTO requires Patroni with automated failover.
Connection patterns? — Peak concurrent connections, connection duration (short-lived HTTP handlers vs long-lived batch jobs). This determines whether you need PgBouncer and how to size pools.
Read/write ratio? — Read-heavy workloads benefit from read replicas. Write-heavy workloads need careful WAL tuning, connection pooling, and potentially Citus for horizontal scaling.
Compliance requirements? — SOC 2, HIPAA, PCI-DSS, GDPR. These drive encryption at rest, SSL enforcement, pgAudit, row-level security, and data residency decisions.
Backup & recovery targets? — RPO/RTO numbers, retention requirements, offsite storage mandates. pgBackRest to S3 with PITR is the standard, but some customers need backup to specific providers.
Deployment platform? — Bare metal, VMs, Kubernetes, or managed service? Team expertise matters more than technical merits. If they don't have K8s skills, don't deploy on K8s.
Upgrade strategy? — How will you handle major version upgrades? pg_upgrade with downtime window, or logical replication for near-zero downtime? How often will you apply minor patches?
Extensions needed? — PostGIS, pgvector, TimescaleDB, pg_cron? This affects deployment model — managed services may not support all extensions. Verify compatibility before committing.
Monitoring stack? — What's the existing observability platform? Prometheus/Grafana, Datadog, New Relic? Ensure postgres_exporter or the provider's integration is in place from day one.
Team expertise? — Does the team have PostgreSQL operational experience? If not, budget for training, runbooks, and potentially commercial support (EDB, Crunchy Data, Percona).
Migration source? — Coming from MySQL, Oracle, SQL Server? Plan for schema differences, stored procedure rewrites, and data type mapping. EDB has Oracle compatibility tools. pgLoader handles MySQL migrations.

Info

The most common mistake in PostgreSQL deployments is underestimating operational complexity. PostgreSQL is not "install and forget." Autovacuum, bloat management, connection pooling, backup testing, and upgrade planning all require ongoing attention. Budget for operations, not just initial deployment.

MVCC (Multi-Version Concurrency Control)

PostgreSQL's core concurrency mechanism — how multiple transactions coexist without blocking each other

What is MVCC?

MVCC is the fundamental strategy PostgreSQL uses to handle concurrent access to the same data. Instead of using locks to prevent simultaneous reads and writes, PostgreSQL keeps multiple versions of each row so that every transaction can see a consistent snapshot of the database without blocking other transactions. Readers never block writers, and writers never block readers.

This is not unique to PostgreSQL — MySQL/InnoDB, Oracle, and SQL Server all implement MVCC in some form. The difference is how PostgreSQL implements it: row versions are stored directly in the table (heap), whereas InnoDB uses a separate undo log to reconstruct old versions.

How it works: xmin and xmax

Every row in a PostgreSQL table has two hidden system columns that control visibility:

xmin — the transaction ID (XID) of the transaction that created (INSERT or UPDATE) this row version
xmax — the transaction ID of the transaction that deleted or updated this row version (0 if still live)

When you UPDATE a row, PostgreSQL does not modify the existing tuple in-place. Instead, it:

Sets xmax on the old tuple to the current transaction's XID (marks it as "dead" from this transaction onward)
Inserts a brand new tuple with xmin set to the current transaction's XID

-- You can actually see xmin and xmax
SELECT xmin, xmax, * FROM my_table WHERE id = 1;

-- After an UPDATE, you'd see two versions internally:
-- Old tuple: xmin=100, xmax=105  (dead after txn 105)
-- New tuple: xmin=105, xmax=0    (live)

Tuple visibility rules

A row version is visible to a transaction if:

The xmin transaction has committed (the row was successfully created)
The xmax is either 0 (not deleted) or the deleting transaction has not committed (from the viewer's perspective)
The creating transaction started before the viewer's snapshot was taken

Each transaction gets a snapshot at its start (in Read Committed, a new snapshot is taken for each statement). The snapshot contains: the current XID, a list of in-progress XIDs, and the highest completed XID. This is what allows different transactions to see different versions of the same row simultaneously.

Transaction isolation levels

Read Committed (default) — each statement sees a fresh snapshot. You may see different data between two SELECT statements in the same transaction if another transaction committed in between.
Repeatable Read — the snapshot is taken at the start of the transaction. All statements see the same data. Serialization errors are possible if concurrent updates conflict.
Serializable — full serializability via SSI (Serializable Snapshot Isolation). PostgreSQL tracks read/write dependencies and aborts transactions that would produce non-serializable results. The gold standard for correctness, but comes with a performance cost.

-- Set isolation per transaction
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT balance FROM accounts WHERE id = 42;
-- This will always see the same balance, even if
-- another transaction commits an update in between
COMMIT;

The MVCC trade-off: dead tuples and bloat

Because every UPDATE creates a new row version and every DELETE just marks a tuple as dead, dead tuples accumulate over time. This is the fundamental trade-off of PostgreSQL's MVCC design:

Table size grows even if the logical row count stays the same (table bloat)
Index entries pointing to dead tuples waste space and slow down index scans
The VACUUM process is required to reclaim this space
Long-running transactions hold back VACUUM because those old row versions might still be visible to the long-running transaction's snapshot

Comparison with InnoDB (MySQL)

MySQL/InnoDB also uses MVCC, but with a fundamentally different approach:

InnoDB: updates rows in-place, stores old versions in a separate undo log. Undo logs are cleaned up by a purge thread. No table bloat from updates.
PostgreSQL: stores all versions in the heap (table). Requires VACUUM to clean up. Can cause table bloat but avoids the complexity of undo log management.
PostgreSQL's approach is simpler and makes crash recovery faster (no undo replay needed), but the operational burden of VACUUM is significant.

Operational impact: MVCC is why VACUUM exists, why long-running transactions are dangerous, why table bloat happens, and why pg_stat_user_tables.n_dead_tup is one of the most important metrics to monitor. Understanding MVCC is the key to understanding most PostgreSQL production issues.

WAL (Write-Ahead Logging)

The durability and replication mechanism that underpins crash recovery, PITR, and streaming replication

What is WAL?

Write-Ahead Logging is the guarantee that no data change is considered committed until it has been written to the WAL. Every INSERT, UPDATE, DELETE, and DDL operation generates WAL records that are flushed to disk before the transaction is acknowledged. If the server crashes, PostgreSQL replays WAL records from the last checkpoint to recover to a consistent state.

WAL serves three critical purposes: crash recovery (replay after unexpected shutdown), replication (stream WAL records to standbys), and Point-in-Time Recovery (archive WAL for restoring to any past moment).

How WAL works internally

The write path for a typical transaction:

Transaction modifies data in shared buffers (in-memory pages)
WAL records describing the changes are written to WAL buffers
At COMMIT, WAL buffers are flushed to disk (pg_wal/ directory) via fsync
The client receives "COMMIT OK" only after WAL is on disk
The actual data pages (dirty buffers) are written to disk later by the background writer or checkpointer

This is why it's called "write-ahead" — the log is always written before the data files.

WAL segments

WAL is stored as a sequence of 16 MB segment files in the pg_wal/ directory (configurable with --wal-segsize at initdb time). Each segment is named with a hex string encoding the timeline and position:

ls -la /var/lib/postgresql/18/main/pg_wal/
# 000000010000000000000042  (16 MB each)
# 000000010000000000000043
# ...

# Check current WAL position
SELECT pg_current_wal_lsn();
-- 0/42A1B8C0

# Check WAL generation rate (PG 14+)
SELECT * FROM pg_stat_wal;

Checkpoints

A checkpoint is when PostgreSQL flushes all dirty shared buffers to disk and writes a checkpoint record to WAL. After a checkpoint, all WAL segments before the checkpoint are no longer needed for crash recovery and can be recycled or archived.

checkpoint_timeout — max time between checkpoints (default 5 min, recommend 15-30 min for production)
max_wal_size — WAL size that triggers a checkpoint (default 1 GB, recommend 4-16 GB)
checkpoint_completion_target — spread checkpoint I/O over this fraction of the interval (default 0.9 since PG 14; was 0.5 before)

# postgresql.conf - Production checkpoint tuning
checkpoint_timeout = 15min
max_wal_size = 8GB
min_wal_size = 2GB
checkpoint_completion_target = 0.9

WAL levels

The wal_level setting controls how much information is written to WAL:

minimal — only enough for crash recovery. No replication possible. Bulk operations (COPY, CREATE INDEX) skip WAL for performance.
replica (default since PG 10) — adds information needed for streaming replication and WAL archiving. This is the standard production setting.
logical — adds row-level change data needed for logical replication and CDC (Change Data Capture). Required for tools like Debezium, pglogical, and built-in logical replication.

# Required for logical replication / CDC
wal_level = logical
max_wal_senders = 10
max_replication_slots = 10

WAL archiving for PITR

WAL archiving copies completed WAL segments to a separate location (local directory, S3, GCS) for Point-in-Time Recovery:

# Enable WAL archiving
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
# Or simple file copy:
# archive_command = 'cp %p /archive/wal/%f'

With a base backup + archived WAL, you can restore to any point in time between the base backup and the last archived WAL segment.

Key operational settings

wal_level — minimal / replica / logical
max_wal_senders — max concurrent replication connections (set to at least replicas + 2)
wal_keep_size — minimum WAL to retain for slow standbys (PG 13+, replaces wal_keep_segments)
archive_mode / archive_command — enable and configure WAL archiving
synchronous_commit — on (default), remote_apply, remote_write, off. Trade durability for latency.

Critical failure mode: If the pg_wal/ disk fills up, PostgreSQL immediately stops accepting writes and may crash. This is one of the most common emergency scenarios. Causes: unconsumed replication slots holding WAL, archive_command failing silently, or insufficient max_wal_size. Always monitor pg_wal disk usage and set up alerts at 70% capacity.

Patroni

PostgreSQL HA orchestrator using distributed consensus for automatic failover and switchover

What is Patroni?

Patroni is a Python-based HA orchestrator for PostgreSQL, created by Zalando and now the de facto standard for self-managed PostgreSQL high availability. It runs as an agent alongside each PostgreSQL instance and uses a Distributed Consensus Store (DCS) — etcd, Consul, or ZooKeeper — to coordinate leader election and cluster state.

Patroni doesn't replace PostgreSQL's streaming replication. It manages it: setting up replication, monitoring health, performing failover when the primary fails, and handling planned switchovers.

How leader election works

The core mechanism is a leader key with a TTL in the DCS:

One node holds the leader key in etcd/Consul/ZooKeeper. This node runs PostgreSQL as primary.
The leader must continuously renew the key before the TTL expires (default 30s TTL, renewed every loop_wait seconds).
All other nodes are replicas, streaming WAL from the primary.
If the leader fails to renew (crash, network partition, DCS unreachable), the key expires.
Replicas detect the expired key and race to acquire it. The replica with the least replication lag (within maximum_lag_on_failover) wins and promotes itself to primary.

Automatic failover vs. planned switchover

Automatic failover happens when the primary fails unexpectedly:

Leader key expires in DCS
Best replica promotes itself (configurable selection criteria)
Other replicas repoint to the new primary
Total failover time: typically 10-30 seconds (DCS TTL + promotion time)

Planned switchover is operator-initiated for maintenance:

# Trigger a switchover to a specific replica
patronictl switchover --master node1 --candidate node2 --force

# List cluster state
patronictl list

# Restart a specific member
patronictl restart mycluster node1

Switchovers are near-zero-downtime when combined with a connection pooler (PgBouncer/HAProxy) that detects the role change via Patroni's REST API.

Configuration overview

# patroni.yml (simplified)
scope: mycluster
name: node1

etcd3:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1 MB
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        max_wal_senders: 10
        hot_standby: "on"

postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/18/main
  authentication:
    superuser:
      username: postgres
      password: "${PATRONI_SUPERUSER_PASSWORD}"
    replication:
      username: replicator
      password: "${PATRONI_REPLICATION_PASSWORD}"

restapi:
  listen: 0.0.0.0:8008

Key settings

ttl (30s) — leader lock TTL in DCS. Lower = faster failover detection, higher = more tolerance for brief network glitches.
loop_wait (10s) — how often Patroni checks cluster state and renews the leader key.
retry_timeout (10s) — how long to retry DCS/PostgreSQL operations before giving up.
maximum_lag_on_failover (1 MB) — replicas lagging more than this are excluded from failover candidates. Prevents promoting a far-behind replica.
use_pg_rewind — allows an old primary to rejoin as a replica without a full base backup (requires wal_log_hints = on).

Connection routing

Patroni exposes a REST API on port 8008 that reports the node's role. Load balancers use this for routing:

# Returns 200 if primary, 503 otherwise
curl -s http://node1:8008/primary

# Returns 200 if replica, 503 otherwise
curl -s http://node1:8008/replica

# HAProxy health check configuration
backend pg_primary
  option httpchk GET /primary
  http-check expect status 200
  server node1 10.0.1.1:5432 check port 8008
  server node2 10.0.1.2:5432 check port 8008
  server node3 10.0.1.3:5432 check port 8008

Split-brain prevention

Patroni's design prevents split-brain by construction: only the node holding the DCS leader key runs as primary. If the primary loses connectivity to the DCS, it demotes itself (fences itself) rather than continuing to accept writes. This is the critical safety guarantee.

For additional safety, configure Linux watchdog (/dev/watchdog) support. The watchdog ensures that if Patroni crashes or hangs, the operating system forcibly reboots the node, preventing a zombie primary.

Common deployment pattern

Small: 3 PG nodes with etcd co-located on each (3 nodes total). Simple, but etcd competes for resources.
Standard: 3 PG nodes + 3 dedicated etcd nodes (6 nodes total). Recommended for production.
Large: 3+ PG nodes + 5 etcd nodes across 3 AZs. Maximum fault tolerance.

Consultant's note: Patroni is mature, battle-tested, and the clear industry standard. However, it adds operational complexity: you must maintain the DCS cluster, understand failover behavior, and test failover scenarios regularly. If you're on Kubernetes, consider CloudNativePG or Crunchy Operator instead — they handle HA without requiring a separate DCS.

VACUUM

PostgreSQL's garbage collection for dead tuples — the essential maintenance that keeps MVCC from consuming all your disk

Why VACUUM exists

PostgreSQL's MVCC implementation leaves dead row versions (dead tuples) after every UPDATE and DELETE. These dead tuples consume disk space, bloat indexes, and slow down sequential scans. VACUUM is the process that marks dead tuples as reusable space so new rows can occupy that space.

Without VACUUM, a table that receives constant updates will grow indefinitely even if the logical row count never changes. This is the most common source of "my disk is filling up" emergencies in PostgreSQL.

Types of VACUUM

Regular VACUUM — scans the table, marks dead tuples as free space. Does not shrink the table file on disk (space is reused internally but not returned to the OS). Can run concurrently with reads and writes.
VACUUM FULL — rewrites the entire table to a new file, reclaiming disk space. Requires an exclusive lock on the table (blocks all reads and writes). Avoid in production unless absolutely necessary. Use pg_repack instead for online table rewrites.
VACUUM FREEZE — aggressively freezes old transaction IDs to prevent wraparound. Normally done automatically by autovacuum when approaching autovacuum_freeze_max_age.

-- Manual VACUUM on a specific table
VACUUM VERBOSE my_table;

-- VACUUM with ANALYZE (updates statistics too)
VACUUM ANALYZE my_table;

-- Check VACUUM progress (PG 9.6+)
SELECT * FROM pg_stat_progress_vacuum;

-- NEVER do this in production unless you accept full table lock:
-- VACUUM FULL my_table;

-- Use pg_repack instead for online table compaction:
-- pg_repack --table my_table --no-kill-backend mydb

Autovacuum

Autovacuum is a background daemon that automatically triggers VACUUM (and ANALYZE) per table based on configurable thresholds. It runs continuously, checking tables at intervals defined by autovacuum_naptime (default 1 minute).

A table is vacuumed when:

# Trigger threshold formula:
# dead_tuples > autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor * table_rows)
#
# With defaults (threshold=50, scale_factor=0.2):
#   100-row table: vacuum at 50 + 20 = 70 dead tuples
#   1M-row table:  vacuum at 50 + 200,000 = 200,050 dead tuples
#
# This means large tables wait for 20% dead tuples before vacuuming!
# For busy large tables, lower the scale_factor.

Key autovacuum settings

# postgresql.conf - Tuned for high-write production workloads
autovacuum_max_workers = 6                  # default 3, increase for many tables
autovacuum_naptime = 15s                    # default 1min, check more frequently
autovacuum_vacuum_threshold = 25            # default 50
autovacuum_vacuum_scale_factor = 0.01       # default 0.2 (trigger at 1% dead tuples)
autovacuum_analyze_threshold = 25           # default 50
autovacuum_analyze_scale_factor = 0.005     # default 0.1
autovacuum_vacuum_cost_delay = 2ms          # default 2ms (PG 12+ already aggressive)
autovacuum_vacuum_cost_limit = 1000         # default -1 (inherits vacuum_cost_limit=200)

For individual hot tables, you can override settings at the table level:

-- Per-table autovacuum tuning for a high-write table
ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.005,
  autovacuum_vacuum_threshold = 100,
  autovacuum_analyze_scale_factor = 0.002
);

Transaction ID wraparound

PostgreSQL transaction IDs (XIDs) are 32-bit integers, giving roughly 4 billion unique IDs. Once consumed, they wrap around, and old transactions would appear to be in the "future" — making their data invisible. This would be catastrophic data loss.

To prevent this, VACUUM freezes old XIDs: it marks tuples as "frozen" (visible to all future transactions regardless of XID). Autovacuum performs aggressive anti-wraparound VACUUM when a table's oldest unfrozen XID approaches autovacuum_freeze_max_age (default 200 million transactions).

-- Check how close tables are to wraparound
SELECT relname,
       age(relfrozenxid) AS xid_age,
       pg_size_pretty(pg_total_relation_size(oid)) AS total_size
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 20;

-- If age approaches 2 billion, PostgreSQL will SHUT DOWN
-- and require manual VACUUM in single-user mode

Monitoring VACUUM health

-- Dead tuple count and last autovacuum per table
SELECT schemaname, relname, n_dead_tup, n_live_tup,
       last_autovacuum, last_autoanalyze,
       autovacuum_count, autoanalyze_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

-- Currently running VACUUMs
SELECT pid, datname, relid::regclass, phase,
       heap_blks_total, heap_blks_scanned, heap_blks_vacuumed
FROM pg_stat_progress_vacuum;

-- Long-running transactions blocking VACUUM
SELECT pid, age(backend_xmin) AS xmin_age,
       state, query_start, left(query, 80) AS query
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;

Common VACUUM issues

Autovacuum not keeping up — increase autovacuum_max_workers, lower scale_factor, increase cost_limit.
Long-running transactions blocking cleanup — the most common cause. Even an idle-in-transaction session holds back VACUUM across all tables. Set idle_in_transaction_session_timeout.
Replication slots preventing cleanup — unconsumed logical replication slots hold back WAL and VACUUM. Monitor pg_replication_slots.
VACUUM FULL temptation — VACUUM FULL reclaims disk space but requires an exclusive lock. Use pg_repack for online table compaction instead.
Wraparound warnings in logs — "WARNING: database must be vacuumed within X transactions" means you're approaching the emergency threshold. Drop everything and investigate.

Operational rule: Monitor three things religiously: n_dead_tup per table (is autovacuum keeping up?), age(relfrozenxid) (are we approaching wraparound?), and pg_stat_activity for long-running transactions (is something holding back cleanup?). If all three are healthy, your VACUUM setup is working.

PgBouncer

Lightweight connection pooler for PostgreSQL — sits between your application and the database to multiplex thousands of client connections onto a small pool of real database connections

What it does

PostgreSQL forks a dedicated backend process for every client connection, consuming ~5-10 MB of RAM each. At scale (hundreds or thousands of app workers, serverless functions, or microservices), this quickly exhausts max_connections and memory. PgBouncer solves this by accepting client connections and multiplexing them onto a smaller pool of real PostgreSQL connections.

PgBouncer is a single-threaded, event-driven process written in C. It's extremely lightweight (<2 MB RSS) and adds sub-millisecond latency. It's the de facto standard connection pooler in the PostgreSQL ecosystem.

Pooling modes

Transaction pooling (recommended) — a server connection is assigned to a client only for the duration of a transaction, then returned to the pool. Maximum multiplexing efficiency. However, session-level state (SET, prepared statements, LISTEN/NOTIFY, advisory locks) does not persist between transactions.
Session pooling — connection is held for the entire client session. No multiplexing benefit; only useful as a stable endpoint or to enforce a connection limit.
Statement pooling — connection is returned after every statement. Only works for simple, autocommit workloads. Rarely used.

Key configuration

[pgbouncer]
listen_port = 6432
pool_mode = transaction
default_pool_size = 25       # server connections per user/db pair
max_client_conn = 1000       # total client connections accepted
max_db_connections = 100     # cap real connections to PostgreSQL
auth_type = scram-sha-256

Transaction pooling gotchas

Session-level SET leaks — SET search_path, SET statement_timeout, etc. persist on the server connection and leak to the next client. Use SET LOCAL (resets at transaction end) or pass settings per-query.
Prepared statements — named prepared statements are session-scoped and break under transaction pooling. PgBouncer 1.21+ supports protocol-level prepared statement tracking. For older versions, disable prepared statements in your driver.
LISTEN/NOTIFY — requires a persistent session. Route LISTEN clients to a separate session-pooled PgBouncer instance or connect them directly.
Advisory locks — session-level advisory locks are released when the server connection is reassigned. Use transaction-level advisory locks (pg_advisory_xact_lock) instead.
Temporary tables — created within a transaction, they vanish when the connection is returned. Plan accordingly.

Deployment patterns

Sidecar — run PgBouncer on each app server, connecting to the remote database. Reduces latency for pool management and distributes connection pooling.
Central proxy — run PgBouncer on the database server or a dedicated proxy host. Simplifies management but adds a single point of failure (pair with keepalived/HAProxy).
With Patroni — place PgBouncer behind HAProxy or use Patroni's built-in pgbouncer callback to reconfigure the pool on failover.

Monitoring

-- Connect to PgBouncer admin console (port 6432, database pgbouncer)
SHOW POOLS;    -- active/waiting/idle connections per pool
SHOW STATS;    -- request count, bytes, query duration
SHOW CLIENTS;  -- connected client sessions
SHOW SERVERS;  -- backend server connections

-- Key metrics to alert on:
-- cl_waiting > 0 for sustained periods (clients waiting for a connection)
-- sv_active approaching max_db_connections (pool exhaustion)

Sizing rule of thumb: Set default_pool_size to match the number of CPU cores on your PostgreSQL server (or slightly higher). Set max_db_connections to stay well under PostgreSQL's max_connections, leaving headroom for direct admin connections and replication.