Proxmox Production Architecture Guide

Overview

Proxmox Virtual Environment (PVE) is an open-source server virtualization platform built on Debian Linux. It combines KVM for full virtualization and LXC for lightweight containers, managed through a web UI and REST API. It competes with VMware vSphere, Microsoft Hyper-V, and Nutanix AHV.

Since VMware's acquisition by Broadcom (2023) and subsequent licensing changes, Proxmox has become a serious contender for customers looking to exit VMware. The migration wave is real — many engagements now are VMware-to-Proxmox transitions.

Strengths

Truly open source (AGPL v3) — no feature gating
Integrated clustering, HA, live migration, backup
Ceph storage integration built-in
Web UI that covers 95% of operations
REST API for automation
No per-CPU or per-VM licensing

Weaknesses

No equivalent to vMotion DRS (automatic load balancing)
Ecosystem is smaller — fewer third-party integrations
Enterprise support is good but not VMware/Microsoft tier
No native NSX-equivalent SDN (basic SDN exists)
GPU passthrough works but is less polished than VMware
Windows guest tooling less mature than VMware Tools

Positioning

Proxmox is not a 1:1 VMware replacement. It's a different philosophy — Linux-native, CLI-friendly, built on standard open-source components (KVM, LXC, Ceph, ZFS, Corosync). For customers who are comfortable with Linux, it's arguably better. For customers who expect a Windows-centric, GUI-everything experience, set expectations early.

Architecture

Each Proxmox node is a standalone Debian server that can join a cluster. Understanding the component stack matters for troubleshooting and capacity planning.

Component	Role	Notes
QEMU/KVM	Full virtualization	Hardware-accelerated VMs. Supports live migration, snapshots, CPU pinning.
LXC	OS-level containers	Lightweight, shared kernel. Not Docker — full OS containers. Great for services that don't need a full VM.
Corosync	Cluster communication	Handles cluster membership, quorum, and node heartbeats. Totem protocol over UDP.
pmxcfs	Cluster filesystem	FUSE filesystem backed by a SQLite DB replicated via Corosync. Stores cluster config (VMs, storage, users, ACLs).
pveproxy	API & web UI	HTTPS reverse proxy on port 8006. Handles authentication, serves the web UI, exposes the REST API.
pvedaemon	Node management	Local daemon for VM/container operations, storage management, task execution.
Ceph	Distributed storage	Optional. Built-in Ceph deployment for hyper-converged storage. OSD, MON, MDS, MGR.
ZFS	Local storage	Optional. Advanced filesystem with snapshots, compression, checksums, replication.
Open vSwitch	Virtual networking	Optional. Software-defined networking with VLANs, bonds, and SDN zones.

Clustering

A Proxmox cluster is a group of nodes managed as a single entity. Clustering enables live migration, HA, shared configuration, and centralized management. Minimum 3 nodes for production.

Creating a cluster

# On the first node:
pvecm create my-cluster

# On subsequent nodes:
pvecm add 10.0.0.1    # IP of an existing cluster node

# Verify cluster status
pvecm status
pvecm nodes

Quorum

Proxmox uses Corosync's voting system for quorum. A cluster needs a majority of votes to operate:

Nodes	Quorum Requires	Tolerate Failures
2	2 (both must be up)	0 — never do this without a QDevice
3	2	1
4	3	1
5	3	2

Two-Node Clusters

A 2-node cluster has no fault tolerance by default — losing one node loses quorum, and the surviving node won't start HA services. Fix this with a QDevice (Corosync Quorum Device) — a lightweight third-party witness running on a small VM or Raspberry Pi that provides the tiebreaker vote.

# Set up QDevice (on a separate machine):
apt install corosync-qdevice corosync-qnetd

# On a cluster node:
pvecm qdevice setup 10.0.0.100    # IP of the QDevice host

# Verify
pvecm status

Corosync network

Since PVE 6.0+, Corosync 3 uses Kronosnet (knet) for transport, which is unicast only. Multicast was used in Corosync 2.x (PVE 5.x and earlier) and is no longer supported.
Dedicate a separate NIC/VLAN for cluster traffic (Corosync + Ceph). Don't share with VM traffic.
Configure redundant links (knet supports up to 8 separate network links) for Corosync. If the cluster network fails, you lose quorum and all HA stops.
Latency between nodes must be < 2ms. Proxmox clusters cannot span WANs or high-latency links.

# /etc/pve/corosync.conf (managed by pvecm, don't edit directly)
# Verify link status:
pvecm status
# Check for link errors:
corosync-cfgtool -s

Cluster Breakup

Removing a node from a cluster is destructive. All VMs/CTs on that node must be migrated first. The node is wiped of cluster config and must be reinstalled to join a different cluster. Plan cluster membership carefully — it's not something you casually change.

Storage

Storage architecture is the most consequential decision in a Proxmox deployment. It affects performance, HA capability, backup speed, and operational complexity.

Hyper-Converged Ceph

Distributed storage built into Proxmox. Each node contributes disks to a shared pool. VMs can run on any node and access their storage over the network. Enables live migration and HA.

Pros: No external storage needed, scales linearly, self-healing
Cons: Needs 3+ nodes, dedicated network, CPU/RAM overhead, complex to tune
Best for: 3+ node clusters needing shared storage without a SAN

Local ZFS

Advanced local filesystem. Snapshots, compression, checksums, send/receive replication. Best local storage option for Proxmox.

Pros: Excellent data integrity, fast snapshots, built-in compression
Cons: Local only (no live migration without Ceph/NFS), RAM-hungry (1 GB ARC per 1 TB of storage is a common rule of thumb; must limit ARC on VM hosts)
Best for: Single nodes, or combined with Ceph (ZFS for local, Ceph for shared)

External NFS / iSCSI / FC

Traditional shared storage from a NAS/SAN. NFS is simplest. iSCSI and Fibre Channel for higher performance.

Pros: Well-understood, existing investment, enables live migration
Cons: Single point of failure (unless HA SAN), separate infrastructure to manage
Best for: Customers with existing SAN/NAS infrastructure

Simple LVM / LVM-Thin / Directory

Basic local storage. LVM-Thin supports thin provisioning and snapshots. Directory storage uses the filesystem directly (ext4/xfs).

Pros: Zero overhead, simple, fast
Cons: No checksums, limited snapshots, no replication
Best for: Dev/test, ephemeral workloads, boot drives

Ceph deployment

Proxmox has a built-in Ceph installer — you don't need to deploy Ceph separately:

# Install Ceph on each node (from the Proxmox UI or CLI):
pveceph install

# Create monitors (one per node, need 3+ for quorum):
pveceph mon create

# Create managers:
pveceph mgr create

# Create OSDs (one per disk):
pveceph osd create /dev/sdb
pveceph osd create /dev/sdc

# Create a storage pool:
pveceph pool create vm-storage --pg_autoscale_mode on

# Pool is now available as a storage backend in Proxmox

Ceph Networking

Ceph needs a dedicated network with at least 10 Gbps between nodes. 1 Gbps will work for small deployments but becomes a bottleneck quickly. For production, use 25 Gbps. Separate the Ceph public network (client access) from the Ceph cluster network (OSD replication) for best performance.

Ceph sizing rules of thumb

Minimum 3 nodes with at least 2 OSDs each
Don't fill beyond 70-80% — Ceph performance degrades and recovery becomes dangerous above 80%
RAM: BlueStore's default osd_memory_target is 4 GB per OSD (1 GB for HDD-backed, 3 GB for SSD-backed cache by default) + 1 GB per monitor + the OS baseline. A node with 8 NVMe OSDs needs ~28 GB just for Ceph.
CPU: 1 core per OSD for HDD, 2+ cores per OSD for NVMe (NVMe saturates CPU faster)
Journal/WAL: Use a fast NVMe for the OSD WAL/DB if your OSDs are SATA SSDs or HDDs. This dramatically improves write latency.
Replication: Default is 3x (3 copies). For NVMe-only clusters, consider erasure coding for better space efficiency on cold data.

ZFS configuration

# Create a mirrored ZFS pool (recommended over RAIDZ for VMs):
zpool create -f rpool mirror /dev/sda /dev/sdb

# Enable compression (always):
zfs set compression=lz4 rpool

# Set ARC (cache) limits to leave RAM for VMs:
# In /etc/modprobe.d/zfs.conf:
options zfs zfs_arc_max=8589934592    # 8GB max ARC

# Add as Proxmox storage:
pvesm add zfspool local-zfs -pool rpool/data -content images,rootdir

Networking

Proxmox networking is Linux networking. If you understand bridges, bonds, VLANs, and routing on Linux, you understand Proxmox networking. There's no proprietary abstraction layer.

Network architecture for production

A production node should have at minimum 3 network segments:

Management Proxmox UI / API / SSH

The management network carries web UI, API, SSH, and Corosync cluster traffic. Dedicated NIC or VLAN. This is your control plane — if it goes down, you can't manage the cluster.

VM Traffic Guest Networks

VLANs for VM/CT traffic. Trunk the VLANs to the Proxmox bridge and assign VLAN tags per VM NIC. Use LACP bonds for bandwidth and redundancy.

Storage Ceph / iSCSI / NFS

Dedicated high-bandwidth network for storage traffic. 10/25 Gbps minimum. Jumbo frames (MTU 9000) recommended for Ceph. This must be low-latency and reliable.

Optional Live Migration

Separate network for VM memory transfer during live migration. Shares with storage network in smaller deployments. Dedicated in large ones to avoid migration storms saturating storage I/O.

Bridge and bond configuration

# /etc/network/interfaces (typical production node)

# Management bond (LACP)
auto bond0
iface bond0 inet manual
    bond-slaves eno1 eno2
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4

# Management bridge
auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0

# Storage/Ceph bond (LACP, jumbo frames)
auto bond1
iface bond1 inet manual
    bond-slaves ens1f0 ens1f1
    bond-miimon 100
    bond-mode 802.3ad
    bond-xmit-hash-policy layer3+4
    mtu 9000

# Storage bridge (no gateway - isolated network)
auto vmbr1
iface vmbr1 inet static
    address 10.10.0.10/24
    bridge-ports bond1
    bridge-stp off
    bridge-fd 0
    mtu 9000

# VM traffic bridge (VLAN-aware)
auto vmbr2
iface vmbr2 inet manual
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 100-200

SDN (Software-Defined Networking)

Proxmox includes an SDN module for managing VNets, zones, and subnets across the cluster. It supports VLAN, VXLAN, and EVPN zones with BGP-based routing and fabric automation. PVE 8+ improved SDN significantly with DHCP integration and subnet management. It's functional for multi-tenancy but not as feature-rich as NSX or Cilium.

Use SDN if you need to define networks centrally and have them auto-configured on all nodes
Skip SDN if you're comfortable managing bridges/VLANs in /etc/network/interfaces directly — it's more transparent and easier to debug

VMs & Containers

KVM virtual machines

Full hardware virtualization. Each VM gets its own kernel, full OS, emulated or paravirtualized hardware. Use for:

Windows guests
Workloads that need kernel modules or specific kernel versions
Security isolation (separate kernel per workload)
Anything that needs GPU passthrough, USB passthrough, or specific hardware emulation

VM best practices

# Create a VM with virtio devices (best performance):
qm create 100 \
  --name my-vm \
  --memory 4096 \
  --cores 4 \
  --scsihw virtio-scsi-single \
  --scsi0 local-zfs:32,iothread=1 \
  --net0 virtio,bridge=vmbr0,tag=100 \
  --ostype l26 \
  --boot order=scsi0 \
  --agent enabled=1

Always use VirtIO for disk and network — dramatically faster than IDE/E1000 emulation
Enable QEMU Guest Agent (--agent enabled=1) — required for proper shutdown, freeze/thaw for backups, IP reporting
Use virtio-scsi-single with iothread=1 per disk for best I/O performance
CPU type: Use host for maximum performance (exposes real CPU features). Use x86-64-v2-AES or similar if you need live migration between different CPU generations.
Ballooning: Enabled by default. Allows the VM to return unused RAM to the host. Disable for latency-sensitive workloads (databases, real-time).

LXC containers

OS-level containers sharing the host kernel. Not Docker — these are full OS containers (think systemd, SSH, the full userspace). Use for:

Linux-only services that don't need a custom kernel
Lightweight infrastructure services (DNS, monitoring agents, web servers)
Dev/test environments
Anything where VM overhead is unnecessary

Privileged vs. Unprivileged

Unprivileged (default, recommended): Container UIDs are mapped to a high range on the host. Root inside the container is not root on the host. Much safer.

Privileged: Container root = host root (mapped 1:1). Required for some operations (NFS mounts, certain device access). Use sparingly and only when unprivileged doesn't work.

Resource Limits

Set CPU, RAM, and I/O limits per container. Unlike VMs, containers share the host kernel and scheduler — one runaway container can affect others without proper limits.

pct set 200 -memory 2048
pct set 200 -cores 2
pct set 200 -swap 512

LXC Limitations

LXC containers can run Docker with features: nesting=1,keyctl=1 on unprivileged containers, and this works well for many workloads. However, not all Docker images or complex stacks are guaranteed to work due to kernel namespace and AppArmor constraints. For maximum compatibility and isolation, a VM with Docker inside remains the safest choice for production. PVE 9.1+ also added native OCI container support, allowing you to pull and run OCI images directly without Docker or a full VM.

High Availability

Proxmox HA automatically restarts VMs/CTs on another node if a node fails. It requires a cluster with quorum and shared storage (Ceph, NFS, iSCSI).

How HA works

The HA manager (pve-ha-lrm + pve-ha-crm) runs on each node
Nodes are monitored via Corosync heartbeats
If a node is fenced (declared dead after missing heartbeats), the cluster requests HA resources be restarted elsewhere
The CRM (Cluster Resource Manager) picks a target node and starts the VM/CT
The VM boots fresh on the new node — this is not live migration, it's a cold restart

HA is Not Live Migration

HA restarts VMs after a node failure. The VM is down during the failover (typically 1-5 minutes). Live migration (zero-downtime) is a manual or planned operation, not part of HA. Don't promise customers "zero downtime HA" with Proxmox — that's not what it does.

Fencing

Fencing is how the cluster ensures a failed node is truly dead before restarting its VMs elsewhere. Without proper fencing, you risk split-brain — two copies of the same VM running simultaneously, corrupting data.

Default (watchdog): Linux software watchdog. If the HA manager can't communicate with the cluster, the watchdog reboots the local node. This is the default and works for most deployments.
Hardware fencing (IPMI/iLO/iDRAC): The cluster powers off the failed node via out-of-band management. More reliable but requires BMC network access between nodes.
STONITH: "Shoot The Other Node In The Head" — same concept, different name. Configure via the ha-manager fencing options.

HA groups & resource configuration

# Add a VM to HA management:
ha-manager add vm:100

# Set priority (higher = preferred node):
ha-manager set vm:100 --group my-group

# Create an HA group (restrict which nodes can run this VM):
ha-manager groupadd my-group --nodes node1,node2 --nofailback 1

# List HA resources:
ha-manager status

nofailback: When the original node comes back, don't automatically migrate the VM back. Set this to avoid unnecessary migrations and potential disruption.
max_restart: Maximum restart attempts before giving up (default: 1). Increase for flaky workloads, keep at 1 for workloads where repeated restarts could cause data corruption.
max_relocate: Maximum times to try a different node (default: 1).

Backups

Built-in backup (vzdump)

Proxmox includes vzdump for VM and container backups. Three modes:

Mode	Downtime	Consistency	Use Case
Snapshot	None (live)	Crash-consistent (app-consistent with QEMU agent)	Production VMs — the default choice
Suspend	Brief (seconds-minutes)	Memory state saved	When snapshot mode doesn't work
Stop	Full (VM is stopped)	Clean shutdown, fully consistent	Maintenance windows, critical databases

Recommendation

Use snapshot mode with the QEMU Guest Agent enabled. The agent triggers fsfreeze inside the guest before the snapshot, making it application-consistent for most workloads (equivalent to taking a snapshot of a cleanly-paused filesystem). Without the agent, you get crash-consistent backups — fine for most Linux workloads, risky for databases.

Proxmox Backup Server (PBS)

Dedicated backup appliance from Proxmox. Strongly recommended over storing backups on local/NFS storage:

Deduplication: Client-side dedup with fixed-size chunks (for VM disk images) and variable-size chunks (for file archives, using a rolling hash for better dedup ratios). Second backup of a 100 GB VM that changed 1 GB only transfers ~1 GB.
Incremental forever: Every backup after the first is incremental. No periodic full backups needed.
Encryption: Client-side AES-256-GCM. The PBS server never sees plaintext data.
Verification: Scheduled verify jobs that check backup integrity without restoring.
Garbage collection: Automatic cleanup of unreferenced chunks.
Sync & offsite: Native sync to a remote PBS for offsite copies.

# Schedule backups in Proxmox UI: Datacenter → Backup → Add
# Or via CLI:
vzdump 100 --storage pbs-backup --mode snapshot --compress zstd

# Backup all VMs on a node:
vzdump --all --storage pbs-backup --mode snapshot --compress zstd

# Note: --mailnotification and --mailto are deprecated in PVE 8+.
# Use the notification system instead: Datacenter → Notifications
# to configure targets, matchers, and notification policies.

Backup strategy

Daily backups of all VMs/CTs to PBS (snapshot mode, off-hours)
Retention: 7 daily, 4 weekly, 3 monthly minimum. PBS handles retention policies natively.
Offsite: Sync PBS to a remote PBS or push to S3-compatible storage. The 3-2-1 rule applies: 3 copies, 2 media types, 1 offsite.
Test restores quarterly — restore a VM to a temporary name and verify it boots and works.
Backup the Proxmox config itself: /etc/pve/ contains cluster config. It's small — back it up separately.

Don't Forget

Back up /etc/pve/ (cluster config, VM configs, user database, ACLs, storage definitions). It's not included in VM backups. Losing this means you can recreate VMs from backups but not the cluster configuration, users, permissions, or HA settings.

Upgrades

Proxmox follows Debian releases. Major version upgrades coincide with the underlying Debian upgrade (PVE 7/Bullseye → PVE 8/Bookworm → PVE 9/Trixie). PVE 9.0 was released August 2025 on Debian 13 "Trixie" with kernel 6.14, QEMU 10.0, Ceph Squid 19.2, and ZFS 2.3. Minor updates are regular apt upgrades.

Minor updates (within a version)

# Standard apt upgrade, one node at a time:
apt update
apt dist-upgrade

# Reboot if kernel was updated:
# Check: running kernel vs. installed kernel
uname -r
ls /boot/vmlinuz-* | tail -1

Upgrade one node at a time in a cluster
Migrate or shut down VMs on the node before rebooting (or rely on HA for automatic failover)
Verify the node rejoins the cluster after reboot: pvecm status
Wait for Ceph to rebalance (if using Ceph) before upgrading the next node: ceph status should show HEALTH_OK

Major version upgrades

Major upgrades are in-place Debian upgrades. Proxmox provides a checklist tool:

# Run the pre-upgrade checklist:
pve8to9 --full    # (or pve7to8 for older upgrades)

# This checks for:
# - Unsupported packages
# - Deprecated configurations
# - Ceph version compatibility
# - Kernel version
# - Repository configuration

Major Upgrade Strategy

Major upgrades are not reversible (Debian doesn't support downgrades). Take a full backup of the node (ideally a bare-metal backup or at minimum /etc/ and /var/lib/pve-cluster/) before starting. Upgrade one node at a time. If it fails catastrophically, reinstall from scratch and rejoin the cluster. VMs on shared storage are unaffected.

Ceph upgrades

If running Ceph, it has its own upgrade path that must be coordinated with the PVE upgrade:

Ceph upgrades are version-locked to the PVE major version (PVE 9 ships Ceph Squid 19.x, PVE 8 shipped Ceph Quincy/Reef/Squid, PVE 7 shipped Ceph Pacific/Quincy)
Upgrade Ceph monitors first, then OSDs, then MDS (if using CephFS)
Set noout flag before rebooting OSD nodes to prevent unnecessary rebalancing: ceph osd set noout
Unset after upgrade: ceph osd unset noout

Monitoring

Proxmox has basic built-in monitoring (web UI graphs) but production deployments need external monitoring.

What to monitor

Metric	Alert Threshold	Why
Cluster quorum	votes < expected	Quorum loss = HA stops, no management operations
Node CPU	> 85% sustained	VMs will compete for cycles, latency increases
Node RAM	> 90% (incl. ZFS ARC)	OOM killer will start killing VMs
Storage usage	> 80% (Ceph: > 70%)	Ceph degrades severely above 80%, near-full OSD = cluster emergency
Ceph health	!= HEALTH_OK	Degraded = reduced redundancy, one more failure could lose data
Ceph OSD latency	commit_latency_ms > 20	Slow disk or overloaded OSD
ZFS pool health	!= ONLINE	Degraded pool = running on reduced redundancy
Disk SMART	Any reallocated sectors	Early warning for disk failure
Network bond	Degraded (lost a link)	Running without redundancy
Backup status	Failed or stale	No backup = no recovery

Monitoring stack

Prometheus + PVE Exporter: The prometheus-pve-exporter scrapes the Proxmox API and exposes metrics. Community-maintained, works well.
Ceph built-in: ceph status, ceph health detail, Ceph Manager's Prometheus module (ceph mgr module enable prometheus)
SMART monitoring: smartmontools + smartd on every node. Alert on any SMART errors.
Node Exporter: Standard Prometheus node_exporter for OS-level metrics (CPU, RAM, disk I/O, network)

# Enable Ceph's Prometheus module:
ceph mgr module enable prometheus
# Scrape at http://ceph-mgr-node:9283/metrics

# Install PVE exporter (on a monitoring host):
pip install prometheus-pve-exporter
# Config: point at https://pve-node:8006 with API token

Security Hardening

Proxmox runs as root on bare metal. The hypervisor is the highest-privilege layer in the stack — if it's compromised, every VM is compromised.

Access Web UI & API

Restrict port 8006 to management network only (firewall or bind address)
Use API tokens instead of username/password for automation
Enable 2FA (TOTP) for all admin accounts
Disable root login; create named admin accounts with appropriate roles

SSH Hardening

Key-only authentication (disable password auth)
Restrict SSH to management network
Use fail2ban for brute-force protection
Disable root SSH if using sudo-capable admin accounts

Network Isolation

Management, storage, and VM traffic on separate networks/VLANs
Corosync traffic never on an untrusted network
Proxmox built-in firewall for VM-level rules
No VMs should be able to reach the management network

Updates Patching

Subscribe to Proxmox security advisories
Patch monthly at minimum, critical CVEs immediately
Kernel updates require reboot — schedule maintenance windows
Don't skip Debian security updates (it's a full Debian system)

RBAC & permissions

Proxmox has a granular permission system with users, groups, roles, and path-based ACLs:

# Create a user (Proxmox realm):
pveum user add admin@pve --comment "Node admin"

# Create a role with specific privileges:
pveum role add VMOperator -privs "VM.Audit,VM.Console,VM.PowerMgmt"

# Assign role on a path:
pveum acl modify /vms/100 --users admin@pve --roles VMOperator

# API tokens (for automation):
pveum user token add admin@pve automation --privsep 1
# privsep=1 means the token gets its own permissions, not the user's

Use LDAP/AD integration for user authentication in enterprise environments
Map AD groups to Proxmox groups, then assign roles to groups
Use API tokens with privsep for Terraform, Ansible, and other automation

Licensing & Support

Proxmox VE is fully open source (AGPL v3). Every feature works without a subscription. The subscription buys you access to the enterprise repository and support. Pricing is per physical CPU socket per year (not per core).

Tier	Cost (per CPU socket/year)	What You Get
No subscription	Free	Full software, no-subscription repo (slightly less tested packages), community forum support only
Community	€115	Enterprise repo access, community-based support (no professional tickets)
Basic	€355	Enterprise repo, 3 support tickets/year (next business day response)
Standard	€530	Enterprise repo, 10 support tickets/year (4-hour response during business hours)
Premium	€1,060	Enterprise repo, unlimited tickets (2-hour response, business day around-the-clock)

Enterprise repo vs. no-subscription repo

The enterprise repo (pve-enterprise) requires a valid subscription key. Packages are held back slightly for extra testing.
The no-subscription repo (pve-no-subscription) is free. Same packages, slightly less testing. Completely usable for production — many companies run it without issues.
The test repo (pvetest) has bleeding-edge packages. Never use in production.

# Switch to no-subscription repo (if no subscription):
# Remove enterprise repo:
rm /etc/apt/sources.list.d/pve-enterprise.list

# Add no-subscription repo (use your Debian codename: trixie for PVE 9, bookworm for PVE 8):
echo "deb http://download.proxmox.com/debian/pve trixie pve-no-subscription" \
  > /etc/apt/sources.list.d/pve-no-subscription.list

apt update

Recommendation

For production customer deployments, buy at least the Community subscription (€115/socket/year) for enterprise repo access, or Basic (€355/socket/year) if you want professional support tickets. The enterprise repo is more stable, and having vendor support as a safety net matters for customer confidence. The cost is negligible compared to VMware licensing — often 10-50x cheaper. For internal/lab use, the no-subscription repo is perfectly fine.

VMware comparison (for customer conversations)

Capability	Proxmox VE	VMware vSphere
License model	Free or per-socket/year (€115-€1,060)	Per-core subscription (post-Broadcom)
Hypervisor	KVM (Type 1, Linux-based)	ESXi (Type 1, proprietary)
Live migration	Yes (manual or API)	Yes (vMotion, + DRS for automatic)
HA	Yes (cold restart on failure)	Yes (cold restart + DRS rebalancing)
Distributed storage	Ceph (built-in)	vSAN (licensed separately)
Containers	LXC (native)	None (requires VMs)
SDN	Basic (VLAN, VXLAN, EVPN)	NSX (advanced, very expensive)
Automation	REST API, Terraform, Ansible	vSphere API, Terraform, PowerCLI
GPU passthrough	Works (vfio-pci)	Works (better vGPU support with NVIDIA)

Consultant's Checklist

Before proposing a Proxmox deployment:

How many hosts? — Determines cluster size and quorum strategy (2-node needs QDevice)
Storage strategy? — Ceph (hyper-converged), ZFS (local), NFS/iSCSI (external SAN), or a mix
Network infrastructure? — How many NICs, 10G/25G availability, VLAN support, jumbo frames
Workload types? — VMs vs. LXC, Windows vs. Linux, GPU needs, real-time requirements
HA requirements? — Needs shared storage. What's the acceptable failover time? (Proxmox HA = cold restart, 1-5 min)
Backup strategy? — PBS recommended. Offsite target? Retention requirements? RTO for restore?
Migration from VMware? — How many VMs? OVA export possible? VMDK conversion plan? V2V tooling?
Linux competency? — Proxmox is Linux. If the team isn't comfortable with CLI, networking config, and apt, budget for training.
Subscription? — Enterprise repo access and support level. Even Basic is worth it for production.
Automation plans? — Terraform (proxmox provider), Ansible (community modules), Packer for templates

Corosync

Cluster communication engine — group membership, messaging, and quorum

What is Corosync?

Corosync is the cluster communication layer that underpins every Proxmox VE cluster. It provides three fundamental services: group membership (which nodes are in the cluster), reliable messaging (how nodes talk to each other), and quorum (whether the cluster has enough nodes to safely operate). Without Corosync, Proxmox nodes are standalone boxes — there is no cluster.

Corosync implements the Totem Single Ring Ordering and Membership Protocol, a token-passing ring protocol. A virtual token circulates among cluster members over UDP. Only the node holding the token can send messages, which guarantees total ordering — every node sees every message in the same sequence. This is critical for consistent cluster state.

Role in Proxmox

In a Proxmox cluster, Corosync handles:

Cluster membership — tracking which nodes are currently active members. When you run pvecm add, you're joining the Corosync ring.
Node heartbeats — each node sends periodic heartbeats. If a node misses enough heartbeats, it's declared dead and fenced.
Quorum decisions — Corosync calculates whether the cluster has a majority of votes. If not, the cluster becomes read-only.
pmxcfs replication — the Proxmox cluster filesystem (/etc/pve) uses Corosync to replicate its SQLite database across all nodes.

Quorum mechanics

Quorum requires a majority of total votes. By default, each node has 1 vote. A 3-node cluster has 3 total votes and needs 2 to operate (can tolerate 1 failure). A 5-node cluster needs 3 votes (can tolerate 2 failures).

Without quorum, the cluster goes read-only. VMs continue to run on their current nodes, but you cannot start, stop, migrate, or modify any VM/CT. HA fencing stops. The web UI becomes partially non-functional. This is by design — it prevents split-brain scenarios where two halves of a partitioned cluster both try to manage the same resources.

The two-node problem

A 2-node cluster has 2 total votes, requiring both to be present for quorum. Losing one node means the other can't operate either. This is why two-node clusters are dangerous without a QDevice. The QDevice provides a third vote from an external lightweight server, giving the surviving node enough votes to maintain quorum.

Networking & ports

Corosync communicates on UDP ports 5405-5406. Since PVE 6.0+, Corosync 3 uses Kronosnet (knet) transport, which is unicast only. Multicast was used in Corosync 2.x (PVE 5.x and earlier) and is no longer supported. Key requirements:

# Check Corosync link status
pvecm status

# View the Corosync configuration (read-only!)
cat /etc/pve/corosync.conf

# Check link connectivity
corosync-cfgtool -s

# NEVER edit corosync.conf manually — use pvecm commands
# Manual edits can break the cluster

Latency must be < 2ms between nodes — Corosync is very sensitive to latency
Dedicate a network interface or VLAN for Corosync traffic — don't share with VM or storage traffic
Configure redundant links (knet supports up to 8 separate links) so a single NIC failure doesn't kill quorum
Corosync does not support WAN links — clusters cannot span datacenters

Configuration

The config file lives at /etc/pve/corosync.conf and is replicated via pmxcfs. Never edit it manually — always use pvecm commands. Manual edits can corrupt the cluster state and leave you in a recovery scenario that requires shutting down all nodes.

Consultant tip: If a client reports "cluster is read-only" or "can't start VMs", the first thing to check is quorum. Run pvecm status and look at the quorum line. If Quorate: No, you've lost enough nodes (or network connectivity) that the cluster can't safely operate. Restoring the missing node or its network link is the fix — not forcing quorum, which risks split-brain.

Ceph

Distributed storage system — block, object, and filesystem in one platform

What is Ceph?

Ceph is a software-defined distributed storage system that provides block storage (RBD), object storage (RADOS Gateway), and a POSIX filesystem (CephFS) — all built on top of a single unified storage cluster called RADOS (Reliable Autonomic Distributed Object Store). In Proxmox, Ceph is the go-to solution for hyper-converged infrastructure: your compute nodes double as storage nodes.

Architecture

A Ceph cluster consists of several daemon types:

MON (Monitor) — maintains the cluster map (what OSDs exist, where data lives, cluster health). Monitors form their own quorum — you need an odd number (3 or 5). Monitors are very sensitive to clock skew — always run NTP.
OSD (Object Storage Daemon) — one per physical disk. Handles actual data storage, replication, recovery, and scrubbing. More OSDs = more performance and capacity.
MGR (Manager) — provides monitoring, dashboard, and orchestration. Runs the Prometheus module. At least 2 for HA.
MDS (Metadata Server) — only needed if you use CephFS. Handles POSIX metadata (directory listings, permissions). Not needed for RBD (block) storage, which is what Proxmox VMs typically use.

CRUSH map & data placement

Ceph uses the CRUSH algorithm (Controlled Replication Under Scalable Hashing) to determine where data is stored. Unlike traditional storage with a central metadata server or lookup table, CRUSH computes placement deterministically using the cluster topology. This means any client can calculate where any piece of data lives without asking a central authority — which is why Ceph scales horizontally.

The CRUSH map encodes your physical topology: datacenters, racks, hosts, and OSDs. Failure domains ensure replicas land on different hosts (or racks) so a single hardware failure doesn't lose data.

Replication & pool types

Replicated pools (default) — every object is copied N times. Default size=3, min_size=2 means 3 copies of every object, and the cluster continues serving I/O as long as 2 copies exist. Simple, fast reads (any copy can serve), but uses 3x the raw capacity.
Erasure-coded pools — more space-efficient (e.g., k=4, m=2 gives 1.5x overhead vs 3x). Trade-off: higher CPU usage for encode/decode, partial write penalties, and some features (like RBD snapshots) have limitations.

Minimum requirements in Proxmox

3 nodes minimum (for 3 MONs and the default replication factor of 3)
1 OSD per disk per node minimum, but more OSDs = better throughput and recovery speed
Dedicated network: separate the Ceph public network (client→OSD) from the cluster network (OSD→OSD replication). Both should be 10 Gbps+, with 25 Gbps recommended for production.
RAM: BlueStore default osd_memory_target is ~4 GB per OSD (1 GB cache for HDD-backed, 3 GB for SSD-backed) + 1 GB per MON + operating system baseline

# Install Ceph on a Proxmox node
pveceph install

# Create monitors (run on 3 nodes)
pveceph mon create

# Create OSDs
pveceph osd create /dev/sdb
pveceph osd create /dev/sdc

# Create a replicated pool for VM disks
pveceph pool create vm-pool --size 3 --min_size 2 --pg_autoscale_mode on

# Check cluster health
ceph status
ceph health detail
ceph osd tree

Common issues

Clock skew — Monitors are extremely sensitive to time drift. If NTP fails, MONs will flag warnings and eventually lose quorum. Always run chrony or ntpd.
Full OSDs — Ceph stops accepting writes when an OSD hits the full_ratio (default 95%). At nearfull_ratio (85%) you get warnings. Keep utilization below 70-80%.
Slow OSDs — a single slow disk drags down the entire pool. Watch ceph osd perf for outliers. Replace failing disks promptly.
PG states — Placement Groups should be active+clean. States like degraded, undersized, recovering, or stale indicate problems. ceph pg stat and ceph health detail are your diagnostic tools.

Consultant tip: The most common Ceph failure mode in small Proxmox clusters is running out of space. Clients add VMs and forget that 3x replication means 1 TB of VM disks requires 3 TB of raw storage. Always present capacity in terms of usable space, not raw. And set up monitoring alerts at 60% utilization, not 80%.

ZFS

Combined filesystem + volume manager with built-in RAID, compression, and snapshots

What is ZFS?

ZFS is a combined filesystem and volume manager originally developed by Sun Microsystems. It eliminates the traditional separation between disk management (hardware RAID, LVM) and filesystem (ext4, XFS) by handling both in a single, integrated layer. ZFS provides built-in RAID, data checksumming, compression, snapshots, and replication — all without any additional software.

Key concepts

Zpool — the physical storage pool. Made up of one or more vdevs (virtual devices). A vdev can be a single disk, a mirror, or a RAIDZ group. Data is striped across vdevs, so losing an entire vdev means losing the pool.
Dataset — a filesystem within a zpool. Datasets can have independent properties (compression, quota, record size). Think of them as lightweight, nested filesystems.
Zvol — a block device backed by the zpool. This is what Proxmox uses for VM disks when ZFS is the storage backend. It looks like a regular block device to the VM.

RAID levels

Mirror (RAID1) — 2+ copies of every block. Best random I/O performance. Recommended for VM workloads in Proxmox. 50% space efficiency with 2-way mirror.
RAIDZ1 (similar to RAID5) — single parity. Can lose 1 disk per vdev. Better space efficiency than mirror but worse random I/O. Not recommended for disks > 2 TB due to long resilver times.
RAIDZ2 (similar to RAID6) — double parity. Can lose 2 disks. Recommended for large HDDs where resilver takes days.
RAIDZ3 — triple parity. Can lose 3 disks. Rare, but useful for very large HDD arrays.

Proxmox integration

Proxmox has native ZFS support. You can select ZFS as the root filesystem during installation, and Proxmox will manage zpools, datasets, and zvols through its API and web UI.

# Create a mirrored pool for VM storage
zpool create -f tank mirror /dev/sda /dev/sdb

# Enable compression (always recommended)
zfs set compression=lz4 tank

# Set optimal record size for VM workloads
zfs set recordsize=64k tank/vm-disks

# Create a dataset for ISO storage
zfs create tank/isos

# Check pool status
zpool status
zpool list

# Scrub (data integrity check) — schedule monthly via cron
zpool scrub tank

Snapshots & replication

ZFS snapshots are instantaneous and free (copy-on-write). Proxmox uses them for VM backups and for ZFS replication — sending incremental snapshots to another Proxmox node on a schedule. This gives you disaster recovery for ZFS-based VMs without needing Ceph.

# Manual snapshot
zfs snapshot tank/vm-disk-100@before-upgrade

# Send a snapshot to a remote node
zfs send tank/vm-disk-100@snap1 | ssh node2 zfs recv backup/vm-disk-100

# Proxmox handles this automatically via Datacenter > Replication

ARC, L2ARC, and SLOG

ARC (Adaptive Replacement Cache) — ZFS's primary read cache, stored in RAM. Rule of thumb: 1 GB of RAM per 1 TB of storage. ZFS will use all available RAM for ARC unless you set zfs_arc_max. On a Proxmox host also running VMs, you must limit ARC or it will starve your VMs.
L2ARC — a second-level read cache on a fast SSD. Extends ARC to SSD when RAM isn't enough. Useful for large working sets that don't fit in RAM. Adds ~50 bytes of RAM per cached block for the index.
SLOG (Separate Log) / ZIL (ZFS Intent Log) — a dedicated SSD for synchronous write logging. Improves sync write latency (databases, NFS). Must be a high-endurance, power-loss-protected SSD (Optane ideal). If the SLOG device fails, you lose uncommitted transactions.

Gotchas

No native clustering — ZFS is local to each node. You cannot share a ZFS pool between Proxmox nodes. Use ZFS replication for DR, or combine with Ceph for shared storage.
RAIDZ expansion now available — OpenZFS 2.3 (released January 2025, shipped with PVE 9) supports adding a disk to an existing RAIDZ vdev. The data is rewritten across all drives including the new one, the pool remains usable during expansion, and it resumes automatically if interrupted. On PVE 8 (ZFS 2.2.x), you still cannot expand existing RAIDZ vdevs — only add new vdevs to a pool.
RAM consumption — deduplication is extremely RAM-hungry (5 GB per 1 TB). Do not enable dedup unless you fully understand the RAM requirements. Compression (lz4) is always preferred over dedup.

Consultant tip: For Proxmox hosts with both ZFS and VMs, always set zfs_arc_max in /etc/modprobe.d/zfs.conf. A common formula: total RAM minus (number of VMs * average VM RAM) minus 4 GB for the OS. If you don't limit ARC, it will happily consume all available memory, and the Linux OOM killer will start terminating VM processes.

QDevice

External quorum arbitrator for Corosync clusters

What is a QDevice?

A QDevice (Corosync Quorum Device) is an external vote provider that solves the fundamental problem of two-node clusters: with only two votes, losing one node means the other can't reach majority and the cluster goes read-only. The QDevice runs as a lightweight daemon on cluster nodes (corosync-qdevice) and connects to an external qnetd (Quorum Network Daemon) server that acts as the tiebreaker.

Architecture

The QDevice system has two components:

corosync-qdevice — runs on each Proxmox cluster node. Connects to the qnetd server and participates in quorum voting.
corosync-qnetd — runs on an external server. This can be any Linux box — a small VM, a Raspberry Pi, a container. It does not need to run Proxmox, it is not a cluster member, and it does not run VMs. Its only job is to arbitrate quorum decisions.

The qnetd server effectively provides one additional vote. In a 2-node cluster, this means 3 total votes (1 per node + 1 from qdevice), and quorum requires 2. Now a single node failure leaves 2 votes (the surviving node + qdevice), which is sufficient for quorum.

The ffsplit algorithm

The default algorithm is ffsplit (fifty-fifty split). When the qnetd server detects a network partition (both halves of the cluster are still talking to qnetd but not to each other), it grants its vote to the partition that was last known to have all nodes. This heuristic works well for the typical failure case: one node crashes or loses its cluster network, and the other node (which was healthy when both were last seen) gets the vote.

There's also lms (last man standing), which grants the vote to whichever partition has been connected to qnetd for the longest time. ffsplit is the default and recommended for most deployments.

Setup

# On the external qnetd server (any Linux box):
apt install corosync-qnetd
systemctl enable --now corosync-qnetd

# On one of the Proxmox cluster nodes:
apt install corosync-qdevice
pvecm qdevice setup 10.0.0.100    # IP of the qnetd server

# This will:
# 1. Exchange TLS certificates between cluster and qnetd
# 2. Configure corosync-qdevice on all cluster nodes
# 3. Restart Corosync to pick up the new quorum device

# Verify
pvecm status
# Look for: "Qdevice: Connected" and vote count increased

Networking & requirements

The qnetd server needs network connectivity to all cluster nodes on TCP port 5403
Communication is TLS-encrypted (certificates are exchanged during setup)
The qnetd server should be on a different failure domain than the cluster nodes. If it's on the same switch/rack/power circuit, it defeats the purpose.
Latency to the qnetd server is less critical than inter-node Corosync latency. A few milliseconds is fine.

Limitations

Only truly solves the 2-node problem. For 3+ node clusters, proper quorum math already provides fault tolerance. A QDevice on a 3-node cluster gives 4 total votes (quorum = 3), which doesn't actually improve things — you still can only lose 1 node.
Not a replacement for a proper 3-node cluster. A 2-node + QDevice setup is a compromise. The QDevice can fail too, and if both the QDevice and one node fail simultaneously, you're back to the same problem.
The qnetd server is a single point of failure for the quorum arbitration function. If qnetd goes down, the 2-node cluster falls back to requiring both nodes for quorum.
One qnetd server can serve multiple clusters (it tracks them independently).

Consultant tip: For clients insisting on a 2-node cluster (common in small offices, branch sites, or budget-constrained environments), a QDevice is the minimum viable solution. But always document the risk: if both the QDevice server and one node fail at the same time, the surviving node cannot operate. For truly critical workloads, push for 3 nodes. The cost of a third node is almost always less than the cost of a prolonged outage.

Proxmox VE Production Architecture

Overview

Strengths

Weaknesses

Architecture

Clustering

Creating a cluster

Quorum

Corosync network

Storage

Hyper-Converged Ceph

Local ZFS

External NFS / iSCSI / FC

Simple LVM / LVM-Thin / Directory

Ceph deployment

Ceph sizing rules of thumb

ZFS configuration

Networking

Network architecture for production

Management Proxmox UI / API / SSH

VM Traffic Guest Networks

Storage Ceph / iSCSI / NFS

Optional Live Migration

Bridge and bond configuration

SDN (Software-Defined Networking)

VMs & Containers

KVM virtual machines

VM best practices

LXC containers

Privileged vs. Unprivileged

Resource Limits

High Availability

How HA works

Fencing

HA groups & resource configuration

Backups

Built-in backup (vzdump)

Proxmox Backup Server (PBS)

Backup strategy

Upgrades

Minor updates (within a version)

Major version upgrades

Ceph upgrades

Monitoring

What to monitor

Monitoring stack

Security Hardening

Access Web UI & API

SSH Hardening

Network Isolation

Updates Patching

RBAC & permissions

Licensing & Support

Enterprise repo vs. no-subscription repo

VMware comparison (for customer conversations)

Consultant's Checklist

Corosync

What is Corosync?

Role in Proxmox

Quorum mechanics

The two-node problem

Networking & ports

Configuration

Ceph

What is Ceph?

Architecture

CRUSH map & data placement

Replication & pool types

Minimum requirements in Proxmox

Common issues

ZFS

What is ZFS?

Key concepts

RAID levels

Proxmox integration

Snapshots & replication

ARC, L2ARC, and SLOG

Gotchas

QDevice

What is a QDevice?