Keycloak Production Architecture
Customer deployment reference — architecture, HA, DR, integrations, hardening & licensing
Overview
Keycloak is an open-source Identity and Access Management (IAM) solution that provides single sign-on (SSO), identity brokering, user federation, and fine-grained authorization for modern applications. Originally created by Red Hat, it became a CNCF incubating project in April 2023.
Protocols Standards-Based
OpenID Connect (OIDC), SAML 2.0, and OAuth 2.0. Tokens are issued as JWTs. Supports authorization code, client credentials, device code, and token exchange flows out of the box.
Architecture Stateless Application
Keycloak is a stateless Java application built on Quarkus (since KC 17). All persistent state lives in the database. Session caches are distributed across nodes via embedded Infinispan. This makes horizontal scaling straightforward.
Core Concepts Realms & Clients
Realms are isolated tenants — each has its own users, roles, clients, and configuration. Clients represent applications that delegate authentication to Keycloak. A typical deployment has one realm per environment or tenant.
Extensible SPIs & Themes
Almost everything is customizable via Service Provider Interfaces (SPIs): authentication flows, user storage, event listeners, token mappers. Themes control the look of login and email pages using Freemarker templates. The admin and account consoles are React SPAs (since KC 22+).
What Keycloak replaces
Keycloak is a self-hosted alternative to SaaS identity providers like Auth0, Okta, and Azure AD B2C. It provides equivalent functionality — SSO, MFA, social login, user federation, RBAC — without per-user pricing. Common migration drivers: cost (large user bases), data sovereignty, customization depth, and vendor lock-in avoidance.
Key capabilities
- Single Sign-On & Single Sign-Out — across all applications in a realm, with session management
- Identity Brokering — delegate auth to external IdPs (SAML, OIDC, social providers like Google, GitHub, Microsoft)
- User Federation — sync users from LDAP/AD, custom user storage SPIs
- Multi-Factor Authentication — TOTP, WebAuthn/FIDO2, configurable per-realm authentication flows
- Fine-Grained Authorization — resource-based permissions using UMA 2.0, policies, and scopes
- Admin Console & REST API — full management UI and comprehensive Admin API for automation
- Account Console — self-service portal for users to manage profile, sessions, credentials, and linked accounts
This guide covers Keycloak 25+ (Quarkus-based). If the customer is still on Wildfly-based Keycloak (pre-KC 17), prioritize migration — the Wildfly distribution was removed at KC 20 and receives no security patches. The latest release is KC 26.5.x.
Deployment Target: Kubernetes vs. VMs
The first architectural decision is where Keycloak will run. Both paths are well-supported, but they carry different operational trade-offs that ripple into clustering, upgrades, scaling, and configuration management.
| Factor | Kubernetes | Virtual Machines |
|---|---|---|
| Best for | Teams with K8s maturity, cloud-native stacks | No K8s platform, strict compliance, ops preference |
| Lifecycle | Keycloak Operator handles rolling upgrades, scaling | Manual or Ansible/Terraform managed |
| Scaling | HPA / replica count — trivial horizontal scaling | Add nodes behind LB — manual provisioning |
| Complexity | Ingress, TLS certs, operator CRDs, namespace isolation | systemd, reverse proxy, firewall rules, config mgmt |
| Cluster discovery | DNS_PING / KUBE_PING — automatic | TCPPING / JDBC_PING — requires manual config or DB table |
Kubernetes is the preferred path for most new deployments. Keycloak is inherently stateless — all persistent state lives in the database. Use the official Keycloak Operator (Quarkus-based).
When VMs are the right call
- No existing K8s platform — introducing K8s just for Keycloak creates more risk than it solves.
- Regulatory constraints — some industries require IAM on dedicated, isolated hosts.
- Ops team preference — Keycloak on systemd behind Nginx/HAProxy is well-understood.
- Air-gapped environments — container registry and operator lifecycle overhead is significant.
For VM deployments, use Ansible or Terraform for repeatable provisioning. Place 2+ nodes behind a load balancer with sticky sessions.
Container & K8s deployment patterns
Two main approaches on K8s:
- Keycloak Operator (recommended) — manages
KeycloakandKeycloakRealmImportCRDs. Handles pod lifecycle, DB migration coordination, and health checks. - Helm chart (community) — Bitnami chart is popular but not officially maintained by the project. More granular control but no operator lifecycle management.
Build a custom container image with themes and SPIs baked in:
FROM quay.io/keycloak/keycloak:latest as builder
COPY themes/custom-theme /opt/keycloak/themes/custom-theme
COPY providers/custom-spi.jar /opt/keycloak/providers/
RUN /opt/keycloak/bin/kc.sh build
FROM quay.io/keycloak/keycloak:latest
COPY --from=builder /opt/keycloak/ /opt/keycloak/
ENTRYPOINT ["/opt/keycloak/bin/kc.sh"]
High Availability Architecture
Keycloak is in the critical authentication path for every application. If it goes down, users can't log in and tokens can't be issued. HA is not optional — it's the baseline.
Required Load Balancer
Enable sticky sessions on AUTH_SESSION_ID cookie. Without it, mid-login users can be bounced to a node without their auth session, causing failures. Alternatively, configure fully distributed sessions and skip stickiness — but sticky is simpler.
Clustering Infinispan / JGroups
Nodes cluster via embedded Infinispan using JGroups for transport. Caches: user sessions, auth sessions, offline tokens, action tokens, login failure counters.
- K8s: DNS_PING or KUBE_PING
- VMs: TCPPING or JDBC_PING
Port 7800 (JGroups) must be open between all KC nodes.
Persistent user sessions (KC 26+)
Since KC 26, persistent user sessions are enabled by default. All user sessions are now stored in both the database and Infinispan caches. Users remain logged in even after all Keycloak nodes are restarted or upgraded — a major improvement for HA. This is also a requirement for the multi-site architecture.
Cache topology & session ownership
- Distributed caches (
owners=2) — user sessions and offline sessions. Each session replicated to 2 nodes. If one dies, session survives on the other owner. - Replicated caches — realm metadata, client sessions, authorization data. Every node has a full copy.
- Local caches — realm and user caches have local layers invalidated via cluster events.
For 100k+ sessions, consider tuning owners, eviction policies, or externalizing Infinispan.
Node failure behavior
What happens when a node dies:
- Active user sessions — with
owners=2, the session migrates to surviving nodes. No user impact on next token refresh. With persistent sessions (KC 26+ default), sessions also survive full cluster restarts. - In-flight auth sessions — may be lost. User restarts login flow (typically invisible — they just see the login page again).
- Rebalancing — Infinispan redistributes cache entries across survivors. Brief CPU/memory spike on remaining nodes.
Keycloak does not have read-only or non-voting nodes like OpenBao/Vault. All nodes in a cluster are equal — every node can handle any request. There is no leader/follower distinction.
Cold spare nodes
Keycloak doesn't have a native cold-spare mode, but you can achieve this operationally. Deploy additional KC pods/nodes that are part of the cluster but have zero weight in the load balancer. They participate in cache replication (increasing data redundancy) but receive no traffic until you shift load to them. On K8s, keep extra replicas and adjust endpoint weighting. On VMs, configure the LB to mark them as backup/standby.
Deploy minimum 2 nodes (3+ recommended). With 3+ you tolerate a failure during a rolling upgrade.
Replication & Multi-Site
Replication in Keycloak happens at two layers: the cache layer (Infinispan) and the database layer (PostgreSQL etc.). Understanding both is essential for HA and DR design.
Within a Cluster Intra-site Replication
Within a single Keycloak cluster (single site/datacenter), replication is handled entirely by embedded Infinispan. Session data is distributed across nodes using consistent hashing with configurable owner counts. Realm/client metadata is fully replicated to every node. Database state is shared — all nodes read/write to the same DB instance. No application-level DB replication is needed within a single site.
Across Clusters Cross-site Replication
For multi-datacenter deployments, you need replication at both layers: an external Infinispan cluster with XSITE (cross-site) replication for cache data, and synchronous database replication between sites. Keycloak's official multi-site guide supports exactly two sites — more than two is explicitly unsupported due to latency amplification and split-brain complexity.
Cross-site Infinispan architecture
The official Keycloak multi-site setup (documented since KC 24, with significant improvements in KC 26 including true active-active support) uses external Infinispan clusters — one per site — connected via XSITE replication:
- Each site runs its own Infinispan Data Grid cluster (3+ nodes) as a separate deployment from Keycloak.
- The two Infinispan clusters connect via RELAY2/XSITE protocol over a dedicated network link (JGroups bridge stack). Communication uses TLS with mutual authentication.
- Keycloak nodes connect to their local Infinispan cluster via remote-store (Hot Rod protocol), not embedded Infinispan.
- When data changes in Site A's Infinispan, the XSITE backup replicates it to Site B's Infinispan synchronously.
- This is how cache invalidation messages propagate — when a user's session is updated in Site A, Site B's cache is invalidated or updated via this XSITE channel.
Important: The Red Hat build of Keycloak requires Red Hat Data Grid (the commercial Infinispan product) for multi-site. Community Keycloak uses upstream Infinispan Server.
Synchronous vs. async XSITE: Keycloak's official guidance strongly recommends synchronous cross-site replication. Async replication can lead to stale caches — e.g., a user changes their password on Site A, but Site B still has the old password hash cached, allowing login with the old password until the cache is invalidated. The trade-off is that synchronous replication adds latency to every write (requires low-latency link, e.g., same region, different AZs).
Multi-site DR patterns
Option A No DR Infrastructure (Cold Provision)
The database replicates to the paired region, but there are no Keycloak VMs sitting there. If the primary region goes down, you spin up VMs from IaC (Terraform), deploy Keycloak via Ansible, promote the database replica, and point DNS at the new Application Gateway.
- RTO: 30–60 minutes — provisioning infrastructure from scratch during an outage.
- Cost: Cheapest option. No idle compute in DR region.
- Risk: IaC and Ansible must be tested regularly. Cloud capacity in the DR region is not guaranteed during a regional outage — you may not be able to provision the VMs you need.
Option B Cold Spare VMs
VMs exist in the DR region and Keycloak is installed, but the service is stopped. No Application Gateway is routing traffic to them.
- On failover: Promote database, start Keycloak services, update Application Gateway or DNS.
- RTO: 10–15 minutes — infrastructure is already there, just starting services and cutting over.
- Cost: Paying for stopped VMs (minimal compute cost, still paying for disks).
Option C Warm Standby
Keycloak is running in the DR region, connected to the read replica, but not receiving traffic. On failover: promote database from read replica to primary, shift Application Gateway or DNS.
- RTO: ~5 minutes — fastest of the passive options.
- Catch: Keycloak tries to write session data on startup, which fails against a read-only database. You'd need to keep Keycloak stopped or in a degraded state anyway.
- Cost: Paying for 3 running VMs that do nothing most of the time.
- In practice: Option C often collapses into Option B because of the read-only DB limitation.
Keycloak is not a read-only application. On every request it writes session data, login failure counters, event logs, user last-login timestamps, and brute-force detection state. A Keycloak instance connected to a read-only database replica will fail to start or crash on the first login attempt. This is why Option C (Warm Standby) rarely works as advertised — you cannot keep Keycloak "warm" against a read replica without it erroring out on writes.
Option D — Active-Active (maximum complexity)
Both sites serve traffic simultaneously via a global load balancer. This is the only pattern that provides near-zero RTO (no failover needed — traffic just shifts), but it comes with the highest cost and operational complexity. KC 26 introduced official active-active multi-site support with persistent user sessions and improved cache invalidation.
- Requires external Infinispan clusters with XSITE synchronous replication at both sites.
- Database must be synchronously replicated between sites (Aurora Global Database, CockroachDB, or PostgreSQL with BDR).
- Split-brain handling: if sites lose connectivity, the global LB must route all traffic to one site. Keycloak has no built-in split-brain resolution — Infinispan XSITE handles cache conflicts, but DB-level conflicts require the database's own conflict resolution.
- Only officially supported with exactly 2 sites.
DR options comparison
| Option | RTO | Cost | Complexity | Catch |
|---|---|---|---|---|
| A — Cold Provision | 30–60 min | Lowest | Medium | IaC must be tested; DR region capacity not guaranteed |
| B — Cold Spare VMs | 10–15 min | Low | Low | Paying for idle disks; Keycloak version must be kept in sync |
| C — Warm Standby | ~5 min | Medium | Medium | Read-only DB breaks Keycloak — collapses into Option B in practice |
| D — Active-Active | ~0 | Highest | Very High | Requires Infinispan XSITE, sync DB replication, 2 sites max |
Limitations & what Keycloak doesn't do
- No read-only replicas — unlike databases, Keycloak has no concept of a read-replica site. Every active site is a full read-write participant.
- No non-voting nodes — unlike Consul/OpenBao/etcd, there are no "voter" vs. "non-voter" roles. All nodes are equal peers in the Infinispan cluster.
- Two sites max — the official multi-site architecture is tested and supported with exactly two sites. Adding a third site exponentially increases write latency and split-brain probability.
- Low-latency required for sync XSITE — the two sites should be in the same region (different AZs), not across continents.
- XSITE state transfer — if one site goes offline and comes back, you need to perform a manual state transfer to resynchronize Infinispan caches. This involves clearing the offline site's caches and doing a full push from the active site.
Database
The database stores all persistent state: realm configuration, clients, users, credentials, roles, groups, events. It is the most critical component.
The database is the real single point of failure. All KC nodes connect to the same DB. If the DB goes down, every node goes down. The database must be independently HA.
| Database | Status | Notes |
|---|---|---|
| PostgreSQL | Recommended | Best tested, widest support. Patroni, RDS, CloudSQL, Azure DB for HA. |
| MySQL / MariaDB | Supported | InnoDB required. Galera or managed services for HA. |
| Oracle | Supported | Only when customer has existing Oracle licensing/DBA expertise. |
| MS SQL | Supported | Less common. Always On AG for HA. Also works with Azure SQL Database. |
Note: The Keycloak project considers PostgreSQL as its primary target database. MySQL, MariaDB, Oracle, and MS SQL are supported but receive less testing focus. The project has indicated plans to narrow database support over time.
PostgreSQL HA patterns
- Patroni + etcd — de facto standard for self-managed PostgreSQL HA. Automatic leader election and failover.
- Streaming replication — synchronous recommended for RPO=0. Async acceptable if some data loss is tolerable.
- Connection pooling — PgBouncer between KC and PostgreSQL. KC opens many connections under load.
Managed services (RDS, CloudSQL, Azure DB) provide built-in HA with multi-AZ, automated backups, and PITR.
Connection pool tuning
KC_DB_POOL_INITIAL_SIZE=25
KC_DB_POOL_MIN_SIZE=25
KC_DB_POOL_MAX_SIZE=100
Verify your DB can handle (KC nodes × max pool size) total connections. Monitor agroal.active.count, agroal.available.count, agroal.awaiting.count (should be zero — if not, pool is too small).
On Kubernetes, do not run the database inside the same cluster as Keycloak for production. A K8s failure would take down both.
Backup & Restore
Keycloak's persistent state lives almost entirely in the database. Backups are therefore primarily a database concern — but there are other artifacts to include. The most important thing about backups is that they're tested. An untested backup is not a backup.
Primary Database Backups
The database contains everything: realm config, users, hashed credentials, client registrations, roles, groups, events, offline sessions. Two complementary strategies:
- Logical backups —
pg_dump(or equivalent). Full point-in-time snapshot. Good for portability and selective restore. - Continuous archival — PostgreSQL WAL archival for PITR. Enables restore to any point in time, not just the last dump. Essential for minimising data loss.
Supplemental What Else to Back Up
- Custom themes & SPI JARs — should be in Git; also baked into container images.
- TLS certs & keystores — store in Vault / secrets manager.
- Keycloak config —
keycloak.conf, env vars, Helm values, operator CRDs. Version control. - Infinispan config — custom
cache-ispn.xmlif used. - Realm exports (JSON) — useful for config-as-code but does not include user credentials or client secrets.
Automated backup strategy
Schedule and retention:
- WAL archival — continuous, to object storage (S3, GCS, Azure Blob). Use
pgBackRest,barman, orwal-g. This is your primary recovery mechanism. - Logical dumps — daily
pg_dump --format=custom. Retain 30 days minimum. Store off-site (different region/account). - Managed services — RDS/CloudSQL automated backups provide both snapshot and PITR. Enable with appropriate retention (default is often only 7 days — increase to 30+).
Automation tools:
- pgBackRest — best-in-class for PostgreSQL. Supports full/incremental/differential backups, parallel compression, encryption at rest, and S3/GCS storage.
- CronJob on K8s — for logical dumps, run a K8s CronJob that executes
pg_dumpand uploads to object storage. Include a verification step that restores to a temp DB. - K8s Velero — can back up PVCs, but this is a storage-level backup, not application-consistent. Don't rely on Velero alone for DB backups.
# Example: automated pg_dump to S3
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
FILENAME="keycloak_backup_${TIMESTAMP}.dump"
pg_dump --host=$DB_HOST --username=$DB_USER \
--format=custom --file="/tmp/${FILENAME}" keycloak
aws s3 cp "/tmp/${FILENAME}" \
"s3://backups-bucket/keycloak/${FILENAME}" \
--storage-class STANDARD_IA
rm "/tmp/${FILENAME}"
Restoring from backup
Full restore procedure:
- 1. Stop all Keycloak nodes. No KC instance should be writing to the database during restore.
- 2. Restore the database. For
pg_dumpbackups:pg_restore --clean --create --dbname=keycloak backup.dump. For PITR: restore the base backup and replay WALs to the desired timestamp. - 3. Verify the DB. Connect directly and spot-check: realm exists, user count is correct, client registrations are present.
- 4. Start Keycloak. KC will connect to the restored DB. Infinispan caches will rebuild from the DB on startup (this is automatic). First startup after restore may be slower as caches warm up.
- 5. Validate. Test login flows, token issuance, LDAP sync, admin console access. With persistent user sessions (KC 26+ default), user sessions survive in the DB and users may not need to re-authenticate. On older versions, users will need to re-authenticate since session caches were lost.
Partial / selective restore: Keycloak doesn't support restoring a single realm from a database backup — it's all or nothing at the DB level. For realm-level recovery, realm JSON exports are more useful. You can import a realm JSON to recreate the config, clients, and roles — but users will need to reset passwords (credentials aren't in the export).
Backups during upgrades
Upgrades are the most critical time to have a reliable backup, because Keycloak applies irreversible Liquibase schema migrations on startup.
- Always take a fresh backup immediately before upgrading — not a day-old scheduled backup.
- Use a consistent snapshot — ensure no KC nodes are writing during the backup. Shut down all KC nodes, take the backup, then start the upgrade.
- Label the backup — tag it clearly as a pre-upgrade backup with the current KC version and the target version.
- Test the restore path first — before upgrading production, restore the backup to a staging DB, run the upgrade against it, and verify.
If the upgrade fails and you need to roll back: Stop all KC nodes immediately. Restore the DB from the pre-upgrade backup. Redeploy the previous KC version. There is no Liquibase rollback — schema changes are forward-only. The only rollback path is restoring the DB.
Restore testing cadence
Schedule restore tests quarterly at minimum:
- Restore to a staging environment (isolated DB + KC instance).
- Verify: realm config loads, users can log in, tokens are issued, LDAP sync runs, custom themes render, admin console works.
- Measure actual RTO (time from "start restore" to "first successful login") and compare against the customer's target.
- Document the procedure as a runbook with exact commands, expected timings, and verification steps.
- Rotate the person running the test — don't let it be single-threaded knowledge.
# Export all realms
/opt/keycloak/bin/kc.sh export \
--dir /tmp/realm-exports --users realm_file
# Export specific realm
/opt/keycloak/bin/kc.sh export \
--dir /tmp/realm-exports --realm my-realm \
--users realm_file
Upgrade Strategy
Staying current is important for security, but upgrades need careful planning due to irreversible database schema changes.
Release Cadence
Community Keycloak targets 4 minor releases per year (roughly quarterly) and a major release every 2–3 years. Starting with KC 26, backwards compatibility is guaranteed for fully supported features and APIs within a major version — breaking changes in minors are opt-in. Preview features and non-public APIs may change at any time.
Only the latest release gets security patches. There is no LTS for community Keycloak. If a critical CVE drops, you must upgrade to the current release to get the fix.
Support Lifecycle
Community: no long-term support. Only the latest major.minor gets patches.
Red Hat build: minimum 2-year support lifecycle for RHBK 26.x (3 years for 27.x onwards). Full support until next major ships, then 6+ months maintenance. Red Hat skips some upstream versions, cherry-picking stable releases.
If the customer cannot upgrade frequently, the Red Hat build is strongly recommended for its backported security patches.
Database Migrations
Keycloak runs Liquibase changelogs on startup. First pod applies the migration; others wait. Always back up before upgrading. There is no schema downgrade — rollback = restore DB from backup.
Breakage Themes & SPIs
Custom themes and SPI JARs are the most common breakage. Freemarker templates and SPI interfaces change between majors. Pin to specific KC versions and test thoroughly.
Step-by-step upgrade runbook
Pre-upgrade (1–2 weeks):
- Read release notes and migration guide for every version between current and target.
- Audit custom themes and SPIs for compatibility.
- Update custom container image to new KC base. Run
kc.sh build.
Staging (1 week):
- Restore production DB copy into staging.
- Deploy new KC version. Verify Liquibase migration completes.
- Test: login flows, token issuance, themes, SPIs, admin console, LDAP sync.
Production:
- Fresh DB backup immediately before starting.
- K8s: rolling update (first pod runs migration, others detect schema is current).
- VMs: blue-green deployment.
- Monitor 1–2 hours: login rates, errors, latency, sessions.
Rolling back failed upgrades
Keycloak's Liquibase migrations are forward-only. There is no kc.sh rollback command. If an upgrade fails:
- Stop all KC nodes immediately. Don't let them keep trying to start against a partially-migrated DB.
- Assess the failure. Check logs for the specific Liquibase error. Common causes: custom schema modifications conflicting with changelogs, insufficient DB permissions, unexpected column types.
- Option A: Fix forward. If the failure is a known issue with a workaround, apply the fix and restart KC. Liquibase tracks which changelogs have run and will resume from where it failed.
- Option B: Full rollback. Restore the DB from the pre-upgrade backup. Redeploy the previous KC version. Guaranteed to work if you have a good backup.
- Never manually edit Liquibase tracking tables (
DATABASECHANGELOG,DATABASECHANGELOGLOCK) unless you deeply understand the consequences.
Lock table stuck: If KC was killed mid-migration, the Liquibase lock table may be stuck. Clear it:
UPDATE DATABASECHANGELOGLOCK
SET LOCKED = FALSE, LOCKGRANTED = NULL, LOCKEDBY = NULL
WHERE ID = 1;
Skipping versions & legacy migration
Skipping versions: Liquibase changelogs are cumulative. Going from v22→v25 applies all intermediate changelogs in sequence. Read migration guides for every skipped version. Large jumps take longer and carry more risk.
Wildfly → Quarkus migration: The Wildfly distribution was removed at KC 20 (2022). Migration involves rewriting standalone-ha.xml to keycloak.conf/env vars, replacing Wildfly-specific SPIs, updating custom themes, and changing deployment tooling (no more WARs). Note that the /auth context path was also removed by default in the Quarkus distribution. Prioritise this if the customer is still on Wildfly.
LDAP / Active Directory Integration
Almost every enterprise deployment involves LDAP/AD. Keycloak's User Federation provider handles this, but several design decisions significantly affect the architecture.
Design Federation Mode
On-demand (default): users imported to KC's DB on first login. LDAP stays source of truth for credentials.
Periodic batch sync: full or changed-user sync on a schedule. Pre-populates the user list for admin visibility.
Decision Read vs. Write
Read-only (most common): KC reads users/groups, never writes back.
Writable: password and profile changes propagate to LDAP. Only enable if explicitly needed.
Kerberos / SPNEGO for Windows SSO
Requirements: SPN registered in AD (HTTP/keycloak.example.com@EXAMPLE.COM), keytab file, browser config (Group Policy), and correct DNS (forward + reverse). Kerberos is extremely DNS-sensitive — hostname mismatch is the #1 failure cause. NTP sync is critical (5-minute clock skew tolerance).
LDAP mappers, groups & multi-directory
Mappers: User Attribute, Group, Role, Hardcoded Role, MSAD User Account Control. Plan mapping strategy early — it directly affects token claims.
Multiple directories per realm supported: different providers with independent connection settings, mappers, sync schedules, and priority ordering.
Common LDAP pitfalls
- Bind credentials — dedicated service account, minimum permissions.
- LDAPS — always encrypt. Import CA cert into JVM truststore.
- Pagination — AD defaults to 1000 result limit. KC handles paging but verify with
ldapsearch. - Initial sync — 100k+ users takes time and memory. Run during maintenance window.
- Username/email uniqueness — conflicts can block imports.
- Referrals — multi-domain AD forests may return referrals. Configure handling correctly.
Kubernetes-Specific Guidance
Operator Keycloak Operator
Official Quarkus-based operator. Manages Keycloak and KeycloakRealmImport CRDs. Dedicated namespace with scoped RBAC.
Ingress Proxy Headers
Set KC_PROXY_HEADERS=xforwarded and KC_HTTP_ENABLED=true. Missing X-Forwarded-Proto causes redirect loops. Don't use path rewriting. Set KC_HOSTNAME to the full public URL (KC 26+ hostname v2).
Sizing Resources
Start: 2 replicas, 1–2 CPU, 1–2 GB RAM per pod. CPU-heavy during RSA signing and password hashing. Load test to tune.
Resilience PDB & Affinity
PodDisruptionBudget with minAvailable: 1. Anti-affinity across nodes/zones.
Ingress & TLS termination patterns
Edge (most common): TLS at Ingress, HTTP to KC. Passthrough: TLS direct to KC. Re-encrypt: TLS at Ingress + new TLS to KC.
# KC 26+ hostname v2 configuration
KC_PROXY_HEADERS=xforwarded
KC_HTTP_ENABLED=true
KC_HOSTNAME=https://keycloak.example.com
# KC_HOSTNAME_STRICT was removed in KC 26 — use KC_HOSTNAME with a full URL instead
Health probes & startup
Enable KC_HEALTH_ENABLED=true. Since KC 25, health and metrics endpoints are served on the management port 9000 (not the main HTTP port 8080). Use /health/started for startup probe (KC can take 30–90s during migrations), /health/ready for readiness, /health/live for liveness.
startupProbe:
httpGet: { path: /health/started, port: 9000 }
failureThreshold: 30
periodSeconds: 5
readinessProbe:
httpGet: { path: /health/ready, port: 9000 }
periodSeconds: 10
livenessProbe:
httpGet: { path: /health/live, port: 9000 }
periodSeconds: 15
failureThreshold: 3
Namespace, RBAC & network policies
Dedicated namespace. Scoped RBAC (no cluster-admin). NetworkPolicies restricting ingress/egress. External secrets for credentials. Run as non-root with read-only root filesystem.
Security Hardening
Keycloak is your IdP — if compromised, every downstream app is compromised. Many settings are not enabled by default.
Admin Console
Restrict /admin and master realm to internal networks. Never expose publicly.
Default Off Brute Force
Enable per realm. Configure max failures, wait increment, lockout.
Tokens Lifespans
Access: 5 min. Refresh: 30 min. SSO idle: 30 min. SSO max: 10 hrs. Shorter is better.
Passwords Hashing
argon2id (default since KC 25; KC 24 uses PBKDF2-SHA512 210K iterations). Min 12 chars. History, complexity rules.
TLS everywhere
Encrypt every hop: Client→LB (TLS 1.2+, HSTS), LB→KC (re-encrypt if policy requires), KC→DB (sslmode=verify-full), KC→KC (JGroups SYM_ENCRYPT), KC→LDAP (LDAPS port 636).
Admin API & service accounts
Don't use master admin for automation. Dedicated service accounts with minimal roles. Client credentials grant for S2S. Enable adminEventsEnabled + adminEventsDetailsEnabled.
Key rotation & token signing
RS256 default. Consider ES256 for shorter tokens. No auto-rotation — automate via Admin API. Keycloak recommends rotating every 3–6 months (annually at absolute minimum). Keep old key passive until all tokens signed with it expire. Clients cache JWKS — most re-fetch on kid mismatch.
Monitoring & Observability
Enable KC_METRICS_ENABLED=true. Scrape /metrics on management port 9000 (since KC 25) with Prometheus. Build Grafana dashboards.
Key Metrics
- Login success/failure rates — brute-force detection, IdP outages
- Token endpoint latency — p50/p95/p99
- Active sessions — capacity planning
- DB connection pool — alert at 80% saturation
- JGroups cluster size — should match expected node count
- JVM heap / GC — memory pressure signals
Alerting rules
- Login failure rate > 50/min (5 min sustained) → possible attack
- DB pool > 80% → increase pool or investigate slow queries
- JGroups members ≠ expected → node left cluster
- Token p99 > 2s → performance degradation
- 5xx rate > 1% → check logs
- JVM heap > 85% for 10 min → memory pressure
Events, logging & SIEM
User events (login, logout, register) and admin events (every admin API change). Store in DB with configurable expiry or forward to SIEM via custom Event Listener SPI. Ship logs to ELK/Loki/Datadog. INFO for prod, selective DEBUG for troubleshooting.
Licensing & Open Source vs. Enterprise
Keycloak is proper open source, not open core. It is licensed under Apache License 2.0 — one of the most permissive open-source licenses available. Every feature in Keycloak is available to everyone. There are no features gated behind a commercial license, no "enterprise edition" binary with extra capabilities, and no feature flags that unlock with a paid key.
Since April 2023, Keycloak is a CNCF incubating project (Cloud Native Computing Foundation), which further solidifies its independence and long-term governance. Red Hat remains the primary contributor but does not control the project unilaterally.
Free Community Keycloak
Full-featured, no cost, Apache 2.0 license. This is the upstream project from keycloak.org / GitHub. All features included: SSO, OIDC, SAML, user federation, fine-grained authorization, admin console, account console, themes, SPIs — everything.
Support comes from the community: GitHub issues, Keycloak forum, CNCF Slack. No SLA, no guaranteed response times, no backported security patches to older versions.
Paid Red Hat build of Keycloak
Same codebase, different binary, with support. Red Hat takes specific Keycloak versions, certifies them, applies additional QA/testing, and provides long-term support with backported security patches and bug fixes.
This replaced the older "Red Hat SSO" (RH-SSO) product in November 2023. It is not sold separately — it's included with Red Hat Runtimes, Red Hat Application Foundations, or OpenShift subscriptions.
Is it a different binary or just a license key?
It's a different binary — similar to the GitLab CE/EE model, but with an important distinction: there are no extra features in the Red Hat build. The differences are:
- Build & packaging — Red Hat builds from a specific Keycloak commit, applies their build pipeline, and produces container images hosted on
registry.redhat.io. The community build comes fromquay.io/keycloak. - Certified dependencies — Red Hat pins and tests specific versions of Quarkus, Infinispan, and other dependencies. Community Keycloak uses latest upstream versions.
- Long-term support — Red Hat backports security fixes to their supported version streams for 2–3 years. Community Keycloak only patches the latest release.
- Support SLAs — Red Hat provides 24/7 support, SLA-backed response times, and access to Red Hat's engineering team for critical issues.
You cannot just apply a license key to community Keycloak to get Red Hat support. You need to deploy the Red Hat build of Keycloak binary/image to be covered by their support contract. It's a swap of the container image, not a license toggle.
Other commercial Keycloak vendors
Beyond Red Hat, a growing ecosystem of managed Keycloak providers exists. These are third-party companies — not affiliated with the Keycloak project — that offer hosted or managed Keycloak with their own support and SLAs. Examples include Phase Two, Skycloak, and Inteca, among others.
Some of these vendors add proprietary extensions (e.g., custom UIs, enhanced multi-tenancy, advanced analytics). These extensions are not part of upstream Keycloak and vary by vendor. Evaluate carefully whether their additions create vendor lock-in or are built as standard Keycloak SPIs that you could replace.
Community vs. Red Hat build — when does it matter?
The functional capabilities are identical. The decision comes down to operational and contractual needs:
- Choose community Keycloak if: the customer has a strong internal platform team, is comfortable staying on the latest release, can respond to CVEs by upgrading promptly, and doesn't need vendor-backed SLAs for procurement/compliance.
- Choose Red Hat build if: the customer needs long-term support on a pinned version (2–3 year lifecycle), requires vendor-backed security patch SLAs for compliance (SOC2, PCI, ISO 27001), needs someone to call at 2am when auth is down, or procurement requires a commercial support contract.
- Licensing model — Red Hat build is priced per CPU core (as part of their Runtimes/RHAF/OCP subscription), not per user. This is favorable for large user bases where per-user SaaS pricing (Auth0, Okta) becomes very expensive.
The migration path between community and Red Hat build is straightforward — same DB schema, same realm config, same API — it's essentially a container image swap.
Keycloak is 100% open source, Apache 2.0, no features behind a paywall. It is not open core. The Red Hat build adds long-term support, certified builds, and SLAs — but no extra features. It's a different binary (container image swap), not a license key applied to community Keycloak.
Consultant's Checklist
Before proposing a Keycloak deployment:
- How many users? — Determines sizing, DB capacity, and whether you need clustering. 10k vs 500k are very different architectures.
- Authentication sources? — LDAP/AD, social login, SAML IdPs, Kerberos/SPNEGO? Each adds complexity and testing surface.
- How many realms and clients? — Multi-tenant (realm per tenant) vs single realm with client scoping. Realm count affects admin overhead and resource consumption.
- Protocol requirements? — OIDC, SAML 2.0, or both? Legacy apps often need SAML. Determine token format needs (JWT claims mapping).
- HA requirements? — Keycloak is an authentication gateway — downtime means nobody can log in. Plan for multi-node with load balancing. Define RPO/RTO.
- Deployment target? — Kubernetes (Operator), VMs, or bare metal? Each has different operational patterns for upgrades, scaling, and monitoring.
- Database choice? — PostgreSQL (recommended), MariaDB, MySQL, Oracle, MSSQL. Managed vs self-hosted. Connection pooling strategy.
- Custom themes or SPIs? — Custom login pages, email templates, or Java extensions? These are the #1 upgrade blocker — budget for maintenance.
- Backup & restore plan? — DB-level backups (pg_dump/snapshot). Realm JSON exports for config portability. Test restores quarterly.
- Upgrade cadence? — Community KC has no LTS — only the latest release gets patches. Can the team upgrade quarterly? If not, consider Red Hat build of Keycloak.
- Network & security? — TLS everywhere, admin console access restrictions, brute-force protection, token lifespans, CORS policies, CSP headers.
- Monitoring? — Prometheus metrics, Grafana dashboards, alerting on login failures, DB pool saturation, JGroups cluster size, event forwarding to SIEM.