Keycloak Production Architecture Guide

Overview

Keycloak is an open-source Identity and Access Management (IAM) solution that provides single sign-on (SSO), identity brokering, user federation, and fine-grained authorization for modern applications. Originally created by Red Hat, it became a CNCF incubating project in April 2023.

Protocols Standards-Based

OpenID Connect (OIDC), SAML 2.0, and OAuth 2.0. Tokens are issued as JWTs. Supports authorization code, client credentials, device code, and token exchange flows out of the box.

Architecture Stateless Application

Keycloak is a stateless Java application built on Quarkus (since KC 17). All persistent state lives in the database. Session caches are distributed across nodes via embedded Infinispan. This makes horizontal scaling straightforward.

Core Concepts Realms & Clients

Realms are isolated tenants — each has its own users, roles, clients, and configuration. Clients represent applications that delegate authentication to Keycloak. A typical deployment has one realm per environment or tenant.

Extensible SPIs & Themes

Almost everything is customizable via Service Provider Interfaces (SPIs): authentication flows, user storage, event listeners, token mappers. Themes control the look of login and email pages using Freemarker templates. The admin and account consoles are React SPAs (since KC 22+).

What Keycloak replaces

Keycloak is a self-hosted alternative to SaaS identity providers like Auth0, Okta, and Azure AD B2C. It provides equivalent functionality — SSO, MFA, social login, user federation, RBAC — without per-user pricing. Common migration drivers: cost (large user bases), data sovereignty, customization depth, and vendor lock-in avoidance.

Key capabilities

Single Sign-On & Single Sign-Out — across all applications in a realm, with session management
Identity Brokering — delegate auth to external IdPs (SAML, OIDC, social providers like Google, GitHub, Microsoft)
User Federation — sync users from LDAP/AD, custom user storage SPIs
Multi-Factor Authentication — TOTP, WebAuthn/FIDO2, configurable per-realm authentication flows
Fine-Grained Authorization — resource-based permissions using UMA 2.0, policies, and scopes
Admin Console & REST API — full management UI and comprehensive Admin API for automation
Account Console — self-service portal for users to manage profile, sessions, credentials, and linked accounts

Version Note

This guide covers Keycloak 25+ (Quarkus-based). If the customer is still on Wildfly-based Keycloak (pre-KC 17), prioritize migration — the Wildfly distribution was removed at KC 20 and receives no security patches. The latest release is KC 26.5.x.

Deployment Target: Kubernetes vs. VMs

The first architectural decision is where Keycloak will run. Both paths are well-supported, but they carry different operational trade-offs that ripple into clustering, upgrades, scaling, and configuration management.

Factor	Kubernetes	Virtual Machines
Best for	Teams with K8s maturity, cloud-native stacks	No K8s platform, strict compliance, ops preference
Lifecycle	Keycloak Operator handles rolling upgrades, scaling	Manual or Ansible/Terraform managed
Scaling	HPA / replica count — trivial horizontal scaling	Add nodes behind LB — manual provisioning
Complexity	Ingress, TLS certs, operator CRDs, namespace isolation	systemd, reverse proxy, firewall rules, config mgmt
Cluster discovery	DNS_PING / KUBE_PING — automatic	TCPPING / JDBC_PING — requires manual config or DB table

Recommendation

Kubernetes is the preferred path for most new deployments. Keycloak is inherently stateless — all persistent state lives in the database. Use the official Keycloak Operator (Quarkus-based).

When VMs are the right call

No existing K8s platform — introducing K8s just for Keycloak creates more risk than it solves.
Regulatory constraints — some industries require IAM on dedicated, isolated hosts.
Ops team preference — Keycloak on systemd behind Nginx/HAProxy is well-understood.
Air-gapped environments — container registry and operator lifecycle overhead is significant.

For VM deployments, use Ansible or Terraform for repeatable provisioning. Place 2+ nodes behind a load balancer with sticky sessions.

Container & K8s deployment patterns

Two main approaches on K8s:

Keycloak Operator (recommended) — manages Keycloak and KeycloakRealmImport CRDs. Handles pod lifecycle, DB migration coordination, and health checks.
Helm chart (community) — Bitnami chart is popular but not officially maintained by the project. More granular control but no operator lifecycle management.

Build a custom container image with themes and SPIs baked in:

FROM quay.io/keycloak/keycloak:latest as builder
COPY themes/custom-theme /opt/keycloak/themes/custom-theme
COPY providers/custom-spi.jar /opt/keycloak/providers/
RUN /opt/keycloak/bin/kc.sh build

FROM quay.io/keycloak/keycloak:latest
COPY --from=builder /opt/keycloak/ /opt/keycloak/
ENTRYPOINT ["/opt/keycloak/bin/kc.sh"]

High Availability Architecture

Keycloak is in the critical authentication path for every application. If it goes down, users can't log in and tokens can't be issued. HA is not optional — it's the baseline.

Required Load Balancer

Enable sticky sessions on AUTH_SESSION_ID cookie. Without it, mid-login users can be bounced to a node without their auth session, causing failures. Alternatively, configure fully distributed sessions and skip stickiness — but sticky is simpler.

Clustering Infinispan / JGroups

Nodes cluster via embedded Infinispan using JGroups for transport. Caches: user sessions, auth sessions, offline tokens, action tokens, login failure counters.

K8s: DNS_PING or KUBE_PING
VMs: TCPPING or JDBC_PING

Port 7800 (JGroups) must be open between all KC nodes.

Persistent user sessions (KC 26+)

Since KC 26, persistent user sessions are enabled by default. All user sessions are now stored in both the database and Infinispan caches. Users remain logged in even after all Keycloak nodes are restarted or upgraded — a major improvement for HA. This is also a requirement for the multi-site architecture.

Cache topology & session ownership

Distributed caches (owners=2) — user sessions and offline sessions. Each session replicated to 2 nodes. If one dies, session survives on the other owner.
Replicated caches — realm metadata, client sessions, authorization data. Every node has a full copy.
Local caches — realm and user caches have local layers invalidated via cluster events.

For 100k+ sessions, consider tuning owners, eviction policies, or externalizing Infinispan.

Node failure behavior

What happens when a node dies:

Active user sessions — with owners=2, the session migrates to surviving nodes. No user impact on next token refresh. With persistent sessions (KC 26+ default), sessions also survive full cluster restarts.
In-flight auth sessions — may be lost. User restarts login flow (typically invisible — they just see the login page again).
Rebalancing — Infinispan redistributes cache entries across survivors. Brief CPU/memory spike on remaining nodes.

Keycloak does not have read-only or non-voting nodes like OpenBao/Vault. All nodes in a cluster are equal — every node can handle any request. There is no leader/follower distinction.

Cold spare nodes

Keycloak doesn't have a native cold-spare mode, but you can achieve this operationally. Deploy additional KC pods/nodes that are part of the cluster but have zero weight in the load balancer. They participate in cache replication (increasing data redundancy) but receive no traffic until you shift load to them. On K8s, keep extra replicas and adjust endpoint weighting. On VMs, configure the LB to mark them as backup/standby.

Minimum Topology

Deploy minimum 2 nodes (3+ recommended). With 3+ you tolerate a failure during a rolling upgrade.

Replication & Multi-Site

Replication in Keycloak happens at two layers: the cache layer (Infinispan) and the database layer (PostgreSQL etc.). Understanding both is essential for HA and DR design.

Within a Cluster Intra-site Replication

Within a single Keycloak cluster (single site/datacenter), replication is handled entirely by embedded Infinispan. Session data is distributed across nodes using consistent hashing with configurable owner counts. Realm/client metadata is fully replicated to every node. Database state is shared — all nodes read/write to the same DB instance. No application-level DB replication is needed within a single site.

Across Clusters Cross-site Replication

For multi-datacenter deployments, you need replication at both layers: an external Infinispan cluster with XSITE (cross-site) replication for cache data, and synchronous database replication between sites. Keycloak's official multi-site guide supports exactly two sites — more than two is explicitly unsupported due to latency amplification and split-brain complexity.

Cross-site Infinispan architecture

The official Keycloak multi-site setup (documented since KC 24, with significant improvements in KC 26 including true active-active support) uses external Infinispan clusters — one per site — connected via XSITE replication:

Each site runs its own Infinispan Data Grid cluster (3+ nodes) as a separate deployment from Keycloak.
The two Infinispan clusters connect via RELAY2/XSITE protocol over a dedicated network link (JGroups bridge stack). Communication uses TLS with mutual authentication.
Keycloak nodes connect to their local Infinispan cluster via remote-store (Hot Rod protocol), not embedded Infinispan.
When data changes in Site A's Infinispan, the XSITE backup replicates it to Site B's Infinispan synchronously.
This is how cache invalidation messages propagate — when a user's session is updated in Site A, Site B's cache is invalidated or updated via this XSITE channel.

Important: The Red Hat build of Keycloak requires Red Hat Data Grid (the commercial Infinispan product) for multi-site. Community Keycloak uses upstream Infinispan Server.

Synchronous vs. async XSITE: Keycloak's official guidance strongly recommends synchronous cross-site replication. Async replication can lead to stale caches — e.g., a user changes their password on Site A, but Site B still has the old password hash cached, allowing login with the old password until the cache is invalidated. The trade-off is that synchronous replication adds latency to every write (requires low-latency link, e.g., same region, different AZs).

Multi-site DR patterns

Option A No DR Infrastructure (Cold Provision)

The database replicates to the paired region, but there are no Keycloak VMs sitting there. If the primary region goes down, you spin up VMs from IaC (Terraform), deploy Keycloak via Ansible, promote the database replica, and point DNS at the new Application Gateway.

RTO: 30–60 minutes — provisioning infrastructure from scratch during an outage.
Cost: Cheapest option. No idle compute in DR region.
Risk: IaC and Ansible must be tested regularly. Cloud capacity in the DR region is not guaranteed during a regional outage — you may not be able to provision the VMs you need.

Option B Cold Spare VMs

VMs exist in the DR region and Keycloak is installed, but the service is stopped. No Application Gateway is routing traffic to them.

On failover: Promote database, start Keycloak services, update Application Gateway or DNS.
RTO: 10–15 minutes — infrastructure is already there, just starting services and cutting over.
Cost: Paying for stopped VMs (minimal compute cost, still paying for disks).

Option C Warm Standby

Keycloak is running in the DR region, connected to the read replica, but not receiving traffic. On failover: promote database from read replica to primary, shift Application Gateway or DNS.

RTO: ~5 minutes — fastest of the passive options.
Catch: Keycloak tries to write session data on startup, which fails against a read-only database. You'd need to keep Keycloak stopped or in a degraded state anyway.
Cost: Paying for 3 running VMs that do nothing most of the time.
In practice: Option C often collapses into Option B because of the read-only DB limitation.

Key Insight — Read-Only DB Limitation

Keycloak is not a read-only application. On every request it writes session data, login failure counters, event logs, user last-login timestamps, and brute-force detection state. A Keycloak instance connected to a read-only database replica will fail to start or crash on the first login attempt. This is why Option C (Warm Standby) rarely works as advertised — you cannot keep Keycloak "warm" against a read replica without it erroring out on writes.

Option D — Active-Active (maximum complexity)

Both sites serve traffic simultaneously via a global load balancer. This is the only pattern that provides near-zero RTO (no failover needed — traffic just shifts), but it comes with the highest cost and operational complexity. KC 26 introduced official active-active multi-site support with persistent user sessions and improved cache invalidation.

Requires external Infinispan clusters with XSITE synchronous replication at both sites.
Database must be synchronously replicated between sites (Aurora Global Database, CockroachDB, or PostgreSQL with BDR).
Split-brain handling: if sites lose connectivity, the global LB must route all traffic to one site. Keycloak has no built-in split-brain resolution — Infinispan XSITE handles cache conflicts, but DB-level conflicts require the database's own conflict resolution.
Only officially supported with exactly 2 sites.

DR options comparison

Option	RTO	Cost	Complexity	Catch
A — Cold Provision	30–60 min	Lowest	Medium	IaC must be tested; DR region capacity not guaranteed
B — Cold Spare VMs	10–15 min	Low	Low	Paying for idle disks; Keycloak version must be kept in sync
C — Warm Standby	~5 min	Medium	Medium	Read-only DB breaks Keycloak — collapses into Option B in practice
D — Active-Active	~0	Highest	Very High	Requires Infinispan XSITE, sync DB replication, 2 sites max

Limitations & what Keycloak doesn't do

No read-only replicas — unlike databases, Keycloak has no concept of a read-replica site. Every active site is a full read-write participant.
No non-voting nodes — unlike Consul/OpenBao/etcd, there are no "voter" vs. "non-voter" roles. All nodes are equal peers in the Infinispan cluster.
Two sites max — the official multi-site architecture is tested and supported with exactly two sites. Adding a third site exponentially increases write latency and split-brain probability.
Low-latency required for sync XSITE — the two sites should be in the same region (different AZs), not across continents.
XSITE state transfer — if one site goes offline and comes back, you need to perform a manual state transfer to resynchronize Infinispan caches. This involves clearing the offline site's caches and doing a full push from the active site.

Database

The database stores all persistent state: realm configuration, clients, users, credentials, roles, groups, events. It is the most critical component.

Critical

The database is the real single point of failure. All KC nodes connect to the same DB. If the DB goes down, every node goes down. The database must be independently HA.

Database	Status	Notes
PostgreSQL	Recommended	Best tested, widest support. Patroni, RDS, CloudSQL, Azure DB for HA.
MySQL / MariaDB	Supported	InnoDB required. Galera or managed services for HA.
Oracle	Supported	Only when customer has existing Oracle licensing/DBA expertise.
MS SQL	Supported	Less common. Always On AG for HA. Also works with Azure SQL Database.

Note: The Keycloak project considers PostgreSQL as its primary target database. MySQL, MariaDB, Oracle, and MS SQL are supported but receive less testing focus. The project has indicated plans to narrow database support over time.

PostgreSQL HA patterns

Patroni + etcd — de facto standard for self-managed PostgreSQL HA. Automatic leader election and failover.
Streaming replication — synchronous recommended for RPO=0. Async acceptable if some data loss is tolerable.
Connection pooling — PgBouncer between KC and PostgreSQL. KC opens many connections under load.

Managed services (RDS, CloudSQL, Azure DB) provide built-in HA with multi-AZ, automated backups, and PITR.

Connection pool tuning

KC_DB_POOL_INITIAL_SIZE=25
KC_DB_POOL_MIN_SIZE=25
KC_DB_POOL_MAX_SIZE=100

Verify your DB can handle (KC nodes × max pool size) total connections. Monitor agroal.active.count, agroal.available.count, agroal.awaiting.count (should be zero — if not, pool is too small).

Warning

On Kubernetes, do not run the database inside the same cluster as Keycloak for production. A K8s failure would take down both.

Backup & Restore

Keycloak's persistent state lives almost entirely in the database. Backups are therefore primarily a database concern — but there are other artifacts to include. The most important thing about backups is that they're tested. An untested backup is not a backup.

Primary Database Backups

The database contains everything: realm config, users, hashed credentials, client registrations, roles, groups, events, offline sessions. Two complementary strategies:

Logical backups — pg_dump (or equivalent). Full point-in-time snapshot. Good for portability and selective restore.
Continuous archival — PostgreSQL WAL archival for PITR. Enables restore to any point in time, not just the last dump. Essential for minimising data loss.

Supplemental What Else to Back Up

Custom themes & SPI JARs — should be in Git; also baked into container images.
TLS certs & keystores — store in Vault / secrets manager.
Keycloak config — keycloak.conf, env vars, Helm values, operator CRDs. Version control.
Infinispan config — custom cache-ispn.xml if used.
Realm exports (JSON) — useful for config-as-code but does not include user credentials or client secrets.

Automated backup strategy

Schedule and retention:

WAL archival — continuous, to object storage (S3, GCS, Azure Blob). Use pgBackRest, barman, or wal-g. This is your primary recovery mechanism.
Logical dumps — daily pg_dump --format=custom. Retain 30 days minimum. Store off-site (different region/account).
Managed services — RDS/CloudSQL automated backups provide both snapshot and PITR. Enable with appropriate retention (default is often only 7 days — increase to 30+).

Automation tools:

pgBackRest — best-in-class for PostgreSQL. Supports full/incremental/differential backups, parallel compression, encryption at rest, and S3/GCS storage.
CronJob on K8s — for logical dumps, run a K8s CronJob that executes pg_dump and uploads to object storage. Include a verification step that restores to a temp DB.
K8s Velero — can back up PVCs, but this is a storage-level backup, not application-consistent. Don't rely on Velero alone for DB backups.

# Example: automated pg_dump to S3
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
FILENAME="keycloak_backup_${TIMESTAMP}.dump"
pg_dump --host=$DB_HOST --username=$DB_USER \
  --format=custom --file="/tmp/${FILENAME}" keycloak
aws s3 cp "/tmp/${FILENAME}" \
  "s3://backups-bucket/keycloak/${FILENAME}" \
  --storage-class STANDARD_IA
rm "/tmp/${FILENAME}"

Restoring from backup

Full restore procedure:

1. Stop all Keycloak nodes. No KC instance should be writing to the database during restore.
2. Restore the database. For pg_dump backups: pg_restore --clean --create --dbname=keycloak backup.dump. For PITR: restore the base backup and replay WALs to the desired timestamp.
3. Verify the DB. Connect directly and spot-check: realm exists, user count is correct, client registrations are present.
4. Start Keycloak. KC will connect to the restored DB. Infinispan caches will rebuild from the DB on startup (this is automatic). First startup after restore may be slower as caches warm up.
5. Validate. Test login flows, token issuance, LDAP sync, admin console access. With persistent user sessions (KC 26+ default), user sessions survive in the DB and users may not need to re-authenticate. On older versions, users will need to re-authenticate since session caches were lost.

Partial / selective restore: Keycloak doesn't support restoring a single realm from a database backup — it's all or nothing at the DB level. For realm-level recovery, realm JSON exports are more useful. You can import a realm JSON to recreate the config, clients, and roles — but users will need to reset passwords (credentials aren't in the export).

Backups during upgrades

Upgrades are the most critical time to have a reliable backup, because Keycloak applies irreversible Liquibase schema migrations on startup.

Always take a fresh backup immediately before upgrading — not a day-old scheduled backup.
Use a consistent snapshot — ensure no KC nodes are writing during the backup. Shut down all KC nodes, take the backup, then start the upgrade.
Label the backup — tag it clearly as a pre-upgrade backup with the current KC version and the target version.
Test the restore path first — before upgrading production, restore the backup to a staging DB, run the upgrade against it, and verify.

If the upgrade fails and you need to roll back: Stop all KC nodes immediately. Restore the DB from the pre-upgrade backup. Redeploy the previous KC version. There is no Liquibase rollback — schema changes are forward-only. The only rollback path is restoring the DB.

Restore testing cadence

Schedule restore tests quarterly at minimum:

Restore to a staging environment (isolated DB + KC instance).
Verify: realm config loads, users can log in, tokens are issued, LDAP sync runs, custom themes render, admin console works.
Measure actual RTO (time from "start restore" to "first successful login") and compare against the customer's target.
Document the procedure as a runbook with exact commands, expected timings, and verification steps.
Rotate the person running the test — don't let it be single-threaded knowledge.

Realm Export Commands

# Export all realms
/opt/keycloak/bin/kc.sh export \
  --dir /tmp/realm-exports --users realm_file

# Export specific realm
/opt/keycloak/bin/kc.sh export \
  --dir /tmp/realm-exports --realm my-realm \
  --users realm_file

Upgrade Strategy

Staying current is important for security, but upgrades need careful planning due to irreversible database schema changes.

Release Cadence

Community Keycloak targets 4 minor releases per year (roughly quarterly) and a major release every 2–3 years. Starting with KC 26, backwards compatibility is guaranteed for fully supported features and APIs within a major version — breaking changes in minors are opt-in. Preview features and non-public APIs may change at any time.

Only the latest release gets security patches. There is no LTS for community Keycloak. If a critical CVE drops, you must upgrade to the current release to get the fix.

Support Lifecycle

Community: no long-term support. Only the latest major.minor gets patches.

Red Hat build: minimum 2-year support lifecycle for RHBK 26.x (3 years for 27.x onwards). Full support until next major ships, then 6+ months maintenance. Red Hat skips some upstream versions, cherry-picking stable releases.

If the customer cannot upgrade frequently, the Red Hat build is strongly recommended for its backported security patches.

Database Migrations

Keycloak runs Liquibase changelogs on startup. First pod applies the migration; others wait. Always back up before upgrading. There is no schema downgrade — rollback = restore DB from backup.

Breakage Themes & SPIs

Custom themes and SPI JARs are the most common breakage. Freemarker templates and SPI interfaces change between majors. Pin to specific KC versions and test thoroughly.

Step-by-step upgrade runbook

Pre-upgrade (1–2 weeks):

Read release notes and migration guide for every version between current and target.
Audit custom themes and SPIs for compatibility.
Update custom container image to new KC base. Run kc.sh build.

Staging (1 week):

Restore production DB copy into staging.
Deploy new KC version. Verify Liquibase migration completes.
Test: login flows, token issuance, themes, SPIs, admin console, LDAP sync.

Production:

Fresh DB backup immediately before starting.
K8s: rolling update (first pod runs migration, others detect schema is current).
VMs: blue-green deployment.
Monitor 1–2 hours: login rates, errors, latency, sessions.

Rolling back failed upgrades

Keycloak's Liquibase migrations are forward-only. There is no kc.sh rollback command. If an upgrade fails:

Stop all KC nodes immediately. Don't let them keep trying to start against a partially-migrated DB.
Assess the failure. Check logs for the specific Liquibase error. Common causes: custom schema modifications conflicting with changelogs, insufficient DB permissions, unexpected column types.
Option A: Fix forward. If the failure is a known issue with a workaround, apply the fix and restart KC. Liquibase tracks which changelogs have run and will resume from where it failed.
Option B: Full rollback. Restore the DB from the pre-upgrade backup. Redeploy the previous KC version. Guaranteed to work if you have a good backup.
Never manually edit Liquibase tracking tables (DATABASECHANGELOG, DATABASECHANGELOGLOCK) unless you deeply understand the consequences.

Lock table stuck: If KC was killed mid-migration, the Liquibase lock table may be stuck. Clear it:

UPDATE DATABASECHANGELOGLOCK
SET LOCKED = FALSE, LOCKGRANTED = NULL, LOCKEDBY = NULL
WHERE ID = 1;

Skipping versions & legacy migration

Skipping versions: Liquibase changelogs are cumulative. Going from v22→v25 applies all intermediate changelogs in sequence. Read migration guides for every skipped version. Large jumps take longer and carry more risk.

Wildfly → Quarkus migration: The Wildfly distribution was removed at KC 20 (2022). Migration involves rewriting standalone-ha.xml to keycloak.conf/env vars, replacing Wildfly-specific SPIs, updating custom themes, and changing deployment tooling (no more WARs). Note that the /auth context path was also removed by default in the Quarkus distribution. Prioritise this if the customer is still on Wildfly.

LDAP / Active Directory Integration

Almost every enterprise deployment involves LDAP/AD. Keycloak's User Federation provider handles this, but several design decisions significantly affect the architecture.

Design Federation Mode

On-demand (default): users imported to KC's DB on first login. LDAP stays source of truth for credentials.

Periodic batch sync: full or changed-user sync on a schedule. Pre-populates the user list for admin visibility.

Decision Read vs. Write

Read-only (most common): KC reads users/groups, never writes back.

Writable: password and profile changes propagate to LDAP. Only enable if explicitly needed.

Kerberos / SPNEGO for Windows SSO

Requirements: SPN registered in AD (HTTP/keycloak.example.com@EXAMPLE.COM), keytab file, browser config (Group Policy), and correct DNS (forward + reverse). Kerberos is extremely DNS-sensitive — hostname mismatch is the #1 failure cause. NTP sync is critical (5-minute clock skew tolerance).

LDAP mappers, groups & multi-directory

Mappers: User Attribute, Group, Role, Hardcoded Role, MSAD User Account Control. Plan mapping strategy early — it directly affects token claims.

Multiple directories per realm supported: different providers with independent connection settings, mappers, sync schedules, and priority ordering.

Common LDAP pitfalls

Bind credentials — dedicated service account, minimum permissions.
LDAPS — always encrypt. Import CA cert into JVM truststore.
Pagination — AD defaults to 1000 result limit. KC handles paging but verify with ldapsearch.
Initial sync — 100k+ users takes time and memory. Run during maintenance window.
Username/email uniqueness — conflicts can block imports.
Referrals — multi-domain AD forests may return referrals. Configure handling correctly.

Kubernetes-Specific Guidance

Operator Keycloak Operator

Official Quarkus-based operator. Manages Keycloak and KeycloakRealmImport CRDs. Dedicated namespace with scoped RBAC.

Ingress Proxy Headers

Set KC_PROXY_HEADERS=xforwarded and KC_HTTP_ENABLED=true. Missing X-Forwarded-Proto causes redirect loops. Don't use path rewriting. Set KC_HOSTNAME to the full public URL (KC 26+ hostname v2).

Sizing Resources

Start: 2 replicas, 1–2 CPU, 1–2 GB RAM per pod. CPU-heavy during RSA signing and password hashing. Load test to tune.

Resilience PDB & Affinity

PodDisruptionBudget with minAvailable: 1. Anti-affinity across nodes/zones.

Ingress & TLS termination patterns

Edge (most common): TLS at Ingress, HTTP to KC. Passthrough: TLS direct to KC. Re-encrypt: TLS at Ingress + new TLS to KC.

# KC 26+ hostname v2 configuration
KC_PROXY_HEADERS=xforwarded
KC_HTTP_ENABLED=true
KC_HOSTNAME=https://keycloak.example.com
# KC_HOSTNAME_STRICT was removed in KC 26 — use KC_HOSTNAME with a full URL instead

Health probes & startup

Enable KC_HEALTH_ENABLED=true. Since KC 25, health and metrics endpoints are served on the management port 9000 (not the main HTTP port 8080). Use /health/started for startup probe (KC can take 30–90s during migrations), /health/ready for readiness, /health/live for liveness.

startupProbe:
  httpGet: { path: /health/started, port: 9000 }
  failureThreshold: 30
  periodSeconds: 5
readinessProbe:
  httpGet: { path: /health/ready, port: 9000 }
  periodSeconds: 10
livenessProbe:
  httpGet: { path: /health/live, port: 9000 }
  periodSeconds: 15
  failureThreshold: 3

Namespace, RBAC & network policies

Dedicated namespace. Scoped RBAC (no cluster-admin). NetworkPolicies restricting ingress/egress. External secrets for credentials. Run as non-root with read-only root filesystem.

Security Hardening

Keycloak is your IdP — if compromised, every downstream app is compromised. Many settings are not enabled by default.

Admin Console

Restrict /admin and master realm to internal networks. Never expose publicly.

Default Off Brute Force

Enable per realm. Configure max failures, wait increment, lockout.

Tokens Lifespans

Access: 5 min. Refresh: 30 min. SSO idle: 30 min. SSO max: 10 hrs. Shorter is better.

Passwords Hashing

argon2id (default since KC 25; KC 24 uses PBKDF2-SHA512 210K iterations). Min 12 chars. History, complexity rules.

TLS everywhere

Encrypt every hop: Client→LB (TLS 1.2+, HSTS), LB→KC (re-encrypt if policy requires), KC→DB (sslmode=verify-full), KC→KC (JGroups SYM_ENCRYPT), KC→LDAP (LDAPS port 636).

Admin API & service accounts

Don't use master admin for automation. Dedicated service accounts with minimal roles. Client credentials grant for S2S. Enable adminEventsEnabled + adminEventsDetailsEnabled.

Key rotation & token signing

RS256 default. Consider ES256 for shorter tokens. No auto-rotation — automate via Admin API. Keycloak recommends rotating every 3–6 months (annually at absolute minimum). Keep old key passive until all tokens signed with it expire. Clients cache JWKS — most re-fetch on kid mismatch.

Monitoring & Observability

Enable KC_METRICS_ENABLED=true. Scrape /metrics on management port 9000 (since KC 25) with Prometheus. Build Grafana dashboards.

Key Metrics

Login success/failure rates — brute-force detection, IdP outages
Token endpoint latency — p50/p95/p99
Active sessions — capacity planning
DB connection pool — alert at 80% saturation
JGroups cluster size — should match expected node count
JVM heap / GC — memory pressure signals

Alerting rules

Login failure rate > 50/min (5 min sustained) → possible attack
DB pool > 80% → increase pool or investigate slow queries
JGroups members ≠ expected → node left cluster
Token p99 > 2s → performance degradation
5xx rate > 1% → check logs
JVM heap > 85% for 10 min → memory pressure

Events, logging & SIEM

User events (login, logout, register) and admin events (every admin API change). Store in DB with configurable expiry or forward to SIEM via custom Event Listener SPI. Ship logs to ELK/Loki/Datadog. INFO for prod, selective DEBUG for troubleshooting.

Licensing & Open Source vs. Enterprise

Keycloak is proper open source, not open core. It is licensed under Apache License 2.0 — one of the most permissive open-source licenses available. Every feature in Keycloak is available to everyone. There are no features gated behind a commercial license, no "enterprise edition" binary with extra capabilities, and no feature flags that unlock with a paid key.

Since April 2023, Keycloak is a CNCF incubating project (Cloud Native Computing Foundation), which further solidifies its independence and long-term governance. Red Hat remains the primary contributor but does not control the project unilaterally.

Free Community Keycloak

Full-featured, no cost, Apache 2.0 license. This is the upstream project from keycloak.org / GitHub. All features included: SSO, OIDC, SAML, user federation, fine-grained authorization, admin console, account console, themes, SPIs — everything.

Support comes from the community: GitHub issues, Keycloak forum, CNCF Slack. No SLA, no guaranteed response times, no backported security patches to older versions.

Paid Red Hat build of Keycloak

Same codebase, different binary, with support. Red Hat takes specific Keycloak versions, certifies them, applies additional QA/testing, and provides long-term support with backported security patches and bug fixes.

This replaced the older "Red Hat SSO" (RH-SSO) product in November 2023. It is not sold separately — it's included with Red Hat Runtimes, Red Hat Application Foundations, or OpenShift subscriptions.

Is it a different binary or just a license key?

It's a different binary — similar to the GitLab CE/EE model, but with an important distinction: there are no extra features in the Red Hat build. The differences are:

Build & packaging — Red Hat builds from a specific Keycloak commit, applies their build pipeline, and produces container images hosted on registry.redhat.io. The community build comes from quay.io/keycloak.
Certified dependencies — Red Hat pins and tests specific versions of Quarkus, Infinispan, and other dependencies. Community Keycloak uses latest upstream versions.
Long-term support — Red Hat backports security fixes to their supported version streams for 2–3 years. Community Keycloak only patches the latest release.
Support SLAs — Red Hat provides 24/7 support, SLA-backed response times, and access to Red Hat's engineering team for critical issues.

You cannot just apply a license key to community Keycloak to get Red Hat support. You need to deploy the Red Hat build of Keycloak binary/image to be covered by their support contract. It's a swap of the container image, not a license toggle.

Other commercial Keycloak vendors

Beyond Red Hat, a growing ecosystem of managed Keycloak providers exists. These are third-party companies — not affiliated with the Keycloak project — that offer hosted or managed Keycloak with their own support and SLAs. Examples include Phase Two, Skycloak, and Inteca, among others.

Some of these vendors add proprietary extensions (e.g., custom UIs, enhanced multi-tenancy, advanced analytics). These extensions are not part of upstream Keycloak and vary by vendor. Evaluate carefully whether their additions create vendor lock-in or are built as standard Keycloak SPIs that you could replace.

Community vs. Red Hat build — when does it matter?

The functional capabilities are identical. The decision comes down to operational and contractual needs:

Choose community Keycloak if: the customer has a strong internal platform team, is comfortable staying on the latest release, can respond to CVEs by upgrading promptly, and doesn't need vendor-backed SLAs for procurement/compliance.
Choose Red Hat build if: the customer needs long-term support on a pinned version (2–3 year lifecycle), requires vendor-backed security patch SLAs for compliance (SOC2, PCI, ISO 27001), needs someone to call at 2am when auth is down, or procurement requires a commercial support contract.
Licensing model — Red Hat build is priced per CPU core (as part of their Runtimes/RHAF/OCP subscription), not per user. This is favorable for large user bases where per-user SaaS pricing (Auth0, Okta) becomes very expensive.

The migration path between community and Red Hat build is straightforward — same DB schema, same realm config, same API — it's essentially a container image swap.

Summary

Keycloak is 100% open source, Apache 2.0, no features behind a paywall. It is not open core. The Red Hat build adds long-term support, certified builds, and SLAs — but no extra features. It's a different binary (container image swap), not a license key applied to community Keycloak.

Consultant's Checklist

Before proposing a Keycloak deployment:

How many users? — Determines sizing, DB capacity, and whether you need clustering. 10k vs 500k are very different architectures.
Authentication sources? — LDAP/AD, social login, SAML IdPs, Kerberos/SPNEGO? Each adds complexity and testing surface.
How many realms and clients? — Multi-tenant (realm per tenant) vs single realm with client scoping. Realm count affects admin overhead and resource consumption.
Protocol requirements? — OIDC, SAML 2.0, or both? Legacy apps often need SAML. Determine token format needs (JWT claims mapping).
HA requirements? — Keycloak is an authentication gateway — downtime means nobody can log in. Plan for multi-node with load balancing. Define RPO/RTO.
Deployment target? — Kubernetes (Operator), VMs, or bare metal? Each has different operational patterns for upgrades, scaling, and monitoring.
Database choice? — PostgreSQL (recommended), MariaDB, MySQL, Oracle, MSSQL. Managed vs self-hosted. Connection pooling strategy.
Custom themes or SPIs? — Custom login pages, email templates, or Java extensions? These are the #1 upgrade blocker — budget for maintenance.
Backup & restore plan? — DB-level backups (pg_dump/snapshot). Realm JSON exports for config portability. Test restores quarterly.
Upgrade cadence? — Community KC has no LTS — only the latest release gets patches. Can the team upgrade quarterly? If not, consider Red Hat build of Keycloak.
Network & security? — TLS everywhere, admin console access restrictions, brute-force protection, token lifespans, CORS policies, CSP headers.
Monitoring? — Prometheus metrics, Grafana dashboards, alerting on login failures, DB pool saturation, JGroups cluster size, event forwarding to SIEM.

JGroups

Reliable group communication for Java applications

What is JGroups?

JGroups is a Java library for reliable messaging between processes in a cluster. It handles group membership (who's in the cluster), failure detection (who left), message ordering, and message delivery guarantees. Think of it as TCP but for groups of nodes instead of point-to-point.

Keycloak doesn't use JGroups directly — it uses Infinispan for caching, and Infinispan uses JGroups as its transport layer. So JGroups is the network plumbing underneath Keycloak's distributed caches.

How it works in Keycloak

When a Keycloak node starts, JGroups runs a discovery protocol to find other nodes and form a cluster. Once joined, JGroups handles:

Cluster membership — detecting when nodes join or leave (including crashes)
Message transport — sending cache updates between nodes (session created, session invalidated, realm config changed)
Failure detection — heartbeat-based detection of unresponsive nodes, triggering Infinispan rebalancing
Flow control — preventing fast senders from overwhelming slow receivers

Discovery protocols

The discovery protocol determines how nodes find each other on startup:

KUBE_PING — queries the Kubernetes API for pods with matching labels. Requires RBAC permissions. Recommended for K8s.
DNS_PING — resolves a DNS name (headless Service in K8s) to find peers. Simpler RBAC than KUBE_PING.
TCPPING — static list of IP:port pairs. Works everywhere but requires manual config updates when nodes change.
JDBC_PING — nodes register themselves in a shared database table. Useful for VM deployments where the DB is already shared.

Key configuration

Keycloak exposes JGroups config via environment variables:

# Kubernetes — use DNS_PING with headless service
KC_CACHE_STACK=kubernetes
# This sets: dns.query=keycloak-headless.<namespace>.svc.cluster.local

# VMs — use JDBC_PING (auto-discovers via shared DB)
KC_CACHE_STACK=jdbc-ping

# VMs — use TCPPING (static node list)
KC_CACHE_STACK=tcp
# Requires custom cache-ispn.xml with TCPPING initial_hosts

Ports & networking

Port 7800 (TCP) — JGroups cluster communication. Must be open between all KC nodes.
Additional ports 7801–7802 may be used for FD (failure detection) and state transfer.
For cross-site (XSITE), a separate RELAY2 bridge channel connects Infinispan clusters across sites, also over JGroups.

In Kubernetes, these ports are handled automatically within the cluster network. On VMs, ensure firewall rules allow TCP 7800-7802 between all Keycloak hosts.

Encryption

JGroups traffic is unencrypted by default. For production, enable SYM_ENCRYPT (shared keystore) or ASYM_ENCRYPT (PKI-based). Keycloak's Quarkus distribution supports this via the cache-ispn.xml configuration file.

Troubleshooting tip: If nodes aren't clustering, check: 1) firewall rules on port 7800, 2) correct discovery protocol for your environment, 3) nodes can resolve each other's hostnames, 4) JGroups bind address is correct (not 127.0.0.1).

Infinispan

Distributed in-memory data grid for caching & state

What is Infinispan?

Infinispan is a distributed in-memory key/value data store written in Java. It provides caching, clustering, and data replication. Originally created by Red Hat (with a commercial version called Red Hat Data Grid), it's the backbone of Keycloak's session management and state distribution.

How Keycloak uses Infinispan

Keycloak embeds Infinispan to manage several types of cached data across the cluster:

User sessions — active login sessions. Distributed cache (owners=2). Losing one node doesn't lose sessions.
Authentication sessions — in-progress login flows. Short-lived. Lost if the owning node dies mid-login.
Offline sessions — persistent refresh tokens surviving restarts. Distributed.
Login failure counters — brute-force protection state. Distributed.
Realm & client caches — configuration data. Replicated (full copy on every node) with local invalidation layers.
Action tokens — email verification, password reset links. Distributed.

Embedded vs. External Infinispan

Embedded (default, single-site): Infinispan runs inside the Keycloak JVM process. No separate infrastructure needed. JGroups handles inter-node communication. This is sufficient for most single-datacenter deployments.

External (required for multi-site): A standalone Infinispan Server cluster runs separately from Keycloak. KC connects via the Hot Rod protocol. Required for XSITE (cross-site) replication because the Infinispan clusters at each site need to communicate independently of Keycloak.

Cache modes

Distributed — data split across nodes with N owners (default 2). Scalable. Used for session data.
Replicated — full copy on every node. Fast reads, expensive writes. Used for realm/client config.
Local — not shared. Used as a near-cache layer in front of replicated/distributed caches.
Invalidation — local caches that broadcast invalidation messages. Node re-reads from DB on miss.

Tuning for large deployments

<!-- cache-ispn.xml — increase owners for higher redundancy -->
<distributed-cache name="sessions" owners="3">
  <memory>
    <heap size="10000" eviction="COUNT"/>
  </memory>
</distributed-cache>

owners — number of copies per entry. 2 is default. 3 for critical deployments. More owners = more memory + network overhead.
Eviction — set memory bounds to prevent OOM. Evicted entries are re-read from DB (sessions) or trigger re-login (auth sessions).
State transfer — when a node joins/leaves, Infinispan rebalances entries. This causes a brief CPU/memory spike.

Key decision: For single-site deployments with fewer than 100k active sessions, embedded Infinispan is fine. Only externalize Infinispan when you need multi-site (XSITE) replication or when cache memory pressure requires dedicating separate infrastructure.

Quarkus

Supersonic Subatomic Java framework

What is Quarkus?

Quarkus is a Java framework designed for cloud-native applications. It's built on established standards (CDI, JAX-RS, JPA) but optimized for fast startup, low memory, and container-first deployment. Created by Red Hat, it's the runtime that powers Keycloak since version 17.

Why Keycloak moved to Quarkus

Keycloak originally ran on Wildfly (formerly JBoss Application Server) — a full-featured Java EE application server. The move to Quarkus (completed in KC 17, Wildfly support dropped at KC 20) brought:

Faster startup — Quarkus performs build-time optimization. Keycloak starts in 5–15 seconds vs. 30–90 seconds on Wildfly.
Lower memory — roughly 50% less heap and metaspace usage for equivalent workloads.
Simpler configuration — keycloak.conf or environment variables replace the complex standalone-ha.xml.
Container-optimized — smaller images, better suited for K8s with predictable resource usage.
Build-time augmentation — kc.sh build pre-compiles configuration, themes, and providers into an optimized runtime image.

The build step

Quarkus-based Keycloak has a distinct build phase that's different from traditional Java apps:

# Build phase — pre-compiles config, discovers providers
/opt/keycloak/bin/kc.sh build \
  --db=postgres \
  --features=docker,token-exchange

# Start phase — uses the optimized build output
/opt/keycloak/bin/kc.sh start \
  --hostname=keycloak.example.com \
  --db-url=jdbc:postgresql://db:5432/keycloak

The build step resolves providers, optimizes class loading, and locks in certain configuration. This is why Keycloak Dockerfiles have a two-stage pattern — build in one layer, run in the next.

Configuration model

All Keycloak configuration maps to environment variables or keycloak.conf:

KC_DB → database vendor
KC_HOSTNAME → public hostname
KC_PROXY_HEADERS → proxy mode (xforwarded or forwarded)
KC_CACHE_STACK → JGroups discovery protocol
KC_LOG_LEVEL → logging verbosity

Build-time options (features, DB type, cache stack) require a rebuild. Runtime options (hostname, credentials, log level) can change on restart without rebuilding.

Migration note: If the customer is still on Wildfly-based Keycloak (pre-KC 17), migrating to Quarkus involves: rewriting standalone-ha.xml to env vars, updating custom SPIs to Quarkus-compatible APIs, rebuilding custom themes (Freemarker login templates mostly unchanged), changing deployment tooling (no more WAR deployments), and updating all client URLs — the /auth context path is removed by default in the Quarkus distribution (can be re-added via http-relative-path=/auth).

Liquibase

Java-based database schema migration and changelog tool

What is Liquibase?

Liquibase is an open-source database schema change management tool written in Java. It tracks, versions, and applies database schema changes (called changelogs) in a deterministic, repeatable way. Think of it as "Git for your database schema" — every change is recorded, ordered, and applied exactly once.

Liquibase supports multiple changelog formats (XML, YAML, JSON, SQL) and works with virtually every relational database. Keycloak uses XML-formatted changelogs bundled inside the Keycloak JAR.

How Keycloak uses Liquibase

Every time Keycloak starts, it checks whether the database schema is up to date. If a new version introduces schema changes, Keycloak runs the corresponding Liquibase changelogs automatically on startup. This is not optional — you cannot skip or defer migrations.

Changelog execution — Keycloak bundles all migration changelogs inside its JARs. On startup, Liquibase compares what has been applied (tracked in the DB) against what exists in the changelog files, and executes any unapplied changesets.
First-pod wins — in a multi-node cluster, the first pod to start acquires a Liquibase lock and runs the migration. All other pods wait for the lock to release before proceeding with their own startup. This is why the first pod in a rolling upgrade takes longer to become ready.
Idempotent checks — Liquibase tracks each changeset by ID, author, and filepath. It will not re-run a changeset that has already been applied, even if you restart Keycloak multiple times.

Tracking tables

Liquibase creates two metadata tables in the Keycloak database:

DATABASECHANGELOG — records every changeset that has been applied: ID, author, filename, date executed, checksum, execution status. This is how Liquibase knows what has already run.
DATABASECHANGELOGLOCK — a single-row lock table that prevents multiple nodes from running migrations simultaneously. Contains a boolean LOCKED flag, the lock timestamp, and the identity of the node holding the lock.

-- Inspect applied changelogs
SELECT id, author, filename, dateexecuted, orderexecuted
FROM DATABASECHANGELOG
ORDER BY orderexecuted DESC
LIMIT 20;

-- Check lock status
SELECT * FROM DATABASECHANGELOGLOCK;

Forward-only migrations

This is the single most important thing to understand about Liquibase in Keycloak: migrations are forward-only. There is no kc.sh rollback command. Keycloak does not ship rollback changesets. Once a migration runs, the only way to undo it is to restore the database from a pre-upgrade backup.

This has critical implications for upgrade planning:

Always take a consistent database backup immediately before upgrading — not a day-old scheduled backup.
Stop all Keycloak nodes before taking the backup to ensure no writes are in flight.
Test the upgrade against a copy of production data in staging first. This validates both the migration itself and its execution time.
Large version jumps (e.g., v22 to v25) apply all intermediate changelogs sequentially. This can take minutes on large databases and carries cumulative risk.

When migrations fail

Migration failures are rare but high-impact. Common causes:

Custom schema modifications — if someone manually altered Keycloak's tables (added columns, changed types, added constraints), a changelog may fail because the expected pre-condition doesn't match.
Insufficient DB permissions — the Keycloak DB user needs DDL privileges (CREATE, ALTER, DROP on tables and indexes).
Lock table stuck — if Keycloak was killed mid-migration (OOMKilled, node crash, manual kill), the lock row stays in a locked state. All subsequent startups will hang waiting for a lock that will never release.

-- Fix a stuck Liquibase lock (ONLY when you are certain no migration is running)
UPDATE DATABASECHANGELOGLOCK
SET LOCKED = FALSE, LOCKGRANTED = NULL, LOCKEDBY = NULL
WHERE ID = 1;

After clearing a stuck lock, investigate why the migration was interrupted. Check if any partial DDL was applied. Liquibase marks failed changesets — inspect DATABASECHANGELOG for entries with an error status. You may need to manually complete the failed changeset or restore from backup.

Operational tips

Monitor first-pod startup time — during upgrades, the migration pod will take significantly longer. Set your Kubernetes startupProbe failureThreshold high enough to accommodate this (5+ minutes for large DBs).
Never manually edit DATABASECHANGELOG unless you fully understand the consequences. Deleting rows will cause Liquibase to re-run those changesets, which will almost certainly fail on a non-empty database.
Checksum validation — Liquibase stores an MD5 checksum for each changeset. If someone modifies a changelog file after it has been applied, Liquibase will fail with a checksum mismatch on next startup. This is a safety mechanism — don't override it.
Migration duration — for databases with millions of rows in user/session tables, migrations involving schema changes on those tables can take several minutes. Plan maintenance windows accordingly.

Bottom line: Liquibase makes Keycloak upgrades irreversible at the database level. Your only rollback path is a pre-upgrade database backup. This is why the backup-before-upgrade discipline is non-negotiable — it is the only safety net.

OpenID Connect (OIDC)

Identity layer built on top of OAuth 2.0

What is OIDC?

OpenID Connect is an identity authentication protocol built as a layer on top of OAuth 2.0. Where OAuth 2.0 is an authorization framework (delegating access to resources), OIDC adds standardized identity — it answers "who is this user?" in a structured, verifiable way.

OIDC is the dominant modern authentication protocol for web, mobile, and API applications. It replaced older approaches (SAML for SPAs, custom token schemes, session cookies across domains) with a JSON/REST-based standard that works natively with JavaScript frontends and mobile apps.

Key flows

OIDC defines several grant types (flows) for different use cases:

Authorization Code — the standard flow for web apps. User is redirected to Keycloak, authenticates, and is redirected back with an authorization code. The app exchanges the code for tokens server-side. Always use with PKCE (Proof Key for Code Exchange) — even for confidential clients, PKCE is now best practice.
Client Credentials — machine-to-machine (M2M) authentication. No user involved. The client authenticates directly with its client ID and secret to get an access token. Used for service-to-service communication, batch jobs, and API integrations.
Device Code — for input-constrained devices (smart TVs, CLI tools, IoT). The device displays a code, the user enters it on a separate device with a browser, and the device polls for the resulting token.
Token Exchange — exchanging one token for another. Used for delegation scenarios (user A's token exchanged for a token allowing service B to act on behalf of A). Must be explicitly enabled in Keycloak (--features=token-exchange).

Deprecated flows: Implicit flow (tokens in URL fragment — insecure) and Resource Owner Password Credentials (ROPC — sends username/password directly to the app). Both are still supported in Keycloak but should not be used in new applications.

The three tokens

OIDC issues three types of tokens, each serving a distinct purpose:

ID Token — contains identity claims about the authenticated user (sub, name, email, groups). Consumed by the client application to know who logged in. Short-lived (minutes). Never send the ID token to an API — it's for the client, not for resource servers.
Access Token — authorizes API requests. Sent in the Authorization: Bearer header. Contains scopes and permissions. Resource servers validate this token (via signature verification or introspection). Short-lived (5–15 minutes recommended).
Refresh Token — used to obtain new access tokens without re-authenticating the user. Longer-lived (30 minutes to hours). Stored securely by the client. Rotation is recommended — Keycloak supports refresh token rotation where each use invalidates the old token and issues a new one.

JWT structure and signing

Keycloak issues tokens as JSON Web Tokens (JWTs) — a compact, URL-safe format consisting of three Base64URL-encoded parts separated by dots:

header.payload.signature

# Header — algorithm and key ID
{"alg": "RS256", "typ": "JWT", "kid": "abc123"}

# Payload — claims (identity + authorization data)
{"sub": "user-uuid", "iss": "https://keycloak.example.com/realms/myrealm",
 "aud": "my-client", "exp": 1711234567, "iat": 1711234267,
 "realm_access": {"roles": ["admin", "user"]},
 "name": "Jane Doe", "email": "jane@example.com"}

# Signature — RS256 or ES256 over header+payload

RS256 (RSA + SHA-256) — default in Keycloak. 2048-bit key. Widely supported. Produces larger tokens (~800+ bytes signature).
ES256 (ECDSA + SHA-256) — smaller signatures, equivalent security. Produces smaller tokens. Good choice when token size matters (e.g., headers hitting size limits).

Resource servers validate tokens by fetching Keycloak's public key from the JWKS endpoint and verifying the signature locally. No network call to Keycloak is needed for each request — this is what makes JWT-based auth scalable.

OIDC vs. SAML

Both are SSO protocols, but they serve different eras and use cases:

OIDC — JSON-based, REST-native, compact tokens (JWTs), works well with SPAs/mobile/APIs, modern.
SAML 2.0 — XML-based, SOAP-era, verbose assertions, designed for server-rendered web apps, mature enterprise standard.
When to use SAML — legacy enterprise apps that only support SAML (e.g., older ServiceNow, Salesforce classic, on-prem apps). Many SaaS providers support both but default to SAML for enterprise SSO.
When to use OIDC — everything new. SPAs, mobile apps, microservices, APIs, any greenfield development.

Keycloak supports both protocols simultaneously. A single realm can have OIDC clients and SAML clients side by side, sharing the same user pool, sessions, and authentication flows.

Keycloak's well-known endpoint

Every OIDC-compliant provider publishes a discovery document at a standard URL:

https://keycloak.example.com/realms/{realm}/.well-known/openid-configuration

This returns a JSON document containing all the endpoints a client needs: authorization endpoint, token endpoint, userinfo endpoint, JWKS URI, supported scopes, supported grant types, and more. Client libraries use this for auto-configuration — you typically only need to provide the issuer URL and the library discovers everything else.

The JWKS endpoint (/realms/{realm}/protocol/openid-connect/certs) publishes the realm's public signing keys. Resource servers fetch and cache these keys to validate token signatures. When Keycloak rotates keys, a new kid (key ID) appears in tokens, and clients re-fetch the JWKS on the next verification attempt.

Consultant tip: When onboarding a new application, always start with the well-known endpoint. Use curl to fetch it and verify the issuer URL, supported flows, and signing algorithms match what the application expects. Mismatched issuer URLs are the #1 cause of "token validation failed" errors.

Kerberos / SPNEGO

Network authentication protocol and Windows SSO mechanism

What is Kerberos?

Kerberos is a network authentication protocol that uses symmetric-key cryptography and a trusted third party (the Key Distribution Center, or KDC) to authenticate users and services to each other without transmitting passwords over the network. It was developed at MIT and is the default authentication protocol in Active Directory environments.

In enterprise environments, Kerberos is the mechanism behind Windows desktop SSO — users log into their workstation once (domain login) and are transparently authenticated to internal web applications without entering credentials again.

What is SPNEGO?

SPNEGO (Simple and Protected GSSAPI Negotiation Mechanism) is a negotiation wrapper that allows Kerberos to work over HTTP. Browsers and web servers use SPNEGO to negotiate which authentication mechanism to use — in practice, it almost always resolves to Kerberos.

The flow works like this:

User's browser sends a request to the Keycloak login page.
Keycloak responds with HTTP 401 and header WWW-Authenticate: Negotiate.
The browser obtains a Kerberos service ticket for Keycloak's SPN from the KDC (using the user's existing TGT from their domain login).
The browser resends the request with Authorization: Negotiate <base64-encoded-ticket>.
Keycloak validates the ticket using the keytab file containing the service's secret key.
If valid, the user is authenticated — no password prompt, no redirect, completely transparent.

How Kerberos authentication works (detailed)

Understanding the ticket-based flow helps with troubleshooting:

TGT (Ticket-Granting Ticket) — when the user logs into their Windows workstation, the machine contacts the KDC (domain controller) and obtains a TGT. This TGT is cached in the user's credential cache and proves their identity to the KDC for subsequent requests.
Service Ticket — when the user needs to access a Kerberos-protected service (like Keycloak), their machine presents the TGT to the KDC and requests a service ticket for the specific SPN. The KDC issues a ticket encrypted with the service's secret key.
Validation — the service (Keycloak) decrypts the ticket using its keytab file, which contains the same secret key. If decryption succeeds and the ticket is not expired, the user is authenticated.

The entire flow happens without the user's password ever being sent to Keycloak or transmitted over the network after the initial domain login.

Requirements in Keycloak

Setting up Kerberos/SPNEGO in Keycloak requires careful preparation across multiple systems:

SPN (Service Principal Name) — register in Active Directory: HTTP/keycloak.example.com@EXAMPLE.COM. The hostname must exactly match what users type in their browsers. Use setspn -A on the domain controller.
Keytab file — generated from AD using ktpass (Windows) or msktutil (Linux). Contains the service's secret key. Mount as a file in the Keycloak container or place on the VM filesystem. Protect it like a private key.
DNS (forward + reverse) — Kerberos is extremely DNS-sensitive. The hostname in the SPN must resolve via forward DNS to the correct IP, and that IP must reverse-resolve back to the same hostname. Mismatched DNS is the #1 cause of Kerberos failures.
NTP synchronization — Kerberos has a default 5-minute clock skew tolerance. If the clocks on the client, KDC, and Keycloak server drift beyond this, tickets are rejected silently. NTP is non-negotiable.
Keycloak configuration — configure an LDAP User Federation provider with Kerberos integration enabled. Specify the keytab path, the Kerberos realm (e.g., EXAMPLE.COM), and the server principal.

# Generate keytab on Windows domain controller
ktpass -princ HTTP/keycloak.example.com@EXAMPLE.COM \
  -mapuser keycloak_svc@EXAMPLE.COM \
  -pass * -crypto AES256-SHA1 \
  -ptype KRB5_NT_PRINCIPAL \
  -out keycloak.keytab

# Verify keytab on Linux
klist -kt keycloak.keytab
# Should show: HTTP/keycloak.example.com@EXAMPLE.COM

# Test ticket acquisition
kinit -kt keycloak.keytab HTTP/keycloak.example.com@EXAMPLE.COM
klist

Browser configuration

Browsers do not send Kerberos tickets by default — they must be configured to trust the Keycloak domain. This is typically managed via Group Policy in enterprise environments:

Internet Explorer / Edge — add the Keycloak URL to the Local Intranet zone. Enable "Integrated Windows Authentication" in Internet Options > Advanced.
Chrome — inherits IE settings on Windows. On Linux/macOS, set the --auth-server-allowlist flag or use policy: AuthServerAllowlist = "*.example.com".
Firefox — set network.negotiate-auth.trusted-uris to https://keycloak.example.com in about:config, or deploy via Group Policy / policies.json.

Use Group Policy Objects (GPOs) to push browser settings to all domain-joined workstations. This ensures consistent Kerberos behavior without requiring user action.

Common failure causes

Kerberos troubleshooting follows a consistent pattern — these are the usual suspects, in order of frequency:

DNS mismatch (#1) — the hostname in the URL doesn't match the SPN, or reverse DNS doesn't match forward DNS. Always verify with nslookup in both directions.
Clock skew — clocks differ by more than 5 minutes between client, KDC, and Keycloak. Check with w32tm /query /status (Windows) or timedatectl (Linux).
Missing or wrong SPN — duplicate SPNs in AD, SPN registered on the wrong account, or hostname casing mismatch. Check with setspn -Q HTTP/keycloak.example.com.
Browser not configured — browser doesn't trust the domain, so it never attempts Negotiate auth. User sees a login form instead of transparent SSO.
Keytab mismatch — keytab was regenerated (e.g., password reset on the service account) but not updated on the Keycloak server. The old keytab can't decrypt tickets encrypted with the new key.
Encryption type mismatch — AD and Keycloak disagree on supported encryption types (e.g., AD only allows AES256 but the keytab was generated with RC4). Align encryption types in AD security policy and keytab generation.
Firewall blocking KDC — the Keycloak server (or the user's workstation) cannot reach the KDC on port 88 (TCP/UDP). Kerberos requires direct connectivity to the domain controller.

Kerberos + OIDC: the bridge pattern

A common enterprise pattern: users authenticate to Keycloak via Kerberos/SPNEGO (transparent Windows SSO), and Keycloak issues OIDC tokens to the downstream application. This bridges the legacy Windows authentication world with modern OIDC-based apps. The application never needs to understand Kerberos — it only sees standard OIDC tokens.

Troubleshooting checklist: When Kerberos SSO fails, check in this order: 1) DNS forward + reverse, 2) setspn -Q for SPN, 3) clock skew on all parties, 4) keytab validity with kinit -kt, 5) browser trusted sites config, 6) Keycloak logs at DEBUG level for org.keycloak.federation.kerberos.

SAML 2.0

Security Assertion Markup Language — the XML-based SSO standard that dominates enterprise identity federation

What is SAML?

SAML 2.0 (Security Assertion Markup Language) is an XML-based open standard for exchanging authentication and authorization data between parties — specifically between an Identity Provider (IdP) and a Service Provider (SP). Published in 2005 by OASIS, it remains the dominant SSO protocol in enterprise environments, especially for web-based applications.

Keycloak can act as both a SAML IdP (authenticating users and issuing assertions to applications) and a SAML SP (federating identity from an upstream IdP like ADFS or Okta via identity brokering).

How the SAML flow works

The most common flow is the SP-initiated SSO using the HTTP POST binding:

User visits the application (SP). The app sees no session.
SP generates an AuthnRequest (XML) and redirects the user's browser to Keycloak (IdP)
Keycloak authenticates the user (login form, Kerberos, MFA, etc.)
Keycloak generates a SAML Response containing an Assertion — a signed XML document with the user's identity, attributes, and conditions
Keycloak POSTs the Response back to the SP's Assertion Consumer Service (ACS) URL via the user's browser
SP validates the XML signature, checks conditions (audience, timestamps), extracts user attributes, and creates a session

Key SAML concepts

Assertion — the core payload. Contains authentication statements (who the user is, when they authenticated), attribute statements (email, groups, roles), and authorization decision statements (rarely used).
Entity ID — a globally unique identifier for each party (IdP and SP). Usually a URL but doesn't have to resolve. Must match exactly between IdP and SP configurations.
Metadata — XML document describing the entity's endpoints, certificates, supported bindings, and capabilities. Exchange metadata between IdP and SP to automate configuration.
Bindings — how SAML messages are transported. HTTP-Redirect (for AuthnRequests via URL query params), HTTP-POST (for Responses via auto-submitting forms), and SOAP/Artifact (for back-channel communication).
Name ID — the identifier for the user in the assertion. Can be email, username, persistent (opaque), or transient (random per session).
Relay State — an opaque value passed through the flow to maintain the user's original destination URL after authentication.

Keycloak SAML configuration

# Keycloak SAML IdP metadata endpoint:
https://keycloak.example.com/realms/myrealm/protocol/saml/descriptor

# Key settings when creating a SAML client in Keycloak:
Client ID:            https://app.example.com/saml/metadata  (the SP Entity ID)
Root URL:             https://app.example.com
Valid Redirect URIs:  https://app.example.com/*
Master SAML URL:      https://app.example.com/saml/acs
Name ID Format:       email (or persistent, username, etc.)
Sign Assertions:      ON  (always)
Sign Documents:       ON
Encrypt Assertions:   OFF (unless the SP requires it)
Force POST Binding:   ON

SAML vs. OIDC — when to use which

Aspect	SAML 2.0	OIDC
Format	XML (verbose)	JSON / JWT (compact)
Best for	Enterprise web SSO, legacy apps	Modern web, mobile, SPAs, APIs
Token size	Large (XML + signature)	Small (JWT)
API auth	Not designed for it	Built for it (Bearer tokens)
Mobile	Poor (XML parsing, redirects)	Good (JSON, native flows)
Enterprise support	Ubiquitous (ADFS, Okta, Ping)	Growing rapidly

Consultant guidance: Default to OIDC for new applications. Use SAML only when the application doesn't support OIDC (many enterprise SaaS products like Salesforce, ServiceNow, and legacy Java apps only support SAML), or when federating with an existing SAML IdP that can't speak OIDC. Keycloak can bridge the two — accept SAML from an upstream IdP and issue OIDC tokens to downstream apps.

OAuth 2.0

The authorization framework that underpins modern API security and delegated access

What is OAuth 2.0?

OAuth 2.0 is an authorization framework (RFC 6749) that enables applications to obtain limited access to user resources without exposing credentials. It was designed to solve a specific problem: how can a user grant a third-party application access to their data on another service, without sharing their password?

A critical distinction: OAuth 2.0 is about authorization (what can you access?), not authentication (who are you?). OIDC was built on top of OAuth 2.0 to add the authentication layer. Keycloak implements both.

Core roles

Resource Owner — the user who owns the data (e.g., a GitLab user who owns their repositories)
Client — the application requesting access (e.g., a CI tool that wants to read GitLab repos)
Authorization Server — issues tokens after authenticating the user and obtaining consent (this is Keycloak)
Resource Server — the API that hosts the protected resources (e.g., GitLab's API). Validates access tokens.

Grant types (flows)

Authorization Code — the standard flow for web apps. User is redirected to Keycloak, authenticates, Keycloak returns an authorization code to the client, the client exchanges it for tokens via a back-channel call. Most secure for web apps.
Authorization Code + PKCE — same flow but with Proof Key for Code Exchange. Required for public clients (SPAs, mobile apps) that can't safely store a client secret. PKCE prevents authorization code interception attacks.
Client Credentials — machine-to-machine. No user involved. The client authenticates directly with its client ID and secret to get an access token. Used for service-to-service communication.
Device Authorization — for devices with limited input (smart TVs, CLI tools). User authorizes on a separate device by entering a code displayed on the device.
Token Exchange — exchange one token for another (e.g., impersonation, delegation). Keycloak supports this for advanced multi-service architectures.

Token types

Access Token — short-lived (minutes). Presented to resource servers to authorize API requests. In Keycloak, this is a signed JWT containing claims about the user and their permissions.
Refresh Token — longer-lived (hours/days). Used to obtain new access tokens without re-authenticating. Should be stored securely and is only sent to the authorization server, never to resource servers.
ID Token — an OIDC addition (not part of core OAuth 2.0). A JWT containing user identity claims. Only consumed by the client application, not sent to resource servers.

Scopes and consent

Scopes define the boundaries of access a client is requesting. Standard scopes include openid, profile, email, roles. Custom scopes can be created in Keycloak to map to specific API permissions. When a client requests scopes, Keycloak can optionally show a consent screen asking the user to approve the requested access.

Keycloak OAuth 2.0 endpoints

# Discovery document (lists all endpoints):
https://keycloak.example.com/realms/myrealm/.well-known/openid-configuration

# Key endpoints:
Authorization: /realms/{realm}/protocol/openid-connect/auth
Token:         /realms/{realm}/protocol/openid-connect/token
UserInfo:      /realms/{realm}/protocol/openid-connect/userinfo
Introspect:    /realms/{realm}/protocol/openid-connect/token/introspect
Revoke:        /realms/{realm}/protocol/openid-connect/revoke
JWKS:          /realms/{realm}/protocol/openid-connect/certs

Key insight: OAuth 2.0 alone doesn't tell you who the user is — it only grants access. That's why OIDC exists: it adds an ID token and a UserInfo endpoint on top of OAuth 2.0. When someone says they're "using OAuth for login," they almost always mean OIDC. If you're configuring authentication (SSO), configure OIDC. If you're configuring API authorization (scopes, permissions), you're working with the OAuth 2.0 layer underneath.

JSON Web Token (JWT)

The compact, self-contained token format that carries identity and authorization claims

What is a JWT?

A JSON Web Token (RFC 7519) is a compact, URL-safe way to represent claims between two parties. It consists of three Base64URL-encoded parts separated by dots: header.payload.signature. JWTs are the token format used by Keycloak for access tokens, ID tokens, and refresh tokens.

The key property of a JWT is that it is self-contained: the resource server can validate the token and extract user information without contacting the authorization server. This is what makes JWTs scalable — no back-channel token introspection needed for every API request.

JWT structure

# Header (algorithm + type)
{
  "alg": "RS256",
  "typ": "JWT",
  "kid": "key-id-from-keycloak-jwks"
}

# Payload (claims)
{
  "iss": "https://keycloak.example.com/realms/myrealm",
  "sub": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "aud": "my-app",
  "exp": 1711234567,
  "iat": 1711230967,
  "auth_time": 1711230960,
  "azp": "my-app",
  "scope": "openid profile email",
  "email": "user@example.com",
  "name": "Jane Smith",
  "preferred_username": "jsmith",
  "realm_access": {
    "roles": ["admin", "user"]
  },
  "resource_access": {
    "my-app": {
      "roles": ["app-admin"]
    }
  }
}

# Signature
RSASHA256(base64url(header) + "." + base64url(payload), privateKey)

Key claims

iss (issuer) — who created the token. Must match the expected Keycloak realm URL.
sub (subject) — unique user identifier. In Keycloak, this is the user's UUID.
aud (audience) — who the token is intended for. Resource servers should reject tokens not addressed to them.
exp (expiration) — Unix timestamp. Reject tokens past this time. No exceptions.
iat (issued at) — when the token was created.
azp (authorized party) — the client that requested the token.
realm_access / resource_access — Keycloak-specific claims containing realm-level and client-level role mappings.

Token validation

Every resource server must validate JWTs before trusting them. The validation steps:

Verify the signature — fetch Keycloak's public keys from the JWKS endpoint (/realms/{realm}/protocol/openid-connect/certs), match the kid from the token header, verify the signature using the public key
Check expiration — reject if exp is in the past (allow a few seconds of clock skew)
Check issuer — iss must match your expected Keycloak realm URL exactly
Check audience — aud must include your service's client ID
Extract claims — read roles, scopes, user attributes as needed

Signing algorithms

RS256 (RSA + SHA-256) — asymmetric. Keycloak signs with a private key, resource servers verify with the public key from JWKS. This is the default and recommended algorithm. Resource servers never need the private key.
ES256 (ECDSA + SHA-256) — asymmetric, smaller keys and signatures than RSA. Good alternative if performance matters.
HS256 (HMAC + SHA-256) — symmetric. Both parties share the same secret. Simpler but less secure for distributed systems — every service that validates tokens needs the secret, creating key distribution challenges.

JWT pitfalls

JWTs can't be revoked — once issued, a JWT is valid until it expires. If a user is deactivated, their existing tokens still work. Mitigation: use short expiration times (5-15 minutes) and rely on refresh token revocation.
Token size — JWTs with many roles/groups can become large (several KB). They're sent in every API request as a Bearer token. Watch for HTTP header size limits in proxies and load balancers.
Don't store secrets in JWTs — the payload is Base64-encoded, not encrypted. Anyone can decode and read the claims. Never put passwords, API keys, or PII that shouldn't be visible to the client.
Clock skew — if the issuer and validator clocks are out of sync, tokens may be rejected prematurely or accepted after expiry. Use NTP and allow 30-60 seconds of leeway.

Keycloak token customization

Keycloak allows extensive JWT customization via protocol mappers:

User Attribute Mapper — add custom user attributes (department, employee ID) to the token
Group Membership Mapper — include group paths in the token
Audience Mapper — add additional audiences for multi-service architectures
Hardcoded Claim Mapper — add static claims (environment, tenant ID)
Script Mapper — custom JavaScript logic to compute claim values (since KC 18, scripts must be deployed as JAR providers — admin console upload is disabled for security)

Debugging tip: Use jwt.io to decode and inspect JWTs during development. Paste the token to see headers, payload, and verify the signature. In production, check Keycloak's token endpoint response directly: curl -s ... | jq -r '.access_token' | cut -d. -f2 | base64 -d | jq to decode the payload.