Rancher Production Guide

Overview

Rancher is an open-source multi-cluster Kubernetes management platform that provides a single pane of glass for deploying, managing, and securing Kubernetes clusters anywhere — on-premises, in the cloud, or at the edge. Originally created by Rancher Labs, the project was acquired by SUSE in December 2020, and is now developed and commercially supported under the SUSE umbrella as Rancher Prime.

The core problem Rancher solves is Kubernetes sprawl. As organizations adopt Kubernetes, they inevitably end up with multiple clusters — dev, staging, production, edge locations, different cloud providers. Managing each cluster independently becomes unsustainable. Rancher centralizes cluster lifecycle management, authentication, policy enforcement, monitoring, and application deployment across all of them from a single UI and API.

What problems does Rancher solve?

Cluster proliferation — Manage tens or hundreds of clusters from one place instead of kubectl-switching between kubeconfigs
Consistent RBAC — Enforce authentication and authorization policies across every cluster using a single identity provider
Cluster provisioning — Spin up new RKE2 or K3s clusters on bare metal, VMs, or cloud providers with a few clicks
GitOps at scale — Deploy workloads across clusters using Fleet, the built-in GitOps engine
Unified observability — Centralized monitoring and logging across all clusters
Edge computing — Manage thousands of lightweight K3s clusters at edge locations

Strengths

Single UI/API for all clusters regardless of location or provider
Supports any CNCF-conformant Kubernetes distribution
Built-in GitOps with Fleet
Integrated monitoring (Prometheus/Grafana) and logging
Strong RBAC model with external IdP integration
Free open-source core — commercial support via Rancher Prime
Scales from a handful of clusters to thousands (edge use cases)

Considerations

Rancher server itself needs a dedicated, well-maintained K8s cluster
Adds an operational layer — another system to upgrade and maintain
Agent-based model means downstream clusters must reach Rancher server
UI can be slow when managing very large numbers of resources
Feature velocity is high — breaking changes between major versions
Some advanced features require Rancher Prime (paid) subscription

Architecture

Rancher follows a hub-and-spoke model. The Rancher server (the hub) runs on a dedicated Kubernetes cluster and manages one or more downstream clusters (the spokes). Communication between the Rancher server and downstream clusters is handled by the Rancher Agent, which runs on each managed cluster.

Core components

Rancher Server

A set of pods deployed via Helm on a dedicated Kubernetes cluster. It hosts the Rancher API, the web UI, the authentication proxy, and the cluster controllers. It stores all state as Kubernetes Custom Resources in the host cluster’s etcd — no separate external database is required.

Downstream Clusters

Any Kubernetes cluster that Rancher manages. These can be provisioned by Rancher (RKE2, K3s, EKS, AKS, GKE) or imported (existing clusters you register with Rancher). Each downstream cluster runs a Rancher Agent.

Rancher Agent

A deployment on each downstream cluster that establishes a WebSocket tunnel back to the Rancher server. This tunnel carries API requests, monitoring data, and cluster events. The agent initiates the connection outbound, so downstream clusters do not need to expose any ports to Rancher.

Authentication Proxy

All kubectl and API requests to downstream clusters are proxied through the Rancher server. Rancher authenticates the user (via its configured IdP), maps them to Rancher roles, and then forwards the request to the downstream cluster’s API server with the appropriate impersonation headers.

Communication flow

Downstream Cluster Rancher Server (Hub) +---------------------+ +------------------------+ | Rancher Agent | --- WebSocket ---> | Rancher API / UI | | (cattle-system ns) | (outbound) | Auth Proxy | | | | Cluster Controllers | | Workloads | | Fleet Manager | | Monitoring Stack | | Backup Operator | +---------------------+ +------------------------+ | +--------+--------+ | etcd (K8s) | | (Rancher state)| +-----------------+

HA deployment

For production, Rancher should run on a 3-node RKE2 cluster dedicated solely to Rancher. This gives you:

etcd quorum — 3 etcd members tolerate 1 node failure
Rancher pod replicas — The Rancher Helm chart defaults to 3 replicas, spread across nodes via preferred anti-affinity rules (configurable to required via the antiAffinity Helm value)
Load balancer — A Layer 4 load balancer (or DNS round-robin) in front of the 3 nodes distributes traffic to the Rancher ingress

Recommendation

Never run user workloads on the Rancher server cluster. Dedicate it entirely to Rancher. If the Rancher server cluster becomes unstable, you lose management access to all downstream clusters. Downstream clusters continue to run independently — you just lose the centralized UI/API.

# Install Rancher on a 3-node RKE2 cluster
# Open-source Rancher (use rancher-stable for production)
helm repo add rancher-stable https://releases.rancher.com/server-charts/stable
# For Rancher Prime, use the authenticated repo URL from SUSE Customer Center (SCC)
helm repo update

helm install rancher rancher-stable/rancher \
  --namespace cattle-system \
  --create-namespace \
  --set hostname=rancher.example.com \
  --set replicas=3 \
  --set bootstrapPassword="admin" \
  --set ingress.tls.source=letsEncrypt \
  --set letsEncrypt.email=ops@example.com

Cluster Management

Rancher provides two primary ways to bring clusters under management: importing existing clusters and provisioning new ones. Both result in a cluster that appears in the Rancher UI with full lifecycle management capabilities.

Importing existing clusters

Any CNCF-conformant Kubernetes cluster can be imported into Rancher. The process is straightforward:

In the Rancher UI, click Import Existing
Rancher generates a kubectl apply command containing a manifest that deploys the Rancher Agent
Run the command on the target cluster
The agent connects back to Rancher, and the cluster appears in the dashboard

Imported clusters retain their original provisioner — Rancher does not take over the cluster lifecycle (upgrades, node scaling). It only adds management capabilities (RBAC, monitoring, app deployment).

Provisioning new clusters

RKE2 / K3s

Rancher can provision RKE2 or K3s clusters on infrastructure you provide. You define node drivers (for cloud VMs) or bring your own nodes. Rancher handles the full lifecycle: install, upgrade, scale, and teardown.

Hosted (EKS/AKS/GKE)

Rancher integrates with cloud provider APIs to provision managed Kubernetes services. You provide cloud credentials, and Rancher creates and manages EKS, AKS, or GKE clusters through their respective APIs. Rancher manages the node pools and Kubernetes version upgrades.

Cluster templates

Cluster templates allow platform teams to define standardized cluster configurations that developers can use for self-service provisioning. Templates enforce organizational policies such as:

Kubernetes version constraints
Required CNI plugin (Calico, Cilium, Canal)
Node pool sizing and instance types
Network and security policies
Monitoring and logging stack enablement

Node drivers and machine drivers

Node drivers are plugins that allow Rancher to provision VMs on various infrastructure providers. Built-in drivers include AWS EC2, Azure, DigitalOcean, Harvester, and vSphere. Custom drivers can be added for other providers. Machine drivers use Rancher Machine (a maintained fork of the now-deprecated Docker Machine) under the hood to create VMs and prepare them for Kubernetes installation.

Cluster API (CAPI)

Rancher’s provisioning is moving toward Cluster API (CAPI) as the underlying framework, integrated via the Rancher Turtles operator. CAPI provides a Kubernetes-native, declarative way to create, configure, and manage clusters. This complements and will eventually replace the older node driver approach, aligning Rancher with the broader CNCF ecosystem. Node drivers remain fully supported for backward compatibility.

Authentication & RBAC

Rancher provides a centralized authentication and authorization layer that sits in front of all managed clusters. Instead of configuring RBAC independently on each cluster, you define roles and bindings once in Rancher and they are enforced everywhere.

Identity provider integration

Rancher supports multiple authentication backends:

Keycloak / OIDC — The recommended approach for enterprise environments. Rancher acts as an OIDC client, delegating authentication to Keycloak (or any OIDC provider)
SAML — Integration with ADFS, Okta, PingFederate, Shibboleth, and other SAML 2.0 providers
LDAP / Active Directory — Direct LDAP bind for organizations that haven’t adopted OIDC/SAML
GitHub / Google / Microsoft Entra ID (Azure AD) — OAuth-based authentication for development environments
Local — Built-in user database for bootstrap and emergency access

Rancher’s role model

Rancher defines roles at three hierarchical scopes:

Scope	Description	Example Roles
Global	Applies across the entire Rancher installation	Administrator, Restricted Admin, Standard User
Cluster	Applies to a specific cluster	Cluster Owner, Cluster Member, Cluster Viewer
Project / Namespace	Applies to a group of namespaces within a cluster	Project Owner, Project Member, Read-Only

Mapping external groups to Rancher roles

When an external IdP is configured, you can map IdP groups directly to Rancher roles. For example:

LDAP group cn=platform-admins → Global Restricted Admin
LDAP group cn=team-alpha → Cluster Member on prod-cluster + Project Owner on alpha-project
OIDC group claim devops → Cluster Owner on all dev clusters

Important

Rancher’s RBAC is enforced at the Rancher proxy layer. If users bypass Rancher and connect directly to a downstream cluster’s API server with a valid kubeconfig, Rancher RBAC is not enforced. For full security, restrict direct API server access via network policies and use Rancher as the sole entry point.

Catalogs & Apps

Rancher provides an app marketplace that simplifies deploying Helm charts across managed clusters. This is the primary mechanism for installing both infrastructure components (monitoring, logging, ingress controllers) and user applications.

Helm chart repositories

Rancher uses standard Helm chart repositories as its catalog backend. There are three types:

Built-in — Rancher ships with curated charts for monitoring, logging, Istio, OPA Gatekeeper, CIS benchmarks, and more
Partner — Charts from SUSE partners available through the Rancher marketplace
Custom — Add any Helm repository URL (public or private) to make its charts available in the Rancher UI

Deploying apps across clusters

From the Rancher UI, you can install a Helm chart on any managed cluster in a few clicks. Rancher proxies the Helm install through its API, so you don’t need direct kubectl access. For deploying the same app across multiple clusters, Rancher integrates with Fleet for GitOps-driven multi-cluster deployment.

Fleet for GitOps at scale

Fleet is Rancher’s built-in GitOps engine designed for managing deployments across large numbers of clusters. Instead of manually installing charts on each cluster, you define your desired state in a Git repository and Fleet ensures every targeted cluster converges to that state. See the next section for a deep dive.

Best Practice

Use Fleet for production workloads rather than manual Helm installs through the UI. The UI-based approach is convenient for one-off deployments and experimentation, but Fleet provides auditability, reproducibility, and drift detection that are essential for production operations.

Fleet

Fleet is a GitOps engine built into Rancher that enables continuous delivery across a large number of clusters. It watches Git repositories for changes and automatically deploys workloads to targeted clusters. Fleet was designed from the ground up for the multi-cluster use case — it can scale to manage thousands of clusters, making it ideal for edge deployments.

Core concepts

GitRepo

A Custom Resource that points to a Git repository (URL, branch, paths). Fleet watches this repo for changes. When a commit is detected, Fleet processes the contents and creates Bundles.

Bundle

The unit of deployment in Fleet. A Bundle contains the Kubernetes manifests, Helm charts, or Kustomize overlays that Fleet will apply to target clusters. Bundles are created automatically from GitRepo resources.

Cluster Groups & Labels

Fleet targets clusters using label selectors. You label your clusters (e.g., env=prod, region=eu) and then use selectors in your GitRepo to specify which clusters should receive the deployment.

BundleDeployment

Created by Fleet for each Bundle/cluster combination. It tracks the deployment status on each individual cluster — whether it’s ready, in progress, or has errors.

Example: multi-cluster deployment

# fleet.yaml - placed in your Git repo
defaultNamespace: my-app
helm:
  releaseName: my-app
  chart: ./charts/my-app
  values:
    replicaCount: 3

targetCustomizations:
- name: staging
  clusterSelector:
    matchLabels:
      env: staging
  helm:
    values:
      replicaCount: 1

- name: production
  clusterSelector:
    matchLabels:
      env: production
  helm:
    values:
      replicaCount: 5
      resources:
        requests:
          memory: "512Mi"
          cpu: "500m"

# GitRepo resource - registered in the Rancher local cluster
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: my-app
  namespace: fleet-default
spec:
  repo: https://github.com/org/my-app-deploy
  branch: main
  paths:
  - /
  targets:
  - clusterSelector:
      matchLabels:
        env: production
  - clusterSelector:
      matchLabels:
        env: staging

Scale

Fleet has been tested managing deployments across 1 million+ clusters in SUSE’s edge computing scenarios. It achieves this by batching operations and using an efficient reconciliation loop that avoids per-cluster API calls from the management plane.

Monitoring & Logging

Rancher provides integrated monitoring and logging stacks that can be deployed to any managed cluster with a single click from the Rancher UI. These are based on well-established open-source projects and are packaged as Helm charts maintained by the Rancher team.

Monitoring stack

Rancher’s monitoring solution is based on the Prometheus Operator and includes:

Prometheus — Metrics collection and storage with pre-configured scrape targets for Kubernetes components, nodes, and common workloads
Grafana — Dashboards for cluster health, node resources, pod metrics, and workload performance. Rancher ships with curated dashboards out of the box
Alertmanager — Alert routing and notification via email, Slack, PagerDuty, webhooks, and more
Node Exporter & kube-state-metrics — Exporters for OS-level and Kubernetes object metrics

Per-cluster vs global dashboards

Monitoring is deployed per cluster — each managed cluster gets its own Prometheus and Grafana instance. This ensures metrics stay local and avoids cross-cluster data transfer. For a global view, you can:

Use Thanos or Cortex to aggregate Prometheus metrics across clusters into a central query layer
Configure remote_write on each cluster’s Prometheus to push metrics to a central TSDB
Use the Rancher UI’s cluster switcher to navigate between per-cluster Grafana instances

Logging integration

Rancher’s logging integration is based on the Banzai Cloud Logging Operator. It deploys Fluent Bit as a DaemonSet on each node to collect and enrich logs with Kubernetes metadata, then forwards them to Fluentd (or syslog-ng) for filtering and routing. Supported outputs include:

Elasticsearch / OpenSearch
Splunk
Amazon CloudWatch
Azure Log Analytics
Syslog
Kafka
Custom HTTP endpoints

Alerting

Rancher exposes Alertmanager configuration through the UI, allowing operators to define alert rules and notification channels without editing YAML directly. Pre-built alert rules cover common scenarios:

Node not ready, high CPU/memory, disk pressure
Pod crash loops, OOMKills, pending pods
etcd health, API server latency, scheduler failures
Certificate expiration warnings

Resource Planning

Prometheus is memory-intensive. For production clusters with many pods, expect Prometheus to use 4-8 GB RAM with default retention (15 days). Adjust retention and retentionSize in the monitoring Helm values to control resource usage. Consider using Thanos with object storage for long-term retention instead of increasing local Prometheus storage.

Backup & Disaster Recovery

The Rancher Backup Operator (rancher-backup) provides a Kubernetes-native way to back up and restore the Rancher server’s state. This is critical because losing the Rancher server means losing centralized management of all downstream clusters (though the clusters themselves continue to operate independently).

What gets backed up

All Rancher Custom Resources (clusters, projects, users, roles, tokens, settings)
Rancher-managed namespaces and their contents
Catalog/app configurations
Fleet GitRepo and Bundle resources

The backup operator does not back up downstream cluster workloads — those need their own backup strategy (e.g., Velero).

Backup configuration

apiVersion: resources.cattle.io/v1
kind: Backup
metadata:
  name: rancher-daily-backup
spec:
  resourceSetName: rancher-resource-set
  schedule: "0 2 * * *"           # Daily at 2 AM
  retentionCount: 10              # Keep last 10 backups
  storageLocation:
    s3:
      bucketName: rancher-backups
      region: us-east-1
      endpoint: s3.amazonaws.com
      credentialSecretName: s3-creds
      credentialSecretNamespace: cattle-system

Restoring Rancher

Restoration deploys a fresh Rancher installation on a new RKE2 cluster, installs the backup operator, and applies a Restore CR pointing to the backup location. The operator reconciles all Rancher CRDs and resources, reconnecting downstream clusters automatically (agents reconnect via the same URL).

apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  name: rancher-restore
spec:
  backupFilename: rancher-daily-backup-2026-03-18T02-00-00Z.tar.gz
  storageLocation:
    s3:
      bucketName: rancher-backups
      region: us-east-1
      endpoint: s3.amazonaws.com
      credentialSecretName: s3-creds
      credentialSecretNamespace: cattle-system

Migrating Rancher to a new cluster

The backup/restore process doubles as a migration strategy. To move Rancher to a new cluster:

Take a backup on the old Rancher server
Provision a new RKE2 cluster and install Rancher with the same hostname
Install the backup operator and apply the Restore CR
Update DNS to point the Rancher hostname to the new cluster’s load balancer
Downstream cluster agents will reconnect automatically

Critical

Always test your restore procedure in a non-production environment before you need it. A backup that has never been tested is not a backup. Schedule quarterly restore drills to validate your DR process end-to-end.

Upgrades

Upgrading a Rancher environment involves two independent concerns: upgrading the Rancher server itself and upgrading the downstream Kubernetes clusters it manages.

Upgrading Rancher server

Rancher server is deployed via Helm, so upgrades are a standard helm upgrade:

# 1. Back up Rancher before upgrading
kubectl apply -f rancher-backup.yaml

# 2. Update the Helm repo
helm repo update

# 3. Upgrade Rancher
helm upgrade rancher rancher-stable/rancher \
  --namespace cattle-system \
  --set hostname=rancher.example.com \
  --set replicas=3 \
  --version 2.13.3

Always take a backup before upgrading (use the Rancher Backup Operator)
Read the release notes — Rancher publishes detailed upgrade notes with known issues and breaking changes
Upgrade sequentially — Rancher recommends upgrading through each consecutive minor version. Skipping minor versions is not supported and increases the risk of issues due to accumulated changes
Upgrade the underlying RKE2 cluster if needed — check the Rancher support matrix for compatible Kubernetes versions

Upgrading downstream clusters

For Rancher-provisioned clusters (RKE2, K3s), upgrades are initiated from the Rancher UI or API:

Select the target Kubernetes version from the available list
Rancher orchestrates a rolling upgrade of control plane nodes first, then workers
For RKE2, the upgrade uses the System Upgrade Controller (SUC) to coordinate node-by-node upgrades
Configure drain settings (max unavailable, drain timeout) to control upgrade speed vs availability

For hosted clusters (EKS/AKS/GKE), Rancher calls the cloud provider’s API to trigger the managed upgrade process.

Upgrade strategy

Recommended Order

Upgrade Rancher server to the latest patch
Upgrade dev/staging downstream clusters
Validate applications and monitoring
Upgrade production downstream clusters during maintenance window
Upgrade monitoring and logging stacks if needed

Common Pitfalls

Upgrading downstream clusters to a K8s version not yet supported by the current Rancher version
Not checking PodDisruptionBudgets before draining nodes
Skipping the backup step — rollback without a backup is painful
Upgrading all clusters simultaneously instead of rolling

Licensing

Rancher has a straightforward licensing model that separates the free open-source project from the commercial offering.

Rancher (Open Source)

The core Rancher project is 100% open source under the Apache 2.0 license. You can download, deploy, and use Rancher in production without any license key or subscription. All core features — multi-cluster management, RBAC, Fleet, monitoring integration, app catalog — are available for free.

Rancher Prime (SUSE Commercial)

Rancher Prime is SUSE’s commercially supported distribution of Rancher. As of 2025, it is priced per CPU core / vCPU (charged per pair of physical cores or per four vCPUs), replacing the previous per-node model. What you get with Rancher Prime:

Enterprise support — 24/7 support with SLAs from SUSE’s Kubernetes experts
Hardened images — SUSE-built, FIPS 140-2 validated container images with BoringCrypto
Extended lifecycle — Longer support windows for Rancher and RKE2 versions
Security patches — Priority access to CVE fixes and security advisories
SUSE Registry — Access to SUSE’s private container registry with verified images
UI extensions — Additional UI capabilities for enterprise governance
Integrations — Certified integrations with SUSE Linux Enterprise, NeuVector, and Longhorn

Feature	Open Source	Rancher Prime
Multi-cluster management	Yes	Yes
Fleet GitOps	Yes	Yes
RBAC & IdP integration	Yes	Yes
Enterprise support (24/7 SLA)	Community only	Yes
FIPS-validated images	No	Yes
Extended lifecycle support	No	Yes
NeuVector integration	Community	Certified

Pricing Note

Rancher Prime licensing changed in 2025 from a per-node model to a per-CPU/vCPU model (per pair of physical cores or per four vCPUs). This can significantly increase costs for deployments with high core counts. The Rancher server cluster itself is not counted. Contact SUSE for current pricing tiers, volume discounts, and academic/government programs.

Consultant’s Checklist

Use this checklist when planning, deploying, or auditing a Rancher environment.

Infrastructure

Dedicated 3-node RKE2 cluster for Rancher server (no user workloads)
Layer 4 load balancer in front of Rancher server nodes
Valid TLS certificate for rancher.example.com
DNS configured for Rancher hostname
Network connectivity: downstream clusters must reach Rancher on port 443
etcd backup schedule configured on the RKE2 cluster

Authentication

External IdP configured (Keycloak/OIDC recommended)
IdP group-to-role mappings defined and tested
Local admin account secured with strong password
Emergency break-glass procedure documented
Session timeout configured appropriately
API tokens have expiration dates set

Operations

Rancher Backup Operator installed and scheduled (daily)
Backup storage configured (S3-compatible recommended)
Restore procedure tested and documented
Upgrade runbook created with rollback steps
Monitoring deployed on all managed clusters
Alert channels configured (Slack, PagerDuty, email)

Security

Direct API server access restricted (force traffic through Rancher proxy)
Network policies isolating Rancher system namespaces
CIS benchmark scan run on RKE2 clusters
Pod Security Standards enforced (restricted profile for workloads)
Audit logging enabled on Rancher server
RBAC follows least-privilege principle

GitOps & Fleet

Fleet configured for application deployment (not manual UI installs)
Git repositories structured with fleet.yaml for target customizations
Cluster labels defined and documented (env, region, tier)
Fleet drift detection reviewed and remediation policy set
Secrets management strategy defined (Sealed Secrets, External Secrets Operator, or Vault)

Sizing Guide

Small (1-10 downstream clusters, < 50 nodes): 3-node RKE2, 4 vCPU / 8 GB RAM per node, 100 GB SSD.
Medium (10-50 clusters, 50-500 nodes): 3-node RKE2, 8 vCPU / 16 GB RAM per node, 200 GB SSD.
Large (50+ clusters, 500+ nodes): 3-node RKE2, 16 vCPU / 32 GB RAM per node, 500 GB SSD. Consider dedicated etcd nodes.

Fleet

GitOps engine for multi-cluster Kubernetes deployments at scale

What is Fleet?

Fleet is a GitOps-based continuous delivery engine designed specifically for multi-cluster Kubernetes environments. It is developed by the Rancher team and ships as a built-in component of Rancher (since Rancher 2.5). Fleet can also be used standalone without Rancher, but it is most commonly encountered as part of the Rancher ecosystem.

The name “Fleet” reflects its purpose: managing a fleet of Kubernetes clusters from a single source of truth (Git). Unlike single-cluster GitOps tools like Argo CD or Flux, Fleet was architected from day one to handle the multi-cluster use case efficiently.

How Fleet works

Fleet runs on the Rancher management cluster (the “local” cluster) and consists of two main controllers:

Fleet Controller — Runs on the management cluster. Watches GitRepo resources, polls Git repositories for changes, creates Bundles, and manages BundleDeployments
Fleet Agent — Runs on each downstream cluster. Receives Bundles from the controller, applies manifests to the local cluster, and reports status back

Key resources

GitRepo — Points to a Git repository with deployment manifests. Defines branch, paths, and target selectors
Bundle — Auto-generated from GitRepo contents. Contains the rendered Helm charts, Kustomize output, or raw manifests
BundleDeployment — Auto-generated per Bundle/cluster pair. Tracks deployment status per cluster
ClusterGroup — Logical grouping of clusters for targeting purposes

Fleet vs Argo CD vs Flux

Fleet is purpose-built for multi-cluster orchestration. Argo CD and Flux are primarily single-cluster tools that can be extended to multi-cluster with additional configuration (Argo CD ApplicationSets, Flux multi-tenancy). Fleet’s advantage is native multi-cluster targeting with label selectors and per-cluster value overrides in a single GitRepo definition.

When to use Fleet: You have 5+ clusters and need to deploy the same apps across them with per-cluster customizations. Fleet excels at the “deploy this chart to all clusters labeled env=prod with these overrides” use case.

RKE2

Rancher’s next-generation Kubernetes distribution, also known as RKE Government

What is RKE2?

RKE2 (Rancher Kubernetes Engine 2) is a fully conformant Kubernetes distribution developed by Rancher/SUSE. It was originally created as “RKE Government” to meet the security and compliance requirements of US federal agencies, and has since become the recommended Kubernetes distribution for all Rancher deployments.

RKE2 combines the best of RKE1 and K3s: it uses K3s’s easy-to-operate, single-binary architecture but replaces K3s’s embedded components with production-grade defaults (etcd instead of SQLite, Canal as the default CNI instead of Flannel alone, with Calico and Cilium also available).

Key characteristics

CIS hardened by default — Passes CIS Kubernetes Benchmark with minimal configuration
FIPS 140-2 compliant — Built with FIPS-validated BoringCrypto modules (when using SUSE images). Only the Canal CNI and NGINX ingress are FIPS-rebuilt by default
Embedded etcd — No need to manage a separate etcd cluster; RKE2 manages etcd as an embedded component
Containerd runtime — Uses containerd directly (no Docker dependency)
Automatic certificate rotation — TLS certificates for all components are automatically rotated
Single binary — Easy installation with curl | sh style deployment

Installation

# Install RKE2 server (control plane)
curl -sfL https://get.rke2.io | sh -
systemctl enable rke2-server
systemctl start rke2-server

# Get kubeconfig
export KUBECONFIG=/etc/rancher/rke2/rke2.yaml

# Join additional server nodes
curl -sfL https://get.rke2.io | sh -
echo "server: https://first-node:9345" >> /etc/rancher/rke2/config.yaml
echo "token: $(cat /var/lib/rancher/rke2/server/node-token)" >> /etc/rancher/rke2/config.yaml
systemctl enable rke2-server
systemctl start rke2-server

RKE2 vs K3s: Use RKE2 for production, datacenter, and regulated environments. Use K3s for edge, IoT, development, and resource-constrained environments. Both are fully supported by SUSE.

Node Driver

Plugins that allow Rancher to provision VMs on infrastructure providers

What is a Node Driver?

A Node Driver in Rancher is a plugin that integrates with an infrastructure provider’s API to create and manage virtual machines. When you provision a new Kubernetes cluster through Rancher, the node driver handles the VM lifecycle — creating the machine, installing an OS, configuring networking, and preparing it for Kubernetes installation.

Node drivers are based on Rancher Machine, a maintained fork of the now-deprecated Docker Machine project. Each driver is a standalone binary that implements a standard interface for VM operations: create, start, stop, restart, remove, and SSH.

Built-in drivers

Amazon EC2 — Provisions EC2 instances with configurable AMI, instance type, VPC, security groups
Azure — Creates Azure VMs with managed disks, availability sets, and NSGs
vSphere — Provisions VMs on VMware vSphere using the vSphere API. Common in on-premises enterprise environments
DigitalOcean — Creates droplets for dev/test clusters
Harvester — Integration with SUSE’s Harvester HCI platform for VM provisioning

Custom drivers

You can add custom node drivers by providing a URL to the driver binary. Rancher downloads the binary and makes it available in the cluster creation UI. This allows integration with any infrastructure provider that has a Rancher Machine-compatible driver, including OpenStack, Proxmox, Hetzner, and others.

Note: Node drivers are being complemented by Cluster API (CAPI) providers via the Rancher Turtles operator, which offers a more Kubernetes-native approach to infrastructure provisioning. CAPI will eventually replace node drivers, but node drivers remain fully supported for backward compatibility.

Cluster Template

Standardized cluster configurations for self-service provisioning

What is a Cluster Template?

A Cluster Template in Rancher is a pre-defined cluster configuration that encodes your organization’s standards and policies into a reusable blueprint. Platform teams create templates; developers and project teams use them to provision clusters that automatically comply with organizational requirements.

Templates enforce consistency across clusters by locking down configuration options that should not vary — while leaving other options (like cluster name, node count, region) as user-configurable parameters.

What templates can enforce

Kubernetes version — Pin to a specific version or allow a range
CNI plugin — Mandate Calico, Cilium, or Canal
Node pools — Define required node pool types (control plane, worker, etcd) with min/max counts
Cloud provider — Restrict to specific infrastructure providers
Security settings — CIS profile, Pod Security Standards, audit logging
Add-ons — Automatically install monitoring, logging, or network policy engines on new clusters

Template revisions

Cluster templates support versioning. When you update a template, existing clusters created from the previous version are not automatically changed. You can then require teams to upgrade their clusters to the latest template revision during their next maintenance window. This provides a controlled rollout mechanism for infrastructure changes.

Best practice: Create at least three templates: dev (small, minimal, fast to provision), staging (mirrors production topology but smaller scale), and production (full HA, CIS hardened, monitoring enabled). Use Rancher RBAC to control who can use which templates.

Rancher Agent

The component that connects downstream clusters to the Rancher management server

What is the Rancher Agent?

The Rancher Agent (also called cattle-cluster-agent) is a Kubernetes deployment that runs in the cattle-system namespace of every downstream cluster managed by Rancher. It is the bridge between the downstream cluster and the Rancher server, establishing a persistent, encrypted WebSocket tunnel over which all management traffic flows.

How it works

Outbound connection — The agent initiates the connection to the Rancher server on port 443. The downstream cluster does not need to expose any inbound ports to Rancher
WebSocket tunnel — Once connected, the agent maintains a persistent WebSocket connection. Rancher uses this tunnel to send API requests, deploy workloads, and collect cluster data
Heartbeat — The agent sends regular heartbeats to Rancher. If the heartbeat is missed, Rancher marks the cluster as “Unavailable” in the UI
Automatic reconnection — If the connection drops, the agent automatically retries with exponential backoff

Agent types

cattle-cluster-agent — A Deployment (typically 1-2 replicas). Handles cluster-level operations: API proxying, cluster status reporting, leader election for controllers
cattle-node-agent — A DaemonSet that runs on every node (used in RKE1; not required for RKE2/K3s provisioned clusters). Handles node-level operations like node plan execution

Troubleshooting

# Check agent status
kubectl -n cattle-system get pods
kubectl -n cattle-system logs deployment/cattle-cluster-agent

# Verify connectivity to Rancher
kubectl -n cattle-system exec deploy/cattle-cluster-agent -- \
  curl -sk https://rancher.example.com/healthz

# Force agent re-registration (last resort)
kubectl -n cattle-system delete deployment cattle-cluster-agent
# Rancher will redeploy the agent automatically

Network requirement: The agent must be able to reach the Rancher server hostname on port 443. In firewall-restricted environments, ensure this outbound rule is in place. No inbound rules are required on the downstream cluster side.

SUSE Rancher Production Guide

Overview

What problems does Rancher solve?

Strengths

Considerations

Architecture

Core components

Rancher Server

Downstream Clusters

Rancher Agent

Authentication Proxy

Communication flow

HA deployment

Cluster Management

Importing existing clusters

Provisioning new clusters

RKE2 / K3s

Hosted (EKS/AKS/GKE)

Cluster templates

Node drivers and machine drivers

Authentication & RBAC

Identity provider integration

Rancher’s role model

Mapping external groups to Rancher roles

Catalogs & Apps

Helm chart repositories

Deploying apps across clusters

Fleet for GitOps at scale

Fleet

Core concepts

GitRepo

Bundle

Cluster Groups & Labels

BundleDeployment

Example: multi-cluster deployment

Monitoring & Logging

Monitoring stack

Per-cluster vs global dashboards

Logging integration

Alerting

Backup & Disaster Recovery

What gets backed up

Backup configuration

Restoring Rancher

Migrating Rancher to a new cluster

Upgrades

Upgrading Rancher server

Upgrading downstream clusters

Upgrade strategy

Recommended Order

Common Pitfalls

Licensing

Rancher (Open Source)

Rancher Prime (SUSE Commercial)

Consultant’s Checklist

Infrastructure

Authentication

Operations

Security

GitOps & Fleet

Fleet

What is Fleet?

How Fleet works

Key resources

Fleet vs Argo CD vs Flux

RKE2

What is RKE2?

Key characteristics

Installation

Node Driver

What is a Node Driver?

Built-in drivers

Custom drivers

Cluster Template

What is a Cluster Template?

What templates can enforce

Template revisions

Rancher Agent

What is the Rancher Agent?

How it works