SUSE Rancher Production Guide

Multi-cluster Kubernetes management — provisioning, GitOps, RBAC, monitoring & operations

01

Overview

Rancher is an open-source multi-cluster Kubernetes management platform that provides a single pane of glass for deploying, managing, and securing Kubernetes clusters anywhere — on-premises, in the cloud, or at the edge. Originally created by Rancher Labs, the project was acquired by SUSE in December 2020, and is now developed and commercially supported under the SUSE umbrella as Rancher Prime.

The core problem Rancher solves is Kubernetes sprawl. As organizations adopt Kubernetes, they inevitably end up with multiple clusters — dev, staging, production, edge locations, different cloud providers. Managing each cluster independently becomes unsustainable. Rancher centralizes cluster lifecycle management, authentication, policy enforcement, monitoring, and application deployment across all of them from a single UI and API.

What problems does Rancher solve?

  • Cluster proliferation — Manage tens or hundreds of clusters from one place instead of kubectl-switching between kubeconfigs
  • Consistent RBAC — Enforce authentication and authorization policies across every cluster using a single identity provider
  • Cluster provisioning — Spin up new RKE2 or K3s clusters on bare metal, VMs, or cloud providers with a few clicks
  • GitOps at scale — Deploy workloads across clusters using Fleet, the built-in GitOps engine
  • Unified observability — Centralized monitoring and logging across all clusters
  • Edge computing — Manage thousands of lightweight K3s clusters at edge locations

Strengths

  • Single UI/API for all clusters regardless of location or provider
  • Supports any CNCF-conformant Kubernetes distribution
  • Built-in GitOps with Fleet
  • Integrated monitoring (Prometheus/Grafana) and logging
  • Strong RBAC model with external IdP integration
  • Free open-source core — commercial support via Rancher Prime
  • Scales from a handful of clusters to thousands (edge use cases)

Considerations

  • Rancher server itself needs a dedicated, well-maintained K8s cluster
  • Adds an operational layer — another system to upgrade and maintain
  • Agent-based model means downstream clusters must reach Rancher server
  • UI can be slow when managing very large numbers of resources
  • Feature velocity is high — breaking changes between major versions
  • Some advanced features require Rancher Prime (paid) subscription
02

Architecture

Rancher follows a hub-and-spoke model. The Rancher server (the hub) runs on a dedicated Kubernetes cluster and manages one or more downstream clusters (the spokes). Communication between the Rancher server and downstream clusters is handled by the Rancher Agent, which runs on each managed cluster.

Core components

Rancher Server

A set of pods deployed via Helm on a dedicated Kubernetes cluster. It hosts the Rancher API, the web UI, the authentication proxy, and the cluster controllers. It stores all state as Kubernetes Custom Resources in the host cluster’s etcd — no separate external database is required.

Downstream Clusters

Any Kubernetes cluster that Rancher manages. These can be provisioned by Rancher (RKE2, K3s, EKS, AKS, GKE) or imported (existing clusters you register with Rancher). Each downstream cluster runs a Rancher Agent.

Rancher Agent

A deployment on each downstream cluster that establishes a WebSocket tunnel back to the Rancher server. This tunnel carries API requests, monitoring data, and cluster events. The agent initiates the connection outbound, so downstream clusters do not need to expose any ports to Rancher.

Authentication Proxy

All kubectl and API requests to downstream clusters are proxied through the Rancher server. Rancher authenticates the user (via its configured IdP), maps them to Rancher roles, and then forwards the request to the downstream cluster’s API server with the appropriate impersonation headers.

Communication flow

Downstream Cluster Rancher Server (Hub) +---------------------+ +------------------------+ | Rancher Agent | --- WebSocket ---> | Rancher API / UI | | (cattle-system ns) | (outbound) | Auth Proxy | | | | Cluster Controllers | | Workloads | | Fleet Manager | | Monitoring Stack | | Backup Operator | +---------------------+ +------------------------+ | +--------+--------+ | etcd (K8s) | | (Rancher state)| +-----------------+

HA deployment

For production, Rancher should run on a 3-node RKE2 cluster dedicated solely to Rancher. This gives you:

  • etcd quorum — 3 etcd members tolerate 1 node failure
  • Rancher pod replicas — The Rancher Helm chart defaults to 3 replicas, spread across nodes via preferred anti-affinity rules (configurable to required via the antiAffinity Helm value)
  • Load balancer — A Layer 4 load balancer (or DNS round-robin) in front of the 3 nodes distributes traffic to the Rancher ingress
Recommendation

Never run user workloads on the Rancher server cluster. Dedicate it entirely to Rancher. If the Rancher server cluster becomes unstable, you lose management access to all downstream clusters. Downstream clusters continue to run independently — you just lose the centralized UI/API.

# Install Rancher on a 3-node RKE2 cluster
# Open-source Rancher (use rancher-stable for production)
helm repo add rancher-stable https://releases.rancher.com/server-charts/stable
# For Rancher Prime, use the authenticated repo URL from SUSE Customer Center (SCC)
helm repo update

helm install rancher rancher-stable/rancher \
  --namespace cattle-system \
  --create-namespace \
  --set hostname=rancher.example.com \
  --set replicas=3 \
  --set bootstrapPassword="admin" \
  --set ingress.tls.source=letsEncrypt \
  --set letsEncrypt.email=ops@example.com
03

Cluster Management

Rancher provides two primary ways to bring clusters under management: importing existing clusters and provisioning new ones. Both result in a cluster that appears in the Rancher UI with full lifecycle management capabilities.

Importing existing clusters

Any CNCF-conformant Kubernetes cluster can be imported into Rancher. The process is straightforward:

  1. In the Rancher UI, click Import Existing
  2. Rancher generates a kubectl apply command containing a manifest that deploys the Rancher Agent
  3. Run the command on the target cluster
  4. The agent connects back to Rancher, and the cluster appears in the dashboard

Imported clusters retain their original provisioner — Rancher does not take over the cluster lifecycle (upgrades, node scaling). It only adds management capabilities (RBAC, monitoring, app deployment).

Provisioning new clusters

RKE2 / K3s

Rancher can provision RKE2 or K3s clusters on infrastructure you provide. You define node drivers (for cloud VMs) or bring your own nodes. Rancher handles the full lifecycle: install, upgrade, scale, and teardown.

Hosted (EKS/AKS/GKE)

Rancher integrates with cloud provider APIs to provision managed Kubernetes services. You provide cloud credentials, and Rancher creates and manages EKS, AKS, or GKE clusters through their respective APIs. Rancher manages the node pools and Kubernetes version upgrades.

Cluster templates

Cluster templates allow platform teams to define standardized cluster configurations that developers can use for self-service provisioning. Templates enforce organizational policies such as:

  • Kubernetes version constraints
  • Required CNI plugin (Calico, Cilium, Canal)
  • Node pool sizing and instance types
  • Network and security policies
  • Monitoring and logging stack enablement

Node drivers and machine drivers

Node drivers are plugins that allow Rancher to provision VMs on various infrastructure providers. Built-in drivers include AWS EC2, Azure, DigitalOcean, Harvester, and vSphere. Custom drivers can be added for other providers. Machine drivers use Rancher Machine (a maintained fork of the now-deprecated Docker Machine) under the hood to create VMs and prepare them for Kubernetes installation.

Cluster API (CAPI)

Rancher’s provisioning is moving toward Cluster API (CAPI) as the underlying framework, integrated via the Rancher Turtles operator. CAPI provides a Kubernetes-native, declarative way to create, configure, and manage clusters. This complements and will eventually replace the older node driver approach, aligning Rancher with the broader CNCF ecosystem. Node drivers remain fully supported for backward compatibility.

04

Authentication & RBAC

Rancher provides a centralized authentication and authorization layer that sits in front of all managed clusters. Instead of configuring RBAC independently on each cluster, you define roles and bindings once in Rancher and they are enforced everywhere.

Identity provider integration

Rancher supports multiple authentication backends:

  • Keycloak / OIDC — The recommended approach for enterprise environments. Rancher acts as an OIDC client, delegating authentication to Keycloak (or any OIDC provider)
  • SAML — Integration with ADFS, Okta, PingFederate, Shibboleth, and other SAML 2.0 providers
  • LDAP / Active Directory — Direct LDAP bind for organizations that haven’t adopted OIDC/SAML
  • GitHub / Google / Microsoft Entra ID (Azure AD) — OAuth-based authentication for development environments
  • Local — Built-in user database for bootstrap and emergency access

Rancher’s role model

Rancher defines roles at three hierarchical scopes:

Scope Description Example Roles
Global Applies across the entire Rancher installation Administrator, Restricted Admin, Standard User
Cluster Applies to a specific cluster Cluster Owner, Cluster Member, Cluster Viewer
Project / Namespace Applies to a group of namespaces within a cluster Project Owner, Project Member, Read-Only

Mapping external groups to Rancher roles

When an external IdP is configured, you can map IdP groups directly to Rancher roles. For example:

  • LDAP group cn=platform-admins → Global Restricted Admin
  • LDAP group cn=team-alpha → Cluster Member on prod-cluster + Project Owner on alpha-project
  • OIDC group claim devops → Cluster Owner on all dev clusters
Important

Rancher’s RBAC is enforced at the Rancher proxy layer. If users bypass Rancher and connect directly to a downstream cluster’s API server with a valid kubeconfig, Rancher RBAC is not enforced. For full security, restrict direct API server access via network policies and use Rancher as the sole entry point.

05

Catalogs & Apps

Rancher provides an app marketplace that simplifies deploying Helm charts across managed clusters. This is the primary mechanism for installing both infrastructure components (monitoring, logging, ingress controllers) and user applications.

Helm chart repositories

Rancher uses standard Helm chart repositories as its catalog backend. There are three types:

  • Built-in — Rancher ships with curated charts for monitoring, logging, Istio, OPA Gatekeeper, CIS benchmarks, and more
  • Partner — Charts from SUSE partners available through the Rancher marketplace
  • Custom — Add any Helm repository URL (public or private) to make its charts available in the Rancher UI

Deploying apps across clusters

From the Rancher UI, you can install a Helm chart on any managed cluster in a few clicks. Rancher proxies the Helm install through its API, so you don’t need direct kubectl access. For deploying the same app across multiple clusters, Rancher integrates with Fleet for GitOps-driven multi-cluster deployment.

Fleet for GitOps at scale

Fleet is Rancher’s built-in GitOps engine designed for managing deployments across large numbers of clusters. Instead of manually installing charts on each cluster, you define your desired state in a Git repository and Fleet ensures every targeted cluster converges to that state. See the next section for a deep dive.

Best Practice

Use Fleet for production workloads rather than manual Helm installs through the UI. The UI-based approach is convenient for one-off deployments and experimentation, but Fleet provides auditability, reproducibility, and drift detection that are essential for production operations.

06

Fleet

Fleet is a GitOps engine built into Rancher that enables continuous delivery across a large number of clusters. It watches Git repositories for changes and automatically deploys workloads to targeted clusters. Fleet was designed from the ground up for the multi-cluster use case — it can scale to manage thousands of clusters, making it ideal for edge deployments.

Core concepts

GitRepo

A Custom Resource that points to a Git repository (URL, branch, paths). Fleet watches this repo for changes. When a commit is detected, Fleet processes the contents and creates Bundles.

Bundle

The unit of deployment in Fleet. A Bundle contains the Kubernetes manifests, Helm charts, or Kustomize overlays that Fleet will apply to target clusters. Bundles are created automatically from GitRepo resources.

Cluster Groups & Labels

Fleet targets clusters using label selectors. You label your clusters (e.g., env=prod, region=eu) and then use selectors in your GitRepo to specify which clusters should receive the deployment.

BundleDeployment

Created by Fleet for each Bundle/cluster combination. It tracks the deployment status on each individual cluster — whether it’s ready, in progress, or has errors.

Example: multi-cluster deployment

# fleet.yaml - placed in your Git repo
defaultNamespace: my-app
helm:
  releaseName: my-app
  chart: ./charts/my-app
  values:
    replicaCount: 3

targetCustomizations:
- name: staging
  clusterSelector:
    matchLabels:
      env: staging
  helm:
    values:
      replicaCount: 1

- name: production
  clusterSelector:
    matchLabels:
      env: production
  helm:
    values:
      replicaCount: 5
      resources:
        requests:
          memory: "512Mi"
          cpu: "500m"
# GitRepo resource - registered in the Rancher local cluster
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: my-app
  namespace: fleet-default
spec:
  repo: https://github.com/org/my-app-deploy
  branch: main
  paths:
  - /
  targets:
  - clusterSelector:
      matchLabels:
        env: production
  - clusterSelector:
      matchLabels:
        env: staging
Scale

Fleet has been tested managing deployments across 1 million+ clusters in SUSE’s edge computing scenarios. It achieves this by batching operations and using an efficient reconciliation loop that avoids per-cluster API calls from the management plane.

07

Monitoring & Logging

Rancher provides integrated monitoring and logging stacks that can be deployed to any managed cluster with a single click from the Rancher UI. These are based on well-established open-source projects and are packaged as Helm charts maintained by the Rancher team.

Monitoring stack

Rancher’s monitoring solution is based on the Prometheus Operator and includes:

  • Prometheus — Metrics collection and storage with pre-configured scrape targets for Kubernetes components, nodes, and common workloads
  • Grafana — Dashboards for cluster health, node resources, pod metrics, and workload performance. Rancher ships with curated dashboards out of the box
  • Alertmanager — Alert routing and notification via email, Slack, PagerDuty, webhooks, and more
  • Node Exporter & kube-state-metrics — Exporters for OS-level and Kubernetes object metrics

Per-cluster vs global dashboards

Monitoring is deployed per cluster — each managed cluster gets its own Prometheus and Grafana instance. This ensures metrics stay local and avoids cross-cluster data transfer. For a global view, you can:

  • Use Thanos or Cortex to aggregate Prometheus metrics across clusters into a central query layer
  • Configure remote_write on each cluster’s Prometheus to push metrics to a central TSDB
  • Use the Rancher UI’s cluster switcher to navigate between per-cluster Grafana instances

Logging integration

Rancher’s logging integration is based on the Banzai Cloud Logging Operator. It deploys Fluent Bit as a DaemonSet on each node to collect and enrich logs with Kubernetes metadata, then forwards them to Fluentd (or syslog-ng) for filtering and routing. Supported outputs include:

  • Elasticsearch / OpenSearch
  • Splunk
  • Amazon CloudWatch
  • Azure Log Analytics
  • Syslog
  • Kafka
  • Custom HTTP endpoints

Alerting

Rancher exposes Alertmanager configuration through the UI, allowing operators to define alert rules and notification channels without editing YAML directly. Pre-built alert rules cover common scenarios:

  • Node not ready, high CPU/memory, disk pressure
  • Pod crash loops, OOMKills, pending pods
  • etcd health, API server latency, scheduler failures
  • Certificate expiration warnings
Resource Planning

Prometheus is memory-intensive. For production clusters with many pods, expect Prometheus to use 4-8 GB RAM with default retention (15 days). Adjust retention and retentionSize in the monitoring Helm values to control resource usage. Consider using Thanos with object storage for long-term retention instead of increasing local Prometheus storage.

08

Backup & Disaster Recovery

The Rancher Backup Operator (rancher-backup) provides a Kubernetes-native way to back up and restore the Rancher server’s state. This is critical because losing the Rancher server means losing centralized management of all downstream clusters (though the clusters themselves continue to operate independently).

What gets backed up

  • All Rancher Custom Resources (clusters, projects, users, roles, tokens, settings)
  • Rancher-managed namespaces and their contents
  • Catalog/app configurations
  • Fleet GitRepo and Bundle resources

The backup operator does not back up downstream cluster workloads — those need their own backup strategy (e.g., Velero).

Backup configuration

apiVersion: resources.cattle.io/v1
kind: Backup
metadata:
  name: rancher-daily-backup
spec:
  resourceSetName: rancher-resource-set
  schedule: "0 2 * * *"           # Daily at 2 AM
  retentionCount: 10              # Keep last 10 backups
  storageLocation:
    s3:
      bucketName: rancher-backups
      region: us-east-1
      endpoint: s3.amazonaws.com
      credentialSecretName: s3-creds
      credentialSecretNamespace: cattle-system

Restoring Rancher

Restoration deploys a fresh Rancher installation on a new RKE2 cluster, installs the backup operator, and applies a Restore CR pointing to the backup location. The operator reconciles all Rancher CRDs and resources, reconnecting downstream clusters automatically (agents reconnect via the same URL).

apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  name: rancher-restore
spec:
  backupFilename: rancher-daily-backup-2026-03-18T02-00-00Z.tar.gz
  storageLocation:
    s3:
      bucketName: rancher-backups
      region: us-east-1
      endpoint: s3.amazonaws.com
      credentialSecretName: s3-creds
      credentialSecretNamespace: cattle-system

Migrating Rancher to a new cluster

The backup/restore process doubles as a migration strategy. To move Rancher to a new cluster:

  1. Take a backup on the old Rancher server
  2. Provision a new RKE2 cluster and install Rancher with the same hostname
  3. Install the backup operator and apply the Restore CR
  4. Update DNS to point the Rancher hostname to the new cluster’s load balancer
  5. Downstream cluster agents will reconnect automatically
Critical

Always test your restore procedure in a non-production environment before you need it. A backup that has never been tested is not a backup. Schedule quarterly restore drills to validate your DR process end-to-end.

09

Upgrades

Upgrading a Rancher environment involves two independent concerns: upgrading the Rancher server itself and upgrading the downstream Kubernetes clusters it manages.

Upgrading Rancher server

Rancher server is deployed via Helm, so upgrades are a standard helm upgrade:

# 1. Back up Rancher before upgrading
kubectl apply -f rancher-backup.yaml

# 2. Update the Helm repo
helm repo update

# 3. Upgrade Rancher
helm upgrade rancher rancher-stable/rancher \
  --namespace cattle-system \
  --set hostname=rancher.example.com \
  --set replicas=3 \
  --version 2.13.3
  • Always take a backup before upgrading (use the Rancher Backup Operator)
  • Read the release notes — Rancher publishes detailed upgrade notes with known issues and breaking changes
  • Upgrade sequentially — Rancher recommends upgrading through each consecutive minor version. Skipping minor versions is not supported and increases the risk of issues due to accumulated changes
  • Upgrade the underlying RKE2 cluster if needed — check the Rancher support matrix for compatible Kubernetes versions

Upgrading downstream clusters

For Rancher-provisioned clusters (RKE2, K3s), upgrades are initiated from the Rancher UI or API:

  • Select the target Kubernetes version from the available list
  • Rancher orchestrates a rolling upgrade of control plane nodes first, then workers
  • For RKE2, the upgrade uses the System Upgrade Controller (SUC) to coordinate node-by-node upgrades
  • Configure drain settings (max unavailable, drain timeout) to control upgrade speed vs availability

For hosted clusters (EKS/AKS/GKE), Rancher calls the cloud provider’s API to trigger the managed upgrade process.

Upgrade strategy

Recommended Order

  1. Upgrade Rancher server to the latest patch
  2. Upgrade dev/staging downstream clusters
  3. Validate applications and monitoring
  4. Upgrade production downstream clusters during maintenance window
  5. Upgrade monitoring and logging stacks if needed

Common Pitfalls

  • Upgrading downstream clusters to a K8s version not yet supported by the current Rancher version
  • Not checking PodDisruptionBudgets before draining nodes
  • Skipping the backup step — rollback without a backup is painful
  • Upgrading all clusters simultaneously instead of rolling
10

Licensing

Rancher has a straightforward licensing model that separates the free open-source project from the commercial offering.

Rancher (Open Source)

The core Rancher project is 100% open source under the Apache 2.0 license. You can download, deploy, and use Rancher in production without any license key or subscription. All core features — multi-cluster management, RBAC, Fleet, monitoring integration, app catalog — are available for free.

Rancher Prime (SUSE Commercial)

Rancher Prime is SUSE’s commercially supported distribution of Rancher. As of 2025, it is priced per CPU core / vCPU (charged per pair of physical cores or per four vCPUs), replacing the previous per-node model. What you get with Rancher Prime:

  • Enterprise support — 24/7 support with SLAs from SUSE’s Kubernetes experts
  • Hardened images — SUSE-built, FIPS 140-2 validated container images with BoringCrypto
  • Extended lifecycle — Longer support windows for Rancher and RKE2 versions
  • Security patches — Priority access to CVE fixes and security advisories
  • SUSE Registry — Access to SUSE’s private container registry with verified images
  • UI extensions — Additional UI capabilities for enterprise governance
  • Integrations — Certified integrations with SUSE Linux Enterprise, NeuVector, and Longhorn
Feature Open Source Rancher Prime
Multi-cluster management Yes Yes
Fleet GitOps Yes Yes
RBAC & IdP integration Yes Yes
Enterprise support (24/7 SLA) Community only Yes
FIPS-validated images No Yes
Extended lifecycle support No Yes
NeuVector integration Community Certified
Pricing Note

Rancher Prime licensing changed in 2025 from a per-node model to a per-CPU/vCPU model (per pair of physical cores or per four vCPUs). This can significantly increase costs for deployments with high core counts. The Rancher server cluster itself is not counted. Contact SUSE for current pricing tiers, volume discounts, and academic/government programs.

11

Consultant’s Checklist

Use this checklist when planning, deploying, or auditing a Rancher environment.

Infrastructure

  • Dedicated 3-node RKE2 cluster for Rancher server (no user workloads)
  • Layer 4 load balancer in front of Rancher server nodes
  • Valid TLS certificate for rancher.example.com
  • DNS configured for Rancher hostname
  • Network connectivity: downstream clusters must reach Rancher on port 443
  • etcd backup schedule configured on the RKE2 cluster

Authentication

  • External IdP configured (Keycloak/OIDC recommended)
  • IdP group-to-role mappings defined and tested
  • Local admin account secured with strong password
  • Emergency break-glass procedure documented
  • Session timeout configured appropriately
  • API tokens have expiration dates set

Operations

  • Rancher Backup Operator installed and scheduled (daily)
  • Backup storage configured (S3-compatible recommended)
  • Restore procedure tested and documented
  • Upgrade runbook created with rollback steps
  • Monitoring deployed on all managed clusters
  • Alert channels configured (Slack, PagerDuty, email)

Security

  • Direct API server access restricted (force traffic through Rancher proxy)
  • Network policies isolating Rancher system namespaces
  • CIS benchmark scan run on RKE2 clusters
  • Pod Security Standards enforced (restricted profile for workloads)
  • Audit logging enabled on Rancher server
  • RBAC follows least-privilege principle

GitOps & Fleet

  • Fleet configured for application deployment (not manual UI installs)
  • Git repositories structured with fleet.yaml for target customizations
  • Cluster labels defined and documented (env, region, tier)
  • Fleet drift detection reviewed and remediation policy set
  • Secrets management strategy defined (Sealed Secrets, External Secrets Operator, or Vault)
Sizing Guide

Small (1-10 downstream clusters, < 50 nodes): 3-node RKE2, 4 vCPU / 8 GB RAM per node, 100 GB SSD.
Medium (10-50 clusters, 50-500 nodes): 3-node RKE2, 8 vCPU / 16 GB RAM per node, 200 GB SSD.
Large (50+ clusters, 500+ nodes): 3-node RKE2, 16 vCPU / 32 GB RAM per node, 500 GB SSD. Consider dedicated etcd nodes.