SUSE AI Production Guide
Private AI on Kubernetes — LLM inference, RAG, model management & GPU orchestration
Overview
SUSE AI is an enterprise platform for deploying and running generative AI and LLM workloads on Kubernetes with full data sovereignty. It packages battle-tested open-source AI tools — Ollama, vLLM, Open WebUI, Milvus, and more — into a curated, hardened stack that deploys via Helm on any SUSE-supported Kubernetes cluster.
The core problem SUSE AI solves is the private AI infrastructure gap. Organizations want to run LLMs on their own hardware for data privacy, regulatory compliance, and cost control, but stitching together GPU drivers, inference engines, vector databases, model serving, and observability on Kubernetes is complex. SUSE AI provides an opinionated, pre-integrated stack with enterprise support, hardened container images, and air-gapped deployment capability.
SUSE AI 1.0 was announced and made generally available at KubeCon NA 2024 (November 2024). Significant updates were added at SUSECON 2025 (MLflow, PyTorch, Pipelines) and KubeCon NA 2025 (vLLM, MCP Universal Proxy tech preview, virtual clusters GA).
What problems does SUSE AI solve?
- Data sovereignty — Run LLMs entirely on-premises or in your own cloud. No data leaves your infrastructure, meeting compliance requirements that rule out SaaS AI services
- Infrastructure complexity — Pre-integrates GPU drivers, inference engines, vector databases, chat UI, observability, and TLS into a single deployable stack
- Model management — Pull, serve, and switch between open-source models (Llama, Gemma, Mistral, etc.) without building custom serving infrastructure
- RAG at scale — Built-in vector database integration for Retrieval Augmented Generation — ground LLM responses in your organization’s documents
- GPU orchestration — Handles NVIDIA GPU scheduling, sharing, and monitoring on Kubernetes through the GPU Operator
- Supply chain trust — SUSE Application Collection provides signed, SBOM-tracked, SLSA Level 3 compliant container images with daily patches
Strengths
- Fully private — on-premises, air-gapped, or your cloud with no data egress
- Open-source core (Ollama, vLLM, Open WebUI, Milvus are all OSS projects)
- Pre-integrated stack — components are tested together and deployed via a single meta chart
- Hardened container images with signatures, SBOMs, daily CVE patches
- Built on the Rancher Prime ecosystem — leverages existing K8s management tooling
- Air-gapped deployment fully supported with offline mirroring scripts
- OpenTelemetry-native observability with AI-specific dashboards
Considerations
- NVIDIA GPUs only — no AMD or Intel GPU support in current release
- Requires SUSE subscription (Rancher Prime + SUSE AI entitlements)
- Young product (GA since Nov 2024) — some features still in tech preview
- Ollama API has no built-in authentication (ingress disabled by default)
- Milvus requires StorageClass with volume expansion — fails silently without it
- GPU hardware is expensive — sizing and cost planning essential
- MCP Universal Proxy and Liz AI assistant are tech preview only
Architecture
SUSE AI is a multi-layered stack built on top of the SUSE Rancher Prime ecosystem. Each layer builds on the one below it, from the operating system up through Kubernetes to the AI workloads themselves.
The four layers
Layer 1 — Operating System
SLES 15 SP6 (general purpose) or SLE Micro 6.1 (immutable, transactional). NVIDIA GPU drivers (G06 generation) are installed at this layer. SLE Micro is preferred for GPU worker nodes due to its smaller attack surface and atomic updates.
Layer 2 — Kubernetes
RKE2 (recommended) or K3s for the Kubernetes runtime, managed by Rancher Manager. RKE2’s containerd runtime is required for the NVIDIA GPU Operator’s device plugin to mount GPUs into containers.
Layer 3 — Infrastructure Services
NVIDIA GPU Operator for GPU scheduling, cert-manager (v1.17.2) for TLS, CSI storage drivers (Longhorn, NFS CSI, or cloud provider), NeuVector for runtime security, and SUSE Observability for monitoring.
Component interaction
Deployment model
All components are packaged as Helm charts distributed through the SUSE Application Collection OCI registry (dp.apps.rancher.io). A meta Helm chart called suse-ai-deployer orchestrates deployment of all components together. Individual charts can also be installed independently for customized deployments.
The deployer chart has six required dependencies: Milvus, Ollama, Open WebUI, Open WebUI MCPO, PyTorch, and vLLM. Default namespace is suse-private-ai or suseai.
For production, use a dedicated RKE2 cluster (or dedicated GPU node pool) for SUSE AI workloads. AI inference is resource-intensive and can starve other workloads of GPU, memory, and I/O. Separate the AI workload plane from general application workloads.
Inference Engines
SUSE AI bundles two inference engines that serve different use cases. You can run both simultaneously and route requests through LiteLLM as a unified gateway.
Ollama — Simplicity & Flexibility
Local LLM inference engine that handles model downloading, loading, and serving. Serves on port 11434. Optimized for single-user or low-concurrency workloads. Supports a wide range of model formats (GGUF). Easy model management with ollama pull, ollama run.
- Best for: development, experimentation, small teams, edge deployments
- Model formats: GGUF (quantized models), Safetensors
- GPU: NVIDIA only in SUSE AI bundle
vLLM — High-Performance Serving
High-throughput inference engine using PagedAttention for efficient GPU memory management. Claims up to 80% reduction in GPU memory waste and up to 24x throughput improvement under high concurrency. Provides OpenAI-compatible API.
- Best for: production chatbots, copilots, high-concurrency API serving
- Continuous batching for dynamic request handling
- Native OpenAI API compatibility
Choosing between Ollama and vLLM
| Criteria | Ollama | vLLM |
|---|---|---|
| Primary use case | Dev/test, small teams, model experimentation | Production serving, high-concurrency APIs |
| Throughput | Good for single/few concurrent users | Optimized for many concurrent requests |
| Memory efficiency | Standard allocation | PagedAttention — up to 80% less waste |
| API compatibility | Ollama API + partial OpenAI compat | Full OpenAI-compatible API |
| Model management | Built-in pull/run/list commands | Requires pre-downloaded models |
| Batching | Sequential processing | Continuous batching |
| Setup complexity | Simple — single binary | More configuration required |
LiteLLM — unified API gateway
LiteLLM sits in front of both engines and provides a single OpenAI-compatible API endpoint that can route requests to Ollama, vLLM, or 100+ external LLM providers. It adds cost tracking, per-key/per-team access control, guardrails, load balancing, and logging.
Supported models
Any model supported by Ollama or vLLM can be used. Common models documented in SUSE AI examples:
- Llama 3.1 / 3.2 (Meta) —
llama3.1,llama3.2:3b - Gemma 2B (Google) —
gemma:2b - Mistral / Mixtral (Mistral AI)
- Phi (Microsoft)
- Any GGUF-format model from Hugging Face or Ollama library
RAG & Vector Databases
Retrieval Augmented Generation (RAG) allows the LLM to ground its responses in your organization’s documents. SUSE AI provides two vector database options for storing and searching document embeddings.
How RAG works in SUSE AI
- Document ingestion — Users upload documents through Open WebUI. Documents are chunked and passed through an embedding model
- Embedding — The default embedding model
sentence-transformers/all-MiniLM-L6-v2converts text chunks into numerical vectors - Storage — Vectors are stored in Milvus or Qdrant for efficient similarity search
- Query — When a user asks a question, the query is embedded and the vector DB finds the most semantically similar document chunks
- Augmented prompt — Retrieved chunks are injected into the LLM prompt as context, grounding the response in actual data
Vector database options
Milvus (Primary)
Open-source vector database purpose-built for similarity search at scale. Deployed in cluster mode with etcd (1 replica), MinIO (4 replicas, distributed mode), and Kafka (3 brokers, 8 Gi storage each). Helm chart version 4.2.2. Serves on port 19530.
Pulsar is disabled by default. Requires a StorageClass with ALLOWVOLUMEEXPANSION or deployment will fail silently.
Qdrant (Alternative)
Lightweight vector database alternative. Simpler deployment footprint than Milvus (no Kafka/MinIO dependencies). Good for smaller deployments or when you want fewer moving parts. Trades some scalability for operational simplicity.
RAG configuration
# Open WebUI values.yaml for RAG with Milvus
open-webui:
env:
VECTOR_DB: "milvus"
MILVUS_URI: "http://milvus.suseai.svc.cluster.local:19530"
RAG_EMBEDDING_MODEL: "sentence-transformers/all-MiniLM-L6-v2"
Milvus with Longhorn storage requires a custom longhorn-xfs StorageClass. The Kafka brokers used by Milvus require XFS filesystem — the default Ext4 is incompatible and will cause data corruption. Create the StorageClass with mkfsParams: "-f" and fsType: "xfs" before deploying Milvus.
Open WebUI
Open WebUI is the user-facing chat interface for SUSE AI. It’s a self-hosted web application that provides a ChatGPT-like experience connecting to your local Ollama or vLLM backends. Exposed via HTTPS Ingress with TLS certificates managed by cert-manager.
Key capabilities
- Multi-model chat — Switch between available models mid-conversation. Configure default models per deployment
- RAG document upload — Upload PDFs, text files, and other documents directly in the chat UI for RAG-powered Q&A
- System prompts — Customize model behavior with system-level instructions
- Pipelines — Chain AI models with APIs and external tools for multi-step workflows (added at SUSECON 2025)
- User management — Built-in user roles (admin, user), configurable default role for new signups
- API access — Programmatic access in addition to the web UI
Configuration
# Open WebUI Helm values
open-webui:
ollamaUrls:
- http://open-webui-ollama.suseai.svc.cluster.local:11434
ingress:
enabled: true
host: ai.example.com
tls: true
env:
WEBUI_NAME: "SUSE AI"
DEFAULT_MODELS: "gemma:2b"
DEFAULT_USER_ROLE: "user"
GLOBAL_LOG_LEVEL: INFO
Pipelines
Open WebUI Pipelines enable chaining models with external tools and APIs for agentic workflows. Combined with MCPO (the MCP-to-OpenAPI proxy), this allows the LLM to call external REST APIs, query databases, or invoke custom business logic as part of its reasoning chain.
Open WebUI Helm chart version in current documentation: 5.16.0 from oci://dp.apps.rancher.io/charts/open-webui. The chart bundles Ollama as a subchart — you can enable/disable Ollama within the same release.
GPU & Hardware
LLM inference is GPU-intensive. SUSE AI currently supports NVIDIA GPUs only, using the NVIDIA GPU Operator to bridge host GPU drivers into Kubernetes containers.
Supported GPUs
- Data center — A100, H100, H200, A10, L40S, V100, Tesla
- Consumer/workstation — RTX 30 series and newer
- Driver generation: G06 (driver version 550.x+, CUDA 12.3+)
Infrastructure requirements
| Component | Minimum | Recommended |
|---|---|---|
| Control plane CPU | 4 cores | 8+ cores (16+ for HA) |
| Control plane RAM | 8 GB | 16 GB+ (32 GB+ for HA) |
| GPU worker RAM | 16 GB | 32 GB+ for larger models |
| Disk | 50 GB SSD | 100 GB+ NVMe SSD |
| Kubernetes | 1.18+ | RKE2 latest stable |
| Nodes (production) | 3 control plane + 1 GPU worker | 3 CP + 2+ GPU workers |
| Model storage | 4 GB (small models) | 100 GB+ (multiple large models) |
GPU Operator setup on RKE2
# Install NVIDIA GPU drivers on the host OS (SLES 15 SP6)
sudo zypper install nvidia-open-driver-G06-signed-kmp-default
# Verify GPU is visible
nvidia-smi
# Should show: Driver Version 550.x, CUDA 12.3+
# Install GPU Operator on the RKE2 cluster
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set driver.enabled=false \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia
# Verify GPU nodes are labeled
kubectl get nodes -l nvidia.com/gpu=1
GPU drivers must be installed on the host OS, not inside containers. The GPU Operator then makes those drivers available to containers via the device plugin. Set driver.enabled=false in the GPU Operator Helm values since the driver is already on the host. The CONTAINERD_CONFIG path is specific to RKE2 — it differs from standard containerd installations.
Virtual clusters for GPU sharing
SUSE AI supports virtual clusters (GA) for sharing GPU resources across teams. Multiple virtual clusters run on the same physical host cluster, with a single GPU scheduler managing allocation. On Harvester, you can also use PCI passthrough for dedicated GPU access or vGPU with Multi-Instance GPU (MIG) partitioning on A100/H100/H200 GPUs.
Set the CPU scaling governor to performance on GPU worker nodes for optimal AI workload throughput. The default powersave governor can significantly reduce inference speed.
Installation
SUSE AI is deployed via Helm charts from the SUSE Application Collection OCI registry. You need an active SUSE subscription with Rancher Prime and SUSE AI entitlements, and service account credentials from SUSE Customer Center (SCC).
Prerequisites
- RKE2 cluster with Ingress controller and GPU worker nodes
- NVIDIA GPU Operator installed (see GPU section above)
- Helm 3 CLI
- SCC service account credentials for
dp.apps.rancher.io - StorageClass with volume expansion support (for Milvus)
- DNS-resolvable hostname for the Open WebUI Ingress
Step-by-step deployment
# 1. Create namespace and registry secret
kubectl create namespace suseai
kubectl create secret docker-registry application-collection \
--docker-server=dp.apps.rancher.io \
--docker-username=<SCC_USERNAME> \
--docker-password=<SCC_TOKEN> \
-n suseai
helm registry login dp.apps.rancher.io \
-u <SCC_USERNAME> -p <SCC_TOKEN>
# 2. Install cert-manager (if not already installed)
kubectl create namespace cert-manager
kubectl create secret docker-registry application-collection \
--docker-server=dp.apps.rancher.io \
--docker-username=<SCC_USERNAME> \
--docker-password=<SCC_TOKEN> \
-n cert-manager
helm upgrade --install cert-manager \
oci://dp.apps.rancher.io/charts/cert-manager \
-n cert-manager \
--set "global.imagePullSecrets[0].name=application-collection" \
--set crds.enabled=true
# 3. Install Milvus (vector database)
helm upgrade --install milvus \
oci://dp.apps.rancher.io/charts/milvus \
-n suseai --version 4.2.2 \
-f customvalues-milvus.yaml
# 4. Install Open WebUI + Ollama
helm upgrade --install open-webui \
oci://dp.apps.rancher.io/charts/open-webui \
-n suseai --version 5.16.0 \
-f customvalues-owui.yaml
# 5. Verify deployment
kubectl get pods -n suseai
kubectl get ingress -n suseai
Meta deployer chart (alternative)
Instead of installing components individually, use the suse-ai-deployer meta chart to deploy everything at once:
# Deploy the full SUSE AI stack
helm upgrade --install suse-ai \
oci://dp.apps.rancher.io/charts/suse-ai-deployer \
--namespace suse-private-ai --create-namespace \
--values ./custom-overrides.yaml
TLS options
| Option | global.tls.source | cert-manager needed |
|---|---|---|
| Self-signed (default) | suse-private-ai | Yes |
| Let’s Encrypt | letsEncrypt | Yes |
| Bring your own cert | secret | No |
Air-gapped deployment
SUSE AI fully supports air-gapped environments. Three scripts handle the offline workflow:
SUSE-AI-mirror-nvidia.sh— Mirrors NVIDIA RPM packages from a connected hostSUSE-AI-get-images.sh— Downloads all SUSE AI container imagesSUSE-AI-load-images.sh— Loads images into a private local registry
Registry secrets must be created per namespace. Kubernetes cannot reference image pull secrets from other namespaces. If you deploy cert-manager and SUSE AI in separate namespaces, create the application-collection secret in both.
Observability
SUSE AI uses OpenTelemetry for instrumentation and integrates with SUSE Observability for unified metrics, logs, and traces. The OpenTelemetry Operator enables auto-instrumentation of Python, Java, and Go services with zero code changes.
What’s monitored out of the box
- LLM metrics — Token usage (input, output, reasoning), cost tracking, latency per model, throughput (tokens/sec)
- GPU metrics — Utilization, temperature, power draw, memory usage via NVIDIA DCGM Exporter
- Inference engine health — Ollama and vLLM request rates, error rates, queue depth
- Vector database — Milvus query latency, index size, memory consumption
- Application traces — End-to-end request tracing from Open WebUI through inference to vector search
Pre-built dashboards
LLM Cost & Tokens
Token consumption by model, user, and team. Input vs output vs reasoning token breakdown. Cost estimation based on configurable per-token rates.
LLM Performance
Inference latency (p50, p95, p99), throughput, time to first token, queue wait times. Comparison across models.
GPU Performance
Per-GPU utilization, memory usage, temperature, power draw. NVIDIA DCGM Exporter metrics. Helps right-size GPU allocation.
VectorDB Performance
Milvus/Qdrant query latency, index operations, memory consumption, segment health.
LLM drift detection
The observability stack includes a drift dashboard that tracks changes in model response patterns over time. This helps detect when model behavior shifts due to updates, prompt changes, or RAG document modifications — critical for compliance-sensitive deployments.
The observability integration uses SUSE Observability’s Time Machine feature for historical analysis. When investigating an incident, you can “travel back in time” to see the exact state of the AI stack at the moment the issue occurred — including GPU metrics, model load, and active queries.
Security & Authentication
SUSE AI provides multiple authentication mechanisms for Open WebUI and includes guardrail capabilities for responsible AI use.
Authentication methods
Built-in Password Auth
Default authentication method. First user to sign up becomes admin. Subsequent users get the role configured in DEFAULT_USER_ROLE.
LDAP / Active Directory
Integration via ENABLE_LDAP and ENABLE_LDAP_GROUP_MANAGEMENT environment variables. Maps LDAP groups to Open WebUI roles.
OAuth 2.0 / OIDC SSO
Any OIDC-compatible identity provider (Keycloak, Azure AD, Okta, etc.). Supports multi-tenant SSO with per-tenant provider configuration.
SAML & Trusted Headers
SAML 2.0 support for enterprise IdPs. Trusted header auth for reverse proxy deployments where authentication is handled upstream.
RBAC & multi-tenancy
- Role-based access control — Model-level permissions control which users/groups can access which models
- Workspace isolation — Multi-tenant workspace separation within a single Open WebUI instance
- API keys — Per-user API keys for programmatic access
- LiteLLM access control — Per-key and per-team access control, budget caps, and token limits at the API gateway level
Guardrails
SUSE AI includes a blueprint for implementing guardrail technology and has partnered with Infosys for their Responsible AI framework (“Scan, Shield, Steer”). Guardrails can enforce content filtering, prompt moderation, and compliance policies on both inputs and outputs.
Supply chain security
- All container images from SUSE Application Collection are signed and include SBOMs
- SLSA Level 3 build provenance compliance
- Daily CVE patches to container images
- NeuVector integration for runtime network monitoring (L2-3 and L7), vulnerability scanning, and automated security policy generation
Ollama’s API endpoint has no built-in authentication. By default, SUSE AI disables the Ollama Ingress to prevent unauthenticated external access. If you need to expose Ollama directly, place an authenticating reverse proxy in front of it. All user-facing access should go through Open WebUI, which handles authentication.
SUSE Ecosystem Integration
SUSE AI is designed to plug into the broader SUSE product portfolio. It leverages existing infrastructure rather than requiring a greenfield deployment.
Rancher Manager
SUSE AI components are available in the Rancher Apps catalog. Rancher UI shows GPU node labels, manages cluster lifecycle, and provides app deployment workflows. Observability integrates as a Rancher UI extension.
SUSE Application Collection
All Helm charts and container images distributed via dp.apps.rancher.io OCI registry. Curated, hardened, and signed. The single source of truth for SUSE AI artifacts.
Harvester (SUSE Virtualization)
Supports PCI passthrough for dedicated GPU access to VMs, vGPU with MIG partitioning (A100, H100, H200), and SR-IOV for GPU sharing. Enables running SUSE AI in VMs with direct GPU access on Harvester HCI clusters.
Longhorn (SUSE Storage)
Tested as persistent storage backend. Requires a custom longhorn-xfs StorageClass for Kafka/Milvus components. XFS filesystem required — Ext4 is incompatible.
NeuVector (SUSE Security)
Network monitoring, CVE scanning, and compliance enforcement for the AI stack. Auto-discovers container communication patterns and generates network policies.
Fleet (GitOps)
GitOps-driven deployment and lifecycle management for SUSE AI components across multiple clusters. Define your AI stack as code and deploy consistently.
Liz — AI-powered Kubernetes assistant
Liz is a tech preview AI agent that runs as a Rancher Prime UI extension. It provides context-aware Kubernetes management assistance — proactive issue detection, performance optimization suggestions, and natural language cluster operations. Powered by Ollama (on-prem) or Amazon Bedrock (AWS). Available as a tech preview since KubeCon NA 2025.
Licensing
SUSE AI requires an active SUSE subscription. While the underlying components are all open-source projects, the SUSE packaging, hardened images, registry access, and enterprise support require a paid subscription.
What you need
- Rancher Prime entitlement — For the Kubernetes management layer
- SUSE AI entitlement — For access to AI-specific charts and images from the Application Collection
- Both entitlements are managed through SUSE Customer Center (SCC)
Subscription model
- Consumption-based billing — SUSE invoices based on consumption of “Units” as defined in the SUSE AI Supplemental Terms
- Available in 1-year, 3-year, and 5-year terms
- Governed by SUSE AI Supplemental Terms (separate from the main subscription agreement)
Open-source components
| Component | License | Upstream project |
|---|---|---|
| Ollama | MIT | ollama/ollama |
| vLLM | Apache 2.0 | vllm-project/vllm |
| Open WebUI | MIT | open-webui/open-webui |
| Milvus | Apache 2.0 | milvus-io/milvus |
| LiteLLM | MIT | BerriAI/litellm |
| MLflow | Apache 2.0 | mlflow/mlflow |
| suse-ai-deployer | Apache 2.0 | SUSE/suse-ai-deployer |
You can run all the upstream open-source components yourself without a SUSE subscription. What the subscription buys you is: hardened container images with daily patches, SLSA Level 3 supply chain compliance, tested component integration, air-gapped deployment support, SUSE enterprise support, and the AI-specific observability dashboards.
Deployment Checklist
Pre-deployment
- Confirm SUSE subscription includes both Rancher Prime and SUSE AI entitlements
- Create SCC service account credentials for
dp.apps.rancher.ioaccess - Provision GPU worker nodes with NVIDIA GPUs (A100/H100 for production, consumer GPUs for dev)
- Install NVIDIA GPU drivers (G06 generation) on host OS
- Deploy RKE2 cluster with 3+ control plane nodes and 1+ GPU worker nodes
- Verify
nvidia-smiworks on GPU nodes - Install NVIDIA GPU Operator with RKE2-specific containerd paths
- Confirm nodes labeled with
nvidia.com/gpu=1 - Ensure StorageClass supports volume expansion (critical for Milvus)
- If using Longhorn: create
longhorn-xfsStorageClass for Kafka/Milvus - Set CPU scaling governor to
performanceon GPU workers - Configure DNS for Open WebUI hostname
Deployment
- Create namespaces and
application-collectionregistry secrets in each namespace - Install cert-manager with CRDs enabled
- Choose TLS strategy: self-signed, Let’s Encrypt, or bring-your-own
- Deploy Milvus (or Qdrant) for RAG — verify pods are running and PVCs are bound
- Deploy Open WebUI + Ollama via Helm — verify Ingress is accessible
- Pull at least one model:
kubectl execinto Ollama pod and runollama pull gemma:2b - Test chat through Open WebUI — verify model responds
- Test RAG: upload a document, ask a question about it, verify grounded response
Post-deployment
- Configure authentication (LDAP, OIDC, or SAML) — disable default password signup if using SSO
- Set up RBAC and model-level permissions
- Deploy OpenTelemetry Operator for observability
- Verify GPU metrics flowing to monitoring stack (DCGM Exporter)
- Set up LiteLLM if you need a unified API gateway or multiple backends
- Configure backup strategy for Milvus data and Open WebUI state
- Document model selection rationale and GPU sizing decisions
- Plan upgrade strategy — SUSE AI charts receive regular updates