SUSE AI Production Guide

Private AI on Kubernetes — LLM inference, RAG, model management & GPU orchestration

01

Overview

SUSE AI is an enterprise platform for deploying and running generative AI and LLM workloads on Kubernetes with full data sovereignty. It packages battle-tested open-source AI tools — Ollama, vLLM, Open WebUI, Milvus, and more — into a curated, hardened stack that deploys via Helm on any SUSE-supported Kubernetes cluster.

The core problem SUSE AI solves is the private AI infrastructure gap. Organizations want to run LLMs on their own hardware for data privacy, regulatory compliance, and cost control, but stitching together GPU drivers, inference engines, vector databases, model serving, and observability on Kubernetes is complex. SUSE AI provides an opinionated, pre-integrated stack with enterprise support, hardened container images, and air-gapped deployment capability.

SUSE AI 1.0 was announced and made generally available at KubeCon NA 2024 (November 2024). Significant updates were added at SUSECON 2025 (MLflow, PyTorch, Pipelines) and KubeCon NA 2025 (vLLM, MCP Universal Proxy tech preview, virtual clusters GA).

What problems does SUSE AI solve?

  • Data sovereignty — Run LLMs entirely on-premises or in your own cloud. No data leaves your infrastructure, meeting compliance requirements that rule out SaaS AI services
  • Infrastructure complexity — Pre-integrates GPU drivers, inference engines, vector databases, chat UI, observability, and TLS into a single deployable stack
  • Model management — Pull, serve, and switch between open-source models (Llama, Gemma, Mistral, etc.) without building custom serving infrastructure
  • RAG at scale — Built-in vector database integration for Retrieval Augmented Generation — ground LLM responses in your organization’s documents
  • GPU orchestration — Handles NVIDIA GPU scheduling, sharing, and monitoring on Kubernetes through the GPU Operator
  • Supply chain trust — SUSE Application Collection provides signed, SBOM-tracked, SLSA Level 3 compliant container images with daily patches

Strengths

  • Fully private — on-premises, air-gapped, or your cloud with no data egress
  • Open-source core (Ollama, vLLM, Open WebUI, Milvus are all OSS projects)
  • Pre-integrated stack — components are tested together and deployed via a single meta chart
  • Hardened container images with signatures, SBOMs, daily CVE patches
  • Built on the Rancher Prime ecosystem — leverages existing K8s management tooling
  • Air-gapped deployment fully supported with offline mirroring scripts
  • OpenTelemetry-native observability with AI-specific dashboards

Considerations

  • NVIDIA GPUs only — no AMD or Intel GPU support in current release
  • Requires SUSE subscription (Rancher Prime + SUSE AI entitlements)
  • Young product (GA since Nov 2024) — some features still in tech preview
  • Ollama API has no built-in authentication (ingress disabled by default)
  • Milvus requires StorageClass with volume expansion — fails silently without it
  • GPU hardware is expensive — sizing and cost planning essential
  • MCP Universal Proxy and Liz AI assistant are tech preview only
02

Architecture

SUSE AI is a multi-layered stack built on top of the SUSE Rancher Prime ecosystem. Each layer builds on the one below it, from the operating system up through Kubernetes to the AI workloads themselves.

The four layers

Layer 1 — Operating System

SLES 15 SP6 (general purpose) or SLE Micro 6.1 (immutable, transactional). NVIDIA GPU drivers (G06 generation) are installed at this layer. SLE Micro is preferred for GPU worker nodes due to its smaller attack surface and atomic updates.

Layer 2 — Kubernetes

RKE2 (recommended) or K3s for the Kubernetes runtime, managed by Rancher Manager. RKE2’s containerd runtime is required for the NVIDIA GPU Operator’s device plugin to mount GPUs into containers.

Layer 3 — Infrastructure Services

NVIDIA GPU Operator for GPU scheduling, cert-manager (v1.17.2) for TLS, CSI storage drivers (Longhorn, NFS CSI, or cloud provider), NeuVector for runtime security, and SUSE Observability for monitoring.

Layer 4 — AI Workloads

The AI components themselves, deployed as Helm charts: Ollama, vLLM, Open WebUI, Milvus, Qdrant, LiteLLM, MCPO, MLflow, PyTorch, and OpenSearch.

Component interaction

User (Browser) | | HTTPS (TLS via cert-manager) v +------------------+ +-------------------+ +-------------------+ | Open WebUI |------>| Ollama / vLLM | | Milvus | | (Chat UI) | | (LLM Inference) | | (Vector DB) | | Port 8080 | | Port 11434 | | Port 19530 | +------------------+ +-------------------+ +-------------------+ | | ^ | | GPU access via | Embedding vectors | | NVIDIA Device Plugin | for RAG queries | v | | +-------------------+ | | | NVIDIA GPU | | | | Operator | +------+-------+ | +-------------------+ | Embedding | | | Model | +-------> LiteLLM (API Gateway) --------> Multiple LLM backends | +-------> MCPO (MCP-to-OpenAPI proxy) --> External tool APIs

Deployment model

All components are packaged as Helm charts distributed through the SUSE Application Collection OCI registry (dp.apps.rancher.io). A meta Helm chart called suse-ai-deployer orchestrates deployment of all components together. Individual charts can also be installed independently for customized deployments.

The deployer chart has six required dependencies: Milvus, Ollama, Open WebUI, Open WebUI MCPO, PyTorch, and vLLM. Default namespace is suse-private-ai or suseai.

Recommendation

For production, use a dedicated RKE2 cluster (or dedicated GPU node pool) for SUSE AI workloads. AI inference is resource-intensive and can starve other workloads of GPU, memory, and I/O. Separate the AI workload plane from general application workloads.

03

Inference Engines

SUSE AI bundles two inference engines that serve different use cases. You can run both simultaneously and route requests through LiteLLM as a unified gateway.

Ollama — Simplicity & Flexibility

Local LLM inference engine that handles model downloading, loading, and serving. Serves on port 11434. Optimized for single-user or low-concurrency workloads. Supports a wide range of model formats (GGUF). Easy model management with ollama pull, ollama run.

  • Best for: development, experimentation, small teams, edge deployments
  • Model formats: GGUF (quantized models), Safetensors
  • GPU: NVIDIA only in SUSE AI bundle

vLLM — High-Performance Serving

High-throughput inference engine using PagedAttention for efficient GPU memory management. Claims up to 80% reduction in GPU memory waste and up to 24x throughput improvement under high concurrency. Provides OpenAI-compatible API.

  • Best for: production chatbots, copilots, high-concurrency API serving
  • Continuous batching for dynamic request handling
  • Native OpenAI API compatibility

Choosing between Ollama and vLLM

CriteriaOllamavLLM
Primary use caseDev/test, small teams, model experimentationProduction serving, high-concurrency APIs
ThroughputGood for single/few concurrent usersOptimized for many concurrent requests
Memory efficiencyStandard allocationPagedAttention — up to 80% less waste
API compatibilityOllama API + partial OpenAI compatFull OpenAI-compatible API
Model managementBuilt-in pull/run/list commandsRequires pre-downloaded models
BatchingSequential processingContinuous batching
Setup complexitySimple — single binaryMore configuration required

LiteLLM — unified API gateway

LiteLLM sits in front of both engines and provides a single OpenAI-compatible API endpoint that can route requests to Ollama, vLLM, or 100+ external LLM providers. It adds cost tracking, per-key/per-team access control, guardrails, load balancing, and logging.

Supported models

Any model supported by Ollama or vLLM can be used. Common models documented in SUSE AI examples:

  • Llama 3.1 / 3.2 (Meta) — llama3.1, llama3.2:3b
  • Gemma 2B (Google) — gemma:2b
  • Mistral / Mixtral (Mistral AI)
  • Phi (Microsoft)
  • Any GGUF-format model from Hugging Face or Ollama library
04

RAG & Vector Databases

Retrieval Augmented Generation (RAG) allows the LLM to ground its responses in your organization’s documents. SUSE AI provides two vector database options for storing and searching document embeddings.

How RAG works in SUSE AI

  1. Document ingestion — Users upload documents through Open WebUI. Documents are chunked and passed through an embedding model
  2. Embedding — The default embedding model sentence-transformers/all-MiniLM-L6-v2 converts text chunks into numerical vectors
  3. Storage — Vectors are stored in Milvus or Qdrant for efficient similarity search
  4. Query — When a user asks a question, the query is embedded and the vector DB finds the most semantically similar document chunks
  5. Augmented prompt — Retrieved chunks are injected into the LLM prompt as context, grounding the response in actual data

Vector database options

Milvus (Primary)

Open-source vector database purpose-built for similarity search at scale. Deployed in cluster mode with etcd (1 replica), MinIO (4 replicas, distributed mode), and Kafka (3 brokers, 8 Gi storage each). Helm chart version 4.2.2. Serves on port 19530.

Pulsar is disabled by default. Requires a StorageClass with ALLOWVOLUMEEXPANSION or deployment will fail silently.

Qdrant (Alternative)

Lightweight vector database alternative. Simpler deployment footprint than Milvus (no Kafka/MinIO dependencies). Good for smaller deployments or when you want fewer moving parts. Trades some scalability for operational simplicity.

RAG configuration

# Open WebUI values.yaml for RAG with Milvus
open-webui:
  env:
    VECTOR_DB: "milvus"
    MILVUS_URI: "http://milvus.suseai.svc.cluster.local:19530"
    RAG_EMBEDDING_MODEL: "sentence-transformers/all-MiniLM-L6-v2"
Warning

Milvus with Longhorn storage requires a custom longhorn-xfs StorageClass. The Kafka brokers used by Milvus require XFS filesystem — the default Ext4 is incompatible and will cause data corruption. Create the StorageClass with mkfsParams: "-f" and fsType: "xfs" before deploying Milvus.

05

Open WebUI

Open WebUI is the user-facing chat interface for SUSE AI. It’s a self-hosted web application that provides a ChatGPT-like experience connecting to your local Ollama or vLLM backends. Exposed via HTTPS Ingress with TLS certificates managed by cert-manager.

Key capabilities

  • Multi-model chat — Switch between available models mid-conversation. Configure default models per deployment
  • RAG document upload — Upload PDFs, text files, and other documents directly in the chat UI for RAG-powered Q&A
  • System prompts — Customize model behavior with system-level instructions
  • Pipelines — Chain AI models with APIs and external tools for multi-step workflows (added at SUSECON 2025)
  • User management — Built-in user roles (admin, user), configurable default role for new signups
  • API access — Programmatic access in addition to the web UI

Configuration

# Open WebUI Helm values
open-webui:
  ollamaUrls:
    - http://open-webui-ollama.suseai.svc.cluster.local:11434
  ingress:
    enabled: true
    host: ai.example.com
    tls: true
  env:
    WEBUI_NAME: "SUSE AI"
    DEFAULT_MODELS: "gemma:2b"
    DEFAULT_USER_ROLE: "user"
    GLOBAL_LOG_LEVEL: INFO

Pipelines

Open WebUI Pipelines enable chaining models with external tools and APIs for agentic workflows. Combined with MCPO (the MCP-to-OpenAPI proxy), this allows the LLM to call external REST APIs, query databases, or invoke custom business logic as part of its reasoning chain.

Note

Open WebUI Helm chart version in current documentation: 5.16.0 from oci://dp.apps.rancher.io/charts/open-webui. The chart bundles Ollama as a subchart — you can enable/disable Ollama within the same release.

06

GPU & Hardware

LLM inference is GPU-intensive. SUSE AI currently supports NVIDIA GPUs only, using the NVIDIA GPU Operator to bridge host GPU drivers into Kubernetes containers.

Supported GPUs

  • Data center — A100, H100, H200, A10, L40S, V100, Tesla
  • Consumer/workstation — RTX 30 series and newer
  • Driver generation: G06 (driver version 550.x+, CUDA 12.3+)

Infrastructure requirements

ComponentMinimumRecommended
Control plane CPU4 cores8+ cores (16+ for HA)
Control plane RAM8 GB16 GB+ (32 GB+ for HA)
GPU worker RAM16 GB32 GB+ for larger models
Disk50 GB SSD100 GB+ NVMe SSD
Kubernetes1.18+RKE2 latest stable
Nodes (production)3 control plane + 1 GPU worker3 CP + 2+ GPU workers
Model storage4 GB (small models)100 GB+ (multiple large models)

GPU Operator setup on RKE2

# Install NVIDIA GPU drivers on the host OS (SLES 15 SP6)
sudo zypper install nvidia-open-driver-G06-signed-kmp-default

# Verify GPU is visible
nvidia-smi
# Should show: Driver Version 550.x, CUDA 12.3+

# Install GPU Operator on the RKE2 cluster
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set driver.enabled=false \
  --set toolkit.env[0].name=CONTAINERD_CONFIG \
  --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
  --set toolkit.env[1].name=CONTAINERD_SOCKET \
  --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
  --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
  --set toolkit.env[2].value=nvidia

# Verify GPU nodes are labeled
kubectl get nodes -l nvidia.com/gpu=1
Important

GPU drivers must be installed on the host OS, not inside containers. The GPU Operator then makes those drivers available to containers via the device plugin. Set driver.enabled=false in the GPU Operator Helm values since the driver is already on the host. The CONTAINERD_CONFIG path is specific to RKE2 — it differs from standard containerd installations.

Virtual clusters for GPU sharing

SUSE AI supports virtual clusters (GA) for sharing GPU resources across teams. Multiple virtual clusters run on the same physical host cluster, with a single GPU scheduler managing allocation. On Harvester, you can also use PCI passthrough for dedicated GPU access or vGPU with Multi-Instance GPU (MIG) partitioning on A100/H100/H200 GPUs.

Recommendation

Set the CPU scaling governor to performance on GPU worker nodes for optimal AI workload throughput. The default powersave governor can significantly reduce inference speed.

07

Installation

SUSE AI is deployed via Helm charts from the SUSE Application Collection OCI registry. You need an active SUSE subscription with Rancher Prime and SUSE AI entitlements, and service account credentials from SUSE Customer Center (SCC).

Prerequisites

  • RKE2 cluster with Ingress controller and GPU worker nodes
  • NVIDIA GPU Operator installed (see GPU section above)
  • Helm 3 CLI
  • SCC service account credentials for dp.apps.rancher.io
  • StorageClass with volume expansion support (for Milvus)
  • DNS-resolvable hostname for the Open WebUI Ingress

Step-by-step deployment

# 1. Create namespace and registry secret
kubectl create namespace suseai
kubectl create secret docker-registry application-collection \
  --docker-server=dp.apps.rancher.io \
  --docker-username=<SCC_USERNAME> \
  --docker-password=<SCC_TOKEN> \
  -n suseai

helm registry login dp.apps.rancher.io \
  -u <SCC_USERNAME> -p <SCC_TOKEN>

# 2. Install cert-manager (if not already installed)
kubectl create namespace cert-manager
kubectl create secret docker-registry application-collection \
  --docker-server=dp.apps.rancher.io \
  --docker-username=<SCC_USERNAME> \
  --docker-password=<SCC_TOKEN> \
  -n cert-manager

helm upgrade --install cert-manager \
  oci://dp.apps.rancher.io/charts/cert-manager \
  -n cert-manager \
  --set "global.imagePullSecrets[0].name=application-collection" \
  --set crds.enabled=true

# 3. Install Milvus (vector database)
helm upgrade --install milvus \
  oci://dp.apps.rancher.io/charts/milvus \
  -n suseai --version 4.2.2 \
  -f customvalues-milvus.yaml

# 4. Install Open WebUI + Ollama
helm upgrade --install open-webui \
  oci://dp.apps.rancher.io/charts/open-webui \
  -n suseai --version 5.16.0 \
  -f customvalues-owui.yaml

# 5. Verify deployment
kubectl get pods -n suseai
kubectl get ingress -n suseai

Meta deployer chart (alternative)

Instead of installing components individually, use the suse-ai-deployer meta chart to deploy everything at once:

# Deploy the full SUSE AI stack
helm upgrade --install suse-ai \
  oci://dp.apps.rancher.io/charts/suse-ai-deployer \
  --namespace suse-private-ai --create-namespace \
  --values ./custom-overrides.yaml

TLS options

Optionglobal.tls.sourcecert-manager needed
Self-signed (default)suse-private-aiYes
Let’s EncryptletsEncryptYes
Bring your own certsecretNo

Air-gapped deployment

SUSE AI fully supports air-gapped environments. Three scripts handle the offline workflow:

  • SUSE-AI-mirror-nvidia.sh — Mirrors NVIDIA RPM packages from a connected host
  • SUSE-AI-get-images.sh — Downloads all SUSE AI container images
  • SUSE-AI-load-images.sh — Loads images into a private local registry
Note

Registry secrets must be created per namespace. Kubernetes cannot reference image pull secrets from other namespaces. If you deploy cert-manager and SUSE AI in separate namespaces, create the application-collection secret in both.

08

Observability

SUSE AI uses OpenTelemetry for instrumentation and integrates with SUSE Observability for unified metrics, logs, and traces. The OpenTelemetry Operator enables auto-instrumentation of Python, Java, and Go services with zero code changes.

What’s monitored out of the box

  • LLM metrics — Token usage (input, output, reasoning), cost tracking, latency per model, throughput (tokens/sec)
  • GPU metrics — Utilization, temperature, power draw, memory usage via NVIDIA DCGM Exporter
  • Inference engine health — Ollama and vLLM request rates, error rates, queue depth
  • Vector database — Milvus query latency, index size, memory consumption
  • Application traces — End-to-end request tracing from Open WebUI through inference to vector search

Pre-built dashboards

LLM Cost & Tokens

Token consumption by model, user, and team. Input vs output vs reasoning token breakdown. Cost estimation based on configurable per-token rates.

LLM Performance

Inference latency (p50, p95, p99), throughput, time to first token, queue wait times. Comparison across models.

GPU Performance

Per-GPU utilization, memory usage, temperature, power draw. NVIDIA DCGM Exporter metrics. Helps right-size GPU allocation.

VectorDB Performance

Milvus/Qdrant query latency, index operations, memory consumption, segment health.

LLM drift detection

The observability stack includes a drift dashboard that tracks changes in model response patterns over time. This helps detect when model behavior shifts due to updates, prompt changes, or RAG document modifications — critical for compliance-sensitive deployments.

Recommendation

The observability integration uses SUSE Observability’s Time Machine feature for historical analysis. When investigating an incident, you can “travel back in time” to see the exact state of the AI stack at the moment the issue occurred — including GPU metrics, model load, and active queries.

09

Security & Authentication

SUSE AI provides multiple authentication mechanisms for Open WebUI and includes guardrail capabilities for responsible AI use.

Authentication methods

Built-in Password Auth

Default authentication method. First user to sign up becomes admin. Subsequent users get the role configured in DEFAULT_USER_ROLE.

LDAP / Active Directory

Integration via ENABLE_LDAP and ENABLE_LDAP_GROUP_MANAGEMENT environment variables. Maps LDAP groups to Open WebUI roles.

OAuth 2.0 / OIDC SSO

Any OIDC-compatible identity provider (Keycloak, Azure AD, Okta, etc.). Supports multi-tenant SSO with per-tenant provider configuration.

SAML & Trusted Headers

SAML 2.0 support for enterprise IdPs. Trusted header auth for reverse proxy deployments where authentication is handled upstream.

RBAC & multi-tenancy

  • Role-based access control — Model-level permissions control which users/groups can access which models
  • Workspace isolation — Multi-tenant workspace separation within a single Open WebUI instance
  • API keys — Per-user API keys for programmatic access
  • LiteLLM access control — Per-key and per-team access control, budget caps, and token limits at the API gateway level

Guardrails

SUSE AI includes a blueprint for implementing guardrail technology and has partnered with Infosys for their Responsible AI framework (“Scan, Shield, Steer”). Guardrails can enforce content filtering, prompt moderation, and compliance policies on both inputs and outputs.

Supply chain security

  • All container images from SUSE Application Collection are signed and include SBOMs
  • SLSA Level 3 build provenance compliance
  • Daily CVE patches to container images
  • NeuVector integration for runtime network monitoring (L2-3 and L7), vulnerability scanning, and automated security policy generation
Critical

Ollama’s API endpoint has no built-in authentication. By default, SUSE AI disables the Ollama Ingress to prevent unauthenticated external access. If you need to expose Ollama directly, place an authenticating reverse proxy in front of it. All user-facing access should go through Open WebUI, which handles authentication.

10

SUSE Ecosystem Integration

SUSE AI is designed to plug into the broader SUSE product portfolio. It leverages existing infrastructure rather than requiring a greenfield deployment.

Rancher Manager

SUSE AI components are available in the Rancher Apps catalog. Rancher UI shows GPU node labels, manages cluster lifecycle, and provides app deployment workflows. Observability integrates as a Rancher UI extension.

SUSE Application Collection

All Helm charts and container images distributed via dp.apps.rancher.io OCI registry. Curated, hardened, and signed. The single source of truth for SUSE AI artifacts.

Harvester (SUSE Virtualization)

Supports PCI passthrough for dedicated GPU access to VMs, vGPU with MIG partitioning (A100, H100, H200), and SR-IOV for GPU sharing. Enables running SUSE AI in VMs with direct GPU access on Harvester HCI clusters.

Longhorn (SUSE Storage)

Tested as persistent storage backend. Requires a custom longhorn-xfs StorageClass for Kafka/Milvus components. XFS filesystem required — Ext4 is incompatible.

NeuVector (SUSE Security)

Network monitoring, CVE scanning, and compliance enforcement for the AI stack. Auto-discovers container communication patterns and generates network policies.

Fleet (GitOps)

GitOps-driven deployment and lifecycle management for SUSE AI components across multiple clusters. Define your AI stack as code and deploy consistently.

Liz — AI-powered Kubernetes assistant

Liz is a tech preview AI agent that runs as a Rancher Prime UI extension. It provides context-aware Kubernetes management assistance — proactive issue detection, performance optimization suggestions, and natural language cluster operations. Powered by Ollama (on-prem) or Amazon Bedrock (AWS). Available as a tech preview since KubeCon NA 2025.

11

Licensing

SUSE AI requires an active SUSE subscription. While the underlying components are all open-source projects, the SUSE packaging, hardened images, registry access, and enterprise support require a paid subscription.

What you need

  • Rancher Prime entitlement — For the Kubernetes management layer
  • SUSE AI entitlement — For access to AI-specific charts and images from the Application Collection
  • Both entitlements are managed through SUSE Customer Center (SCC)

Subscription model

  • Consumption-based billing — SUSE invoices based on consumption of “Units” as defined in the SUSE AI Supplemental Terms
  • Available in 1-year, 3-year, and 5-year terms
  • Governed by SUSE AI Supplemental Terms (separate from the main subscription agreement)

Open-source components

ComponentLicenseUpstream project
OllamaMITollama/ollama
vLLMApache 2.0vllm-project/vllm
Open WebUIMITopen-webui/open-webui
MilvusApache 2.0milvus-io/milvus
LiteLLMMITBerriAI/litellm
MLflowApache 2.0mlflow/mlflow
suse-ai-deployerApache 2.0SUSE/suse-ai-deployer
Note

You can run all the upstream open-source components yourself without a SUSE subscription. What the subscription buys you is: hardened container images with daily patches, SLSA Level 3 supply chain compliance, tested component integration, air-gapped deployment support, SUSE enterprise support, and the AI-specific observability dashboards.

12

Deployment Checklist

Pre-deployment

  • Confirm SUSE subscription includes both Rancher Prime and SUSE AI entitlements
  • Create SCC service account credentials for dp.apps.rancher.io access
  • Provision GPU worker nodes with NVIDIA GPUs (A100/H100 for production, consumer GPUs for dev)
  • Install NVIDIA GPU drivers (G06 generation) on host OS
  • Deploy RKE2 cluster with 3+ control plane nodes and 1+ GPU worker nodes
  • Verify nvidia-smi works on GPU nodes
  • Install NVIDIA GPU Operator with RKE2-specific containerd paths
  • Confirm nodes labeled with nvidia.com/gpu=1
  • Ensure StorageClass supports volume expansion (critical for Milvus)
  • If using Longhorn: create longhorn-xfs StorageClass for Kafka/Milvus
  • Set CPU scaling governor to performance on GPU workers
  • Configure DNS for Open WebUI hostname

Deployment

  • Create namespaces and application-collection registry secrets in each namespace
  • Install cert-manager with CRDs enabled
  • Choose TLS strategy: self-signed, Let’s Encrypt, or bring-your-own
  • Deploy Milvus (or Qdrant) for RAG — verify pods are running and PVCs are bound
  • Deploy Open WebUI + Ollama via Helm — verify Ingress is accessible
  • Pull at least one model: kubectl exec into Ollama pod and run ollama pull gemma:2b
  • Test chat through Open WebUI — verify model responds
  • Test RAG: upload a document, ask a question about it, verify grounded response

Post-deployment

  • Configure authentication (LDAP, OIDC, or SAML) — disable default password signup if using SSO
  • Set up RBAC and model-level permissions
  • Deploy OpenTelemetry Operator for observability
  • Verify GPU metrics flowing to monitoring stack (DCGM Exporter)
  • Set up LiteLLM if you need a unified API gateway or multiple backends
  • Configure backup strategy for Milvus data and Open WebUI state
  • Document model selection rationale and GPU sizing decisions
  • Plan upgrade strategy — SUSE AI charts receive regular updates