SUSE AI Production Guide

Overview

SUSE AI is an enterprise platform for deploying and running generative AI and LLM workloads on Kubernetes with full data sovereignty. It packages battle-tested open-source AI tools — Ollama, vLLM, Open WebUI, Milvus, and more — into a curated, hardened stack that deploys via Helm on any SUSE-supported Kubernetes cluster.

The core problem SUSE AI solves is the private AI infrastructure gap. Organizations want to run LLMs on their own hardware for data privacy, regulatory compliance, and cost control, but stitching together GPU drivers, inference engines, vector databases, model serving, and observability on Kubernetes is complex. SUSE AI provides an opinionated, pre-integrated stack with enterprise support, hardened container images, and air-gapped deployment capability.

SUSE AI 1.0 was announced and made generally available at KubeCon NA 2024 (November 2024). Significant updates were added at SUSECON 2025 (MLflow, PyTorch, Pipelines) and KubeCon NA 2025 (vLLM, MCP Universal Proxy tech preview, virtual clusters GA).

What problems does SUSE AI solve?

Data sovereignty — Run LLMs entirely on-premises or in your own cloud. No data leaves your infrastructure, meeting compliance requirements that rule out SaaS AI services
Infrastructure complexity — Pre-integrates GPU drivers, inference engines, vector databases, chat UI, observability, and TLS into a single deployable stack
Model management — Pull, serve, and switch between open-source models (Llama, Gemma, Mistral, etc.) without building custom serving infrastructure
RAG at scale — Built-in vector database integration for Retrieval Augmented Generation — ground LLM responses in your organization’s documents
GPU orchestration — Handles NVIDIA GPU scheduling, sharing, and monitoring on Kubernetes through the GPU Operator
Supply chain trust — SUSE Application Collection provides signed, SBOM-tracked, SLSA Level 3 compliant container images with daily patches

Strengths

Fully private — on-premises, air-gapped, or your cloud with no data egress
Open-source core (Ollama, vLLM, Open WebUI, Milvus are all OSS projects)
Pre-integrated stack — components are tested together and deployed via a single meta chart
Hardened container images with signatures, SBOMs, daily CVE patches
Built on the Rancher Prime ecosystem — leverages existing K8s management tooling
Air-gapped deployment fully supported with offline mirroring scripts
OpenTelemetry-native observability with AI-specific dashboards

Considerations

NVIDIA GPUs only — no AMD or Intel GPU support in current release
Requires SUSE subscription (Rancher Prime + SUSE AI entitlements)
Young product (GA since Nov 2024) — some features still in tech preview
Ollama API has no built-in authentication (ingress disabled by default)
Milvus requires StorageClass with volume expansion — fails silently without it
GPU hardware is expensive — sizing and cost planning essential
MCP Universal Proxy and Liz AI assistant are tech preview only

Architecture

SUSE AI is a multi-layered stack built on top of the SUSE Rancher Prime ecosystem. Each layer builds on the one below it, from the operating system up through Kubernetes to the AI workloads themselves.

The four layers

Layer 1 — Operating System

SLES 15 SP6 (general purpose) or SLE Micro 6.1 (immutable, transactional). NVIDIA GPU drivers (G06 generation) are installed at this layer. SLE Micro is preferred for GPU worker nodes due to its smaller attack surface and atomic updates.

Layer 2 — Kubernetes

RKE2 (recommended) or K3s for the Kubernetes runtime, managed by Rancher Manager. RKE2’s containerd runtime is required for the NVIDIA GPU Operator’s device plugin to mount GPUs into containers.

Layer 3 — Infrastructure Services

NVIDIA GPU Operator for GPU scheduling, cert-manager (v1.17.2) for TLS, CSI storage drivers (Longhorn, NFS CSI, or cloud provider), NeuVector for runtime security, and SUSE Observability for monitoring.

Layer 4 — AI Workloads

The AI components themselves, deployed as Helm charts: Ollama, vLLM, Open WebUI, Milvus, Qdrant, LiteLLM, MCPO, MLflow, PyTorch, and OpenSearch.

Component interaction

Deployment model

All components are packaged as Helm charts distributed through the SUSE Application Collection OCI registry (dp.apps.rancher.io). A meta Helm chart called suse-ai-deployer orchestrates deployment of all components together. Individual charts can also be installed independently for customized deployments.

The deployer chart has six required dependencies: Milvus, Ollama, Open WebUI, Open WebUI MCPO, PyTorch, and vLLM. Default namespace is suse-private-ai or suseai.

Recommendation

For production, use a dedicated RKE2 cluster (or dedicated GPU node pool) for SUSE AI workloads. AI inference is resource-intensive and can starve other workloads of GPU, memory, and I/O. Separate the AI workload plane from general application workloads.

Inference Engines

SUSE AI bundles two inference engines that serve different use cases. You can run both simultaneously and route requests through LiteLLM as a unified gateway.

Ollama — Simplicity & Flexibility

Local LLM inference engine that handles model downloading, loading, and serving. Serves on port 11434. Optimized for single-user or low-concurrency workloads. Supports a wide range of model formats (GGUF). Easy model management with ollama pull, ollama run.

Best for: development, experimentation, small teams, edge deployments
Model formats: GGUF (quantized models), Safetensors
GPU: NVIDIA only in SUSE AI bundle

vLLM — High-Performance Serving

High-throughput inference engine using PagedAttention for efficient GPU memory management. Claims up to 80% reduction in GPU memory waste and up to 24x throughput improvement under high concurrency. Provides OpenAI-compatible API.

Best for: production chatbots, copilots, high-concurrency API serving
Continuous batching for dynamic request handling
Native OpenAI API compatibility

Choosing between Ollama and vLLM

Criteria	Ollama	vLLM
Primary use case	Dev/test, small teams, model experimentation	Production serving, high-concurrency APIs
Throughput	Good for single/few concurrent users	Optimized for many concurrent requests
Memory efficiency	Standard allocation	PagedAttention — up to 80% less waste
API compatibility	Ollama API + partial OpenAI compat	Full OpenAI-compatible API
Model management	Built-in pull/run/list commands	Requires pre-downloaded models
Batching	Sequential processing	Continuous batching
Setup complexity	Simple — single binary	More configuration required

LiteLLM — unified API gateway

LiteLLM sits in front of both engines and provides a single OpenAI-compatible API endpoint that can route requests to Ollama, vLLM, or 100+ external LLM providers. It adds cost tracking, per-key/per-team access control, guardrails, load balancing, and logging.

Supported models

Any model supported by Ollama or vLLM can be used. Common models documented in SUSE AI examples:

Llama 3.1 / 3.2 (Meta) — llama3.1, llama3.2:3b
Gemma 2B (Google) — gemma:2b
Mistral / Mixtral (Mistral AI)
Phi (Microsoft)
Any GGUF-format model from Hugging Face or Ollama library

RAG & Vector Databases

Retrieval Augmented Generation (RAG) allows the LLM to ground its responses in your organization’s documents. SUSE AI provides two vector database options for storing and searching document embeddings.

How RAG works in SUSE AI

Document ingestion — Users upload documents through Open WebUI. Documents are chunked and passed through an embedding model
Embedding — The default embedding model sentence-transformers/all-MiniLM-L6-v2 converts text chunks into numerical vectors
Storage — Vectors are stored in Milvus or Qdrant for efficient similarity search
Query — When a user asks a question, the query is embedded and the vector DB finds the most semantically similar document chunks
Augmented prompt — Retrieved chunks are injected into the LLM prompt as context, grounding the response in actual data

Vector database options

Milvus (Primary)

Open-source vector database purpose-built for similarity search at scale. Deployed in cluster mode with etcd (1 replica), MinIO (4 replicas, distributed mode), and Kafka (3 brokers, 8 Gi storage each). Helm chart version 4.2.2. Serves on port 19530.

Pulsar is disabled by default. Requires a StorageClass with ALLOWVOLUMEEXPANSION or deployment will fail silently.

Qdrant (Alternative)

Lightweight vector database alternative. Simpler deployment footprint than Milvus (no Kafka/MinIO dependencies). Good for smaller deployments or when you want fewer moving parts. Trades some scalability for operational simplicity.

RAG configuration

# Open WebUI values.yaml for RAG with Milvus
open-webui:
  env:
    VECTOR_DB: "milvus"
    MILVUS_URI: "http://milvus.suseai.svc.cluster.local:19530"
    RAG_EMBEDDING_MODEL: "sentence-transformers/all-MiniLM-L6-v2"

Warning

Milvus with Longhorn storage requires a custom longhorn-xfs StorageClass. The Kafka brokers used by Milvus require XFS filesystem — the default Ext4 is incompatible and will cause data corruption. Create the StorageClass with mkfsParams: "-f" and fsType: "xfs" before deploying Milvus.

Open WebUI

Open WebUI is the user-facing chat interface for SUSE AI. It’s a self-hosted web application that provides a ChatGPT-like experience connecting to your local Ollama or vLLM backends. Exposed via HTTPS Ingress with TLS certificates managed by cert-manager.

Key capabilities

Multi-model chat — Switch between available models mid-conversation. Configure default models per deployment
RAG document upload — Upload PDFs, text files, and other documents directly in the chat UI for RAG-powered Q&A
System prompts — Customize model behavior with system-level instructions
Pipelines — Chain AI models with APIs and external tools for multi-step workflows (added at SUSECON 2025)
User management — Built-in user roles (admin, user), configurable default role for new signups
API access — Programmatic access in addition to the web UI

Configuration

# Open WebUI Helm values
open-webui:
  ollamaUrls:
    - http://open-webui-ollama.suseai.svc.cluster.local:11434
  ingress:
    enabled: true
    host: ai.example.com
    tls: true
  env:
    WEBUI_NAME: "SUSE AI"
    DEFAULT_MODELS: "gemma:2b"
    DEFAULT_USER_ROLE: "user"
    GLOBAL_LOG_LEVEL: INFO

Pipelines

Open WebUI Pipelines enable chaining models with external tools and APIs for agentic workflows. Combined with MCPO (the MCP-to-OpenAPI proxy), this allows the LLM to call external REST APIs, query databases, or invoke custom business logic as part of its reasoning chain.

Note

Open WebUI Helm chart version in current documentation: 5.16.0 from oci://dp.apps.rancher.io/charts/open-webui. The chart bundles Ollama as a subchart — you can enable/disable Ollama within the same release.

GPU & Hardware

LLM inference is GPU-intensive. SUSE AI currently supports NVIDIA GPUs only, using the NVIDIA GPU Operator to bridge host GPU drivers into Kubernetes containers.

Supported GPUs

Data center — A100, H100, H200, A10, L40S, V100, Tesla
Consumer/workstation — RTX 30 series and newer
Driver generation: G06 (driver version 550.x+, CUDA 12.3+)

Infrastructure requirements

Component	Minimum	Recommended
Control plane CPU	4 cores	8+ cores (16+ for HA)
Control plane RAM	8 GB	16 GB+ (32 GB+ for HA)
GPU worker RAM	16 GB	32 GB+ for larger models
Disk	50 GB SSD	100 GB+ NVMe SSD
Kubernetes	1.18+	RKE2 latest stable
Nodes (production)	3 control plane + 1 GPU worker	3 CP + 2+ GPU workers
Model storage	4 GB (small models)	100 GB+ (multiple large models)

GPU Operator setup on RKE2

# Install NVIDIA GPU drivers on the host OS (SLES 15 SP6)
sudo zypper install nvidia-open-driver-G06-signed-kmp-default

# Verify GPU is visible
nvidia-smi
# Should show: Driver Version 550.x, CUDA 12.3+

# Install GPU Operator on the RKE2 cluster
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set driver.enabled=false \
  --set toolkit.env[0].name=CONTAINERD_CONFIG \
  --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
  --set toolkit.env[1].name=CONTAINERD_SOCKET \
  --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
  --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
  --set toolkit.env[2].value=nvidia

# Verify GPU nodes are labeled
kubectl get nodes -l nvidia.com/gpu=1

Important

GPU drivers must be installed on the host OS, not inside containers. The GPU Operator then makes those drivers available to containers via the device plugin. Set driver.enabled=false in the GPU Operator Helm values since the driver is already on the host. The CONTAINERD_CONFIG path is specific to RKE2 — it differs from standard containerd installations.

Virtual clusters for GPU sharing

SUSE AI supports virtual clusters (GA) for sharing GPU resources across teams. Multiple virtual clusters run on the same physical host cluster, with a single GPU scheduler managing allocation. On Harvester, you can also use PCI passthrough for dedicated GPU access or vGPU with Multi-Instance GPU (MIG) partitioning on A100/H100/H200 GPUs.

Recommendation

Set the CPU scaling governor to performance on GPU worker nodes for optimal AI workload throughput. The default powersave governor can significantly reduce inference speed.

Installation

SUSE AI is deployed via Helm charts from the SUSE Application Collection OCI registry. You need an active SUSE subscription with Rancher Prime and SUSE AI entitlements, and service account credentials from SUSE Customer Center (SCC).

Prerequisites

RKE2 cluster with Ingress controller and GPU worker nodes
NVIDIA GPU Operator installed (see GPU section above)
Helm 3 CLI
SCC service account credentials for dp.apps.rancher.io
StorageClass with volume expansion support (for Milvus)
DNS-resolvable hostname for the Open WebUI Ingress

Step-by-step deployment

# 1. Create namespace and registry secret
kubectl create namespace suseai
kubectl create secret docker-registry application-collection \
  --docker-server=dp.apps.rancher.io \
  --docker-username=<SCC_USERNAME> \
  --docker-password=<SCC_TOKEN> \
  -n suseai

helm registry login dp.apps.rancher.io \
  -u <SCC_USERNAME> -p <SCC_TOKEN>

# 2. Install cert-manager (if not already installed)
kubectl create namespace cert-manager
kubectl create secret docker-registry application-collection \
  --docker-server=dp.apps.rancher.io \
  --docker-username=<SCC_USERNAME> \
  --docker-password=<SCC_TOKEN> \
  -n cert-manager

helm upgrade --install cert-manager \
  oci://dp.apps.rancher.io/charts/cert-manager \
  -n cert-manager \
  --set "global.imagePullSecrets[0].name=application-collection" \
  --set crds.enabled=true

# 3. Install Milvus (vector database)
helm upgrade --install milvus \
  oci://dp.apps.rancher.io/charts/milvus \
  -n suseai --version 4.2.2 \
  -f customvalues-milvus.yaml

# 4. Install Open WebUI + Ollama
helm upgrade --install open-webui \
  oci://dp.apps.rancher.io/charts/open-webui \
  -n suseai --version 5.16.0 \
  -f customvalues-owui.yaml

# 5. Verify deployment
kubectl get pods -n suseai
kubectl get ingress -n suseai

Meta deployer chart (alternative)

Instead of installing components individually, use the suse-ai-deployer meta chart to deploy everything at once:

# Deploy the full SUSE AI stack
helm upgrade --install suse-ai \
  oci://dp.apps.rancher.io/charts/suse-ai-deployer \
  --namespace suse-private-ai --create-namespace \
  --values ./custom-overrides.yaml

TLS options

Option	global.tls.source	cert-manager needed
Self-signed (default)	`suse-private-ai`	Yes
Let’s Encrypt	`letsEncrypt`	Yes
Bring your own cert	`secret`	No

Air-gapped deployment

SUSE AI fully supports air-gapped environments. Three scripts handle the offline workflow:

SUSE-AI-mirror-nvidia.sh — Mirrors NVIDIA RPM packages from a connected host
SUSE-AI-get-images.sh — Downloads all SUSE AI container images
SUSE-AI-load-images.sh — Loads images into a private local registry

Note

Registry secrets must be created per namespace. Kubernetes cannot reference image pull secrets from other namespaces. If you deploy cert-manager and SUSE AI in separate namespaces, create the application-collection secret in both.

Observability

SUSE AI uses OpenTelemetry for instrumentation and integrates with SUSE Observability for unified metrics, logs, and traces. The OpenTelemetry Operator enables auto-instrumentation of Python, Java, and Go services with zero code changes.

What’s monitored out of the box

LLM metrics — Token usage (input, output, reasoning), cost tracking, latency per model, throughput (tokens/sec)
GPU metrics — Utilization, temperature, power draw, memory usage via NVIDIA DCGM Exporter
Inference engine health — Ollama and vLLM request rates, error rates, queue depth
Vector database — Milvus query latency, index size, memory consumption
Application traces — End-to-end request tracing from Open WebUI through inference to vector search

Pre-built dashboards

LLM Cost & Tokens

Token consumption by model, user, and team. Input vs output vs reasoning token breakdown. Cost estimation based on configurable per-token rates.

LLM Performance

Inference latency (p50, p95, p99), throughput, time to first token, queue wait times. Comparison across models.

GPU Performance

Per-GPU utilization, memory usage, temperature, power draw. NVIDIA DCGM Exporter metrics. Helps right-size GPU allocation.

VectorDB Performance

Milvus/Qdrant query latency, index operations, memory consumption, segment health.

LLM drift detection

The observability stack includes a drift dashboard that tracks changes in model response patterns over time. This helps detect when model behavior shifts due to updates, prompt changes, or RAG document modifications — critical for compliance-sensitive deployments.

Recommendation

The observability integration uses SUSE Observability’s Time Machine feature for historical analysis. When investigating an incident, you can “travel back in time” to see the exact state of the AI stack at the moment the issue occurred — including GPU metrics, model load, and active queries.

Security & Authentication

SUSE AI provides multiple authentication mechanisms for Open WebUI and includes guardrail capabilities for responsible AI use.

Authentication methods

Built-in Password Auth

Default authentication method. First user to sign up becomes admin. Subsequent users get the role configured in DEFAULT_USER_ROLE.

LDAP / Active Directory

Integration via ENABLE_LDAP and ENABLE_LDAP_GROUP_MANAGEMENT environment variables. Maps LDAP groups to Open WebUI roles.

OAuth 2.0 / OIDC SSO

Any OIDC-compatible identity provider (Keycloak, Azure AD, Okta, etc.). Supports multi-tenant SSO with per-tenant provider configuration.

SAML & Trusted Headers

SAML 2.0 support for enterprise IdPs. Trusted header auth for reverse proxy deployments where authentication is handled upstream.

RBAC & multi-tenancy

Role-based access control — Model-level permissions control which users/groups can access which models
Workspace isolation — Multi-tenant workspace separation within a single Open WebUI instance
API keys — Per-user API keys for programmatic access
LiteLLM access control — Per-key and per-team access control, budget caps, and token limits at the API gateway level

Guardrails

SUSE AI includes a blueprint for implementing guardrail technology and has partnered with Infosys for their Responsible AI framework (“Scan, Shield, Steer”). Guardrails can enforce content filtering, prompt moderation, and compliance policies on both inputs and outputs.

Supply chain security

All container images from SUSE Application Collection are signed and include SBOMs
SLSA Level 3 build provenance compliance
Daily CVE patches to container images
NeuVector integration for runtime network monitoring (L2-3 and L7), vulnerability scanning, and automated security policy generation

Critical

Ollama’s API endpoint has no built-in authentication. By default, SUSE AI disables the Ollama Ingress to prevent unauthenticated external access. If you need to expose Ollama directly, place an authenticating reverse proxy in front of it. All user-facing access should go through Open WebUI, which handles authentication.

SUSE Ecosystem Integration

SUSE AI is designed to plug into the broader SUSE product portfolio. It leverages existing infrastructure rather than requiring a greenfield deployment.

Rancher Manager

SUSE AI components are available in the Rancher Apps catalog. Rancher UI shows GPU node labels, manages cluster lifecycle, and provides app deployment workflows. Observability integrates as a Rancher UI extension.

SUSE Application Collection

All Helm charts and container images distributed via dp.apps.rancher.io OCI registry. Curated, hardened, and signed. The single source of truth for SUSE AI artifacts.

Harvester (SUSE Virtualization)

Supports PCI passthrough for dedicated GPU access to VMs, vGPU with MIG partitioning (A100, H100, H200), and SR-IOV for GPU sharing. Enables running SUSE AI in VMs with direct GPU access on Harvester HCI clusters.

Longhorn (SUSE Storage)

Tested as persistent storage backend. Requires a custom longhorn-xfs StorageClass for Kafka/Milvus components. XFS filesystem required — Ext4 is incompatible.

NeuVector (SUSE Security)

Network monitoring, CVE scanning, and compliance enforcement for the AI stack. Auto-discovers container communication patterns and generates network policies.

Fleet (GitOps)

GitOps-driven deployment and lifecycle management for SUSE AI components across multiple clusters. Define your AI stack as code and deploy consistently.

Liz — AI-powered Kubernetes assistant

Liz is a tech preview AI agent that runs as a Rancher Prime UI extension. It provides context-aware Kubernetes management assistance — proactive issue detection, performance optimization suggestions, and natural language cluster operations. Powered by Ollama (on-prem) or Amazon Bedrock (AWS). Available as a tech preview since KubeCon NA 2025.

Licensing

SUSE AI requires an active SUSE subscription. While the underlying components are all open-source projects, the SUSE packaging, hardened images, registry access, and enterprise support require a paid subscription.

What you need

Rancher Prime entitlement — For the Kubernetes management layer
SUSE AI entitlement — For access to AI-specific charts and images from the Application Collection
Both entitlements are managed through SUSE Customer Center (SCC)

Subscription model

Consumption-based billing — SUSE invoices based on consumption of “Units” as defined in the SUSE AI Supplemental Terms
Available in 1-year, 3-year, and 5-year terms
Governed by SUSE AI Supplemental Terms (separate from the main subscription agreement)

Open-source components

Component	License	Upstream project
Ollama	MIT	ollama/ollama
vLLM	Apache 2.0	vllm-project/vllm
Open WebUI	MIT	open-webui/open-webui
Milvus	Apache 2.0	milvus-io/milvus
LiteLLM	MIT	BerriAI/litellm
MLflow	Apache 2.0	mlflow/mlflow
suse-ai-deployer	Apache 2.0	SUSE/suse-ai-deployer

Note

You can run all the upstream open-source components yourself without a SUSE subscription. What the subscription buys you is: hardened container images with daily patches, SLSA Level 3 supply chain compliance, tested component integration, air-gapped deployment support, SUSE enterprise support, and the AI-specific observability dashboards.

Deployment Checklist

Pre-deployment

Confirm SUSE subscription includes both Rancher Prime and SUSE AI entitlements
Create SCC service account credentials for dp.apps.rancher.io access
Provision GPU worker nodes with NVIDIA GPUs (A100/H100 for production, consumer GPUs for dev)
Install NVIDIA GPU drivers (G06 generation) on host OS
Deploy RKE2 cluster with 3+ control plane nodes and 1+ GPU worker nodes
Verify nvidia-smi works on GPU nodes
Install NVIDIA GPU Operator with RKE2-specific containerd paths
Confirm nodes labeled with nvidia.com/gpu=1
Ensure StorageClass supports volume expansion (critical for Milvus)
If using Longhorn: create longhorn-xfs StorageClass for Kafka/Milvus
Set CPU scaling governor to performance on GPU workers
Configure DNS for Open WebUI hostname

Deployment

Create namespaces and application-collection registry secrets in each namespace
Install cert-manager with CRDs enabled
Choose TLS strategy: self-signed, Let’s Encrypt, or bring-your-own
Deploy Milvus (or Qdrant) for RAG — verify pods are running and PVCs are bound
Deploy Open WebUI + Ollama via Helm — verify Ingress is accessible
Pull at least one model: kubectl exec into Ollama pod and run ollama pull gemma:2b
Test chat through Open WebUI — verify model responds
Test RAG: upload a document, ask a question about it, verify grounded response

Post-deployment

Configure authentication (LDAP, OIDC, or SAML) — disable default password signup if using SSO
Set up RBAC and model-level permissions
Deploy OpenTelemetry Operator for observability
Verify GPU metrics flowing to monitoring stack (DCGM Exporter)
Set up LiteLLM if you need a unified API gateway or multiple backends
Configure backup strategy for Milvus data and Open WebUI state
Document model selection rationale and GPU sizing decisions
Plan upgrade strategy — SUSE AI charts receive regular updates

Ollama

Local LLM inference engine — model management, serving, and GPU acceleration

What is Ollama?

Ollama is an open-source (MIT licensed) local LLM inference engine that handles the full lifecycle of running language models: downloading, loading into GPU memory, serving inference requests, and managing multiple models. It provides a simple CLI and REST API for interacting with models.

In SUSE AI, Ollama runs as a Kubernetes Deployment and serves on port 11434. The Open WebUI connects to it at http://open-webui-ollama.suseai.svc.cluster.local:11434.

Key operations

# Pull a model
ollama pull llama3.2:3b

# List available models
ollama list

# Run a model interactively
ollama run gemma:2b

# Serve (start the API server)
ollama serve

# API call
curl http://localhost:11434/api/generate \
  -d '{"model": "gemma:2b", "prompt": "Explain Kubernetes"}'

GPU acceleration

Ollama in the SUSE AI bundle is optimized for NVIDIA GPUs only. It automatically detects available GPUs via the NVIDIA device plugin and loads model layers into GPU memory. For models that exceed GPU VRAM, Ollama splits layers between GPU and CPU (partial offloading).

Limitations in SUSE AI context

No built-in authentication — The Ollama API is unauthenticated. SUSE AI disables the Ingress by default to prevent external access
Sequential processing — Ollama processes requests one at a time (no continuous batching). For high-concurrency workloads, use vLLM instead
NVIDIA only — AMD and Intel GPU support exists in upstream Ollama but is not included in the SUSE AI bundle

Model storage: Models are stored in a PersistentVolume. Size depends on the models you pull — a 7B parameter model is typically 4-8 GB, a 70B model can be 40+ GB. Plan your PV size accordingly.

vLLM

High-performance LLM inference with PagedAttention and continuous batching

What is vLLM?

vLLM (Virtual Large Language Model) is a high-throughput, memory-efficient inference and serving engine for LLMs. It was developed at UC Berkeley and is now a widely adopted open-source project (Apache 2.0). Its key innovation is PagedAttention, which manages attention key-value cache memory like an operating system manages virtual memory — using paging to avoid fragmentation and waste.

Why PagedAttention matters

Traditional LLM serving pre-allocates contiguous GPU memory for the KV cache of each request. This leads to massive internal fragmentation — up to 60-80% of allocated GPU memory is wasted. PagedAttention stores KV cache in non-contiguous memory blocks (pages), eliminating this waste and allowing the GPU to serve significantly more concurrent requests.

Up to 80% reduction in GPU memory waste
Up to 24x throughput improvement under high concurrency
Continuous batching — new requests can join an in-progress batch dynamically, reducing queue wait times

OpenAI-compatible API

vLLM provides a fully OpenAI-compatible API server out of the box. Any client that works with the OpenAI API can point at vLLM instead — just change the base URL. This makes it a drop-in replacement for OpenAI in existing applications.

When to use vLLM over Ollama

Production deployments with many concurrent users
API-first architectures where you need OpenAI API compatibility
Scenarios where GPU utilization and throughput must be maximized
Chatbots, copilots, and document processing pipelines with high request rates

Trade-off: vLLM requires more configuration than Ollama and does not have built-in model management (pull/list). Models must be pre-downloaded and mounted. Use Ollama for experimentation and vLLM for production serving.

LiteLLM

OpenAI-compatible API proxy and gateway for 100+ LLM providers

What is LiteLLM?

LiteLLM is an open-source (MIT licensed) API proxy that provides a unified OpenAI-compatible interface to over 100 LLM providers and backends. In SUSE AI, it sits between Open WebUI and the inference engines, acting as an API gateway that abstracts the underlying model serving infrastructure.

Key capabilities

Unified API — Translates OpenAI API calls to the native format of each backend (Ollama, vLLM, cloud providers, etc.)
Cost tracking — Tracks token usage and calculates costs per model, user, and API key
Access control — Per-key and per-team API access, budget caps, rate limiting
Load balancing — Distribute requests across multiple model instances or backends
Guardrails — Content filtering and moderation on requests and responses
Logging — Detailed request/response logging for audit and debugging
MCP Gateway — Centralized tool management for agentic AI workflows

When to use LiteLLM

LiteLLM is optional in SUSE AI. Add it when you need:

A single API endpoint that routes to multiple model backends
Per-user or per-team cost tracking and budget enforcement
Fine-grained access control beyond what Open WebUI provides
A hybrid setup mixing local models (Ollama/vLLM) with cloud APIs

Milvus

Open-source vector database for similarity search at scale

What is Milvus?

Milvus is an open-source (Apache 2.0) vector database purpose-built for storing and searching high-dimensional embedding vectors. In SUSE AI, it powers the RAG pipeline — storing document embeddings and performing similarity search to find relevant context for LLM queries.

Architecture in SUSE AI

Milvus is deployed in cluster mode with several supporting components:

etcd (1 replica) — Metadata storage and service discovery
MinIO (4 replicas, distributed) — Object storage for segments and indexes
Kafka (3 brokers, 8 Gi each) — Log streaming for data insertion pipeline
Milvus server — Query nodes, data nodes, index nodes

Pulsar is disabled by default (Kafka is used instead). Helm chart version: 4.2.2.

Critical storage requirements

StorageClass must support ALLOWVOLUMEEXPANSION or Milvus will fail silently during deployment
Kafka brokers require XFS filesystem — Ext4 is incompatible and causes data corruption with Longhorn
If using Longhorn storage, create a custom longhorn-xfs StorageClass with fsType: "xfs"

Verification

# Check Milvus components
kubectl get pods -n suseai -l app.kubernetes.io/name=milvus

# Check PVC status (all should be Bound)
kubectl get pvc -n suseai

# Test Milvus connectivity
kubectl port-forward svc/milvus -n suseai 19530:19530
# Then connect with pymilvus or attu

Sizing: Milvus cluster mode with Kafka and MinIO has a significant resource footprint. For smaller deployments or fewer documents, consider using Qdrant instead, which has a simpler architecture with fewer dependencies.

Model Context Protocol (MCP)

Protocol for connecting AI models to external tools and data sources

What is MCP?

The Model Context Protocol (MCP) is an open standard for connecting AI models to external tools, APIs, and data sources. Think of it as “USB for AI” — a standardized way for an LLM to call external functions, query databases, or interact with APIs as part of its reasoning process.

MCP in SUSE AI

SUSE AI includes two MCP-related components:

MCPO (GA) — An MCP-to-OpenAPI proxy server that bridges MCP servers to REST API interfaces. Enables Open WebUI to call MCP-compatible tools
SUSE AI Universal Proxy (Tech Preview) — A more advanced MCP gateway that adds RBAC, stateful routing, budget caps on reasoning tokens, per-user/per-group token limits, and data sovereignty enforcement

Universal Proxy capabilities (Tech Preview)

Single entry point for all AI services — one URL for all model and tool interactions
Granular RBAC — Control which users/groups can access which tools and models
Stateful routing — Maintains conversational context across requests
Budget enforcement — Per-user and per-group token limits, caps on reasoning tokens
Data sovereignty — Prevents remote LLMs from accessing sensitive local data by enforcing data flow policies

Tech preview: The Universal Proxy is not yet GA. Use MCPO for production agentic workflows today. The Universal Proxy adds governance features that are critical for enterprise deployments but need production validation.

NVIDIA GPU Operator

Bridges host GPU drivers into Kubernetes for containerized GPU workloads

What is the GPU Operator?

The NVIDIA GPU Operator is a Kubernetes operator that automates the management of all NVIDIA software components needed to provision and manage GPUs in Kubernetes. It installs the device plugin, container toolkit, DCGM exporter, and other components as DaemonSets on GPU-equipped nodes.

How it works with SUSE AI

In SUSE AI’s recommended setup, GPU drivers are installed on the host OS (SLES 15 SP6 or SLE Micro), and the GPU Operator is deployed with driver.enabled=false. The operator then:

Installs the nvidia-container-toolkit to make GPUs available to containers
Deploys the device plugin that registers GPUs as schedulable Kubernetes resources
Labels nodes with nvidia.com/gpu=1 (or the number of GPUs)
Deploys DCGM Exporter for GPU metrics (temperature, utilization, power, memory)

RKE2-specific configuration

RKE2 uses containerd as its runtime, but the config and socket paths differ from standard containerd. The GPU Operator must be configured with these RKE2-specific paths:

# GPU Operator values for RKE2
toolkit:
  env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia

Verification

# Check GPU Operator pods
kubectl get pods -n gpu-operator

# Verify GPU is schedulable
kubectl describe node <gpu-node> | grep nvidia.com/gpu

# Test GPU access from a pod
kubectl run gpu-test --rm -it --image=nvidia/cuda:12.3.0-base \
  --limits=nvidia.com/gpu=1 -- nvidia-smi

Driver vs Operator: A common source of confusion — the GPU driver runs on the host OS and talks to the hardware. The GPU Operator runs in Kubernetes and makes the driver’s capabilities available to containers. Both are required, but they are separate installations.