SUSE Observability Production Guide

Overview & History

SUSE Observability is an enterprise observability platform for Kubernetes and cloud-native infrastructure. It unifies metrics, logs, traces, and topology into a single platform, built around a unique 4T data model (Topology, Telemetry, Traces, Time) that correlates all observability signals against a real-time dependency map of your infrastructure. The current version is v2.8.1 (released 17 March 2026).

The product was originally developed by StackState, a Dutch observability company founded in 2015 by Mark Bakker and Lodewijk Bogaards (with Remco Beckers joining as a third co-founder). StackState was built around the insight that traditional monitoring creates data silos, and that mapping IT topology — the relationships and dependencies between components — is essential for understanding complex distributed systems.

History StackState Origins (2014–2017)

Born from a consulting engagement at a major Dutch bank in 2014, where the founders discovered that performance issues persisted despite abundant monitoring data — the problem was not lack of data but lack of insight. They spent 3 years building a custom versioned graph database from scratch because no existing graph database supported time-travel capabilities.

History StackState Launch (2017–2024)

Launched in 2017 as the first observability platform with a time-traveling topology. Recognized by Gartner as a Cool Vendor in Performance Analysis (2019) and as a representative vendor in the Gartner Market Guide for AIOps Platforms (2021). Customers included KPN, Vodafone, and Accenture. The company grew to 50+ employees.

Acquisition SUSE Acquires StackState

June 18, 2024 — announced at SUSECON Berlin. SUSE acquired StackState to add full-stack observability to its Rancher ecosystem. Financial terms were not disclosed. SUSE announced its intention to open-source StackState in the future to foster broader adoption.

Rebranding SUSE Observability (2024–present)

Integrated into Rancher Prime 3.1 on September 5, 2024. Rebranded from StackState to "SUSE Observability." Version 2.0.0 (11 Sep 2024) was the first release under the SUSE brand. The documentation site moved from docs.stackstate.com to documentation.suse.com. The development organization on GitHub remains StackVista.

Deployment Models

SUSE Observability (Self-Hosted) — deployed on your own Kubernetes cluster via Helm. Included with SUSE Rancher Prime subscriptions. Full control over data and infrastructure.
SUSE Cloud Observability (SaaS) — fully managed SaaS platform launched November 2024. Available on AWS Marketplace. Setup in under 5 minutes. Supports EKS, on-premises, and Rancher-managed clusters. This was SUSE's first SaaS-based product.

The 4T Data Model

The core differentiator of SUSE Observability is the 4T data model, introduced in StackState v4.6. Traditional observability tools treat metrics, logs, and traces as separate concerns. SUSE Observability correlates Topology, Telemetry, and Traces at every moment in Time, providing a unified context for troubleshooting that no individual signal can offer alone.

T1 Topology

A real-time map of all infrastructure components and their dependencies (relationships). In Kubernetes, this includes clusters, nodes, namespaces, deployments, pods, services, persistent volumes, and their connections. Topology is auto-discovered from the Kubernetes API and enriched by eBPF-based network observation. Stored in a custom versioned graph database (StackGraph) that preserves every historical state.

T2 Telemetry

Metrics, events, and logs collected from observed infrastructure. Metrics are stored in VictoriaMetrics, logs in Elasticsearch. Telemetry is automatically bound to topology components, so you always see metrics in the context of what they belong to — not just as isolated time-series.

T3 Traces

Distributed traces that show how requests flow across services. Collected via OpenTelemetry or SUSE Observability's own eBPF-based request tracing. Traces are stored in ClickHouse and are correlated with topology to show request paths across the dependency map.

T4 Time

The temporal dimension that binds the other three. Every topology snapshot, metric, log entry, and trace span is precisely timestamped. This enables time-travel debugging — the ability to reconstruct the exact state of your infrastructure at any point in the past and see all associated observability data. This is the foundational innovation built on the versioned graph database.

Why Topology Changes Everything

In traditional monitoring (Prometheus + Grafana), you have metrics and dashboards but no automatic understanding of what depends on what. When a database goes down, you see the database alert, but you have to manually figure out which applications are affected. With the 4T model:

Context is automatic — every metric, log, and trace is tied to a component in the topology map
Impact analysis is built-in — if a component becomes unhealthy, you instantly see all dependent components that are affected via health propagation
Root cause analysis follows the graph — problems propagate through the dependency chain, and SUSE Observability identifies the unhealthy component at the bottom of the chain as the probable root cause
Ephemeral resources are preserved — even after a pod is deleted, you can travel back in time and see how it was connected, its logs, events, and related resources

Architecture

SUSE Observability consists of three primary architectural components: the Server (on-premises or SaaS), the Agent (deployed on observed clusters), and the optional Rancher Prime UI Extension.

Server Components (Distributed Mode)

In HA production deployments, the server runs in distributed mode with separate pods for each function. In non-HA setups, all functions consolidate into a single suse-observability-server pod.

Ingestion Receivers

In HA mode, receivers are split into three types: base (agent telemetry), logs (log data), and process-agent (process-level data). An OpenTelemetry Collector (suse-observability-otel-collector-0) handles OTLP data from instrumented applications.

Processing Processing Services

Individual services handle specific functions: Sync (topology synchronization), Health-Sync (health state computation), State (state management), Checks (monitor evaluation), Correlate (event correlation and problem grouping), Notification (alert delivery), and Slicing (data partitioning).

Serving API & UI

The API server handles all PromQL and topology queries. The UI is a static React application. The Router is an Envoy-based proxy that routes requests to the appropriate backend service. Default port: 8080.

Optional Anomaly Detection

Spotlight-based anomaly detection is available but disabled by default. It uses machine learning to detect deviations from normal metric patterns. Requires a separate anomaly detection chart (v5.2.0-snapshot.179). An AI Assistant and MCP Server are also included for natural-language querying.

Backing Services & Data Stores

SUSE Observability runs six major backing services, all deployed as part of the Helm chart. There is no external dependency on managed databases — everything runs inside the Kubernetes cluster.

Service	Purpose	Chart Version	Pod Pattern
StackGraph (HBase + HDFS)	Topology & configuration storage (versioned graph database)	v0.2.128	`*-hbase-stackgraph-0` (non-HA) or name-nodes, region servers, data-nodes, Tephra (HA)
VictoriaMetrics	Metrics storage & query	v0.8.53-stackstate.45	`-victoria-metrics-0-0`, `-vmagent-0`
ClickHouse	Trace & OpenTelemetry data storage	v3.6.9-suse-observability.21	`*-clickhouse-shard0-N`
Elasticsearch	Events & logs storage	v8.19.4-stackstate.18	`*-elasticsearch-master-N`
Kafka	Message bus for in-transit topology & telemetry updates	v19.1.3-suse-observability.20	`*-kafka-N`
ZooKeeper	Service discovery, orchestration & failover coordination	v8.1.2-suse-observability.18	`*-zookeeper-N`
MinIO	S3-compatible object storage for backups	v8.0.10-stackstate.25	Optional, for backup/restore

Backup Architecture

Backups are handled through a MinIO gateway that supports three storage backends: AWS S3, Azure Blob Storage, or Kubernetes PersistentVolumes.

Data Store	Backup Type	Default Schedule	Default Retention
StackGraph	Full (single `.graph` file)	Daily at 03:00	30 days
VictoriaMetrics	Incremental	Hourly (staggered 25/35 min past)	~14 days
Elasticsearch	Incremental snapshots	Daily at 03:00	30 days
ClickHouse	Full + incremental	Full daily 00:45, incremental hourly	~14 days

Not Backed Up

Kafka and ZooKeeper data are not backed up. Kafka holds only in-transit data that has temporary value. ZooKeeper holds master node negotiation state that is automatically recreated.

Agent Architecture

The SUSE Observability Agent is deployed on each observed cluster (not the server cluster) via Helm. It consists of four components that work together to collect topology, metrics, events, logs, traces, and network data.

DaemonSet Node Agent

Deployed as a DaemonSet on every node. Runs with hostNetwork: true to scrape open metrics from all pods, and hostPID: true to map processes to containers via cgroups. Injects eBPF programs into network namespaces to monitor workload communication and decode L7 protocols (TCP, HTTP/1.0, HTTP/1.1, TLS, Redis). Reads conntrack tables across all network namespaces for connection tracking. Requires securityContext.privileged: true.

Deployment Cluster Agent

A single instance per cluster. Communicates with the Kubernetes API to discover topology: clusters, nodes, namespaces, deployments, statefulsets, daemonsets, pods, services, configmaps, persistent volumes, ingresses, and their relationships. Requires ClusterRole and ClusterRoleBinding for API access.

Deployment Checks Agent

Runs health and diagnostic checks against the cluster. Evaluates the health of Kubernetes resources and reports status back to the SUSE Observability server. Works in conjunction with the monitors configured on the server side.

Dependency kube-state-metrics

Deployed as part of the agent Helm chart. Exposes Kubernetes object state as Prometheus-format metrics (pod status, deployment replicas, resource requests/limits, etc.). The Node Agent scrapes these metrics and forwards them to SUSE Observability.

Request Tracing (Cross-Service)

For tracing requests across service boundaries, load balancers, and service meshes, SUSE Observability can inject a sidecar proxy via a mutating webhook. The sidecar injects an X-Request-ID header into all HTTP traffic. This header is observed at both client and server endpoints, allowing SUSE Observability to map service dependencies across cluster boundaries.

Supported protocols: HTTP/1.0, HTTP/1.1 with keepAlive, unencrypted traffic, OpenSSL-encrypted traffic
Supported integrations: LinkerD service mesh, Envoy proxy, Istio EnvoyFilters
Resource overhead: 25–40 MB memory per pod for the sidecar proxy, plus variable CPU based on request volume
Annotation: http-header-injector.stackstate.io/inject: enabled

Supported Container Runtimes

ContainerD
CRI-O

Topology & Health Model

The topology-based health model is how SUSE Observability turns raw observability data into actionable insights. Every component in your infrastructure has a health state, and health propagates through the dependency graph to enable automatic root cause analysis.

Components & Relations

Component — any discrete element in your infrastructure: a pod, a node, a service, a deployment, a namespace, a PV, etc. Each has properties, telemetry bindings, and a health state.
Relation — a directed dependency between two components. The arrow indicates dependency direction: app → db means "app depends on db."

Health States

Each component has a computed health state based on monitors that evaluate metrics, topology, and metadata:

CLEAR (green) — component is healthy, all monitors pass
DEVIATING (orange) — component is deviating from expected behavior
CRITICAL (red) — component has a critical issue
UNKNOWN (gray) — no health data available

Health Propagation & Root Cause Analysis

Health propagates in the opposite direction to dependency arrows. If app → db and the database turns red, the app component's outer color turns red to indicate potential impact from an upstream dependency. The inner color shows the component's own health; the outer color shows propagated health from its dependencies.

How Root Cause Analysis Works

A problem groups related unhealthy components. The root cause is the unhealthy element at the bottom of the dependency chain. All other unhealthy elements that depend on the root cause are contributing causes. When health states change, root cause identification is automatically updated. A problem is considered resolved when all contributing and root cause elements return to CLEAR.

Out-of-the-Box Monitors

SUSE Observability ships with pre-configured monitors for common Kubernetes failure modes. Each monitor includes remediation guides that appear directly in the UI with step-by-step troubleshooting instructions. Monitors can be:

Metric-based — threshold and dynamic threshold monitors on metrics
Topology-based — validate topology structure and component properties (unique to SUSE Observability's 4T Monitors)
Derived state — monitors that derive health from related components
Custom — user-defined monitors via the UI or CLI, can target Prometheus metrics ingested via remote_write

Time-Travel Debugging

Time-travel is SUSE Observability's signature capability, built on the versioned graph database that preserves every topology state change. It operates on two independent time dimensions that can be controlled separately.

Dimension Topology Time

A specific moment in time for which you fetch a snapshot of your Kubernetes resources. When you select a topology time in the past, the interface reconstructs the exact infrastructure state at that moment — which pods existed, how they were connected, their configurations, and their health states. Even deleted pods are visible at their historical topology time.

Dimension Telemetry Interval

The time range for which you want to see telemetry data (metrics, events, logs, traces). This is independent of topology time. Maximum window is 6 months. Telemetry shown is filtered to only data related to components that existed at the selected topology time.

How It Works in Practice

Incident occurs at 2:00 AM — you arrive at 9:00 AM to investigate
Set topology time to 2:00 AM — the topology perspective reconstructs the exact state of your infrastructure at that time, including pods that may have been killed and restarted since then
Set telemetry interval around 2:00 AM — see metrics, logs, events, and traces from that window
Navigate the topology — follow the dependency graph from affected services to the root cause, seeing all associated telemetry for each component at that point in time
Scrub through time — use the timeline at the bottom of the UI to move forward and backward, watching how the topology and health states changed

Key Insight

Traditional monitoring tools lose context when Kubernetes resources are ephemeral. A CrashLooping pod that was killed and replaced has its logs and metrics scattered or lost. SUSE Observability preserves the complete picture — the pod's topology position, its relationships, its logs, events, and metrics — accessible through time-travel up to the configured data retention period (default 30 days for production).

Installation & Deployment

SUSE Observability is deployed via Helm charts to a dedicated Kubernetes cluster (or namespace on an existing cluster). Installation takes approximately 30 minutes. Helm v3.13.1 or higher is required.

Step 1: Add Helm Repository

# Add the SUSE Observability Helm repo
helm repo add suse-observability \
  https://charts.rancher.com/server-charts/prime/suse-observability
helm repo update

Step 2: Create Namespace

kubectl create namespace suse-observability

Step 3: Create values.yaml

# values.yaml - Core configuration
global:
  suseObservability:
    license: "YOUR-LICENSE-KEY"          # From SUSE Customer Center
    baseUrl: "https://observability.example.com"  # External access URL
    adminPassword: "your-admin-password"  # Plain text or bcrypt hash
    sizing:
      profile: "150-ha"                  # See sizing profiles below
  # imageRegistry: "registry.example.com" # Optional: custom registry
  # storageClass: "gp3"                   # Optional: override default

Step 4: Deploy

# Install SUSE Observability
helm upgrade --install \
  --namespace suse-observability \
  --values values.yaml \
  suse-observability \
  suse-observability/suse-observability

# Verify installation
helm list --namespace suse-observability
kubectl get pods --namespace suse-observability

# Port-forward for local access
kubectl port-forward \
  service/suse-observability-suse-observability-router 8080:8080 \
  --namespace suse-observability

Step 5: Deploy Agent on Observed Clusters

After the server is running, navigate to StackPacks > Integrations > Kubernetes in the SUSE Observability UI. Create a new instance with a cluster identifier. The UI will generate a Helm command with pre-filled configuration:

# Generated by SUSE Observability UI (example)
helm upgrade --install \
  --namespace suse-observability \
  --create-namespace \
  --set-string 'stackstate.apiKey=YOUR-API-KEY' \
  --set-string 'stackstate.cluster.name=my-cluster' \
  --set-string 'stackstate.url=https://observability.example.com/receiver/stsAgent' \
  suse-observability-agent \
  suse-observability/suse-observability-agent

Sizing Profiles

Profile	Observed Nodes	HA	Use Case
`trial`	Up to 10	No	Evaluation only
`10-nonha`	10	No	Small / testing
`20-nonha`	20	No	Small / testing
`50-nonha`	50	No	Small / testing
`100-nonha`	100	No	Small production
`150-ha`	150	Yes (3x replicas)	Production
`250-ha`	250	Yes	Production
`500-ha`	500	Yes	Large production
`4000-ha`	4,000	Yes	Enterprise

Node Counting

An "observed node" is defined as 4 vCPUs + 16 GB memory. If your actual nodes are larger, they count as multiples. For example, a node with 12 vCPU / 48 GB counts as 3 observed nodes.

Air-Gapped Installation

For disconnected environments, pull all container images to a local registry and provide a local-docker-registry.yaml with global.imageRegistry set to your internal registry.

helm upgrade --install \
  --namespace suse-observability \
  --values local-docker-registry.yaml \
  --values values.yaml \
  suse-observability \
  suse-observability/suse-observability

Requirements & Sizing

Compute Requirements (Server Cluster)

Profile	CPU Requests	CPU Limits	Memory Requests	Memory Limits	Storage
`trial`	7.0 cores	15.1 cores	22.7 Gi	23.3 Gi	163 GB
`10-nonha`	7.0 cores	15.1 cores	22.7 Gi	23.3 Gi	358 GB
`50-nonha`	14.0 cores	28.8 cores	30.9 Gi	31.0 Gi	~450 GB
`100-nonha`	23.6 cores	47.9 cores	47.0 Gi	47.2 Gi	562 GB
`150-ha`	49.6 cores	105.2 cores	127.0 Gi	131.8 Gi	2.8 TB
`500-ha`	85.1 cores	176.2 cores	166.4 Gi	171.2 Gi	3.9 TB
`4000-ha`	212.1 cores	281.0 cores	263.9 Gi	321.7 Gi	7.5 TB

Minimum Node Specifications

Deployment Type	Min vCPU/Node	Min Memory/Node
Non-HA (testing/small)	4 vCPU	8 GB
HA (up to 500 nodes)	8 vCPU	16 GB
HA (4000 nodes)	16 vCPU	32 GB

Kubernetes Compatibility

Platform	Supported Versions
Kubernetes	1.25 through 1.33
OpenShift	4.14 through 4.19
Rancher 2.11.x	RKE2 v1.30.11+rke2r1
Rancher 2.12.x	RKE2 v1.30.11+rke2r1
Rancher 2.13.x	RKE2 v1.30.11, v1.31.13, v1.32.10 (+rke2r1)

Supported Kubernetes Distributions

Cloud managed: Amazon EKS, Azure AKS, Google GKE, Alibaba Cloud ACK
On-premises: RKE2, K3s, vanilla Kubernetes
Enterprise: OpenShift (4.14–4.19)

Storage Warning

NFS is not supported for storage provisioning due to the risk of data corruption. Use SSD/flash-based storage for production deployments. The default storage class is used unless global.storageClass is specified in values.yaml. ResourceQuota is not recommended as it may interfere with resource allocation.

Data Retention Defaults

Trial: 3 days
Production profiles: 30 days
SaaS (Cloud Observability): ~1 day for events/logs/metrics, ~12 hours for traces (default tier)

Other Requirements

Helm: v3.13.1 or higher (Helm 4 supported as of v2.8.0)
Ingress: An ingress controller or load balancer for external HTTPS access
Browsers: Chrome and Firefox
Authentication: OIDC, KeyCloak, Microsoft Entra ID, LDAP, file-based, or single-password

Integrations & Data Sources

SUSE Observability extends its functionality through StackPacks — plugin packages that provide automated integration with external systems. StackPacks come in two types: Add-ons (extend platform capabilities) and Integrations (connect to external data sources).

OpenTelemetry (Native)

SUSE Observability is OpenTelemetry-native. It includes an OpenTelemetry Collector (v0.108.0-stackstate.21) as a built-in component and accepts OTLP data (traces, metrics, logs) at dedicated API endpoints. The recommended architecture:

Instrument applications with OpenTelemetry SDKs
Deploy the OpenTelemetry Collector near instrumented applications to preprocess data (enrich with K8s labels, implement sampling)
Forward to SUSE Observability's OTLP endpoints

Out-of-the-box capabilities include monitors for span error rates and duration metrics, metric bindings for span metrics, .NET and JVM memory metrics, and service overview pages.

Prometheus Integration

SUSE Observability exposes a Prometheus remote_write endpoint to mirror metrics from existing Prometheus instances:

# Add to your Prometheus config
remote_write:
  - url: https://<base-url>/receiver/prometheus/api/v1/write
    headers:
      sts-api-key: "<API-KEY>"
    # Or use basic_auth:
    # basic_auth:
    #   username: apikey
    #   password: "<API-KEY>"

This enables using existing Prometheus metrics in SUSE Observability's monitors and topology context without replacing your existing Prometheus setup.

Kubernetes StackPack

The core integration. Provides auto-discovery of all Kubernetes topology (clusters, nodes, namespaces, workloads, pods, services, etc.), pre-built monitors for common Kubernetes issues, and the agent deployment configuration. Multi-instance support allows monitoring multiple clusters from a single SUSE Observability server.

Other Integrations

Cloud providers: AWS StackPack (supports multiple AWS accounts), Azure, GCP
Alerting: Slack, Jira, custom webhooks
CI/CD: Integration with CI/CD pipelines for deployment correlation
Custom: StackPacks can be extended or new ones created for custom data sources
Splunk: Integration for log forwarding (v2.8.0 added improvements)
40+ prebuilt dashboards for common Kubernetes monitoring scenarios

Rancher Integration

SUSE Observability is tightly integrated with SUSE Rancher Prime through a UI extension and shared RBAC. The observability license is included with Rancher Prime subscriptions.

Rancher Prime UI Extension

A Rancher Manager extension that integrates SUSE Observability health signals directly into the Rancher UI. Installation:

Enable UI extensions from the Rancher UI
Navigate to Extensions > Available
Install the Observability extension
Navigate to SUSE Observability > Configurations in the left panel
Add the SUSE Observability server URL and credentials

Once configured, Rancher displays health indicators on every resource (cluster, node, workload, pod). Clicking a health indicator provides a direct link to SUSE Observability's detailed investigation view for that resource.

RBAC Integration

SUSE Observability supports Rancher RBAC, allowing you to map Rancher roles and permissions to SUSE Observability access levels. This means Rancher users see only the clusters and resources they have permission to view.

Complementing Existing Prometheus + Grafana

SUSE Observability does not replace Rancher's built-in Prometheus + Grafana monitoring stack. Instead, it complements it:

Prometheus + Grafana (Rancher Monitoring) — provides detailed metrics dashboards, PromQL queries, and alerting rules for specific metrics
SUSE Observability — adds topology awareness, cross-cluster correlation, root cause analysis, time-travel debugging, and the 4T data model
Connect them via Prometheus remote_write to feed Prometheus metrics into SUSE Observability's topology-correlated view

How They Work Together

Think of Prometheus + Grafana as your microscope (deep metrics analysis) and SUSE Observability as your map (understanding what is connected to what, what broke, and why). Rancher is the control plane that ties them together with unified RBAC and a single management interface.

Comparison & Licensing

SUSE Observability vs. Alternatives

Capability	SUSE Observability	Datadog	Dynatrace	Prometheus + Grafana
Topology-based monitoring	Core differentiator — auto-discovered versioned topology graph	Service maps exist but not versioned/time-travel enabled	Smartscape topology, AI-driven	No built-in topology
Time-travel debugging	Full infrastructure state reconstruction at any past moment	Historical dashboards, no topology time-travel	Session replay for user sessions, not infra topology	Historical PromQL queries only
Root cause analysis	Automatic via dependency graph traversal	Watchdog AI-based correlation	Davis AI engine (patented)	Manual investigation
Deployment model	Self-hosted (K8s) or SaaS	SaaS only	SaaS or Managed (on-prem available)	Self-hosted
Open source	Planned (SUSE committed to open-sourcing)	No (agent is open-source)	No	Fully open-source (Apache 2.0)
Kubernetes-native	Primary focus; deep K8s topology	Strong K8s support, broader scope	Strong K8s support, broader scope	Excellent K8s integration
Pricing model	Included with Rancher Prime, or SaaS per-host	Per-host + per-feature add-ons	Host Units (tied to RAM), complex	Free (operational costs only)
OpenTelemetry	Native OTLP support + built-in collector	OTLP ingestion supported	OTLP ingestion supported	Via OTLP remote_write or Alloy
eBPF monitoring	Built-in for L7 protocol decoding & network topology	Yes (network monitoring)	OneAgent uses eBPF	Separate tools (Cilium, Pixie)

Unique Selling Points

Versioned topology — the only platform with a custom-built versioned graph database that stores every topology state change, enabling true time-travel debugging of infrastructure
4T Monitors — monitors that can validate topology structure and properties, not just metric thresholds
Rancher-native — deep integration with the Rancher ecosystem, shared RBAC, included in Rancher Prime subscription
Self-hosted option — full on-premises deployment for organizations with data sovereignty requirements, unlike SaaS-only competitors
Open-source commitment — SUSE has committed to open-sourcing the platform

Licensing & Pricing

Included SUSE Rancher Prime

SUSE Observability is included with SUSE Rancher Prime subscriptions. The license key is available in the SUSE Customer Center under the Subscription tab, shown as "SUSE Observability" Registration Code. Valid for the duration of your Rancher Prime subscription.

SaaS SUSE Cloud Observability

Available on AWS Marketplace with pay-as-you-go pricing:

10–100 hosts: $9.99/host/month (hourly billing, 10-host minimum = $99/mo base)
100+ hosts: $8.99/host/month ($899/mo base)
Included: 5 GB logs + 5 GB metrics + 5 GB traces
Overage: $0.15/GB

Add-on Platform Optimization

"SUSE Platform Optimization" is a separate add-on that requires its own license. It provides cost optimization recommendations for Kubernetes workloads. Not included in the base Observability license.

Future Open Source

SUSE announced plans to open-source StackState/SUSE Observability. As of March 2026, this has not yet occurred, but SUSE has been contributing to CNCF observability projects (including a case study on Longhorn). No timeline for the open-source release has been published.

Version History

Version	Date	Notable Changes
v2.8.1	17 Mar 2026	Latest release (patch)
v2.8.0	03 Mar 2026	Helm 4 support, simplified installation, Traefik ingress docs
v2.7.0	14 Jan 2026	Feature release
v2.6.0	29 Sep 2025	HBase 2.6.3 upgrade, global commonLabels, editable service monitors. Breaking: ClickHouse/ZooKeeper StatefulSet labels immutable
v2.5.0	08 Sep 2025	Feature release
v2.4.0	25 Aug 2025	Feature release
v2.3.0	30 Jan 2025	Feature release (7 patch releases through v2.3.7)
v2.2.0	09 Dec 2024	Feature release
v2.1.0	29 Oct 2024	Feature release
v2.0.0	11 Sep 2024	First SUSE-branded release, integrated with Rancher Prime 3.1

StackGraph

The custom versioned graph database at the heart of SUSE Observability's topology and time-travel capabilities

What it is

StackGraph is a custom-built versioned graph database that stores all topology data (components, relations, configurations, and properties) with full version history. It was built from scratch by the StackState founders between 2014 and 2017 because no existing graph database supported the time-travel capabilities they needed. It is the foundational technology that enables SUSE Observability's unique time-travel debugging.

How it works

Every change to the topology — a new pod created, a relation added, a configuration updated, a component deleted — is stored as a new version in the graph. This means you can query the graph at any historical timestamp and get the exact topology state at that moment. The graph preserves the complete history of your infrastructure's evolution.

Technical implementation

StackGraph is built on top of HBase (a distributed column-oriented store running on HDFS) and uses Tephra for transaction management. In HA deployments, HBase runs with separate name-nodes, secondary name-nodes, data-nodes, region servers, and Tephra transaction servers. In non-HA mode, everything runs in a single hbase-stackgraph-0 pod with an embedded ZooKeeper instance.

Storage

StackGraph data is backed up as full backups (single .graph files) on a daily schedule. Configuration and topology data retention follows the deployment profile (3 days for trial, 30 days for production).

Why custom? Existing graph databases like Neo4j were evaluated but lacked temporal versioning at the data model level. The team needed the ability to query "show me the topology at 2:00 AM last Tuesday" natively, not as an application-layer abstraction. This 3-year investment in building StackGraph from scratch is what makes SUSE Observability's time-travel fundamentally different from competitors that reconstruct state from event logs.

StackPacks

Plugin packages that extend SUSE Observability with automated integrations and additional capabilities

What they are

StackPacks are the integration framework for SUSE Observability. They provide automated connection to external systems and can be installed/uninstalled from the StackPacks page in the UI. There are two types:

Add-ons — extend platform capabilities (e.g., OpenTelemetry StackPack adds service/instance overview pages)
Integrations — connect to external data sources (e.g., Kubernetes, AWS, Azure)

Multi-instance support

Some StackPacks support multiple instances. For example, the Kubernetes StackPack can be installed once per monitored cluster, each with its own cluster identifier. The AWS StackPack can connect to multiple AWS accounts and combine information from all of them.

What they install

A StackPack typically installs component templates, functions, actions, views, and monitors. These are locked configuration items by default, meaning they cannot be modified by users. This protects against accidental changes that would be overwritten during StackPack upgrades.

Key StackPacks

Kubernetes — the core integration, provides topology discovery, health monitors, and agent deployment
OpenTelemetry — adds service/instance pages, span metrics monitors
AWS — multi-account cloud resource topology

Uninstall warning: Uninstalling a StackPack removes all data received via that StackPack from SUSE Observability. External system components (like agents) must be uninstalled separately.

4T Monitors

Monitors that leverage the full 4T data model to validate topology structure, not just metric thresholds

What makes them different

Traditional monitors check whether a metric crosses a threshold (e.g., "CPU > 90%"). 4T Monitors can combine topology, telemetry, and traces in their evaluation logic. They can validate the structure and properties of topology itself, not just the numeric values of metrics.

Examples of 4T monitoring

Check that a deployment has the expected number of replicas based on its topology
Verify that a service's dependency graph matches expected patterns
Detect when a component's metadata/labels violate organizational standards
Monitor span error rates correlated with specific service topology paths
Detect common Kubernetes misconfigurations by validating resource properties

Remediation guides

Every monitor (both built-in and custom) can have associated remediation guides that appear in the UI when the monitor fires. These guides are specific to the resource type and the monitored condition, providing step-by-step troubleshooting instructions directly in context.

Types

Threshold monitors — standard metric threshold checks
Dynamic threshold monitors — ML-based anomaly detection on metrics
Derived state monitors — derive health from related components
Topology monitors — validate topology structure and properties

Adding custom monitors via CLI

Monitors can be added through the SUSE Observability CLI, allowing infrastructure-as-code management of monitoring rules alongside your deployment automation.

eBPF in SUSE Observability

Kernel-level observability without code instrumentation for automatic topology discovery and L7 protocol decoding

What it does

The SUSE Observability Node Agent uses eBPF (extended Berkeley Packet Filter) programs to monitor workload communication at the kernel level. This provides deep visibility without requiring any code changes to monitored applications.

Capabilities

Network namespace injection — eBPF programs are injected into each network namespace to observe traffic
Connection tracking — reads conntrack tables across all network namespaces to map connections
L7 protocol decoding — decodes RED (Rate, Errors, Duration) signals for TCP, HTTP, TLS, and Redis protocols
Service dependency discovery — automatically identifies which services communicate with each other
Cross-boundary visibility — can observe connections between services and pods in different clusters, or through service meshes and load balancers

Why eBPF?

Traditional approaches to topology discovery require either code instrumentation (adding tracing libraries) or proxy sidecars (like Istio/Envoy). eBPF operates at the kernel level with minimal overhead, automatically discovering service dependencies without any application changes. This is especially valuable in heterogeneous environments where you cannot instrument every service.

Requirements

The Node Agent requires privileged security context to inject eBPF programs and read conntrack tables. It runs with hostPID: true and hostNetwork: true. This is the primary reason the agent needs elevated permissions.

Infinite dependency maps: The combination of eBPF-based network observation and Kubernetes API topology gives SUSE Observability the ability to build dependency maps that extend beyond a single cluster, mapping connections to external services, databases, and third-party APIs automatically.

StackState Heritage

The observability company behind SUSE Observability — from a Dutch bank to SUSE acquisition

Founding (2014–2015)

Founded by Mark Bakker and Lodewijk Bogaards in the Netherlands. The idea emerged during a consulting engagement at a major Dutch bank in 2014, where they discovered that monitoring data was abundant but insight was missing. Remco Beckers joined as a third co-founder. The company was formally incorporated in 2015, headquartered in Utrecht.

Technology development (2014–2017)

The team spent three years building a custom versioned graph database (StackGraph) from scratch. They initially evaluated commercial graph databases but found none supported time-travel at the data model level. The development organization was called StackVista (still the GitHub organization name).

Launch and growth (2017–2024)

2017 — StackState 1.0 launched as the first observability platform with time-traveling topology
2019 — Gartner Cool Vendor in Performance Analysis
2021 — Gartner Market Guide for AIOps Platforms representative vendor
StackState v4.6 — introduced the formal 4T data model concept
StackState v5.0 — introduced 4T Monitors (topology-aware monitoring)
Customers included KPN, Vodafone, Accenture
Team grew to 50+ employees

SUSE acquisition (June 2024)

SUSE acquired StackState on June 18, 2024, announced at SUSECON Berlin. The product was rebranded to "SUSE Observability" and version numbering reset to v2.0.0 (September 2024). The technology was integrated into Rancher Prime 3.1.

What changed under SUSE

Branding: StackState became "SUSE Observability"
Distribution: Bundled with Rancher Prime subscriptions
Documentation: Moved from docs.stackstate.com to documentation.suse.com
SaaS offering: SUSE Cloud Observability launched November 2024 (first SUSE SaaS product)
Rancher integration: UI extension, shared RBAC, Rancher marketplace distribution
Open-source commitment: SUSE announced plans to open-source the codebase

What stayed the same

Core technology: StackGraph, 4T data model, time-travel debugging
GitHub org remains StackVista
Helm chart still references stackstate in many internal names
Agent annotations still use stackstate.io namespace
Internal pod names still contain stackstate references