Kubernetes Production Guide
Container orchestration — distributions, networking, storage, security & operations
Overview
Kubernetes (K8s) is an open-source container orchestration platform originally designed by Google and now maintained by the Cloud Native Computing Foundation (CNCF). It automates the deployment, scaling, and management of containerized applications across clusters of machines.
At its core, Kubernetes follows a declarative model: you describe the desired state of your workloads (how many replicas, what image, what resources, what networking), and Kubernetes continuously reconciles the actual state to match. This is fundamentally different from imperative scripting where you tell the system what to do step-by-step.
Architecture
Key concepts
- Control Plane — The brain of the cluster. The API Server is the single entry point for all operations. The Scheduler places pods on nodes. The Controller Manager runs reconciliation loops. etcd stores all cluster state.
- Worker Nodes — Machines that run your workloads. Each node runs a kubelet (agent that talks to the API server), kube-proxy (networking rules), and a container runtime (containerd, CRI-O).
- Pods — The smallest deployable unit. A pod contains one or more containers that share networking and storage. Pods are ephemeral by design.
- Deployments — Declarative way to manage ReplicaSets and pods. You specify the desired number of replicas and the update strategy, and the Deployment controller handles the rest.
- Services — Stable network endpoints that abstract away pod IPs. Services provide load balancing across pods that match a label selector.
- Namespaces — Virtual clusters within a physical cluster. Used for multi-tenancy, environment separation (dev/staging/prod), and resource quota boundaries.
Declarative vs imperative
Declarative
You write YAML manifests that describe the desired state. Kubernetes controllers continuously reconcile actual state to match. If a pod crashes, it gets recreated. If a node dies, pods get rescheduled. This is the production-correct approach.
kubectl apply -f deployment.yaml
Imperative
You issue one-off commands that directly modify cluster state. Useful for debugging and quick experiments, but not suitable for production because changes are not tracked or reproducible.
kubectl create deployment nginx --image=nginx
kubectl scale deployment nginx --replicas=3
Kubernetes does not run containers. It orchestrates them. The actual container execution is handled by the container runtime (containerd or CRI-O). Kubernetes manages the lifecycle, scheduling, networking, and storage for those containers. Think of Kubernetes as the operating system for your datacenter — it abstracts away individual machines and lets you treat a cluster as a single compute surface.
Distributions
Kubernetes is a set of components, not a single binary you install. Distributions package those components with opinionated defaults for networking, storage, ingress, and container runtime. The three most common lightweight/edge distributions are MicroK8s, K3s, and RKE2.
Comparison table
| Feature | MicroK8s | K3s | RKE2 |
|---|---|---|---|
| Maintainer | Canonical | Rancher Labs (SUSE) | Rancher Labs (SUSE) |
| Packaging | Snap package | Single binary | RPM / tarball |
| Default CNI | Calico | Flannel | Canal (Flannel + Calico) |
| Default Ingress | None (addon available) | Traefik | Nginx Ingress Controller |
| Default Storage | hostpath-storage (addon) | Local-path provisioner | None (manual setup) |
| Container Runtime | containerd | containerd | containerd |
| Datastore | Dqlite (default) / etcd | Embedded SQLite (single) / etcd (HA) | Embedded etcd |
| Security Hardening | Manual | Manual | CIS hardened by default |
| Best For | Dev, IoT, single-node, Ubuntu | Edge, IoT, resource-constrained | Production, gov, air-gapped |
| HA Support | Yes (3+ nodes) | Yes (embedded etcd or external DB) | Yes (embedded etcd) |
| Addon System | Yes (microk8s enable) | No (use Helm/manifests) | No (use Helm/manifests) |
When to use which
MicroK8s
- Developer workstations (especially Ubuntu / WSL)
- Single-node clusters for testing
- IoT and edge with snap-based infrastructure
- Quick enablement of common addons (dns, dashboard, registry, gpu, istio)
K3s
- Edge computing and resource-constrained environments
- CI/CD pipelines needing a quick cluster
- ARM devices (Raspberry Pi)
- When you need the smallest possible footprint (~2GB RAM minimum recommended; ~512MB technically possible but impractical for real workloads)
RKE2
- Production clusters where security compliance is required (FedRAMP, STIG, CIS)
- Government and defense environments
- Air-gapped deployments (designed for it)
- When you need FIPS-validated cryptography (currently FIPS 140-2; plan for 140-3 transition by Sept 2026)
- Rancher-managed multi-cluster environments
For production workloads that require security hardening, RKE2 is the default recommendation. It ships CIS-hardened out of the box, which saves weeks of manual hardening. For dev/test and edge, K3s is the go-to choice for its simplicity and minimal resource footprint. MicroK8s is best when the client is heavily invested in the Ubuntu/Canonical ecosystem and wants snap-based management.
kubectl & Kubeconfig
kubectl is the primary CLI for interacting with Kubernetes clusters. It communicates with the API server using configuration stored in a kubeconfig file (default: ~/.kube/config).
Kubeconfig structure
A kubeconfig file has three main sections:
apiVersion: v1
kind: Config
clusters:
- name: production
cluster:
server: https://10.0.1.100:6443
certificate-authority-data: <base64-ca-cert>
users:
- name: admin
user:
client-certificate-data: <base64-client-cert>
client-key-data: <base64-client-key>
contexts:
- name: prod-admin
context:
cluster: production
user: admin
namespace: default
current-context: prod-admin
- clusters — Define API server endpoints and CA certificates
- users — Define authentication credentials (certs, tokens, OIDC)
- contexts — Bind a cluster + user + optional namespace into a named context
- current-context — The active context that kubectl uses by default
Merging kubeconfigs
When managing multiple clusters, you can merge kubeconfigs using the KUBECONFIG environment variable:
# Merge multiple kubeconfig files
export KUBECONFIG=~/.kube/config:~/.kube/cluster2.yaml:~/.kube/cluster3.yaml
# Flatten into a single file
kubectl config view --flatten > ~/.kube/merged-config
export KUBECONFIG=~/.kube/merged-config
# Switch between contexts
kubectl config get-contexts
kubectl config use-context prod-admin
kubectl config use-context staging-dev
Common kubectl commands
| Command | Purpose |
|---|---|
kubectl get pods -A | List all pods across all namespaces |
kubectl describe pod <name> | Detailed info including events |
kubectl logs <pod> -f | Stream logs from a pod |
kubectl exec -it <pod> -- /bin/sh | Shell into a running container |
kubectl apply -f manifest.yaml | Declaratively apply a resource |
kubectl delete -f manifest.yaml | Delete resources defined in a file |
kubectl get events --sort-by=.lastTimestamp | View recent cluster events |
kubectl top pods | Resource usage (requires metrics-server) |
kubectl port-forward svc/myapp 8080:80 | Forward local port to a service |
kubectl drain <node> --ignore-daemonsets | Safely evict pods before node maintenance |
TLS SAN warnings
When connecting to a cluster, you may encounter a certificate error like:
Unable to connect to the server: x509: certificate is valid for 10.0.1.100,
127.0.0.1, not 192.168.1.50
Why this happens: The Kubernetes API server generates a TLS certificate during cluster initialization. That certificate includes a list of Subject Alternative Names (SANs) — the hostnames and IP addresses the certificate is valid for. If you connect to the API server using a hostname or IP that is not in the SAN list, TLS verification fails because the client cannot verify it is talking to the correct server.
This commonly occurs when:
- Accessing a cluster from outside the network (the external IP is not in the cert)
- Using a load balancer IP or DNS name that was not included at install time
- Connecting via a VPN or bastion host with a different IP
Fixing SAN issues per distribution
RKE2 --tls-san flag
Add SANs at install time or in the config file:
# /etc/rancher/rke2/config.yaml
tls-san:
- "k8s.example.com"
- "192.168.1.50"
- "10.0.0.100"
Restart the RKE2 server after modifying. The API server certificate will be regenerated with the new SANs.
K3s --tls-san flag
Pass SANs during install or in the config:
# During install
curl -sfL https://get.k3s.io | \
sh -s - server \
--tls-san k8s.example.com \
--tls-san 192.168.1.50
# Or in /etc/rancher/k3s/config.yaml
tls-san:
- "k8s.example.com"
- "192.168.1.50"
MicroK8s CSR config modification
MicroK8s requires editing the CSR configuration template and refreshing certificates:
# Edit the CSR config
sudo nano /var/snap/microk8s/current/certs/csr.conf.template
# Add your SANs under [alt_names]
# IP.3 = 192.168.1.50
# DNS.4 = k8s.example.com
# Refresh the certificates
sudo microk8s refresh-certs --cert server.crt
Workaround: skip TLS verification
Skipping TLS verification should only be used for debugging, never in production. It disables certificate validation, which means you cannot verify the identity of the API server (man-in-the-middle risk).
# One-off command
kubectl --insecure-skip-tls-verify get nodes
# Set in kubeconfig context permanently
kubectl config set-cluster my-cluster \
--insecure-skip-tls-verify=true
Ingress & Load Balancing
Ingress is a Kubernetes API object that manages external access to services within a cluster, typically HTTP/HTTPS. It provides URL-based routing, TLS termination, and virtual hosting. An Ingress resource is useless without an Ingress Controller — a pod that reads Ingress objects and configures the underlying proxy (Nginx, Traefik, HAProxy, etc.).
The Gateway API is the official successor to the Ingress API, offering richer routing (header-based, multi-protocol), role-oriented RBAC, and better extensibility. The community Ingress NGINX controller is being retired (March 2026). While the Ingress API itself is not deprecated, new projects should evaluate Gateway API first. All major controllers (Traefik, Cilium, Envoy Gateway, Kong, Istio) support Gateway API.
Ingress controllers
| Controller | Pros | Cons | Default In |
|---|---|---|---|
| Nginx Ingress | Mature, widely used, extensive annotations, good docs, supports gRPC via backend-protocol annotation | Config via annotations can get messy; community Ingress NGINX controller is being retired March 2026 — migrate to Gateway API or NGINX's own controller | RKE2 |
| Traefik | Auto-discovery, middlewares, IngressRoute CRD, built-in dashboard, Gateway API support | Less familiar to ops teams, v1 to v2 migration was painful | K3s |
| HAProxy Ingress | High performance, TCP/UDP support, enterprise support available | Smaller community, fewer examples online | — |
Ingress example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx
tls:
- hosts:
- myapp.example.com
secretName: myapp-tls
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-svc
port:
number: 80
Service type: LoadBalancer
In cloud environments, creating a Service of type LoadBalancer automatically provisions a cloud load balancer (AWS ELB, GCP LB, Azure LB). On bare-metal, there is no cloud API to call, so the Service stays in Pending state forever — unless you install MetalLB.
MetalLB for bare-metal
MetalLB provides LoadBalancer service support for bare-metal clusters. It operates in two modes:
Layer 2 Mode
MetalLB responds to ARP requests for the service IP on the local network. Simple to set up, no router configuration needed. The downside is that all traffic for a given service IP goes through a single node (no true load balancing at the network level).
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default-pool
namespace: metallb-system
spec:
addresses:
- 192.168.1.200-192.168.1.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default
namespace: metallb-system
BGP Mode
MetalLB peers with your network router via BGP and announces service IPs as routes. Provides true multi-path load balancing (ECMP). Requires BGP-capable routers and network team coordination.
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: router
namespace: metallb-system
spec:
myASN: 64500
peerASN: 64501
peerAddress: 10.0.0.1
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: default
namespace: metallb-system
Most on-prem and homelab deployments use MetalLB in Layer 2 mode because it requires zero router configuration. The single-node bottleneck is rarely an issue for small-to-medium clusters. BGP mode is worth the effort when you have a proper network infrastructure with BGP-capable switches (e.g., Cisco, Arista, or even a FRRouting-based software router).
TLS & Certificate Management
cert-manager is the standard way to manage TLS certificates in Kubernetes. It automates the issuance, renewal, and rotation of certificates from various sources including Let's Encrypt, HashiCorp Vault, and self-signed CAs.
Issuer vs ClusterIssuer
Issuer
Namespace-scoped. Can only issue certificates for resources in the same namespace. Use when you want to isolate certificate management per team or environment.
ClusterIssuer
Cluster-scoped. Can issue certificates for any namespace. The most common choice for production because you typically have one certificate authority for the entire cluster.
Let's Encrypt with cert-manager
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@example.com
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
ACME challenge types
| Challenge | How it works | When to use |
|---|---|---|
| HTTP-01 | cert-manager creates a temporary pod/ingress that serves a token at /.well-known/acme-challenge/. Let's Encrypt hits that URL to verify domain ownership. | Standard web-facing services. Requires port 80 to be publicly reachable. |
| DNS-01 | cert-manager creates a TXT record in your DNS zone (e.g., _acme-challenge.example.com). Let's Encrypt queries DNS to verify ownership. | Wildcard certificates (*.example.com). Works even if the cluster is not publicly accessible. Requires DNS provider API integration (Route53, Cloudflare, etc.). |
Using cert-manager with Ingress annotations
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- myapp.example.com
secretName: myapp-tls # cert-manager creates this Secret
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp
port:
number: 80
When this Ingress is created, cert-manager detects the cert-manager.io/cluster-issuer annotation, requests a certificate from Let's Encrypt, completes the ACME challenge, and stores the resulting certificate in the myapp-tls Secret. The Ingress controller then uses that Secret for TLS termination. Renewal happens automatically before expiry (default: 2/3 through the certificate's duration, which is ~30 days before expiry for standard 90-day Let's Encrypt certificates). You can customize this with spec.renewBefore or spec.renewBeforePercentage.
Self-signed CA
For internal services, air-gapped environments, or development, you can use a self-signed CA:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-issuer
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: internal-ca
namespace: cert-manager
spec:
isCA: true
commonName: internal-ca
secretName: internal-ca-secret
issuerRef:
name: selfsigned-issuer
kind: ClusterIssuer
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: internal-ca-issuer
spec:
ca:
secretName: internal-ca-secret
Always use letsencrypt-staging for testing to avoid hitting rate limits. The staging server issues untrusted certificates but has much higher rate limits. Switch to letsencrypt-prod only when you have confirmed the flow works end-to-end.
GitOps
GitOps is a paradigm where Git is the single source of truth for your infrastructure and application state. A GitOps operator watches a Git repository and automatically synchronizes the cluster state to match what is committed. Changes are made through pull requests, which provides an audit trail, code review, and easy rollback (just revert the commit).
How GitOps works
- Developer pushes a change to a Git repository (e.g., updates an image tag in a Deployment manifest)
- The GitOps operator detects the change (via polling or webhook)
- The operator compares the desired state (Git) with the actual state (cluster)
- If there is drift, the operator applies the changes to the cluster
- Health checks verify the deployment succeeded
ArgoCD vs FluxCD
| Feature | ArgoCD | FluxCD |
|---|---|---|
| UI | Rich web UI with app visualization, diff view, sync status | No built-in UI (use Weave GitOps or CLI) |
| Architecture | Centralized server with API | Decentralized controllers (source, kustomize, helm, notification) |
| CRDs | Application, ApplicationSet, AppProject | GitRepository, Kustomization, HelmRelease, etc. |
| Multi-cluster | Built-in (register external clusters) | Via Flux on each cluster or Cluster API |
| Helm support | Native (renders Helm charts as manifests) | Native (HelmRelease CRD) |
| Kustomize support | Native | Native (first-class citizen) |
| RBAC | Built-in with SSO integration | Kubernetes-native RBAC |
| Image automation | Argo CD Image Updater (separate component) | Built-in (image-reflector-controller + image-automation-controller) |
| Notifications | Built-in (Slack, webhook, etc.) | notification-controller (Slack, Teams, etc.) |
| Community | CNCF Graduated, very large community | CNCF Graduated, strong but smaller community |
When to use which
ArgoCD
- Teams that want a visual dashboard for deployments
- Multi-cluster management from a single pane of glass
- Organizations that need SSO-integrated RBAC for GitOps
- When you want to demo deployment state to stakeholders
FluxCD
- Teams that prefer CLI-first, no-UI workflows
- When you want tighter integration with Kustomize
- Automated image updates as a first-class feature
- When you want each cluster to be self-contained (no central server)
ArgoCD Application example
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/k8s-manifests.git
targetRevision: main
path: apps/myapp/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: myapp
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Revert manual changes in cluster
syncOptions:
- CreateNamespace=true
For most clients, ArgoCD is the default recommendation because the web UI is a massive operational advantage. Being able to see at a glance which apps are synced, out-of-sync, degraded, or healthy is invaluable. FluxCD is the better choice when the team is deeply CLI-native and does not want to manage the ArgoCD server component.
Helm vs Kustomize
Helm and Kustomize are the two primary tools for managing Kubernetes manifests at scale. They solve overlapping but different problems, and many teams use them together.
Comparison
| Aspect | Helm | Kustomize |
|---|---|---|
| Approach | Templating (Go templates) | Patching (overlay-based) |
| Package format | Charts (packaged, versioned, shareable) | Directories of plain YAML |
| Value injection | values.yaml + --set flags | Patches, JSON merge patches, strategic merge patches |
| Repository | Helm chart repositories (Artifact Hub) | Git repositories or local directories |
| Release management | Built-in (helm install/upgrade/rollback) | None (uses kubectl apply) |
| Learning curve | Higher (Go templates, chart structure, hooks) | Lower (just YAML patching) |
| 3rd-party software | Standard distribution format for OSS | Rarely used by upstream projects |
| Built into kubectl | No (separate binary) | Yes (kubectl apply -k) |
Helm basics
# Add a chart repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# Install a chart
helm install my-postgres bitnami/postgresql \
--namespace databases --create-namespace \
--values custom-values.yaml
# Upgrade a release
helm upgrade my-postgres bitnami/postgresql \
--values custom-values.yaml
# List releases
helm list -A
# Rollback
helm rollback my-postgres 1
Kustomize basics
# base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
namespace: production
patches:
- target:
kind: Deployment
name: myapp
patch: |-
- op: replace
path: /spec/replicas
value: 5
images:
- name: myapp
newTag: v2.1.0
# Build and apply
kubectl apply -k overlays/production/
# Preview rendered output
kubectl kustomize overlays/production/
Using them together
A common pattern is to use Helm for third-party software (databases, monitoring, ingress controllers) and Kustomize for your own applications. You can also render Helm charts into plain YAML and manage them with Kustomize:
# Render a Helm chart to plain YAML
helm template my-release bitnami/postgresql \
--values values.yaml > base/postgresql.yaml
# Then manage with Kustomize overlays for env-specific tweaks
Do not fight the ecosystem. Install third-party charts with Helm — it is how they are designed to be consumed. For your own application manifests, Kustomize is often simpler because you avoid the complexity of Go templates and can keep manifests as valid, readable YAML. If using ArgoCD or FluxCD, both support Helm and Kustomize natively.
KubeVirt
KubeVirt is a Kubernetes add-on that allows you to run traditional virtual machines alongside containers on the same cluster. It extends Kubernetes with custom resource definitions (CRDs) for managing VM lifecycle using the same kubectl tooling.
Why it matters
- Converged infrastructure — Run VMs and containers side-by-side. No need for separate VMware/Proxmox infrastructure and a separate Kubernetes cluster.
- Migration path — Move legacy workloads that cannot be containerized (Windows apps, kernel-dependent software, legacy databases) into the Kubernetes platform without rewriting them.
- Unified tooling — Use the same CI/CD pipelines, monitoring, networking, and storage for both VMs and containers.
- Harvester — Rancher's Harvester HCI platform is built on KubeVirt, providing a complete hyperconverged infrastructure solution on top of Kubernetes.
Key CRDs
| CRD | Purpose |
|---|---|
VirtualMachine | Persistent VM definition. Survives restarts. Analogous to a Deployment for containers. |
VirtualMachineInstance | A running VM instance. Analogous to a Pod. Created by the VirtualMachine controller. |
DataVolume | Declarative way to import VM disk images (from URL, registry, or PVC clone) using CDI (Containerized Data Importer). |
Basic VM example
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: ubuntu-vm
spec:
running: true
template:
metadata:
labels:
kubevirt.io/vm: ubuntu-vm
spec:
domain:
cpu:
cores: 2
memory:
guest: 4Gi
devices:
disks:
- name: rootdisk
disk:
bus: virtio
- name: cloudinit
disk:
bus: virtio
interfaces:
- name: default
masquerade: {}
networks:
- name: default
pod: {}
volumes:
- name: rootdisk
dataVolume:
name: ubuntu-dv
- name: cloudinit
cloudInitNoCloud:
userData: |
#cloud-config
password: changeme
chpasswd: { expire: false }
---
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: ubuntu-dv
spec:
source:
http:
url: "https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img"
pvc:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
KubeVirt vs traditional virtualization
| Aspect | KubeVirt | VMware / Proxmox |
|---|---|---|
| Platform | Runs on Kubernetes | Standalone hypervisor |
| Management | kubectl, GitOps, Kubernetes APIs | vCenter, Proxmox UI, proprietary APIs |
| Networking | CNI plugins (Calico, Cilium, etc.) | vSphere networking, OVS |
| Storage | CSI drivers (Longhorn, Ceph, etc.) | VMFS, NFS, vSAN |
| Container co-location | Native — VMs and containers on same nodes | Separate platform |
| Maturity | CNCF Incubating, growing rapidly | Decades of production use |
| Licensing | Apache 2.0 (free) | vSphere is expensive; Proxmox is AGPL (free + paid support) |
KubeVirt is not a VMware replacement for enterprise clients with thousands of VMs and deep VMware integration. It is ideal for organizations that are Kubernetes-first and need to run a handful of VMs alongside their containerized workloads. The sweet spot is running legacy apps, Windows servers, or network appliances as VMs within the same platform that runs the container workloads. Harvester (built on KubeVirt + Longhorn) is worth evaluating for clients who want a full HCI solution without the VMware licensing cost.
Storage
Kubernetes storage is built around three key abstractions: StorageClasses define how storage is provisioned, PersistentVolumes (PVs) represent actual storage resources, and PersistentVolumeClaims (PVCs) are requests for storage by pods. The Container Storage Interface (CSI) is the standard plugin API that connects Kubernetes to storage backends.
Storage flow
Dynamic provisioning
With dynamic provisioning, you do not need to pre-create PVs. When a PVC is created that references a StorageClass, the CSI driver automatically provisions the underlying storage and creates the PV:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-volume
spec:
storageClassName: longhorn
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
Storage solutions comparison
| Solution | Type | Replication | Best For |
|---|---|---|---|
| Local-path | Local disk (node-bound) | None | Development, single-node, CI/CD. Default in K3s. |
| Longhorn | Distributed block storage | Configurable (2-3 replicas) | Production bare-metal clusters. Easy to deploy, built-in backup/restore, Rancher integration. |
| Ceph / Rook | Distributed (block, file, object) | Configurable | Large-scale production. High performance, mature, but complex to operate. |
| NFS | Network file system | Depends on backend | Shared storage (ReadWriteMany). Simple but not performant. |
| Cloud CSI | Cloud disks (EBS, PD, Azure Disk) | Provider-managed | Cloud-hosted clusters. Automatic provisioning. |
Access modes
- ReadWriteOnce (RWO) — Mounted as read-write by a single node. Most common for databases and stateful apps.
- ReadOnlyMany (ROX) — Mounted as read-only by many nodes. Good for shared configuration or static content.
- ReadWriteMany (RWX) — Mounted as read-write by many nodes. Required for shared storage across pods. NFS, CephFS, and Longhorn (via built-in NFSv4 share-manager since v1.1) support this.
For on-prem bare-metal clusters, Longhorn is the recommended starting point. It is simple to install (single Helm chart), provides replicated storage with automatic failover, has a built-in UI, supports backups to S3-compatible targets, and integrates natively with Rancher. Rook/Ceph is more powerful but significantly more complex to operate — only use it when you need the scale (100+ TB) or need object storage (S3 API).
Networking
Kubernetes networking follows a flat model: every pod gets its own IP address, and all pods can communicate with each other without NAT. This is implemented by Container Network Interface (CNI) plugins. The choice of CNI affects performance, security policy support, and operational complexity.
CNI plugins
| CNI | Mode | Network Policy | Notes |
|---|---|---|---|
| Calico | BGP, VXLAN, IPIP | Full support | Most popular CNI. Excellent Network Policy support. Default in MicroK8s. Can run in eBPF mode for performance. |
| Flannel | VXLAN, host-gw | None | Simplest CNI. Default in K3s. No Network Policy support — pair with Calico (Canal) if needed. |
| Canal | Flannel networking + Calico policy | Full support | Combines Flannel's simplicity with Calico's policy engine. Default in RKE2. |
| Cilium | eBPF-based | Full + L7 policies | Most advanced CNI. eBPF-based dataplane bypasses iptables. L7 visibility and policy (HTTP, gRPC, Kafka). Hubble for observability. |
Network Policies
Network Policies are Kubernetes-native firewall rules that control pod-to-pod traffic. By default, all pods can talk to all other pods. Network Policies restrict this based on labels, namespaces, and ports.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
namespace: production
spec:
podSelector: {} # Applies to all pods in namespace
policyTypes:
- Ingress
ingress: [] # Empty = deny all ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- port: 8080
protocol: TCP
Service types
| Type | Scope | Use Case |
|---|---|---|
| ClusterIP | Internal only | Default. Internal service discovery. Pods within the cluster can reach the service via its DNS name (svc-name.namespace.svc.cluster.local). |
| NodePort | External (via node IP:port) | Exposes the service on a static port (30000-32767) on every node. Simple but not production-grade for web traffic. |
| LoadBalancer | External (via LB IP) | Provisions an external load balancer (cloud LB or MetalLB on bare-metal). The standard way to expose services externally. |
| ExternalName | DNS alias | Maps a service to an external DNS name (CNAME). No proxying. Used to reference external services from within the cluster. |
DNS (CoreDNS)
CoreDNS runs as a Deployment in the kube-system namespace and provides DNS-based service discovery for the cluster. Every Service gets a DNS entry:
my-service.my-namespace.svc.cluster.local— Full qualified domain namemy-service.my-namespace— Short form (from any namespace)my-service— Shortest form (from same namespace only)
If the client needs Network Policies (and they should for any production cluster), ensure the CNI supports them. Flannel alone does not. The easiest path is Canal (Flannel + Calico policy), which is why RKE2 defaults to it. For advanced use cases (L7 policies, observability, service mesh replacement), Cilium is the future, but it requires kernel 5.10+ (as of Cilium 1.19; v1.20 will require 6.1+) and has a steeper learning curve.
Security
Kubernetes security is a broad topic that spans authentication, authorization, workload isolation, secrets management, and supply chain security. The fundamental principle is defense in depth — no single mechanism is sufficient; you need layers.
RBAC (Role-Based Access Control)
RBAC controls who can do what in the cluster. It uses four resource types:
- Role — Namespace-scoped permissions (e.g., "can read pods in namespace X")
- ClusterRole — Cluster-scoped permissions (e.g., "can read nodes", "can create namespaces")
- RoleBinding — Binds a Role to a user/group/ServiceAccount within a namespace
- ClusterRoleBinding — Binds a ClusterRole to a user/group/ServiceAccount cluster-wide
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
namespace: production
name: read-pods
subjects:
- kind: User
name: jane
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
ServiceAccounts
Every pod runs as a ServiceAccount. If not specified, it uses the default ServiceAccount in its namespace. Best practice: create dedicated ServiceAccounts for each workload with only the permissions it needs.
apiVersion: v1
kind: ServiceAccount
metadata:
name: myapp-sa
namespace: production
automountServiceAccountToken: false # Don't mount token unless needed
Pod Security Standards
Pod Security Standards (PSS) replaced the deprecated PodSecurityPolicy (PSP). They are enforced via the built-in Pod Security Admission controller using namespace labels:
| Level | Description |
|---|---|
| Privileged | No restrictions. For system-level workloads (CNI, storage drivers). |
| Baseline | Prevents known privilege escalations. Allows most workloads. Good starting point. |
| Restricted | Strict security. Requires non-root, dropped all capabilities (except NET_BIND_SERVICE), seccomp profile, and disallows privilege escalation. Read-only root filesystem is a recommended best practice but not enforced by PSS. Target for production workloads. |
# Apply restricted security to a namespace
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Secrets management
Kubernetes Secrets are base64-encoded (not encrypted) by default. For production:
- Enable encryption at rest — Configure the API server to encrypt Secrets in etcd using AES-GCM (preferred) or AES-CBC. AES-GCM is faster and provides authenticated encryption.
- External secrets management — Use the External Secrets Operator to sync secrets from HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or GCP Secret Manager
- Sealed Secrets — Bitnami's Sealed Secrets controller allows you to store encrypted secrets in Git. Only the controller in the cluster can decrypt them
- SOPS + age/GPG — Encrypt secret values in YAML files using Mozilla SOPS. Works well with FluxCD's native SOPS decryption
Image scanning and supply chain
- Scan images in CI — Use Trivy, Grype, or Snyk to scan container images during the build pipeline, before they reach the cluster
- Admission control — Use a policy engine to block deployment of unscanned or vulnerable images
- Image signing — Sign images with Cosign and verify signatures at admission time
OPA / Gatekeeper
Open Policy Agent (OPA) Gatekeeper is an admission controller that enforces custom policies on Kubernetes resources. It uses Rego (a policy language) to define constraints:
- Require all images to come from an approved registry
- Block containers running as root
- Require resource limits on all pods
- Enforce label standards across all resources
- Prevent use of
latestimage tag
At minimum, every production cluster must have: (1) RBAC enabled and configured (no wildcard ClusterRoleBindings), (2) Network Policies to restrict pod-to-pod traffic, (3) Pod Security Standards at baseline or restricted level, (4) Secrets encrypted at rest, (5) Container images scanned for vulnerabilities. Everything else is defense in depth.
Consultant's Checklist
Use this checklist when assessing, deploying, or auditing a Kubernetes cluster.
Cluster Foundation
- Distribution selected (K3s/RKE2/MicroK8s/managed)
- HA control plane (3+ control plane nodes)
- etcd backup strategy configured and tested
- Node OS hardened and patched
- Container runtime configured (containerd)
- Kubeconfig access controlled and distributed securely
- TLS SANs configured for all access paths
Networking
- CNI plugin selected and deployed
- Network Policies enforced (default deny + allow rules)
- Ingress controller deployed and configured
- LoadBalancer solution in place (MetalLB for bare-metal)
- DNS resolution working (CoreDNS health)
- TLS certificates automated (cert-manager)
- External DNS configured if needed
Storage
- StorageClass configured with dynamic provisioning
- Storage backend deployed (Longhorn/Ceph/cloud CSI)
- Backup solution for persistent data
- Volume snapshot support if needed
- Storage capacity monitoring and alerting
- Reclaim policy set appropriately (Retain for production)
Security
- RBAC configured (no default admin bindings)
- Pod Security Standards enforced
- Secrets encrypted at rest
- External secrets management in place
- Image scanning in CI pipeline
- Admission controller for policy enforcement
- Audit logging enabled
- ServiceAccount tokens not auto-mounted
GitOps & Deployment
- GitOps operator deployed (ArgoCD or FluxCD)
- Git repository structure defined (monorepo vs multi-repo)
- Helm charts or Kustomize overlays for all environments
- Image update automation configured
- Rollback procedure documented and tested
- Sync policies configured (auto-sync, prune, self-heal)
Operations
- Monitoring stack deployed (Prometheus + Grafana)
- Alerting rules configured for critical conditions
- Logging aggregation (Loki, EFK, or cloud logging)
- Resource requests and limits set on all workloads
- Horizontal Pod Autoscaler configured where appropriate
- Node upgrade procedure documented (drain, upgrade, uncordon)
- Disaster recovery plan documented and tested
When building a new cluster from scratch, work through these areas in order: (1) Cluster foundation + HA, (2) Networking + Ingress + TLS, (3) Storage, (4) Security hardening, (5) GitOps setup, (6) Monitoring + alerting. Do not skip ahead — each layer depends on the one before it.