Kubernetes Multi-Tenancy: Beyond Namespaces

Jun 5, 2025 Metasphere Engineering 12 min read

Six months ago, your team built a shared Kubernetes platform. The first few internal teams onboarded smoothly. Namespaces per team, RBAC configured, NetworkPolicy deployed. Everyone felt good about it.

Then the payments team asks to run PCI-scoped workloads. Your security team runs an isolation assessment and asks one question: “Can the marketing team’s pods reach the payment namespace over the network?” You check. With your current Cilium configuration and the default-allow egress policy nobody thought to restrict, the answer is yes.

Separate floors. All the doors unlocked. Marketing just walked into finance.

The PCI DSS requirements are unambiguous on network segmentation. The Kubernetes Multi-Tenancy Working Group defines the isolation models. But most teams discover the gaps the same way you just did.

Key takeaways

Default-allow egress means every tenant can reach every other tenant. Marketing pods touching the payment namespace is a PCI violation. Every door in the building unlocked. Designing isolation after a dozen tenants are running takes weeks. Before the first tenant: days.
Three isolation models exist: namespace (soft), virtual cluster (medium), and dedicated cluster (hard). The right choice depends on compliance requirements, not team preference.
NetworkPolicy default-deny is the single most impactful isolation control. Lock all doors before the first tenant moves in. Retrofitting it across existing workloads breaks things.
ResourceQuota prevents noisy neighbors. Without CPU/memory limits per namespace, one team’s memory leak takes down the cluster. One tenant’s industrial equipment shakes the building.
RBAC alone is not isolation. RBAC controls API access. Network, storage, and resource boundaries need separate enforcement. The access badge opens the right doors. It doesn’t stop someone from yelling through the wall.

The Three Isolation Models

The isolation model sets the security ceiling for your entire platform. Choose it before the first tenant onboards. Changing it later means re-architecting under load. Remodeling the building with tenants already in it.

Model	Isolation	Cost	Complexity	Best For
Namespace per tenant	Soft (shared nodes + control plane)	1x	Low	Internal teams with mutual trust
vCluster (virtual cluster)	Medium (virtual control plane, shared nodes)	Low overhead above 1x	Medium	SaaS with moderate isolation needs
Cluster per tenant	Hard (dedicated everything)	Several times 1x	High	PCI/HIPAA, zero-trust between tenants

Namespace isolation works for internal teams that trust each other. Pods share nodes and the control plane. Separate floors, same building, same hallways. If compromise in one namespace triggers breach notification to another tenant, this model is not enough.

vCluster occupies the sweet spot that most teams overlook. Each tenant gets a virtual API server, scheduler, and controller manager running as pods in the host cluster. A building-within-a-building. Own reception, own mailroom, shared parking lot. Shared nodes keep costs down. Each virtual cluster adds only a few hundred megabytes of memory overhead, so a single host cluster can run dozens of them affordably. Tenants can install their own CRDs and get cluster-admin on their virtual cluster without affecting neighbors.

Cluster per tenant provides the strongest isolation but the highest operational cost. Separate buildings. Beyond a dozen tenants, manual cluster management breaks down. Cloud-native automation for cluster lifecycle becomes mandatory.

When namespace isolation works	When it doesn’t
All tenants are internal teams with aligned security posture	Tenants have different compliance needs (PCI in one, HIPAA in another)
No tenant needs cluster-admin access or custom CRDs	A tenant compromise triggers breach notification to other tenants
Shared resource pools are acceptable	Workloads need dedicated node pools with specific hardware
A single operations team manages all tenant workloads	Tenants expect independent upgrade schedules or Kubernetes versions

Dimension	Namespace Isolation	Virtual Cluster (vCluster)	Cluster per Tenant
Control plane	Shared. All tenants on same API server	Virtual API server per tenant. Syncs to host cluster	Dedicated. Full isolation
Node sharing	Shared nodes. Noisy neighbor risk	Shared nodes (better isolation via virtual kubelet)	Dedicated nodes. No sharing
Network isolation	NetworkPolicy (if CNI supports it). Default: no isolation	NetworkPolicy + virtual network	Physical network isolation
RBAC complexity	Grows quadratically with tenants. Unauditable at 50+	Scoped per virtual cluster. Simpler	Trivial. Each cluster is independent
Cost	Lowest. Shared everything	Medium. Virtual cluster overhead is small	Highest. Full cluster per tenant
Blast radius	Namespace escape = access to all tenants	Virtual cluster escape = host cluster (contained)	Cluster compromise = one tenant only
Best for	Trusted internal teams, dev/staging environments	Multi-team platform, moderate isolation needs	Regulated workloads, untrusted tenants, compliance requirements

NetworkPolicy Gaps the Docs Won’t Warn You About

Four gaps catch teams after they’ve already shipped what they believed was production-grade network isolation. Four unlocked doors that look locked.

Your CNI might not enforce policies at all. NetworkPolicy is an API object. Enforcement is a CNI feature. Flannel does not enforce NetworkPolicy. If you’re running Flannel and you have NetworkPolicy objects deployed, those objects are decoration. Every last one. Locks installed but nobody connected the wiring. They click but don’t lock. Clusters have run carefully crafted policies for months with zero enforcement. Calico and Cilium enforce. Verify yours does before you trust a single rule.

# Default-deny + DNS allow - deploy to every tenant namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: tenant-commerce
spec:
  podSelector: {}  # Applies to all pods
  policyTypes: ["Ingress", "Egress"]
  egress:
    - to:  # Allow DNS only
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

    
      
      
      
      
      
      
    
    

  

Past 20 tenants, the only sustainable answer is admission controllers. OPA Gatekeeper or Kyverno policies validate RBAC resources on creation, preventing overly broad bindings, enforcing naming conventions, and requiring ownership labels. The security guard who checks every badge at the door. Same principle as [automated remediation](/technology/reliability-operations/automated-remediation/) in reliability engineering: when the system grows past what humans can review, enforcement must be automatic.

A practical starting policy set: block any RoleBinding referencing `cluster-admin` outside the platform namespace, require a `team` label on every ServiceAccount, deny ClusterRoleBindings from tenant namespaces, and require all Roles to specify explicit resource names rather than wildcards.

<details>
<summary>Example OPA Gatekeeper constraint: block wildcard resource access</summary>

```yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sBlockWildcardResources
metadata:
  name: deny-wildcard-resources
spec:
  match:
    kinds:
      - apiGroups: ["rbac.authorization.k8s.io"]
        kinds: ["Role", "ClusterRole"]
    excludedNamespaces: ["kube-system", "platform-system"]
  parameters:
    message: "Roles must specify explicit resource names. Wildcards are not allowed in tenant namespaces."

This constraint prevents tenants from creating Roles with resources: ["*"], which would grant access to every resource type in the namespace. A master key for the floor. Explicit resource lists make RBAC auditable and prevent privilege creep.

Resource Isolation: Preventing Noisy Neighbors

RBAC governs who can do what. Resource isolation governs how much they can consume. Without both, the platform is only half-secured. The badge opens the right doors. But one tenant’s industrial equipment is shaking the entire building.

ResourceQuota sets hard limits per namespace: maximum CPU requests, maximum memory limits, maximum pod count, maximum storage claims. LimitRange sets defaults and maxima for individual containers, preventing pods without resource specs from consuming unbounded node resources. You need both. Without LimitRange, a single pod can eat an entire node. Without ResourceQuota, a namespace can spread pods across every node in the pool. Without desk size limits, one person brings a 12-foot desk. Without floor space limits, one company takes over the building.

Prerequisites

LimitRange deployed in every tenant namespace with default CPU and memory limits
ResourceQuota deployed in every tenant namespace with hard caps on CPU, memory, pod count, and storage
PriorityClass definitions separate critical platform workloads from tenant workloads
Node pool taints and tolerations isolate workloads requiring dedicated hardware
Actual consumption profiled over 2-4 weeks before setting production quotas

Profile actual consumption for 2-4 weeks, set quotas at the 95th percentile plus 20% headroom, and review quarterly. Starting tight is always easier than tightening after the fact, because reducing quotas on running workloads triggers pod evictions and scheduling failures that generate urgent tickets. Shrinking someone’s office while they’re sitting in it.

Taint-and-toleration for dedicated node pools. Platform engineering starts tight and loosens deliberately. The reverse never ends.

The Isolation Illusion The false sense of security that comes from having separate Kubernetes namespaces for each tenant. Separate floors in the office building. The API boundary exists. The network boundary does not (without NetworkPolicy). The resource boundary does not (without ResourceQuotas). The storage boundary does not (without StorageClass restrictions). Three of four isolation layers are missing by default. Separate floors, but the doors are unlocked, the vents are connected, and anyone can press any elevator button.

What the Industry Gets Wrong About Kubernetes Multi-Tenancy

“Namespaces provide isolation.” Namespaces provide API-level separation. They don’t provide network isolation, storage isolation, or resource isolation by default. A pod in namespace A can reach a pod in namespace B over the network unless explicit NetworkPolicy denies it. Separate offices with unlocked doors. Namespaces are organizational boundaries, not security boundaries.

“Start permissive, tighten later.” Tightening network policies after a dozen tenants are running production workloads takes weeks of careful rollout and change management. Default-deny before the first tenant takes days. The cost ratio is roughly 10:1. Install the locks before the tenants move in. Every team that starts permissive wishes they hadn’t.

Our take Deploy default-deny NetworkPolicy and ResourceQuotas before the first tenant onboards. Not after the security review. Not after the compliance audit. Before. A few days of upfront design prevents weeks of retrofitting with a dozen tenants watching. vCluster is the right default for SaaS multi-tenancy unless you have a specific regulatory requirement for physical isolation.

Can the marketing team’s pods reach the payment namespace? With default-deny network policies, namespace-scoped RBAC, and admission controllers enforced from day one, the answer is no before the security team even asks. Separate floors. Locked doors. Keys for the right rooms only. The building is quiet for the right reason.

Frequently Asked Questions

What isolation does a Kubernetes namespace actually provide?

Namespaces provide soft isolation only. Pods in different namespaces share the same kernel, control plane API, and node resources unless ResourceQuota and LimitRange are set up. A container escape in one namespace compromises every namespace on that node. CVE-2022-0185 allowed a single unprivileged container to gain root on the host. For tenants with regulatory isolation needs, namespace isolation alone is not enough.

What is vCluster and when should you use it?

vCluster runs a virtual Kubernetes cluster inside a namespace of the host cluster, with its own API server, scheduler, and controller manager while pods run on shared host nodes. A single host cluster can run 50+ virtual clusters at roughly 256 MB memory overhead per instance. Use it when tenants need CRD installation or cluster-admin access but dedicated clusters are too expensive.

How do you prevent noisy neighbor resource exhaustion in a shared cluster?

Both LimitRange and ResourceQuota are required. Without LimitRange, a single pod can eat an entire node’s 64 GB of memory. Without ResourceQuota, a namespace can schedule pods across every node in the pool. Set quotas by profiling actual usage over 2-4 weeks, starting at the 95th percentile plus 20% headroom. Review and adjust quarterly based on real consumption patterns.

What are the network policy gaps teams commonly miss?

NetworkPolicy objects only work if your CNI plugin enforces them. Calico and Cilium enforce policies. Flannel does not. Policies must explicitly allow DNS traffic to kube-dns on port 53, or workloads lose name resolution. Egress to 169.254.169.254 must be blocked to prevent SSRF to cloud IMDS. Host-networked pods like most daemonsets bypass NetworkPolicy entirely.

At what scale does cluster-per-tenant become necessary?

Cluster-per-tenant is warranted when tenants need conflicting Kubernetes versions, cluster-admin access, or regulatory-mandated physical isolation. The operational break-even depends on your automation maturity. Below roughly a dozen tenants, manual cluster management works. Beyond that, you need cluster lifecycle automation, which takes months of dedicated engineering to build well.