← Back to Insights

Kubernetes Multi-Tenancy: Beyond Namespaces

Metasphere Engineering 12 min read

Six months ago, your team built a shared Kubernetes platform. The first few internal teams onboarded smoothly. Namespaces per team, RBAC configured, NetworkPolicy deployed. Everyone felt good about it.

Then the payments team asks to run PCI-scoped workloads. Your security team runs an isolation assessment and asks one question: “Can the marketing team’s pods reach the payment namespace over the network?” You check. With your current Cilium configuration and the default-allow egress policy nobody thought to restrict, the answer is yes.

Separate floors. All the doors unlocked. Marketing just walked into finance.

The PCI DSS requirements are unambiguous on network segmentation. The Kubernetes Multi-Tenancy Working Group defines the isolation models. But most teams discover the gaps the same way you just did.

Key takeaways
  • Default-allow egress means every tenant can reach every other tenant. Marketing pods touching the payment namespace is a PCI violation. Every door in the building unlocked. Designing isolation after a dozen tenants are running takes weeks. Before the first tenant: days.
  • Three isolation models exist: namespace (soft), virtual cluster (medium), and dedicated cluster (hard). The right choice depends on compliance requirements, not team preference.
  • NetworkPolicy default-deny is the single most impactful isolation control. Lock all doors before the first tenant moves in. Retrofitting it across existing workloads breaks things.
  • ResourceQuota prevents noisy neighbors. Without CPU/memory limits per namespace, one team’s memory leak takes down the cluster. One tenant’s industrial equipment shakes the building.
  • RBAC alone is not isolation. RBAC controls API access. Network, storage, and resource boundaries need separate enforcement. The access badge opens the right doors. It doesn’t stop someone from yelling through the wall.

The Three Isolation Models

The isolation model sets the security ceiling for your entire platform. Choose it before the first tenant onboards. Changing it later means re-architecting under load. Remodeling the building with tenants already in it.

ModelIsolationCostComplexityBest For
Namespace per tenantSoft (shared nodes + control plane)1xLowInternal teams with mutual trust
vCluster (virtual cluster)Medium (virtual control plane, shared nodes)Low overhead above 1xMediumSaaS with moderate isolation needs
Cluster per tenantHard (dedicated everything)Several times 1xHighPCI/HIPAA, zero-trust between tenants

Namespace isolation works for internal teams that trust each other. Pods share nodes and the control plane. Separate floors, same building, same hallways. If compromise in one namespace triggers breach notification to another tenant, this model is not enough.

vCluster occupies the sweet spot that most teams overlook. Each tenant gets a virtual API server, scheduler, and controller manager running as pods in the host cluster. A building-within-a-building. Own reception, own mailroom, shared parking lot. Shared nodes keep costs down. Each virtual cluster adds only a few hundred megabytes of memory overhead, so a single host cluster can run dozens of them affordably. Tenants can install their own CRDs and get cluster-admin on their virtual cluster without affecting neighbors.

Cluster per tenant provides the strongest isolation but the highest operational cost. Separate buildings. Beyond a dozen tenants, manual cluster management breaks down. Cloud-native automation for cluster lifecycle becomes mandatory.

When namespace isolation worksWhen it doesn’t
All tenants are internal teams with aligned security postureTenants have different compliance needs (PCI in one, HIPAA in another)
No tenant needs cluster-admin access or custom CRDsA tenant compromise triggers breach notification to other tenants
Shared resource pools are acceptableWorkloads need dedicated node pools with specific hardware
A single operations team manages all tenant workloadsTenants expect independent upgrade schedules or Kubernetes versions
DimensionNamespace IsolationVirtual Cluster (vCluster)Cluster per Tenant
Control planeShared. All tenants on same API serverVirtual API server per tenant. Syncs to host clusterDedicated. Full isolation
Node sharingShared nodes. Noisy neighbor riskShared nodes (better isolation via virtual kubelet)Dedicated nodes. No sharing
Network isolationNetworkPolicy (if CNI supports it). Default: no isolationNetworkPolicy + virtual networkPhysical network isolation
RBAC complexityGrows quadratically with tenants. Unauditable at 50+Scoped per virtual cluster. SimplerTrivial. Each cluster is independent
CostLowest. Shared everythingMedium. Virtual cluster overhead is smallHighest. Full cluster per tenant
Blast radiusNamespace escape = access to all tenantsVirtual cluster escape = host cluster (contained)Cluster compromise = one tenant only
Best forTrusted internal teams, dev/staging environmentsMulti-team platform, moderate isolation needsRegulated workloads, untrusted tenants, compliance requirements

NetworkPolicy Gaps the Docs Won’t Warn You About

Kubernetes Multi-Tenancy Isolation Breach ComparisonThree Kubernetes isolation models side by side. In the Namespace model, an attack in Tenant A bleeds through the shared kernel into Tenant B. In the vCluster model, the attack is stopped by the virtual control plane barrier. In Cluster-per-Tenant, the attack cannot reach the other tenant at all. Cost indicators show the tradeoff.Isolation Model Comparison: Attack PropagationNamespaceShared ClusterTenant APodPodShared KernelTenant BPodPodBreach: kernel sharedvClusterHost ClusterTenant A (vCluster)PodPodVirtual Control PlaneTenant B (vCluster)PodPodBLOCKEDIsolatedCluster-per-TenantCluster ATenant APodPodNo shared resourcesCluster BTenant BPodPodComplete separationCost: 1xLowest costCost: 2-3xModerate overheadCost: NxPer-tenant clustersIsolation increases left to right. So does cost and complexity.

Four gaps catch teams after they’ve already shipped what they believed was production-grade network isolation. Four unlocked doors that look locked.

Your CNI might not enforce policies at all. NetworkPolicy is an API object. Enforcement is a CNI feature. Flannel does not enforce NetworkPolicy. If you’re running Flannel and you have NetworkPolicy objects deployed, those objects are decoration. Every last one. Locks installed but nobody connected the wiring. They click but don’t lock. Clusters have run carefully crafted policies for months with zero enforcement. Calico and Cilium enforce. Verify yours does before you trust a single rule.

# Default-deny + DNS allow - deploy to every tenant namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: tenant-commerce
spec:
  podSelector: {}  # Applies to all pods
  policyTypes: ["Ingress", "Egress"]
  egress:
    - to:  # Allow DNS only
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
RBAC complexity growth: 5 teams manageable, 20 teams painful, 50 teams unauditableAt 5 teams you have 15 roles and 30 bindings, manually reviewable. At 20 teams it grows to 120 roles and 400 bindings. At 50 teams, 500 roles and 2000 bindings make manual audit impossible without automation.RBAC Complexity: It Doesn't Scale Linearly5 Teams15 Roles, 30 BindingsManual review: feasibleManageable20 Teams120 Roles, 400+ BindingsManual review: painfulNeeds automation50 Teams500+ Roles, 2000+ BindingsManual review: impossibleUnauditableRBAC that can't be audited can't be trusted. Automate or drown.
Past 20 tenants, the only sustainable answer is admission controllers. OPA Gatekeeper or Kyverno policies validate RBAC resources on creation, preventing overly broad bindings, enforcing naming conventions, and requiring ownership labels. The security guard who checks every badge at the door. Same principle as [automated remediation](/technology/reliability-operations/automated-remediation/) in reliability engineering: when the system grows past what humans can review, enforcement must be automatic. A practical starting policy set: block any RoleBinding referencing `cluster-admin` outside the platform namespace, require a `team` label on every ServiceAccount, deny ClusterRoleBindings from tenant namespaces, and require all Roles to specify explicit resource names rather than wildcards. <details> <summary>Example OPA Gatekeeper constraint: block wildcard resource access</summary> ```yaml apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sBlockWildcardResources metadata: name: deny-wildcard-resources spec: match: kinds: - apiGroups: ["rbac.authorization.k8s.io"] kinds: ["Role", "ClusterRole"] excludedNamespaces: ["kube-system", "platform-system"] parameters: message: "Roles must specify explicit resource names. Wildcards are not allowed in tenant namespaces."

This constraint prevents tenants from creating Roles with resources: ["*"], which would grant access to every resource type in the namespace. A master key for the floor. Explicit resource lists make RBAC auditable and prevent privilege creep.

Resource Isolation: Preventing Noisy Neighbors

RBAC governs who can do what. Resource isolation governs how much they can consume. Without both, the platform is only half-secured. The badge opens the right doors. But one tenant’s industrial equipment is shaking the entire building.

ResourceQuota sets hard limits per namespace: maximum CPU requests, maximum memory limits, maximum pod count, maximum storage claims. LimitRange sets defaults and maxima for individual containers, preventing pods without resource specs from consuming unbounded node resources. You need both. Without LimitRange, a single pod can eat an entire node. Without ResourceQuota, a namespace can spread pods across every node in the pool. Without desk size limits, one person brings a 12-foot desk. Without floor space limits, one company takes over the building.

Prerequisites
  1. LimitRange deployed in every tenant namespace with default CPU and memory limits
  2. ResourceQuota deployed in every tenant namespace with hard caps on CPU, memory, pod count, and storage
  3. PriorityClass definitions separate critical platform workloads from tenant workloads
  4. Node pool taints and tolerations isolate workloads requiring dedicated hardware
  5. Actual consumption profiled over 2-4 weeks before setting production quotas

Profile actual consumption for 2-4 weeks, set quotas at the 95th percentile plus 20% headroom, and review quarterly. Starting tight is always easier than tightening after the fact, because reducing quotas on running workloads triggers pod evictions and scheduling failures that generate urgent tickets. Shrinking someone’s office while they’re sitting in it.

Taint-and-toleration for dedicated node pools. Platform engineering starts tight and loosens deliberately. The reverse never ends.

The Isolation Illusion The false sense of security that comes from having separate Kubernetes namespaces for each tenant. Separate floors in the office building. The API boundary exists. The network boundary does not (without NetworkPolicy). The resource boundary does not (without ResourceQuotas). The storage boundary does not (without StorageClass restrictions). Three of four isolation layers are missing by default. Separate floors, but the doors are unlocked, the vents are connected, and anyone can press any elevator button.

What the Industry Gets Wrong About Kubernetes Multi-Tenancy

“Namespaces provide isolation.” Namespaces provide API-level separation. They don’t provide network isolation, storage isolation, or resource isolation by default. A pod in namespace A can reach a pod in namespace B over the network unless explicit NetworkPolicy denies it. Separate offices with unlocked doors. Namespaces are organizational boundaries, not security boundaries.

“Start permissive, tighten later.” Tightening network policies after a dozen tenants are running production workloads takes weeks of careful rollout and change management. Default-deny before the first tenant takes days. The cost ratio is roughly 10:1. Install the locks before the tenants move in. Every team that starts permissive wishes they hadn’t.

Our take Deploy default-deny NetworkPolicy and ResourceQuotas before the first tenant onboards. Not after the security review. Not after the compliance audit. Before. A few days of upfront design prevents weeks of retrofitting with a dozen tenants watching. vCluster is the right default for SaaS multi-tenancy unless you have a specific regulatory requirement for physical isolation.

Can the marketing team’s pods reach the payment namespace? With default-deny network policies, namespace-scoped RBAC, and admission controllers enforced from day one, the answer is no before the security team even asks. Separate floors. Locked doors. Keys for the right rooms only. The building is quiet for the right reason.

Your Shared Cluster Has Isolation Gaps You Haven't Found

The wrong multi-tenancy model creates isolation gaps that take weeks to close after tenants onboard. Kubernetes tenant architecture must match real security requirements, compliance constraints, and operational capacity before the first workload lands.

Architect Multi-Tenancy

Frequently Asked Questions

What isolation does a Kubernetes namespace actually provide?

+

Namespaces provide soft isolation only. Pods in different namespaces share the same kernel, control plane API, and node resources unless ResourceQuota and LimitRange are set up. A container escape in one namespace compromises every namespace on that node. CVE-2022-0185 allowed a single unprivileged container to gain root on the host. For tenants with regulatory isolation needs, namespace isolation alone is not enough.

What is vCluster and when should you use it?

+

vCluster runs a virtual Kubernetes cluster inside a namespace of the host cluster, with its own API server, scheduler, and controller manager while pods run on shared host nodes. A single host cluster can run 50+ virtual clusters at roughly 256 MB memory overhead per instance. Use it when tenants need CRD installation or cluster-admin access but dedicated clusters are too expensive.

How do you prevent noisy neighbor resource exhaustion in a shared cluster?

+

Both LimitRange and ResourceQuota are required. Without LimitRange, a single pod can eat an entire node’s 64 GB of memory. Without ResourceQuota, a namespace can schedule pods across every node in the pool. Set quotas by profiling actual usage over 2-4 weeks, starting at the 95th percentile plus 20% headroom. Review and adjust quarterly based on real consumption patterns.

What are the network policy gaps teams commonly miss?

+

NetworkPolicy objects only work if your CNI plugin enforces them. Calico and Cilium enforce policies. Flannel does not. Policies must explicitly allow DNS traffic to kube-dns on port 53, or workloads lose name resolution. Egress to 169.254.169.254 must be blocked to prevent SSRF to cloud IMDS. Host-networked pods like most daemonsets bypass NetworkPolicy entirely.

At what scale does cluster-per-tenant become necessary?

+

Cluster-per-tenant is warranted when tenants need conflicting Kubernetes versions, cluster-admin access, or regulatory-mandated physical isolation. The operational break-even depends on your automation maturity. Below roughly a dozen tenants, manual cluster management works. Beyond that, you need cluster lifecycle automation, which takes months of dedicated engineering to build well.