Six months ago, your team built a shared Kubernetes platform. The first few internal teams onboarded smoothly. Namespaces per team, RBAC configured, NetworkPolicy deployed. Everyone felt good about it.
Then the payments team asks to run PCI-scoped workloads. Your security team runs an isolation assessment and asks one question: “Can the marketing team’s pods reach the payment namespace over the network?” You check. With your current Cilium configuration and the default-allow egress policy nobody thought to restrict, the answer is yes.
Separate floors. All the doors unlocked. Marketing just walked into finance.
The PCI DSS
requirements are unambiguous on network segmentation. The Kubernetes Multi-Tenancy Working Group
defines the isolation models. But most teams discover the gaps the same way you just did.
Key takeaways
Default-allow egress means every tenant can reach every other tenant. Marketing pods touching the payment namespace is a PCI violation. Every door in the building unlocked. Designing isolation after a dozen tenants are running takes weeks. Before the first tenant: days.
Three isolation models exist: namespace (soft), virtual cluster (medium), and dedicated cluster (hard). The right choice depends on compliance requirements, not team preference.
NetworkPolicy default-deny is the single most impactful isolation control. Lock all doors before the first tenant moves in. Retrofitting it across existing workloads breaks things.
ResourceQuota prevents noisy neighbors. Without CPU/memory limits per namespace, one team’s memory leak takes down the cluster. One tenant’s industrial equipment shakes the building.
RBAC alone is not isolation. RBAC controls API access. Network, storage, and resource boundaries need separate enforcement. The access badge opens the right doors. It doesn’t stop someone from yelling through the wall.
The Three Isolation Models
The isolation model sets the security ceiling for your entire platform. Choose it before the first tenant onboards. Changing it later means re-architecting under load. Remodeling the building with tenants already in it.
Model
Isolation
Cost
Complexity
Best For
Namespace per tenant
Soft (shared nodes + control plane)
1x
Low
Internal teams with mutual trust
vCluster (virtual cluster)
Medium (virtual control plane, shared nodes)
Low overhead above 1x
Medium
SaaS with moderate isolation needs
Cluster per tenant
Hard (dedicated everything)
Several times 1x
High
PCI/HIPAA, zero-trust between tenants
Namespace isolation works for internal teams that trust each other. Pods share nodes and the control plane. Separate floors, same building, same hallways. If compromise in one namespace triggers breach notification to another tenant, this model is not enough.
vCluster occupies the sweet spot that most teams overlook. Each tenant gets a virtual API server, scheduler, and controller manager running as pods in the host cluster. A building-within-a-building. Own reception, own mailroom, shared parking lot. Shared nodes keep costs down. Each virtual cluster adds only a few hundred megabytes of memory overhead, so a single host cluster can run dozens of them affordably. Tenants can install their own CRDs and get cluster-admin on their virtual cluster without affecting neighbors.
Cluster per tenant provides the strongest isolation but the highest operational cost. Separate buildings. Beyond a dozen tenants, manual cluster management breaks down. Cloud-native
automation for cluster lifecycle becomes mandatory.
When namespace isolation works
When it doesn’t
All tenants are internal teams with aligned security posture
Tenants have different compliance needs (PCI in one, HIPAA in another)
No tenant needs cluster-admin access or custom CRDs
A tenant compromise triggers breach notification to other tenants
Shared resource pools are acceptable
Workloads need dedicated node pools with specific hardware
A single operations team manages all tenant workloads
Tenants expect independent upgrade schedules or Kubernetes versions
Dimension
Namespace Isolation
Virtual Cluster (vCluster)
Cluster per Tenant
Control plane
Shared. All tenants on same API server
Virtual API server per tenant. Syncs to host cluster
Dedicated. Full isolation
Node sharing
Shared nodes. Noisy neighbor risk
Shared nodes (better isolation via virtual kubelet)
Dedicated nodes. No sharing
Network isolation
NetworkPolicy (if CNI supports it). Default: no isolation
NetworkPolicy + virtual network
Physical network isolation
RBAC complexity
Grows quadratically with tenants. Unauditable at 50+
Four gaps catch teams after they’ve already shipped what they believed was production-grade network isolation. Four unlocked doors that look locked.
Your CNI might not enforce policies at all. NetworkPolicy is an API object. Enforcement is a CNI feature. Flannel does not enforce NetworkPolicy. If you’re running Flannel and you have NetworkPolicy objects deployed, those objects are decoration. Every last one. Locks installed but nobody connected the wiring. They click but don’t lock. Clusters have run carefully crafted policies for months with zero enforcement. Calico and Cilium enforce. Verify yours does before you trust a single rule.
# Default-deny + DNS allow - deploy to every tenant namespaceapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata:
name: default-deny-allnamespace: tenant-commercespec:
podSelector: {} # Applies to all podspolicyTypes: ["Ingress", "Egress"]
egress:
- to: # Allow DNS only - namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-systemports:
- protocol: UDPport: 53 - protocol: TCPport: 53
RBAC complexity growth: 5 teams manageable, 20 teams painful, 50 teams unauditableAt 5 teams you have 15 roles and 30 bindings, manually reviewable. At 20 teams it grows to 120 roles and 400 bindings. At 50 teams, 500 roles and 2000 bindings make manual audit impossible without automation.RBAC Complexity: It Doesn't Scale Linearly5 Teams15 Roles, 30 BindingsManual review: feasibleManageable20 Teams120 Roles, 400+ BindingsManual review: painfulNeeds automation50 Teams500+ Roles, 2000+ BindingsManual review: impossibleUnauditableRBAC that can't be audited can't be trusted. Automate or drown.
Past 20 tenants, the only sustainable answer is admission controllers. OPA Gatekeeper or Kyverno policies validate RBAC resources on creation, preventing overly broad bindings, enforcing naming conventions, and requiring ownership labels. The security guard who checks every badge at the door. Same principle as [automated remediation](/technology/reliability-operations/automated-remediation/) in reliability engineering: when the system grows past what humans can review, enforcement must be automatic.A practical starting policy set: block any RoleBinding referencing `cluster-admin` outside the platform namespace, require a `team` label on every ServiceAccount, deny ClusterRoleBindings from tenant namespaces, and require all Roles to specify explicit resource names rather than wildcards.<details><summary>Example OPA Gatekeeper constraint: block wildcard resource access</summary>```yamlapiVersion: constraints.gatekeeper.sh/v1beta1kind: K8sBlockWildcardResourcesmetadata:
name: deny-wildcard-resourcesspec:
match:
kinds:
- apiGroups: ["rbac.authorization.k8s.io"]
kinds: ["Role", "ClusterRole"]
excludedNamespaces: ["kube-system", "platform-system"]
parameters:
message: "Roles must specify explicit resource names. Wildcards are not allowed in tenant namespaces."
This constraint prevents tenants from creating Roles with resources: ["*"], which would grant access to every resource type in the namespace. A master key for the floor. Explicit resource lists make RBAC auditable and prevent privilege creep.
Resource Isolation: Preventing Noisy Neighbors
RBAC governs who can do what. Resource isolation governs how much they can consume. Without both, the platform is only half-secured. The badge opens the right doors. But one tenant’s industrial equipment is shaking the entire building.
ResourceQuota sets hard limits per namespace: maximum CPU requests, maximum memory limits, maximum pod count, maximum storage claims. LimitRange sets defaults and maxima for individual containers, preventing pods without resource specs from consuming unbounded node resources. You need both. Without LimitRange, a single pod can eat an entire node. Without ResourceQuota, a namespace can spread pods across every node in the pool. Without desk size limits, one person brings a 12-foot desk. Without floor space limits, one company takes over the building.
Prerequisites
LimitRange deployed in every tenant namespace with default CPU and memory limits
ResourceQuota deployed in every tenant namespace with hard caps on CPU, memory, pod count, and storage
PriorityClass definitions separate critical platform workloads from tenant workloads
Node pool taints and tolerations isolate workloads requiring dedicated hardware
Actual consumption profiled over 2-4 weeks before setting production quotas
Profile actual consumption for 2-4 weeks, set quotas at the 95th percentile plus 20% headroom, and review quarterly. Starting tight is always easier than tightening after the fact, because reducing quotas on running workloads triggers pod evictions and scheduling failures that generate urgent tickets. Shrinking someone’s office while they’re sitting in it.
Shared ClusterSSRF targetMust block egress explicitlyTenant A PodTenant B Podkube-dns :53Bypasses NetworkPolicy(hostNetwork: true)Blocked by NetworkPolicyMust allow explicitlyor DNS breaksNot filteredby NetworkPolicyDefault: allowedMust deny egress
Taint-and-toleration for dedicated node pools. Platform engineering
starts tight and loosens deliberately. The reverse never ends.
The Isolation Illusion
The false sense of security that comes from having separate Kubernetes namespaces for each tenant. Separate floors in the office building. The API boundary exists. The network boundary does not (without NetworkPolicy). The resource boundary does not (without ResourceQuotas). The storage boundary does not (without StorageClass restrictions). Three of four isolation layers are missing by default. Separate floors, but the doors are unlocked, the vents are connected, and anyone can press any elevator button.
What the Industry Gets Wrong About Kubernetes Multi-Tenancy
“Namespaces provide isolation.” Namespaces provide API-level separation. They don’t provide network isolation, storage isolation, or resource isolation by default. A pod in namespace A can reach a pod in namespace B over the network unless explicit NetworkPolicy denies it. Separate offices with unlocked doors. Namespaces are organizational boundaries, not security boundaries.
“Start permissive, tighten later.” Tightening network policies after a dozen tenants are running production workloads takes weeks of careful rollout and change management. Default-deny before the first tenant takes days. The cost ratio is roughly 10:1. Install the locks before the tenants move in. Every team that starts permissive wishes they hadn’t.
Our take
Deploy default-deny NetworkPolicy and ResourceQuotas before the first tenant onboards. Not after the security review. Not after the compliance audit. Before. A few days of upfront design prevents weeks of retrofitting with a dozen tenants watching. vCluster is the right default for SaaS multi-tenancy unless you have a specific regulatory requirement for physical isolation.
Can the marketing team’s pods reach the payment namespace? With default-deny network policies, namespace-scoped RBAC, and admission controllers enforced from day one, the answer is no before the security team even asks. Separate floors. Locked doors. Keys for the right rooms only. The building is quiet for the right reason.
Your Shared Cluster Has Isolation Gaps You Haven't Found
The wrong multi-tenancy model creates isolation gaps that take weeks to close after tenants onboard. Kubernetes tenant architecture must match real security requirements, compliance constraints, and operational capacity before the first workload lands.
What isolation does a Kubernetes namespace actually provide?
+
Namespaces provide soft isolation only. Pods in different namespaces share the same kernel, control plane API, and node resources unless ResourceQuota and LimitRange are set up. A container escape in one namespace compromises every namespace on that node. CVE-2022-0185 allowed a single unprivileged container to gain root on the host. For tenants with regulatory isolation needs, namespace isolation alone is not enough.
What is vCluster and when should you use it?
+
vCluster runs a virtual Kubernetes cluster inside a namespace of the host cluster, with its own API server, scheduler, and controller manager while pods run on shared host nodes. A single host cluster can run 50+ virtual clusters at roughly 256 MB memory overhead per instance. Use it when tenants need CRD installation or cluster-admin access but dedicated clusters are too expensive.
How do you prevent noisy neighbor resource exhaustion in a shared cluster?
+
Both LimitRange and ResourceQuota are required. Without LimitRange, a single pod can eat an entire node’s 64 GB of memory. Without ResourceQuota, a namespace can schedule pods across every node in the pool. Set quotas by profiling actual usage over 2-4 weeks, starting at the 95th percentile plus 20% headroom. Review and adjust quarterly based on real consumption patterns.
What are the network policy gaps teams commonly miss?
+
NetworkPolicy objects only work if your CNI plugin enforces them. Calico and Cilium enforce policies. Flannel does not. Policies must explicitly allow DNS traffic to kube-dns on port 53, or workloads lose name resolution. Egress to 169.254.169.254 must be blocked to prevent SSRF to cloud IMDS. Host-networked pods like most daemonsets bypass NetworkPolicy entirely.
At what scale does cluster-per-tenant become necessary?
+
Cluster-per-tenant is warranted when tenants need conflicting Kubernetes versions, cluster-admin access, or regulatory-mandated physical isolation. The operational break-even depends on your automation maturity. Below roughly a dozen tenants, manual cluster management works. Beyond that, you need cluster lifecycle automation, which takes months of dedicated engineering to build well.