Multi-Cloud Networking: Connectivity Without Lock-in

Jul 3, 2025 Metasphere Engineering 13 min read

You run workloads on two cloud providers. Compute was straightforward. Kubernetes clusters in both, Terraform modules per provider, deployment pipelines that target each independently. Then someone asks: “Why can’t Service A in AWS talk to Service B in Azure?” And suddenly you discover that multi-cloud compute was never the hard part. Multi-cloud networking is. And it’s not even close.

The networking layer is where multi-cloud stops being an architectural diagram and starts being an engineering problem that makes people reconsider their career choices. Different providers use fundamentally different networking models. AWS VPCs, Azure VNets, and GCP VPCs have incompatible assumptions about routing, DNS resolution, firewall rules, and load balancing. Connecting them is not plugging cables together. It is building and maintaining a translation layer that handles every inconsistency between three distinct networking stacks.

The multi-cloud strategy conversation usually focuses on compute portability and vendor lock-in avoidance. That’s the easy conversation. The networking conversation is where the actual operational cost lives.

Transit Gateway Architectures

Every major cloud provider has a transit routing service: AWS Transit Gateway (TGW), Azure Virtual WAN (vWAN), and GCP Network Connectivity Center (NCC). These are the backbone for hub-and-spoke network topologies within a single provider. The problem is that each operates differently, and connecting them requires a separate interconnection layer. Here is what you’re actually working with.

AWS TGW operates as a regional router. You attach VPCs, VPN connections, and Direct Connect gateways to it, and TGW handles route propagation. A single TGW supports up to 5,000 attachments and 10,000 routes, which is sufficient for most enterprises. The critical nuance most architects miss at first: TGW is regional. Cross-region traffic requires TGW peering, which adds another hop and another set of route tables to manage.

Azure vWAN abstracts the hub router behind a managed service. It handles VPN, ExpressRoute, and inter-VNet routing through a single control plane. The trade-off versus raw Azure hub VNets is flexibility. vWAN simplifies operations but limits the custom routing configurations that complex architectures sometimes require.

GCP NCC is the newest of the three and takes a different approach by treating both cloud and on-premises locations as spokes connected through Google’s backbone. For organizations with significant GCP footprint, NCC provides tight integration with Google’s private network, which reduces cross-region latency within GCP.

Connecting these transit gateways to each other is where the real engineering starts. A cloud exchange like Equinix Fabric or Megaport provides a single physical point of presence where you can establish private connections to all three providers. This avoids the public internet entirely and gives you consistent, low-latency cross-cloud paths. The alternative, site-to-site VPN tunnels between providers, works but adds encryption overhead, is limited to ~1.25 Gbps per tunnel on AWS, and routes over the public internet with unpredictable latency. For production traffic, that’s a non-starter.

Service Mesh Across Clouds

Connectivity is just the foundation. When services in different clouds need to communicate, you also need service discovery, mutual TLS, and traffic management that work across provider boundaries. This is where service mesh becomes relevant for multi-cloud specifically.

Istio multi-cluster federation connects meshes running in different clouds through east-west gateways. Each cluster maintains its own control plane (istiod), but workloads can resolve and call services in remote clusters. The certificate authority must be shared or federated across clusters for mTLS to work. The common approach is running a root CA externally (Vault or cert-manager with a shared root) and issuing intermediate CAs per cluster.

HashiCorp Consul provides a simpler alternative for cross-cloud service mesh through mesh gateways. Consul’s gossip protocol handles service discovery, and mesh gateways proxy traffic between datacenters without requiring direct pod-to-pod connectivity. The operational model is lighter than Istio multi-cluster but provides fewer L7 traffic management features. For organizations that need basic cross-cloud mTLS and service discovery without Istio’s full feature set, Consul mesh gateways are the pragmatic choice.

The latency cost of cross-cloud mesh is real and you cannot engineer it away. Each east-west gateway hop adds 1-3ms. Combined with the base cross-cloud latency of 2-5ms on dedicated interconnect (or 10-30ms over public internet), a request that touches services in both clouds accumulates measurable overhead. Design service boundaries so that hot paths stay within a single provider. Cross-cloud calls should be reserved for cold-path operations like batch processing, failover, or data synchronization. If your service mesh architecture is routing hot-path traffic across clouds, the problem is not the mesh. The problem is which services live where.

DNS-Based Cross-Cloud Routing

DNS is the simplest and most underappreciated tool for multi-cloud traffic management. Don’t overlook it. Weighted DNS routing, health-check-based failover, and latency-based routing all work across provider boundaries without any cross-cloud networking infrastructure at all.

Route 53 (AWS), Azure Traffic Manager, and Cloud DNS (GCP) all support health-checked DNS records. Front your services with provider-specific load balancers, and DNS routes traffic to whichever provider is healthy and responsive. The failover is not instant. DNS TTLs mean propagation takes 30-60 seconds with aggressive settings. But for disaster recovery scenarios where the RTO is measured in minutes, DNS-based failover is simple and it works.

The important design decision is what sits behind the DNS record. A bare IP address pointing to a single instance is fragile. Don’t do this. A load balancer endpoint with its own health checks provides the resilience you need. For cross-cloud failover, each provider should run the critical services independently with their own data layer. Active-active across clouds requires data replication, which brings its own complexity. Active-passive with DNS failover is operationally simpler and sufficient for most disaster recovery requirements.

IP Address Management at Scale

Overlapping CIDR ranges between cloud providers is the networking equivalent of duplicate primary keys. Everything works until you try to connect the networks, and then nothing routes correctly. This mistake is expensive to fix because re-addressing a production VPC or VNet requires downtime or complex migration. Get this wrong on day one and you will pay for it for years.

The discipline is straightforward: allocate non-overlapping supernets per provider before provisioning anything. Reserve 10.0.0.0/12 for AWS, 10.16.0.0/12 for Azure, 10.32.0.0/12 for GCP, and 10.48.0.0/12 for on-premises. Within each supernet, subdivide by environment and region. This gives each provider 1,048,576 addresses, which is more than sufficient for most enterprises.

A centralized IPAM tool (Infoblox for enterprise, NetBox for teams comfortable with open source) enforces allocation rules and prevents drift. Without centralized IPAM, the third team provisioning infrastructure in the second cloud provider will pick a CIDR range that overlaps with something in the first provider. It’s not a question of if. The fix at that point is NAT translation at the transit layer, which technically works but makes cross-cloud debugging significantly harder because source IPs no longer map to actual workloads.

The Egress Cost Trap

This is where multi-cloud networking costs become genuinely surprising. Every cloud provider charges for data leaving their network. AWS charges per GB for data transfer out to the internet and to other cloud providers. Azure and GCP follow similar models. The specific rates vary by region and volume tier, but the pattern is consistent: ingress is free, egress is not.

This creates a gravitational pull that most architecture diagrams ignore. Data wants to stay where it is because moving it costs money. A pipeline that processes 10 TB per day in AWS and sends results to a service in Azure incurs meaningful daily egress charges. Over a year, those charges will exceed the compute cost of the workload itself. Read that again.

The architectural response is straightforward: keep compute close to data. If the data lives in AWS S3, the processing that reads that data should also run in AWS. Only the results, which are typically orders of magnitude smaller, should cross the cloud boundary. A 10 TB raw dataset might produce 50 MB of aggregated results. Moving 50 MB across clouds is negligible. Moving 10 TB is a financial decision that someone will have to justify.

For cases where cross-cloud data movement is genuinely necessary, compression and delta synchronization reduce transfer volumes. AWS S3 Replication can target Azure Blob Storage through S3-compatible endpoints, but verify the egress implications before enabling it on large buckets.

Network Policy Consistency with Cilium

Running Kubernetes clusters in multiple clouds introduces a network policy consistency problem that bites teams hard. Kubernetes NetworkPolicy objects are per-cluster. A default-deny policy in your AWS cluster does not automatically exist in your Azure cluster. If one cluster enforces strict microsegmentation and the other allows all intra-cluster traffic, you have a security gap. And you probably won’t discover it until an audit or an incident.

Cilium solves this at the CNI level. As an eBPF-based networking layer, Cilium enforces network policies with lower overhead than sidecar-based approaches. More importantly for multi-cloud, Cilium Cluster Mesh connects multiple clusters into a single flat service discovery layer with unified policy enforcement. Services in Cluster A discover and call services in Cluster B with identity-based policies applied consistently.

The practical advantage over Istio for multi-cloud network policy is operational simplicity. Cilium policies are Kubernetes-native CRDs. No separate control plane to manage. No certificate rotation to debug. No sidecar memory overhead. The trade-off is fewer L7 traffic management features. If you need weighted canary routing across clouds, Istio is still the more capable choice. If you need consistent network policy and service discovery, Cilium Cluster Mesh is lighter to operate. Effective cloud-native architecture practice evaluates this trade-off against actual requirements rather than defaulting to the most feature-rich option.

VPN vs Dedicated Interconnect vs Cloud Exchange

The physical connectivity between clouds breaks down into three tiers, and the right choice depends on your traffic patterns.

Site-to-site VPN tunnels run over the public internet with IPsec encryption. Cheapest option, fastest to set up. AWS supports up to 1.25 Gbps per tunnel (two tunnels per VPN connection for redundancy). Azure and GCP support similar throughputs. VPN is fine for development environments, low-bandwidth workloads, and initial multi-cloud testing. It is not appropriate for production workloads that need consistent latency or high throughput. Don’t put production traffic on VPN tunnels and expect predictable performance.

Dedicated interconnects (AWS Direct Connect, Azure ExpressRoute, Google Cloud Interconnect) provide private connections that bypass the public internet. Latency drops from 10-30ms to 2-5ms for same-metro connections. Bandwidth ranges from 1 Gbps to 100 Gbps depending on the port type. The cost is a monthly port fee plus per-GB data transfer charges. Provisioning takes 2-8 weeks depending on the provider and location.

Cloud exchanges like Equinix Fabric and Megaport provide a multiplexed alternative. A single physical connection at a colocation facility gives you access to multiple cloud providers through virtual cross-connects. This is the most cost-effective approach when connecting three or more providers because you pay for one physical port instead of three separate dedicated connections. Megaport’s SDN-based provisioning can establish new cross-connects in minutes rather than weeks.

For organizations with workloads across multiple providers and data movement requirements that justify dedicated connectivity, a cloud exchange is almost always the right answer. One physical presence, multiple provider connections, no separate circuits to manage.

SD-WAN for Hybrid Connectivity

When multi-cloud networking extends to on-premises locations, branch offices, or edge sites, SD-WAN enters the picture. Traditional hub-and-spoke WAN architectures that backhaul all traffic through a central data center add unnecessary latency for cloud-bound traffic. SD-WAN provides direct cloud on-ramps from each site. This is a meaningful architecture shift.

Vendors like Cisco Viptela, VMware VeloCloud, and Palo Alto Prisma SD-WAN integrate directly with AWS TGW, Azure vWAN, and GCP NCC. Branch office traffic destined for cloud workloads takes the shortest path rather than detouring through a central data center. The latency improvement is significant for geographically distributed organizations.

The multi-cloud angle is that SD-WAN appliances route traffic to different cloud providers based on application policy. SaaS traffic goes direct to the internet. AWS workload traffic goes through the AWS on-ramp. Azure workload traffic goes through the Azure on-ramp. This application-aware routing is what differentiates SD-WAN from traditional VPN concentrators. The challenge is maintaining consistent security policy across all those paths when traffic no longer funnels through a single inspection point. That’s a real trade-off, not just a footnote.

When the Complexity Exceeds the Benefit

Multi-cloud networking is not free. Every component described in this article (transit gateways, cross-cloud mesh, DNS failover, IPAM, policy enforcement, dedicated interconnects) requires engineering time to build, maintain, and debug. The honest question, and it deserves an honest answer, is whether the business requirements justify that investment.

If you are running workloads in two clouds because of a genuine best-of-breed requirement (GCP for ML training, AWS for everything else) the networking investment is justified. If you are running workloads in two clouds because someone decided “we should not be locked in” without a specific migration scenario in mind, the networking complexity is pure cost with no corresponding benefit. Be direct about this in your architecture reviews.

The decision framework is concrete. Count the workloads that genuinely require a specific provider’s capabilities. If that number is fewer than three per secondary provider, a second region in your primary provider gives you geographic redundancy at a fraction of the networking complexity. Solid infrastructure architecture practice starts with this assessment before committing to cross-cloud networking infrastructure that carries permanent operational cost. The technology capabilities documented in multi-cloud and hybrid cloud engineering become relevant only after the business case clears this bar.

Multi-cloud networking is a solvable engineering problem. The question was never whether you can connect your clouds. It is whether the business value justifies the permanent cost of keeping them connected.

Frequently Asked Questions

What is the latency penalty for cross-cloud traffic versus same-cloud?

Cross-cloud traffic over the public internet typically adds 10-30ms of round-trip latency compared to same-region, same-provider communication. Dedicated interconnects through Equinix or Megaport reduce that to 2-5ms by bypassing public peering. For latency-sensitive workloads, that 25ms difference compounds across chained service calls. A request touching 4 services adds 100ms of cumulative cross-cloud penalty on public internet versus 8-20ms on dedicated interconnect.

How do egress costs compare between VPN, dedicated interconnect, and cloud exchange?

All three incur per-GB egress charges from the source provider, typically in the range of a few cents per GB. The difference is in the transport cost itself. Site-to-site VPN uses the public internet, so transport is essentially free. Dedicated interconnects (AWS Direct Connect, Azure ExpressRoute) add a fixed monthly port fee plus per-GB data transfer charges. Cloud exchanges like Equinix Fabric or Megaport add a port fee and cross-connect charge but let you reach multiple providers from one physical connection, reducing total cost for multi-provider setups.

Can Istio service mesh span multiple cloud providers?

Yes. Istio multi-cluster federation supports connecting meshes across providers using an east-west gateway per cluster. The control planes remain independent, but workloads get cross-cluster mTLS and traffic management. The operational cost is real: you need consistent certificate authorities across providers, synchronized service discovery, and east-west gateways that add 1-3ms of latency per hop. Consul mesh gateway achieves similar cross-cloud connectivity with a simpler operational model but fewer L7 traffic management features.

How should IP address management work across multiple cloud providers?

Reserve non-overlapping CIDR blocks per provider before provisioning any infrastructure. A common pattern is assigning 10.0.0.0/12 to AWS, 10.16.0.0/12 to Azure, and 10.32.0.0/12 to GCP. Overlapping CIDRs between providers require NAT translation at every transit point, which breaks service discovery and makes tracing requests across clouds nearly impossible. Use a centralized IPAM tool like Infoblox or NetBox to enforce allocation rules.

When does multi-cloud networking complexity exceed the benefit?

When you have fewer than 3 genuinely differentiated workloads per provider and cross-cloud data movement exceeds 5 TB per month. At that point the egress costs, operational overhead of maintaining parallel networking stacks, and engineering time debugging cross-cloud routing issues typically exceed the value of the second provider. A second region in the same provider gives geographic redundancy at 20-30% of the operational complexity.