PARANTAP LAHIRI, VP, NETWORK AND DATACENTER ENGINEERING AT EBAY
Monolithic applications running on dedicated servers have evolved into cloud-native microservices that freely scale on cloud. However, in the enterprise space, more often than not, ‘networking’ has acted as a necessary evil that slows down progress, and results in unplanned outages. However, networks can quickly become an asset from liability if used properly. Let’s take a deeper look.
Networking in the enterprise domain grew up to provide connectivity between office buildings and provide access to Internet and corporate services within data centers. Since many enterprise applications came as third party software, physical networks had to facilitate and enforce segmentation and security needs. The networks were complex, inconsistent and fragile with heavy dependence on a set of in-house support staff as well as dedicated engineers from vendors.
Other than configuration inconsistency and failures due to lack of change rigor, the primary factor that contributed was the inherent weakness of protocols like “spanning-tree” that were used to ensure loop-free forwarding path for switching domains. Layer 2 switching domains were needed to support VLAN (Virtual Local Area Network) which has been an integral part of most enterprise networks to ensure IP mobility, enforcing firewalls as the default gateways etc. These domains frequently suffered from broadcast storms that melted the networks due to loop creation. More so, loop-free requirements created topologies that resulted in congestion
Fast forwarding a little, when companies went into delivering online services, many of the same enterprise networks were used to host services. For online services that are expected to be available for at least 99.99% of time, these enterprise network designs have been a fundamental misfit.
Now, to take this discussion one level deeper, let’s analyse the impact of a typical enterprise network on cloud-ready workloads first and then on cloud-native workloads.
Broadly speaking, cloud-ready workloads are monolithic applications that are packed into a virtual machine (VMs) as opposed to the original bare metal servers. In environments with strong dependency on their incumbent physical network, such cloud-ready VMs have simply slided in to replace physical servers. In many cases the VMs simply connect to the VLANs and the overall online service depends on security and load-balancing services through underlay networks. There have been continuous efforts to use network, firewall and load-balancing automation to cater to the agility needs of such services.
However, cloud-native workloads are challenging this very premise. By definition, cloud-native workloads are formed out of applications getting decomposed into microservices. These microservices get automatically deployed, scaled and managed as containers through orchestration systems like Kubernetes etc. Now the question becomes, in an environment with heavy dependency on incumbent physical network services, should the cloud-native applications take any dependency on the services provided by the underlay network or technology leaders should act cautiously and ensure proper decoupling of dependency.
To get deeper into this discussion, it is important to understand how the cloud-native workloads and orchestration systems have evolved themselves. Orchestration systems along with other approaches like service-mesh etc. are continually advancing to take care of many other application needs beyond simple placement of the containers on appropriate nodes.
Firstly, they are facilitating a lot more granular implementation of security controls that go beyond simple enforcement on protocol type and ports, and secondly they are facilitating advanced capabilities to balance and distribute workload sessions. These capabilities are implemented on the server themselves in a highly scaled out and well-managed way. So bringing some of those complexities back into the underlay network is somewhat redundant. To put it more bluntly, the traditional enterprise network controls should get out of the cloud-native way. They can definitely manage the legacy environments in case of a brown field situation, and enforce some base layer security controls but that’s where they should draw the line.
Now, to discuss the actual needs of cloud-native services, the workloads along with supporting services like Hadoop, AI/ ML and distributed storage etc., ideally want unlimited server-to-server east-west capacity from the network. They also need quick correlation of the network issues and application issues while diagnosing service impairments.
Thus, in order to cater to the needs of cloud-native applications many modern cloud scale data centers have built dense mesh Layer 3 routed networks using mainly simple and standardized protocols like BGP (Border Gateway Protocol). Typically the platforms used in these networks are based on commoditized chipsets which provide high-bandwidth at a very reasonable cost points. Automation focus is placed on producing consistent build standards along with automated isolation of network degradation and remediation. Technology leaders have been guarding these networks from taking on unnecessary complexity to ensure that each domain delivers the right services for the right reasons. Interestingly, building the highvolume interconnect capacity has impacted the networking budget favorably and made the network a lot more robust.
To summarize, the journey to cloud-native infrastructure entails not only decomposing applications into microservices running on containers but also looking at the infrastructure in a holistic way. Entrusting the orchestration systems to manage the granular security and session controls and letting the underlay network provide speeds and feeds, could be the winning formulae.