In earlier blogs in this series, we covered data center architecture trends, network virtualization and overlays, traditional network automation, advanced SDN automation and data center fabric visibility. The objective of this blog is to discuss how these tools and technologies are pulled together to deliver an agile and resilient data center network infrastructure for enterprise private cloud deployments.
Private Cloud is the Dominant Cloud
Digital transformation of the enterprise has been underway for the last decade and has only accelerated during the pandemic. Research completed in May 2021 reveals some surprising data given the conventional wisdom around the rapid shift to public cloud: only 25 percent of enterprise workloads will move to the public cloud in the next two years, with approximately 75 percent staying on-premise or in private hosted cloud. Of these non-public cloud workloads, approximately 50 percent will stay on-premise, where on-prem is defined either as a data center that is owned by an operator, an enterprise, or service provider or where compute and storage are deployed in a colocation facility.
Both enterprise and service providers alike want their on-prem data centers to act like the public cloud. As such, they want to be able to spin up workloads and deploy network services with the same ease of the public cloud, move workloads around to improve performance or to better utilize resources, and, with increasing importance, to have the same concept of availability zones to ensure application availability. In this blog, however, we’ll focus on the enterprise and, in particular, active-active data center architectures to support private cloud.
The recent EMA Research State of DC Network Annual Report 2021 that surveyed more than 260 enterprises in NA and EMEA [warning – form fill required!] found that over 80 percent of IT teams intend to move to an active-active data center architecture in the next two years. With customers interacting digitally with enterprises, applications and data have become the lifeblood of most companies. Outages can be very costly as shown in the results from the Uptime Institute survey below.
Many enterprises now classify their applications into different levels of criticality to determine which can tolerate a certain amount of downtime and the related recovery time objectives (RTO – how much time does it take to be back up and running) and recovery point objectives (RPO – how much data can I lose) for each application. More and more applications are being moved into the mission-critical or mission-imperative category and thus the drive for active-active or active hot standby data centers.
Generally, active-active data centers leverage multi-site data center architectures and replicate data between sites synchronously and active-hot standby data centers replicate data between sites synchronously or near-synchronously. In an active-active scenario, workloads can be running in either data center and demand is load balanced across those two data centers. Should one of the data centers fail, all workloads are pointed towards the remaining data center. These active-active data centers are often in close proximity and in the same metro. As a result, assuming it can be budgeted, a third disaster recovery site is often deployed outside of “the blast zone” of a natural or man-made catastrophe.
One of the drawbacks often cited for active-active architectures is that it requires twice the resources at both sites for any mission critical applications, because if one data center fails the architecture needs to ensure there is enough capacity in the remaining site. However, many believe this cost now pales in comparison to revenue loss and reputational damage following a major outage.
The active-hot standby approach can often be more cost effective because the hot standby can serve as a disaster recovery site that can be rapidly activated. All apps run in the primary site, but backup apps are ready to go with recently synchronized data in the hot standby site. If a failure occurs at the primary site, a simple IP address of the advertising firewall for the backup site is all that’s required to point all workloads to the hot standby site.
Both of these scenarios can be contrasted with disaster recovery approaches that will typically take hours, if not days, to get applications up and running. Thus the drive for more resilient data center architectures that mimic the availability zones found in public cloud.
Virtualized Network Overlay Fabrics Help a Lot
The aforementioned EMA Research State of DC Network Annual Report also found that the top impediments to moving to an active-active DC architecture are network architecture complexity and network operations complexity. A well-designed virtualized network overlay can knock down these network challenges.
As discussed in the first blog, network fabrics really need to be thought about in two dimensions. The underlay fabric is where the servers and storage are physically connected to high-performance data center top of rack (aka leaf) switches and then to the spine switches and configured with a simple BGP unnumbered for simplicity, scale and resiliency.
The second dimension is the virtualized network overlay fabric that abstracts a logical network from the physical network. This abstraction enables network operations, or even application teams, to deploy network services quickly, with agility, at the speed of cloud. To better understand these various network overlay approaches, check out the blog, What to Know About Data Center Overlay Networks.
The virtualized overlay fabric enables the stretching of layer 2 (and layer 3) services across data centers to support storage protocols like vSAN that prefer layer 2 adjacencies. It also supports workload mobility without the need to change the workload default gateway IP address so it can move from one data center to another without any service disruption. The overlay delivers the benefits of layer 2 services, but over a segmented and scalable layer 3 underlay – eliminating the issues of directly connecting two or more data centers with a layer 2 underlay and the associated challenges of spanning tree loops and larger failure domains. You can learn some more about this by watching the on demand webinar: Data Center Architectures for Amazing Application Availability [warning: form fill required].
There are many different ways to build and automate data center network fabrics. One approach is to use BGP EVPN [Ethernet virtual private network], which is a protocol-based approach to building a network fabric. The challenge with BGP EVPN is the need to configure tens of lines of CLI in every switch in the fabric, one-by-one, to deploy a single service. For example, a 256-switch fabric takes 5,000 commands with BGP EVPN to deploy a new VRF or VLAN across the fabric. This blog provides more detail on BGP EVPN if you are interested.
Alternately, as described in this blog, one can take an SDN approach and inherently build automation into the underlay and overlay fabric solution. There are only two available SDN solutions for overlay fabrics – Cisco ACI and the Pluribus Adaptive Cloud Fabric™ powered by Netvisor® ONE. As an example, with Pluribus, using the Adaptive Cloud Fabric to deploy a new service fabric-wide takes just one or two commands, even for a 256-switch fabric. Compared to the 5,000 commands mentioned above – the automation is very meaningful from a cost and resource perspective.
Supporting Private Cloud
Networking is really hard. It’s the third leg of the triad of the data center, but it has struggled to be virtualized and automated because it’s so complicated. Unfortunately, legacy incumbent vendors have maintained a stronghold on the industry and it remains in their best interests to keep networking complicated.
Knocking down the networking problem with a virtualized overlay automated via SDN is a very viable approach to achieving an active-active center that supports workload mobility and full failover in seconds or, at most, a few minutes.
Again, Pluribus is one of just two companies that provide an SDN automated underlay and overlay network fabric that can make the deployment of an active-active or active-hot standby data center architecture substantially easier to deploy and manage. In fact, this case study around the deployment of an active-active DC architecture by Creval, a large financial banking institution in Italy, details the company’s sub-four-minute failover achieved by leveraging the Pluribus Adaptive Cloud Fabric.