Published by Alessandro Barbieri

Today, Pluribus launched the Unified Cloud Networking architecture aiming to transform the way CSPs, telcos and enterprises build and operate cloud networks with radical operational simplification, distributed security services integrated into the network, and significantly lower total cost of ownership (TCO) compared to existing solutions.

In this blog I discuss the networking and security challenges cloud operators are facing, and then describe how the Pluribus Unified Cloud Fabric™ addresses these challenges with a holistic approach to cloud networking including both the switching fabric and the compute virtualization fabric. I then explain how the Pluribus Netvisor® ONE network operating system (OS) integrates with the NVIDIA® Bluefield® data processing unit (DPU) hardware architecture to deliver a Unified Cloud Fabric across any workload environment (including ESXi, Hyper-V, Xen, KVM, bare metal, and Kubernetes), provide a zero-trust administration model between compute and network, and radically simplify the networking stack running on the server OS with better overall performance and lower TCO. Finally, I review the initial set of use cases Pluribus is delivering with the Early Field Trial (EFT) program starting next month.

The State of Cloud Networking: A Tale of Many Fabrics

Outside the largest public cloud providers, with the capabilities to develop highly integrated, closed cloud networking and security solutions, most cloud operators assemble their infrastructure by integrating multiple compute virtualization, networking, and security solutions leveraging multiple vendors and point products.

When it comes to networking and security, the cloud infrastructure is a composition of many disjointed networking and security products and architectures. On one end is the physical switch fabric with its own control plane protocols and orchestration tools, and on the other end are one or more server-based virtual fabrics, each operating a different control plane and orchestration tool depending on the hypervisor of choice.

The switch fabric is typically expected to deliver reliable, high-bandwidth connectivity, along with macro segmentation capabilities for tenants and applications. Most of the high-value microsegmentation and security services for east-west traffic are pushed deep into the compute-based virtual fabrics.

While data center switch fabrics are designed with well understood principles based on leaf-spine topologies, and open multi-vendor interoperable IP protocols, the server-based network virtualization solutions could not be more incoherent, siloed and confined to a single specific virtualization solution. Hypervisors like VMware ESXi, Microsoft Hyper-V, Xen, KVM run their own dedicated virtual networking solutions, with non-interoperable orchestration tools and control planes. As a result, cloud builders supporting multiple virtualization stacks must face the daunting task of integrating multiple ad-hoc networks each supporting different services that cannot easily extend outside the confines of a specific hypervisor.

To make the matter even more convoluted, when it comes to supporting cloud native workloads based on Kubernetes, the networking challenge reaches yet another level of complexity: the Kubernetes administrator has no other options but to fire up yet another networking fabric for the pod-to-pod communication inside the Kubernetes cluster. Moreover, most Kubernetes clusters are deployed on top of virtual machines (VMs), therefore creating a network for containers nested on to the network for VMs, which in turn runs on top of a physical fabric. These network fabrics “stacked” on top of each other operate as “ships-in-the-night” without any ability to coordinate with each other.

This is clearly highly inefficient and a recipe for chaos and complexity. (For more on the challenges of delivering Kubernetes on VMs, I suggest this excellent article.)

To add further challeges, firewall policies and microsegmentation services for east-west traffic are generally delivered with virtual or physical firewall appliances, each with their separate control plane, “bolted on” to the physical and virtual fabrics. This approach leads to a sprawl of security appliances, which adds more orchestration points and drives up the cost and complexity of the cloud infrastructure.

Moreover, physical security appliances create a highly inefficient “traffic tromboning” effect by forcing all traffic to be diverted to centralized appliances for security inspection before heading to its destination.  Virtual appliances, while distributed throughout the server infrastructure, subtract compute resources which would otherwise be deployed to run application workloads.

Both physical and virtual appliance models are highly inefficient, and increase the cost of operating the cloud infrastructure to the tune of multiple tens of thousands of dollars for each rack.

A more integrated solution like VMware’s NSX microsegmentation with distributed firewall brings more operational efficiency, however its license cost and limited support for other hypervisors remains a barrier for many cloud operators.

In summary, delivering end-to-end cloud services across an infrastructure that is a patchwork of many network fabrics and point security appliances is a major operational challenge for most organizations. 

The problems don’t end with service delivery. Application outages force multiple teams to engage to determine if the network (which one??), the infrastructure services (e.g. DNS) or the application is to blame for the outage. This tweet explains the point with good humor: 

If we look holistically at networking and security in the cloud, it is truly a tale of many fabrics operating like “ships-in-the-night.”

Pluribus Unified Cloud Networking Architecture

The Pluribus Unified Cloud Networking architecture is a holistic approach to cloud networking encompassing both server-based and switch-based networking. The architecture unifies both Ethernet top of rack (ToR) switches and DPUs into a Unified Cloud Fabric with a single control plane and management plane powered by a single network OS, Netvisor ONE.

The objective of the Unified Cloud Networking architecture is to offer a solution that is the most efficient, the most elegant and the simplest to operate for cloud networking and security services.

Until now the DPU technology was the missing link to enable the Unified Cloud Networking architecture. Thanks to the DPU, and specifically NVIDIA’s Bluefield DPU architecture, it is now possible to consolidate many networks into one common fabric spanning compute and physical network.

Unlike traditional server x86-based networking solutions, the DPU hardware architecture enables a clear demarcation between compute and network, because the network OS is isolated and independently managed from the x86 host.

The DPU hardware also unlocks the ability to integrate security services in the fabric therefore eliminating the sprawl of security appliances for east-west traffic, which is one of the main drivers for cost and complexity in most cloud infrastructures.

Finally, the DPU hardware accelerators liberate at least 20% of the captive compute resources required to handle networking and security tasks, therefore boosting both the efficiency and the performance of the overall cloud infrastructure.

The Pluribus Unified Cloud Fabric is the industry’s only solution that tackles cloud networking holistically from the server to the physical network and elegantly addresses the many challenges facing cloud builders:

  1. Unify incoherent network operational domains across multiple hypervisors, Kubernetes clusters and bare metal workloads, with a single unified network overlay, a single unified control plane, a single unified programmatic API, and one consistent set of cloud services.
  1. Maximize investment protection with a switch+DPU fabric capable of including both DPU-based endpoints and devices that cannot integrate DPUs (e.g. IoT devices, bare metal databases, or simply legacy servers).
Graphical user interface

Description automatically generated
  1. Simplify the network stack running on the server/x86 CPU by offloading all switching and routing capabilities to the DPU. The DPU becomes the new “ToR” for VMs and containers.
  1. Enforce a clear zero-trust demarcation between compute and network thanks to the ability to manage the DPU completely independently from the x86 host.
  1. Distribute security services without an explosion in cost and complexity. Microsegmentation and distributed firewall services for east-west traffic are embedded in the fabric and distributed close to the workloads, thus eliminating the typical sprawl of virtual or physical firewall appliances required to segment and secure the east-west traffic for tenants and applications. 
  1.  Operate the network with the agility of a public cloud. The overlay services are abstracted from the physical topology, “protocol-free,” with a simple end-to-end provisioning model. Just like in a public cloud, you can create with a single command a virtual private cloud (VPC)-like construct to isolate tenant traffic across hundreds of DPUs or switch devices.
  1. Efficiency and performance. All networking and security services are hardware accelerated with much improved performance compared to running on a server CPU. Even more importantly the ability to offload networking and security services to the DPU returns to the cloud operators about 20% of their server resources boosting the efficiency and lowering the cost of operating the infrastructure.

How does Netvisor ONE integrate with NVIDIA Bluefield DPUs?

The hardware architecture of the Bluefield DPU and the way Netvisor ONE integrates with it are key to many of the value propositions of the Pluribus Unified Cloud Networking solution.

Almost all network security products integrating with DPUs adopt a “half-in” model, meaning the control and management planes of the application (e.g. firewall) still run on the x86 and the DPU is used for selective off-loads.

With Netvisor ONE, on the other hand, Pluribus adopted an “all-in” model, meaning the entire Netvisor ONE stack runs on the DPU and no code is placed on the hypervisor/x86 CPU. At-a-glance this may seem like a small, technical nuisance, but it is the secret sauce to achieve true zero-trust administration and the cleanest possible demarcation between compute and network.

Let’s now review the three key aspects of Netvisor ONE integration with the Bluefield DPU architecture.

Graphical user interface, timeline

Description automatically generated

First, hardware isolation: Netvisor ONE runs completely isolated on the Bluefield DPU ARM processor, so there is zero Pluribus code (drivers/agents) on the x86 host CPU. As a result, the Unified Cloud Fabric solution is completely agnostic to the host virtualization layer.

Moreover the “ToR-on-a-NIC” is managed completely independently from the server OS, thanks to both the external out-of-band management interface, as well as Netvisor ONE’s ability to leverage the in-band ports for management and control plane traffic.

These capabilities allow the DPU to enable a zero-trust demarcation between compute and network. 

Second, simplification of the server OS network stack: the host OS can be quite easily configured (with SR-IOV or Virtio) to allow VMs to be directly connected to the virtual interfaces on the DPU, eliminating any complex networking on the server hypervisor. All networking and security services are pushed out of the server CPU and into the network edge implemented on the DPUs.

Third, hardware acceleration: the DPU supports hardware accelerators for both networking and security services. Netvisor ONE programs the Bluefield DPU hardware with a complete forwarding pipeline replicating the features of a modern datacenter Ethernet switch, so that virtually every L2/L3/security/QoS construct can take advantage of the hardware accelerators of the Bluefield DPU.
Thanks to the hardware accelerators it is possible both to achieve greater performance (compared to running the network functions on an x86 CPU) and to reduce server CPU core utilization by about 20%. 

Unified Cloud Networking Use Cases

The DPU architecture unlocks numerous new possibilities to expand the range of Netvisor ONE and Unified Cloud Fabric applications to cloud networking and security.

The first release of Netvisor ONE that enables these new Unified Cloud Fabric capabilities focuses on three use cases:

  1. Multi-hypervisor, bare-metal unified cloud networking:

Description automatically generated
  1. Microsegmentation with distributed firewall services:
Graphical user interface

Description automatically generated with low confidence
  1. Distributed visibility services at the server edge
A picture containing graphical user interface

Description automatically generated

Pluribus Unified Cloud Networking in Perspective

While there are many networking and security technologies that can be successfully implemented in a cloud network with enough skills, time and money, the operational complexity and cost to operate cloud networks remains a huge challenge for most organizations.

The Pluribus Unified Cloud Networking architecture presents the first holistic solution to the major operational fragmentation challenges and cost drivers impacting the cloud networking and security infrastructure.

The Unified Cloud Fabric, built on the Netvisor ONE NOS, is currently the only product in the industry powering both the physical switch and DPU-based server overlays, while maintaining a zero-trust demarcation between compute and network.

Today Pluribus is redefining cloud networking.