The challenges faced by Openflow and SDN
This is the 3rd and final article in this series (Click for Part I and Part II). As promised, let’s look at some of the challenges facing this space and how we are addressing those challenges. In the process, we also look at the role of controllers in a network fabric in more detail in this post.
Challenge 1 – Which is why Nicira had to get a big partner
I see plenty of articles talking about Nicira being aquired. But there’s one question that no one seems to be asking – that if the space is so hot, why did Nicira do the deal so early? The deal size (1.26B) is not chump change but if I were them and my stock was rising, I would have held off to continue pushing to change the world. So what was the rush? I believe the answer lies in some of the issues I discussed in Part II of this series – in the difference between a server-based and switch based approach! The Nicira solution was very dependent on the server and the server hypervisor. The world of server operating systems and hypervisors is so fragmented today, that staying independent would have been a very uphill battle for them. So tying up with one of the biggest hypervisors made sense for them. I believe they did the smart thing to ensure that the technology can keep moving forward. Undoubtedly, VMware is a good, technology-driven company. The moot question now is how long before the VMware/EMC and Cisco relationship comes unhinged?
Challenge 2 – Why Controllers? Or the divide between Control Plane and Data Plane
The current promise of having a standard way of controlling networking and a controller that is platform independent is huge. It provides simplified network management and rapid scale for virtual networks. Yet, the current implementations have become problematic.
Since the switches are dumb and do not have a global view, the current controllers have turned into a policy enforcement engine as well. New flow setup requires a controller to agree which means that every flow now needs to go through the controller which in turn instantiates them on the switch. This, however, raises several issues:
- A controller which is essentially an application running on a server OS over a 10gbs link (with a latency of tens of milli-seconds) is in-charge of controlling a switch which, in turn could be switching 1.2 Tbps of traffic at an average latency of under a μs and dealing with 100k flows with ~30% of them being setup or torn down every second. To put things in perspective, a controller takes tens of milliseconds to set up a flow while the life of a flow transferring a 10Mb data (typical web page) is 10 msec!!
- To deal with 100k flows, the switch ASICs need to have that kind of flow capability. The current (and coming generation) of ASICs have no where near such capability so one can only use the flow table as a cache. This brings to us the 3rd issue.
- Flow setup rate is anemic at best on the existing hardware. You are lucky if you can get 1000 flows per second.
So what is lacking is a Network Operating system on the switch to support the Controller App. If you look at the server world, the administrator specifies the policies and it’s the job of the OS (that works very closely with the Hardware) to enforce these policies. In the current scenario, it seems like an application running on bare metal with no Operating system support. Since this is a highly specialized application, it needs a specialized Operating system – or, a Network Operating system which can also be virtualized.
Challenge 3 – The Controller based Network
For a while, people were just tried to their inflexible networks which didn’t see any innovation in last two decades while the server and storage went through major metamorphosis. That frustration gave birth to Openflow/SDN which has currently morphed into a controller mania. Moving the brain from body and separating them creates somewhat of a split brain problem since the body (or switch in this case) still needs somewhat of a brain. What we need is a solution that encompasses the entire L2 fabric and the controller and Fabric work as one while providing easy abstractions for user to achieve their virtualization, SLA and monitoring needs.
A Distributed Network Hypervisor or Netvisor to the rescue
So what we (at Pluribus) saw early on is that the world of servers is a very good example. The commoditization of ASICs and value moving to software is pretty much what’s happening in the world of storage and is bound to happen in the world of networking. So we decided to do things in the right order i.e. get the bleeding edge commodity ASICs and create a Network Operating System with the following properties:
- Network OS – Designed to manage these ASICs which are very specialized and powerful.
- Distributed – Networks have more than one switch, working in tandem to support end-to-end flow.
- Virtualized – Ability to run both Physical and Virtual Networking applications. As I mentioned before, the switch is not the network. We need to deal with all network services in physical and virtual form and the network OS needs to support that.
Hence we created a Distributed Network Hypervisor called Netvisor™ (a key component of the nvOS™ operating system) or nvOS™. It is designed to run on the network switches and supports both physical and virtual network services. It also runs a controller where the controller is a policy distribution engine and no longer a policy enforcement engine.
As shown in the left half of the above figure, the current dividing line between the control plane and data place is not going to scale and perform. The line we originally drew (and the founding principle of Pluribus Networks), as shown in the right-half, needs to be delivered for SDN and Openflow to deliver its true promise.
