In the last blog I provided a review of data center network virtualization approaches with a focus on the pros and cons of switch-based versus compute-based (host based) overlays and the control plane approaches of SDN and BGP EVPN. The goal of this blog is to take a deeper look at data center network automation including the underlay and overlay networks. As it turns out, the control plane will continue to play a role in the automation discussion.
Fundamental goals of data center network automation are to:
- Increase agility to speed service delivery to internal and external customers
- Reduce human error and increase uptime and application availability
- Improve security with consistent policy deployment and enforcement
- Free the Network Operations (NetOps) team from highly manual and repetitive box-by-box tasks.
There is no question that networking is the most challenging leg of the compute, storage and network triad when it comes to virtualization and automation. The complexity of networking and the risk of taking down one or more server clusters, data center pods or even entire data centers has been so high that rigid processes have been put in place to focus on uptime versus service agility. The chart below is based on an analysis of major outages from 2016 to 2018 by the Uptime Institute, a subsidiary of 451 Research, shows that networking is responsible for 25% of all major data center outages.
With the rise of public cloud, IT teams have realized that they need to have their on-premises data centers operate with the same level of agility and speed to service of the public clouds. Thus, a software defined data center where compute, storage and networking are virtualized, automated and orchestrated to deliver a private cloud is an imperative.
What is Network Automation and Specifically Data Center Network Automation?
Network automation is a method of using software to automate the many tasks required to provision (Day 0) and manage (Day 1, Day 2) network services in order to increase speed to service, to continuously maximize network efficiency, improve network visibility and to reduce the manual workload on NetOps teams. Data center network automation is often used in conjunction with network virtualization.
As we talked about in the first blog, modern data centers are typically deployed in leaf spine topology which provides a 3-stage CLOS fabric – where every leaf switch has a single hop to every other leaf switch for consistent bandwidth and performance to support the increasing amount of server-to-server (app-to-app) traffic (also known as East-West traffic). This fabric can be an underlay only or it can include an overlay as well, where a logical set of VXLAN tunnels can be deployed on top of an IP underlay to provide a virtualized data center fabric. These fabrics can range in size from 4 leaf and 2 spine switches to hundreds of leaf switches and multiple spine switches. Every time a new service needs to be deployed across the entire fabric each of these switches requires a configuration change. In pursuit of cloud operating models, IT departments today seek speed, agility, and consistency in provisioning and managing bare metal, virtualized (VMs) and cloud-native (containers) applications in their on-prem data centers. A modern network automation platform can achieve these goals by automating networking functions across the entire fabric such as deploying new network security policies or new network services such as VLANs or subnets or virtual router forwarding tables (VRFs).
Approaches to Network Automation
There are different categories of data center automation that will be examined in this blog:
|General Linux-based scripting languages||Python, Bash/Shell, Perl, Ruby|
|Infrastructure automation frameworks||Ansible, Puppet, Chef, Saltstack|
|3rd party external automation solutions||Apstra, SolarWinds, Gluware|
|Vendor-based external management solutions||Cisco Prime, Arista CloudVision, Aruba Fabric Composer, Dell Smart Fabric Director|
|Software Defined Networking (SDN) for DC fabrics||Cisco ACI, Pluribus Adaptive Cloud Fabric, VMware NSX*|
*NSX is a control plane for the overlay only whereas Cisco ACI and Pluribus ACF provide control plane automation for the underlay and overlay. Read blog #2 in this series for a deeper dive.
Linux-based Scripting Tools
The traditional method of operating a network has been to perform configuration tasks by logging into each switch, using the CLI and typing in or cutting and pasting a text-based configuration file. The automation approach to improve this has been to develop scripts that can deploy these updates via CLI or into an API such as a RESTful API to execute changes across multiple switches. This automation still happens switch-by-switch but the script is set up to execute across multiple switches in sequence. The idea behind using Linux-based scripting languages is that IT teams have been automating the deployment of multiple servers using Linux-based scripting tools for years and this skill set can be leveraged.
There are a number of Linux-based languages that can be used to execute this scripting such as Python, Bash/Shell, Perl, Ruby and a number of others. For example, Linux® operating system administrators can use Bash operators to chain events based on previous commands’ successes (&&) or failures (||). Or, users could compile command lists into text files—known as shell scripts—that can be repeatedly carried out all at once with a single execution command.
Python has become the go-to programming language for network automation. Its popularity is the result of a snowball effect in that Python caught on early as a programming language for network automation, has a reputation for being easy to learn and can claim more and more network automation script libraries. For example these two projects that have helped automate config management and both are both Python based – NAPALM and netmiko. The bottom line, most scripting for network automation is done using Python.
One of the challenges of scripting is that teams can run into script sprawl, have version control issues, unauthorized resources building and deploying scripts and a lack of transparency across the IT team. This has resulted in the rise of automation frameworks to help manage scripts. Automation framework software products can consolidate network tasks into prepackaged programs that can be selected, scheduled, and executed from the app’s front end and include role-based access and version control.
The Linux tools with the most buzz have historically been Ansible, Puppet, Chef and Saltstack. All of these tools have their roots in server automation. Similar to scripting, the driver behind this approach in all cases is the thinking that if the IT team is provisioning hundreds of servers with one of these tools then it sure would be nice to extend those tools to provision tens of data center switches.
Puppet and Chef
Puppet and Chef received a lot of buzz, primarily due to vendor marketing. However, both are very feature limited, for example Puppet only manages switch interfaces and VLANs and have no support for virtualized overlays. Additionally, the challenge with Puppet and Chef is that they require an agent be deployed on the target device and they depend on Ruby which has not seen a lot of tracking for NetOps. That is fine for a server but maybe not so easy or sensible on a data center switch and neither Puppet or Chef are event-driven nor data-driven. The bottom line is that Puppet and Chef really never became mainstream for network automation.
SaltStack or ‘Salt’ is another open-source automation tool that has been predominantly used for server automation. In fact, it had no network automation features before 2016 so is a more recent addition to the spectrum of tools available. Salt is also agent-based so has the same issues as Puppet and Chef. To address this proxy minions were developed that enable Salt to control devices that cannot run the standard Salt-Minion agent. Proxy Minions are not out of the box features and if your network device is non-standard you might have to write your own interface using Ruby. Salt has a few evangelists such as Cloudflare but has not been widely adopted.
Ansible is an open-source configuration management automation framework platform originally developed in 2012 because of the perceived inadequacies of the leading tools such as Chef and Puppet and was purchased by Red Hat in October 2015. Since Python libraries are more common than Ruby Ansible’s creators developed it in Python. They also enabled Ansible modules to work over JSON and so playbooks can be written in any programming language. Ansible is almost unique among automation framework tools in using an agentless architecture. Instead, it only relies on the tried and tested SSH to idempotently deploy modules to all nodes. Of all of the Linux-based server-oriented tools, Ansible has gained the most traction for network automation. Many NetOps teams use Ansible to administer and automate network operations across a wide variety of platforms.
Ansible operates by running a playbook which is a blueprint of automation tasks—a series of IT actions executed with limited or no human involvement. Ansible playbooks are executed on a set, group or classification of hosts or devices, which together make up an Ansible inventory. Ansible playbooks are essentially frameworks, which are prewritten code developers can use ad-hoc or as starting template.
In addition to Ansible there is Ansible Tower. The Ansible team describes this as “the easy-to-use UI and dashboard and REST API for Ansible. Centralize your Ansible infrastructure from a modern UI, featuring role-based access control, job scheduling, and graphical inventory management. Tower’s REST API and CLI make it easy to embed Tower into existing tools and processes. Tower now includes real-time output of playbook runs, an all-new dashboard and expanded out-of-the-box cloud support.”
Pluribus is fairly pragmatic when we build new features and capabilities. With the market traction of Ansible for network automation and input from customers we have built a meaningful set of Ansible playbooks. What is great about this combination is that customers already benefit tremendously from the native fabric-wide automation provided by the Pluribus Adaptive Cloud Fabric. However many customers asked us to implement Ansible as an additional layer of automation so it could mesh in with their overall IT automation framework.
Our TME team has created a short but educational video that covers how to use Ansible to automate Pluribus Netvisor ONE and the Adaptive Cloud Fabric software using CLI. Since that video was recorded we have continued to build playbooks and now can claim over 50 playbooks – you can view the list on Github here. We have also enabled the Pluribus UNUM management platform can also be used as an Ansible tower to launch custom playbooks to automate any specific configuration workflow. As pioneers in SDN, Pluribus understands the importance of simple IT automation and that’s why we support Ansible.
Ultimately it is important for NetOps teams to understand the difference between automating a data center fabric with an SDN control plane versus one with a box-by-box protocol-based control plane. Automating protocol based solutions can require extensive scripting and even with an automation framework tool like Ansible it can be a heavy lift. It also requires the allocation of NetOps resources to continuously update scripts as new features and capabilities are released in the underlying network operating systems. This is quite different from SDN-automation that is built into the network OS itself, as will be discussed later in this blog.
On a related note, it can be hard for network teams making purchasing decisions to cut through the marketing noise around network automation to understand the real capabilities and value of automation solutions that are offered. For example, Cumulus Networks did extensive marketing around network automation using Linux-based tools like Puppet and Chef to achieve DevOps like agility and speed – which sounds pretty good, doesn’t it? But at the end of the day they are just telling IT teams to build their own automation scripts. This often resulted in Cumulus also selling professional services to implement the scripting for their customers adding 30% or more to the overall cost of the initial purchase. Furthermore, this leaves the NetOps team in a position where they either still need to deeply understand the scripting or bring said vendor every time new scripts are needed.
Third Party Multi-vendor External Automation Solutions
There are a number of third-party multi-vendor external automation offerings in the market including Apstra, Solar Winds, Gluware and many more. Apstra seemed to have received the most marketing buzz over the last couple of years with a marketing message around intent based networking. In the last blog I talked in detail about BGP EVPN network overlay fabrics and what a heavy lift this protocol-based fabric is to manage, especially across more than one site. One of the key value props of Apstra was its ability to automate BGP EVPN fabrics. Apstra was recently acquired by Juniper Networks so will now move into the category of vendor-oriented external management and automation.
There are a number of benefits yet there are also some fundamental challenges with external automation.
- Multi-vendor – if you are looking to automate your data center and campus which are using different networking vendors you can use a third-party tool to automate some basic tasks.
- Staying synchronized with vendor enhancements – multi-vendor implementations have always been challenged to keep their “adaptors” synched with multiple vendors. Every vendor is continuously enhancing their underlying operating system functionality and the third-party automation platform development team is outside the 4 walls and separate from the development teams of all of the vendors it supports. Thus the 3rd party systems always get out of synch and if there is a feature that is desired that is not yet automated then the NetOps team will have to go into the CLI or build a script themselves to take advantage of a new feature.
- Least common denominator – this is very related to the synchronization issues above. In a multi-vendor environment if a NetOps team is automating the deployment of a particular network service then the external automation solution will only be able to deploy the service with features that are available across all of the vendor operating systems involved in the network – the least common denominator.
- Poor vendor support for APIs – because many legacy networking vendors have been around for decades their solutions do not always offer the most robust set of APIs for programmability. This means that the external automation solution may have to revert to CLI programming or other cruder approaches.
Vendor-based external management and automation
Vendor-oriented external automation solutions such as Cisco Prime, Aruba Fabric Composer, and Dell Smart Fabric Director are typically going to outperform a third-party solution when it comes to a single vendor environment. Their development teams work more closely together and can leverage internal engineering documents and communication tools. However even these tools can struggle to keep up and stay synchronized and they are often automating operating systems that are decades old and do not have APIs that have equivalent functionality with the CLI. As above, when trying to automate a BGP EVPN fabric, especially across two or more data centers, the automation tool often cannot provide all of the features available in a single data center fabric environment.
Software Defined Networking (SDN) for DC fabrics
With the external automation solutions, whether third party or vendor-based, automation as a first principle is not architected into the network operating system from the ground up. These external solutions are doing the best they can to automate networking systems that have been built based on conventional CLI network operations and protocol based fabrics. An alternate approach is to start from a clean sheet of paper and design the operating system and associated networking elements from the beginning with automation in mind – this makes it possible to deliver a much more innovative solution that can provide a quantum leap in automation capabilities. That is an entire topic in itself, which I will cover in the next blog in this series SDN: The Evolution of Data Center Network Automation.