[This post was authored by T. Sridhar and Jesse Gross.]
Earlier this year, VMware, Microsoft, Red Hat and Intel published an IETF draft on Generic Network Virtualization Encapsulation (Geneve). This draft (first published on Valentine’s Day no less) includes authors from the each of the first generation encapsulation protocols — VXLAN, NVGRE, and STT. However, beyond the obvious appeal of unification across hypervisor platforms, the salient feature of Geneve is that it was designed from the ground up to be flexible. Nobody wants an endless cycle of new encapsulation formats as network virtualization designs and controllers mature, certainly not the vendors that have to support the ever growing list of acronyms and RFCs.
Of course press releases, standards bodies and predictions about the future mean little without actual implementations, which is why it is important to consider the “ecosystem” from the beginning of the process. This includes software and silicon implementations in both commercial and open source varieties. This always takes time but since Geneve was designed to accommodate a wide variety of use cases it has seen a relatively quick uptake. Unsurprisingly, the first implementations that landed were open source software — including switches such as Open vSwitch and networking troubleshooting tools like Wireshark. Today the first hardware implementation has arrived, in the form of the 40 Gbps Intel XL710 NIC, previously known as Fortville.
Demo of Geneve hardware acceleration at Intel Developer Forum.
Why is hardware support important? Performance. Everyone likes flexibility, of course, but most of the time that comes with a cost. In the case of a NIC, hardware acceleration enables us to have our cake and eat it too by offloading expensive operations while retaining software control in the CPU. These NICs add encapsulation awareness for classic operations like checksum and TCP segmentation offload to bring Geneve tunnels to performance parity with traditional traffic. For good measure, they also add in support for a few additional Geneve-specific features as well.
Of course, this is just the beginning — it is still only six months after publication of the Geneve specification and much more is still to come. Expect to see further announcements coming soon for both NIC and switch silicon and of course new software to take advantage of the advanced capabilities. Until then, a discussion session as well as a live demo will be at Intel Developer Forum this week to provide a first glimpse of Geneve in action.
As we’ve discussed previously, the vSwitch is a great position to detect elephant, or heavy-hitter flows because it has proximity to the guest OS and can use that position to gather additional context. This context may include the TSO send buffer, or even the guest TCP send buffer. Once an elephant is detected, it can be signaled to the underlay using standard interfaces such as DSCP. The following slide deck provides and overview of a working version of this, showing how such a setup can be used to both dynamically detect elephants and isolate mice from queuing delays they cause. We’ll write about this in more detail in a later post, but for now check out the slides (and in particular the graphs showing the latency of mice with and without detection and handling).
[This post was written by Dinesh Dutt with help from Martin Casado. Dinesh is Chief Scientist at Cumulus Networks. Before that, he was a Cisco Fellow, working on various data center technologies from ASICs to protocols to RFCs. He’s a primary co-author on the TRILL RFC and the VxLAN draft at the IETF. Sudeep Goswami, Shrijeet Mukherjee, Teemu Koponen, Dmitri Kalintsev, and T. Sridhar provided useful feedback along the way.]
In light of the seismic shifts introduced by server and network virtualization, many questions pertaining to the role of end hosts and the networking subsystem have come to the fore. Of the many questions raised by network virtualization, a prominent one is this: what function does the physical network provide in network virtualization? This post considers this question through the lens of the end-to-end argument.
Networking and Modern Data Center Applications
There are a few primary lessons learnt from the large scale data centers run by companies such as Amazon, Google, Facebook and Microsoft. The first such lesson is that a physical network built on L3 with equal-cost multipathing (ECMP) is a good fit for the modern data center. These networks provide predictable latency, scale well, converge quickly when nodes or links change, and provide a very fine-grained failure domain.
Second, historically, throwing bandwidth at the problem leads to far simpler networking compared to using complex features to overcome bandwidth limitations. The cost of building such high capacity networks has dropped dramatically in the last few years. By making networks follow the KISS principle, the networks are more robust and can be built out of simple building blocks.
Finally, there is value in moving functions from the network to the edge where there are better semantics, a richer compute model, and lower performance demands. This is evidenced by the applications that are driving data center. Over time, they have subsumed many of the functions that prior generation applications relied on the network for. For example, Hadoop has its own discovery mechanism instead of assuming that all peers are on the same broadcast medium (L2 network). Failure, security and other such characteristics are often built into the application, the compute harness, or the PaaS layer.
There is no debate about the role of networking for such applications. Yes, networks can attempt to do better load spreading or such, but vendors don’t design Hadoop protocol into networking equipment and debate about the performance benefits of doing so.
The story is much different when discussing virtual datacenters (for brevity, we’ll refer to these virtualized datacenters as “clouds” while acknowledging it is a misuse of the term) that host traditional workloads. Here there is active debate as to where functionality should lie.
Network virtualization is a key cloud-enabling technology. Network virtualization does to networking what server virtualization did to servers. It takes the fundamental pieces that constitute networking – addresses and connectivity(including policies that determine connectivity) – and virtualizes them such that many virtual networks can be multiplexed onto a single physical network.
Unlike software layers within modern data center applications that provide similar functionality (although with different abstractions) there is an ongoing discussion on where the virtual network should be implemented. In what follows, we view this discussion in light of the end-to-end argument.
Network Virtualization and the End-to-End Principle
The end-to-end principle (http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf) is a fundamental principle defining the design and functioning of the largest network of them all, the Internet. Over the years since its first inception, in 1984, the principle has been revisited and revised, many times, by the authors themselves and by others. But a fundamental idea it postulated remains as relevant today as when it was first formulated.
With regard to the question of where to place a function, in an application or in the communication subsystem, this is what the original paper says (this oft-quoted section comes at the end of a discussion where the application being discussed is reliable file transfer and the function is reliability): “The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)” [Emphasis is by the authors of this post].
Consider the application of this statement to virtual networking. One of the primary pieces of information required in network virtualization is the virtual network ID (or VNI). Let us consider who can provide the information correctly and completely.
In the current world of server virtualization, the network is completely unaware of when a VM is enabled or disabled and therefore joins or leaves (or creates or destroys) a virtual network. Furthermore, since the VM itself is unaware of the fact that it is running virtualized and that the NIC it sees is really a virtual NIC, there is no information in the packet itself that can help a networking device such as a first hop router or switch identify the virtual network solely on the basis of an incoming frame. The hypervisor on the virtualized server is the only one that is aware of this detail and so it is the only one that can correctly implement this function of associating a packet to a virtual network.
Some solutions to network virtualization concede this point that the hypervisor has to be involved in the decision making of which virtual network a packet belongs to. But they’d like to consider a solution in which the hypervisor signals to the first hop switch the desire for a new virtual network and the first hop switch returns back a tag such as a VLAN for the hypervisor to tag the frame with. The first hop switch then uses this tag to be act as the edge of the network virtualization overlay. Let us consider what this entails to the overall system design. As a co-inventor of VxLAN while at Cisco, I’ve grappled with these consequences during the design of such approaches.
The robustness of a system is determined partly by how many touch points are involved when a function has to be performed. In the case of the network virtualization overlay, the only touch points involved are the ones that are directly involved in the communication: the sending and receiving hypervisors. The state about the bringup and teardown of a virtual network and the connectivity matrix of that virtual network do not involve the physical network. Since fewer touchpoints are involved in the functioning of a virtual network, it is easier to troubleshoot and diagnose problems with the virtual network (decomposing it as discussed in an earlier blog post).
Another data point for this argument comes from James Hamilton’s famous cry of “the network is in my way”. His frustration arose partly from the then in-vogue model of virtualizing networks. VLANs which were the primary construct in virtualizing a network involved touching multiple physical nodes to enable a new VLAN to come up. Since a new VLAN coming up could destabilize the existing network by becoming the straw that broke the camel’s back, a cumbersome, manual and lengthy process was required to add a new VLAN. This constrained the agility with which virtual networks could be spun up and spun down. Furthermore, to scale the solution even mildly, required the reinvention of how the primary L2 control protocol, spanning tree, worked (think MSTP).
Besides the technical merits of the end-to-end principle, another of its key consequences is the effect on innovation. It has been eloquently argued many times that services such as Netflix and VoIP are possible largely because the Internet design has the end-to-end principle as a fundamental rubric. Similarly, by looking at network virtualization as an application best implemented by end stations instead of a tight integration with the communication subsystem, it becomes clear that user choice and innovation become possible with this loose coupling. For example, you can choose between various network virtualization solutions when you separate the physical network from the virtual network overlay. And you can evolve the functions at software time scales. Also, you can use the same physical infrastructure for PaaS and IaaS applications instead of designing different networks for each kind. Lack of choice and control has been a driving factor in the revolution underway in networking today. So, this consequence of the end-to-end principle is not an academic point.
The final issue is performance. The end-to-end principal clearly allows functions to be pushed into the network as a performance optimization. This topic deserves a post in itself (we’ve addressed it in pieces before, but not in its entirety), so we’ll just tee up the basic arguments. Of course, if the virtual network overlay provides sufficient performance, there is nothing additional to do. If not, then the question remains of where to put the functionality to improve performance.
Clearly some functions should be in the physical network, such as packet replication, and enforcing QoS priorities. However, in general, we would argue that it is better to extend the end-host programming model (additional hardware instructions, more efficient memory models, etc.) where all end host applications can take advantage of it, than push a subset of the functions into the network and require an additional mechanism for state synchronization to relay edge semantics. But again, we’ll address these issues in a more focused post later.
At an architectural level, a virtual network overlay is not much different than functionality that already exists in a big data application. In both cases, functionality that applications have traditionally relied on the network for – discovery, handling failures, visibility and security – have been subsumed into the application.
Ivan Pepelnjak said it first, and said it best when comparing network virtualization to Skype. Network virtualization is better compared to the network layer that has organically evolved within applications than to traditional networking functionality found in switches and routers.
If the word “network” was not used in “virtual network overlay” or if it’s origins hadn’t been so intermixed with the discontentment with VLANs, we wonder if the debate would exist at all.
Network virtualization, as others have noted, is now well past the hype stage and in serious production deployments. One factor that has facilitated the adoption of network virtualization is the ease with which it can be incrementally deployed. In a typical data center, the necessary infrastructure is already in place. Servers are interconnected by a physical network that already meets the basic requirements for network virtualization: providing IP connectivity between the physical servers. And the servers are themselves virtualized, providing the ideal insertion point for network virtualization: the vswitch (virtual switch). Because the vswitch is the first hop in the data path for every packet that enters or leaves a VM, it’s the natural place to implement the data plane for network virtualization. This is the approach taken by VMware (and by Nicira before we were part of VMware) to enable network virtualization, and it forms the basis for our current deployments.
In typical data centers, however, not every machine is virtualized. “Bare metal” servers — that is, unvirtualized, or physical machines — are a fact of life in most real data centers. Sometimes they are present because they run software that is not easily virtualized, or because of performance concerns (e.g. highly latency-sensitive applications), or just because there are users of the data center who haven’t felt the need to virtualize. How do we accommodate these bare-metal workloads in virtualized networks if there is no vswitch sitting inside the machine?
Our solution to this issue was to develop gateway capabilities that allow physical devices to be connected to virtual networks. One class of gateway that we’ve been using for a while is a software appliance. It runs on standard x86 hardware and contains an instance of Open vSwitch. Under the control of the NSX controller, it maps physical ports, and VLANs on those ports, to logical networks, so that any physical device can participate in a given logical network, communicating with the VMs that are also connected to that logical network. This is illustrated below.
As an aside, these gateways also address another common use case: traffic that enters and leaves the data center, or “north-south” traffic. The basic functionality is similar enough: the gateway maps traffic from a physical port (in this case, a port connected to a WAN router rather than a server) to logical networks, and vice versa.
Software gateways are a great solution for moderate amounts of physical-to-virtual traffic, but there are inevitably some scenarios where the volume of traffic is too high for a single x86-based appliance, or even a handful of them. Say you had a rack (or more) full of bare-metal database servers and you wanted to connect them to logical networks containing VMs running application and web tiers for multi-tier applications. Ideally you’d like a high-density and high-throughput device that could bridge the traffic to and from the physical servers into the logical networks. This is where hardware gateways enter the picture.
Leveraging VXLAN-capable Switches
Fortunately, there is an emerging class of hardware switch that is readily adaptable to this gateway use case. Switches from several vendors are now becoming available with the ability to terminate VXLAN tunnels. (We’ll call these switches VTEPs — VXLAN Tunnel End Points or, more precisely, hardware VTEPs.) VXLAN tunnel termination addresses the data plane aspects of mapping traffic from the physical world to the virtual. However, there is also a need for a control plane mechanism by which the NSX controller can tell the VTEP everything it needs to know to connect its physical ports to virtual networks. Broadly speaking, this means:
- providing the VTEP with information about the VXLAN tunnels that instantiate a particular logical network (such as the Virtual Network Identifier and destination IP addresses of the tunnels);
- providing mappings between the MAC addresses of VMs and specific VXLAN tunnels (so the VTEP knows how to forward packets to a given VM);
- instructing the VTEP as to which physical ports should be connected to which logical networks.
In return, the VTEP needs to tell the NSX controller what it knows about the physical world — specifically, the physical MAC addresses of devices that it can reach on its physical ports.
There may be other information to be exchanged between the controller and the VTEP to offer more capabilities, but this covers the basics. This information exchange can be viewed as the synchronization of two copies of a database, one of which resides in the controller and one of which is in the VTEP. The NSX controller already implements a database access protocol, OVSDB, for the purposes of configuring and monitoring Open vSwitch instances. We decided to leverage this existing protocol for control of third party VTEPs as well. We designed a new database schema to convey the information outlined above; the OVSDB protocol and the database code are unchanged. That choice has proven very helpful to our hardware partners, as they have been able to leverage the open source implementation of the OVSDB server and client libraries.
The upshot of this work is that we can now build virtual networks that connect relatively large numbers of physical ports to virtual ports, using essentially the same operational model for any type of port, virtual or physical. The NSX controller exposes a northbound API by which physical ports can be attached to logical switches. Virtual ports of VMs are attached in much the same way to build logical networks that span the physical and virtual worlds. The figure below illustrates the approach.
It’s worth noting that this approach has no need for IP multicast in the physical network, and limits the use of flooding within the overlay network. This contrasts with some early VXLAN implementations (and the original VXLAN Internet Draft, which didn’t entirely decouple the data plane from the control plane). The reason we are able to avoid flooding in many cases is that the NSX controller knows the location of all the VMs that it has attached to logical networks — this information is provided by the vswitches to the controller. And the controller shares its knowledge with the hardware VTEPs via OVSDB. Hence, any traffic destined for a VM can be placed on the right VXLAN tunnel from the outset.
In the virtual-to-physical direction, it’s only necessary to flood a packet if there is more than one hardware VTEP. (If there is only one hardware VTEP, we can assume that any unknown destination must be a physical device attached to the VTEP, since we know where all the VMs are). In this case, we use the NSX Service Node to replicate the packet to all the hardware VTEPs (but not to any of the vswitches). Furthermore, once a given hardware VTEP learns about a physical MAC address on one of its ports, it writes that information to the database, so there will be no need to flood subsequent packets. The net result is that the amount of flooded traffic is quite limited.
For a more detailed discussion of the role of VXLAN in the broader landscape of network virtualization, take a look at our earlier post on the topic.
Hardware or Software?
Of course, there are trade-offs between hardware and software gateways. In the NSX software gateway, there is quite a rich set of features that can be enabled (port security, port mirroring, QoS features, etc.) and the set will continue to grow over time. Similarly, the feature set that the hardware VTEPs support, and which NSX can control, will evolve over time. One of the challenges here is that hardware cycles are relatively long. On top of that, we’d like to provide a consistent model across many different hardware platforms, but there are inevitably differences across the various VTEP implementations. For example, we’re currently working with a number of different top-of-rack switch vendors. Some partners use their own custom ASICs for forwarding, while others use merchant silicon from the leading switch silicon vendors. Hence, there is quite a bit of diversity in the features that the hardware can support.
Our expectation is that software gateways will provide the greatest flexibility, feature-richness, and speed of evolution. Hardware VTEPs will be the preferred choice for deployments requiring greater total throughput and density.
A final note: one of the important benefits of network virtualization is that it decouples networking functions from the physical networking hardware. At first glance, it might seem that the use of hardware VTEPs is a step backwards, since we’re depending on specific hardware capabilities to interconnect the physical and virtual worlds. But by developing an approach that works across a wide variety of hardware devices and vendors, we’ve actually managed to achieve a good level of decoupling between the functionality that NSX provides and the hardware that underpins it. And as long as there are significant numbers of physical workloads that need to be managed in the data center, it will be attractive to have a range of options for connecting those workloads to virtual networks.
Two weeks ago I gave a short presentation at the Open Networking Summit. With only 15 minutes allocated per speaker, I wasn’t sure I’d be able to make much of an impact. However, there has been a lot of reaction to the talk – much of it positive – so I’m posting the slides here and including them below. A video of the presentation is also available in the ONS video archive (free registration required).
[This post was written by JR Rivers, Bruce Davie, and Martin Casado]
One of the important characteristics of network virtualization is the decoupling of network services from the underlying physical network. That decoupling is fundamental to the definition of network virtualization: it’s the delivery of network services independent of a physical network that makes those services virtual. Furthermore, many of the benefits of virtualization – such as the ability to move network services along with the workloads that need those services, without touching hardware – follow directly from this decoupling.
In spite of all the benefits that flow from decoupling virtual networks from the underlying physical network, we occasionally hear the concern that something has been lost by not having more direct interaction with the physical network. Indeed, we’ve come across a common intuition that applications would somehow be better off if they could directly control what the physical network is doing. The goal of this post is to explain why we disagree with this view.
It’s worth noting that this idea of getting networks to do something special for certain applications is hardly a novel idea. Consider the history of Voice over IP as an example. It wasn’t that long ago when using Ethernet for phone calls was a research project. Advances in the capacity of both the end points as well as the underlying physical network changed all of that and today VOIP is broadly utilized by consumers and enterprises around the world. Let’s break down the architecture that enabled VOIP.
A call starts with end-points (VOIP phones and computers) interacting with a controller that provisions the connection between them. In this case, provisioning involves authenticating end-points, finding other end-points, and ringing the other end. This process creates a logical connection between the end-points that overlays the physical network(s) that connect them. From there, communication occurs directly between the end-points. The breakthroughs that allowed Voice Over IP were a) ubiquitous end-points with the capacity to encode voice and communicate via IP and b) physical networks with enough capacity to connect the end-points while still carrying their normal workload.
Now, does VOIP need anything special from the network itself? Back in the 1990s, many people believed that to enable VOIP it would be necessary to signal into the network to request bandwidth for each call. Both ATM signalling and RSVP (the Resource Reservation Protocol) were proposed to address this problem. But by the time VOIP really started to gain traction, network bandwidth was becoming so abundant that these explicit communication methods between the endpoints and the network proved un-necessary. Some simple marking of VOIP packets to ensure that they didn’t encounter long queues on bottleneck links was all that was needed in the QoS department. Intelligent behavior at the end-points (such as adaptive bit-rate codecs) made the solution even more robust. Today, of course, you can make a VOIP call between continents without any knowledge of the underlying network.
These same principles have been applied to more interactive use cases such as web-based video conferencing, interactive gaming, tweeting, you name it. The majority of the ways that people interact electronically are based on two fundamental premises: a logical connection between two or more end-points and a high capacity IP network fabric.
Returning to the context of network virtualization, IP fabrics allow system architects to build highly scalable physical networks; the summarization properties of IP and its routing protocols allow the connection of thousands of endpoints without imposing the knowledge of each one on the core of the network. This both reduces the complexity (and cost) of the networking elements, and improves their ability to heal in the event that something goes wrong. IP networks readily support large sets of equal cost paths between end-points, allowing administrators to simultaneously add capacity and redundancy. Path selection can be based on a variety of techniques such as statistical selection (hashing of headers), Valiant Load Balancing, and automated identification of “elephant” flows.
Is anything lost if applications don’t interact directly with the network forwarding elements? In theory, perhaps, an application might be able to get a path that is more well-suited to its precise bandwidth needs if it could talk to the network. In practice, a well-provisioned IP network with rich multipath capabilities is robust, effective, and simple. Indeed, it’s been proven that multipath load-balancing can get very close to optimal utilization, even when the traffic matrix is unknown (which is the normal case). So it’s hard to argue that the additional complexity of providing explicit communication mechanisms for applications to signal their needs to the physical network are worth the cost. In fact, we’ll argue in a future post that trying to carefully engineer traffic is counter-productive in data centers because the traffic patterns are so unpredictable. Combine this with the benefits of decoupling the network services from the physical fabric, and it’s clear that a virtualization overlay on top of a well-provisioned IP network is a great fit for the modern data center.
[This post was written with input from Martin Casado, Ben Pfaff, Justin Pettit and Ben Basler.]
The Open vSwitch (OVS) project is obviously dear to our hearts at Nicira (and now VMware). OVS is a fairly standard open source project, with many dozens of people from companies around the world contributing patches and reviewing them. However, there is more to openness than just open source software; open protocols (with freely accessible specs) are also important. Of course, Open vSwitch is well known as an implementation of the OpenFlow protocol, for which the specs are freely available. But there is another protocol, arguably just as important as OpenFlow, which is used to manage Open vSwitch instances. That protocol is known as the Open vSwitch Data Base management protocol or OVSDB protocol. While the specification of that protocol can be found within the Open vSwitch sources, it’s a bit of an effort to figure out exactly how it works. With that in mind, as well as a desire to be as open as possible, we decided to publish a spec for the OVSDB protocol in a more familiar and accessible format – an Internet draft.
To be clear, anyone can publish an Internet draft, and that does not make something into a standard. There’s a possibility that this Internet draft may be suitable for publication as an informational RFC. That wouldn’t make it a standard either, but it would at least provide an archival publication mechanism for a protocol that is quite widely used. Whether that happens or not depends on the “Independent Stream Editor”, part of the rather complex organization that handles RFC publication. (See http://www.rfc-editor.org/RFCeditor.html.)
So, what is this OVSDB protocol? Obviously, you could just go and read our new Internet draft, but here is the quick summary. While OpenFlow establishes flow state in a switch, there’s a lot more to Open vSwitch – indeed there is more to networking – than just setting up flow (or forwarding) table entries. In Open vSwitch, you can create many virtual switch instances, attach interfaces to those switches, set QOS policies on interfaces, and so on. None of these configuration tasks can be done with OpenFlow, so you need a management/configuration protocol to do them.
The OVSDB protocol has been part of the Open vSwitch implementation for many years. It is essentially a general purpose protocol for interacting with a database, and in Open vSwitch the database in question is a set of tables representing switch configuration data. (Some readers may be familiar with of-config – the OpenFlow config protocol – a more recent effort; we believe that protocol could actually be implemented on top of OVSDB.)
To step back for a moment, networking folks often think of any network device as having a control plane and a data plane. Sadly, the management plane has been all too often neglected. OVSDB is a protocol that was created to address that important but neglected aspect of networking. We think that making networks dramatically easier to manage is in fact one of the major benefits of network virtualization. That’s a topic we’ve discussed elsewhere; for now, I’ll just urge readers of this blog to go take a look at our current approach to managing and configuring Open vSwitch instances.