Accelerating Open vSwitch to “Ludicrous Speed”

[This post was written by OVS core contributors Justin Pettit, Ben Pfaff, and Ethan Jackson.]

The overhead associated with vSwitches has been a hotly debated topic in the networking community. In this blog post, we show how recent changes to OVS have elevated its performance to be on par with the native Linux bridge. Furthermore, CPU utilization of OVS in realistic scenarios can be up to 8x below that of the Linux bridge.  This is the first of a two-part series.  In the next post, we take a peek at the design and performance of the forthcoming port to DPDK, which bypasses the kernel entirely to gain impressive performance.

Open vSwitch is the most popular network back-end for OpenStack deployments and widely accepted as the de facto standard OpenFlow implementation.  Open vSwitch development initially had a narrow focus — supporting novel features necessary for advanced applications such as network virtualization.  However, as we gained experience with production deployments, it became clear these initial goals were not sufficient.  For Open vSwitch to be successful, it not only must be highly programmable and general, it must also be blazingly fast.  For the past several years, our development efforts have focused on precisely this tension — building a software switch that does not compromise on either generality or speed.

To understand how we achieved this feat, one must first understand the OVS architecture.  The forwarding plane consists of two parts: a “slow-path” userspace daemon called ovs-vswitchd and a “fast-path” kernel module.  Most of the complexity of OVS is in ovs-vswitchd; all of the forwarding decisions and network protocol processing are handled there.  The kernel module’s only responsibilities are tunnel termination and caching traffic handling.

Megaflows

When a packet is received by the kernel module, its cache of flows is consulted.  If a relevant entry is found, then the associated actions (e.g., modify headers or forward the packet) are executed on the packet.  If there is no entry, the packet is passed to ovs-vswitchd to decide the packet’s fate. ovs-vswitchd then executes the OpenFlow pipeline on the packet to compute actions, passes it back to the fast path for forwarding, and installs a flow cache entry so similar packets will not need to take these expensive steps.

dataplane

The life of a new flow through the OVS forwarding elements. The first packet misses the kernel module’s flow cache and is sent to userspace for further processing. Subsequent packets are processed entirely in the kernel.

Until OVS 1.11, this fast-path cache contained exact-match “microflows”.  Each cache entry specified every field of the packet header, and was therefore limited to matching packets with this exact header.  While this approach works well for most common traffic patterns,  unusual applications, such as port scans or some peer-to-peer rendezvous servers, would have very low cache hit rates.  In this case, many packets would need to traverse the slow path, severely limiting performance.

OVS 1.11 introduced megaflows, enabling the single biggest performance improvement to date.  Instead of a simple exact-match cache, the kernel cache now supports arbitrary bitwise wildcarding.  Therefore, it is now possible to specify only those fields that actually affect forwarding..  For example, if OVS is configured simply to be a learning switch, then only the ingress port and L2 fields are relevant and all other fields can be wildcarded.  In previous releases, a port scan would have required a separate cache entry for, e.g., each half of a TCP connection, even though the L3 and L4 fields were not important.

Multithreading

The introduction of megaflows allowed OVS to drastically reduce the number of packets that traversed the slow path.   This represents a major improvement, but ovs-vswitchd still had a number of responsibilities, which became the new bottleneck.  These include activities like managing the datapath flow table, running switch protocols (LACP, BFD, STP, etc), and other general accounting and management tasks.

While the kernel datapath has always been multi-threaded, ovs-vswitchd was a single-threaded process until OVS 2.0.  This architecture was pleasantly simple, but it suffered from several drawbacks.  Most obviously, it could use at most one CPU core.  This was sufficient for hypervisors, but we began to see Open vSwitch used more frequently as a network appliance, in which it is important to fully use machine resources.

Less obviously, it becomes quite difficult to support ovs-vswitchd’s real-time requirements in a single-threaded architecture.  Fast-path misses from the kernel must be processed by ovs-vswitchd as promptly as possible.  In the old single-threaded architecture, miss handling often blocked behind the single thread’s other tasks.  Large OVSDB changes,  OpenFlow table changes, disk writes for logs, and other routine tasks could delay handling misses and degrade forwarding performance.  In a multi-threaded architecture, soft real-time tasks can be separated into their own threads, protected from delay by unrelated maintenance tasks.

Since the introduction of multithreading, we’ve continued to fine-tune the number of threads and their responsibilities.  In addition to order of magnitudes improvements in miss handling performance, this architecture shift has allowed us to increase the size of the kernel cache from 1000 flows in early versions of OVS to roughly 200,000 in the most recent version.

Classifier Improvements

The introduction of megaflows and support for a larger cache enabled by multithreading reduced the need for packets to be processed in userspace.  This is important because a packet lookup in userspace can be quite expensive.  For example, a network virtualization application might define an OpenFlow pipeline with dozens of tables that hold hundreds of thousands of rules.  The final optimization we will discuss is improvements in the classifier in ovs-vswitchd, which is responsible for determining which OpenFlow rules apply when processing a packet.

Data structures to allow a flow table to be quickly searched are an active area of research.  OVS uses a tuple space search classifier, which consists of one hash table (tuple) for each kind of match actually in use.  For example, if some flows match on the source IP, that’s represented as one tuple, and if others match on source IP and destination TCP port, that’s a second tuple.  Searching a tuple space search classifier requires searching each tuple, then taking the highest priority match.  In the last few releases, we have introduced a number of novel improvements to the basic tuple search algorithm:

  • Priority Sorting – Our simplest optimization is to sort the tuples in order by the maximum priority of any flow in the tuple.  Then a search that finds a matching flow with priority P can terminate as soon as it arrives at a tuple whose maximum priority is P or less.
  • Staged Lookup – In many network applications, a policy may only apply to a subset of the headers.  For example, a firewall policy that requires looking at TCP ports may only apply to a couple of VMs’ interfaces.  With the basic tuple search algorithm, if any rule looks at the TCP ports, then any generated megaflow would look all the way up through the L4 headers.  With staged lookup, we scope lookups from metadata (e.g., ingress port) up to layers further up the stack on an as-needed basis.
  • Prefix Tracking – When processing L3 traffic, a longest prefix match is required for routing.  The tuple-space algorithm works poorly in this case, since it degenerates into a tuple that matches as many bits as the longest match.  This means that if one rule matches on 10.x.x.x and another on 192.168.0.x, the 10.x.x.x rule will also require matching 24 bits instead of 8, which requires keeping more megaflows in the kernel.  With prefix tracking, we consult a trie that allows us to only look at tuples with the high order bits sufficient to differentiate between rules.

These classifier improvements have been shown with practical rule sets to reduce the number of megaflows needed in the kernel from over a million to only dozens.

Evaluation

The preceding changes improve many aspects of performance.  For this post, we’ll just evaluate the performance gains in flow setup, which was the area of greatest concern.  To measure setup performance, we used netperf’s TCP_CRR test, which measures the number of short-lived transactions per second (tps) that can be established between two hosts.  We compared OVS to the Linux bridge, a fixed-function Ethernet switch implemented entirely inside the Linux kernel.

In the simplest configuration, the two switches achieved identical throughput (18.8 Gbps) and similar TCP_CRR connection rates (696,000 tps for OVS, 688,000 for the Linux bridge), although OVS used more CPU (161% vs. 48%). However, when we added one flow to OVS to drop STP BPDU packets and a similar ebtable rule to the Linux bridge, OVS performance and CPU usage remained constant whereas the Linux bridge connection rate dropped to 512,000 tps and its CPU usage increased over 26-fold to 1,279%. This is because the built-in kernel functions have per-packet overhead, whereas OVS’s overhead is generally fixed per-megaflow. We expect that enabling other features, such as routing and a firewall, would similarly add CPU load.

tps CPU
Linux Bridge Pass BPDUs 688,000 48%
Drop BPDUs 512,000 1,279%
Open vSwitch 1.10 12,000 100%
Open vSwitch 2.1 Megaflows Off 23,000 754%
Megaflows On 696,000 161%

While these performance numbers are useful for benchmarking, they are synthetic.  At the OpenStack Summit last week in Paris, Rackspace engineers described the performance gains they have seen over the past few releases in their production deployment.  They begin in the “Dark Ages” (versions prior to 1.11) and proceed to “Ludicrous Speed” (versions since 2.1).

openstack

Andy Hill and Joel Preas from Rackspace discuss OVS performance improvements at the OpenStack Paris Summit.

Future

Now that OVS’s performance is similar to those of a fixed-function switch while maintaining the flexibility demanded by new networking applications, we’re looking forward to broadening our focus.  While we continue to make performance improvements, the next few releases will begin adding new features such as stateful services and support for new platforms such as DPDK and Hyper-V.

If you want to hear more about this live or talk to us in person, we will all be at the Open vSwitch Fall 2014 Conference next week.

 


6 Comments on “Accelerating Open vSwitch to “Ludicrous Speed””

  1. […] At least tangentially related is this post over on Network Heresy about radically improved Open vSwitch performance. […]

  2. Srini says:

    An observation is that “megaflows” feature improves TPS performance if OF flows have wild card values. In some applications – such as distributed firewall, distribution load balancer and even service function scenarios – OVS is configured with large number of exact match flows. In those cases, as we understand, megaflows feature does not help in improving the connection rate performance.

    It would be good, if OF flow tables themselves are pushed into Kernel DP so that flow-cache miss packets are processed within the kernel.

    Any comment from others with respect to this observation and proposed solution.

    • justindpettit says:

      The megaflows are a cache of what should happen to packets of a certain format. Consulting this cache is much faster than processing the entire OpenFlow table. In addition to the slowdown that would result from processing all the packets in the kernel without a flow cache, the added complexity would likely not be welcomed in the kernel and would slow down feature additions.

      We wrote an NSDI paper that goes into a lot more detail than this blog post that might be of interest to you:

      Click to access nsdi2015.pdf

      As for more stateful services, we’ve been taking the approach of leveraging existing kernel components and using OVS to steer traffic towards those functionality blocks. For example, we used the Linux connection tracker to implement a firewall:

      Click to access 1030-conntrack_nat.pdf

      We’re still thinking about how best to add features such as NAT and DPI, but they are in our development plans.

      • Srinivasa Addepalli says:

        Yes, It is true that megaflow performance would be better than going through set OF tables (some times number of OF tables in the pipeline could be in double digits). I was wondering the efficacy of megaflows when external controllers program highly granular flows. Hence, the suggestion of doing OF table processing within the datapath would be beneficial.

        You gave two main reasons for not pushing OF packet processing to Kernel.

        1. Slowdown resulting from processing all the packets in the kernel without a flow cache.
        2. Increase in the complexity of Kernel Data Path and resulting that to slow down on the features.

        On (1) : I was not suggesting to remove the “flow cache”. Flow Cache, i believe, need to be there even if the entire OF pipeline processing is moved to the Kernel. By moving OF table processing in the Kernel for packets that don’t match the ‘flow cache table”, one could reduce the interaction with the “ovs-vswitchd” even in cases where some external controller programs the highly granular flows (Could be 5-tuple granular flows).

        On (2) : That is a good point. But many in the industry are exploring user space based “Data Path” or iNIC based “Data Path”. The type of complexity in those data path implementation is not as big as Kernel data path implementation. Moreover, the type of hardware in iNIC accelerates lookup performance and hence going through OF tables for cache-miss packet is not bad. I am wondering whether you see merits in extending DPIF to allow selective data path implementations to do their own OF table processing.

        Thanks
        Srini

  3. What is the net packets per second typically achievable with OVS 2.1 or 2.3 (in an Openstack environment inclusive of the effect of linux iptables and multiple bridges etc on performance). The Paris summit talk mentioned 200K pps seen “in the wild” with Intel Xeon CPUs. What’s a typical number these days ?


Leave a reply to Sanjeev Rampal Cancel reply