On a lark, I compiled a list of “open” OpenFlow software projects that I knew about off-hand, or could find with minimal effort searching online.
You can find the list here.
Unsurprisingly, most of the projects either come directly from or originated at academia or industry research. As I’ve argued before, with respect to standardization, the more design and insight that comes from real code and plugging real holes, the better. And it is still very, very early days in the SDN engineering cycle. So, it is nice to see the diversity in projects and I hope the ecosystem continues to broaden with more controllers and associated projects entering the fray.
If you know of a project that I’ve missed (I’m only listing those that have code or bits available for free online — with the exception of Pica8 which will send you code on request) please mention it in the comments or e-mail me and I’ll add it to the list.
[This post was written with Andrew Lambeth]
Our last post “Networking Doesn’t Need a VMware” made the point that drawing a simple analogy between server and network virtualization can steer the technical discourse on network virtualization in the wrong direction. The sentiment comes from the many partner, analyst, and media meetings we’ve been involved in that persistently focus on relatively uninteresting areas of the network virtualization space, specifically, details of encapsulation formats and lookup pipelines.
In this series of writeups, we take a deeper look and discuss some areas in which network
virtualization would do well to emulate server virtualization. This is a fairly broad topic so we’ll break it up across a couple of posts.
In this part, we’ll focus on address space virtualization.
Quick heads up that the length of this post got a little bit out of hand. For those who don’t have the time/patience/inclination/attention span, the synopsis is as follows:
One of the key strengths of a hypervisor lies in its insertion of a completely new address space below the guest OS’s view of what it believes to be the physical address space. And while there are several possible ways to interpose on network address space to achieve some form of virtualization, encapsulation provides the closest analog to the hierarchical memory virtualization used in compute. It does so by taking advantage of the hierarchy inherent in the physical topology, and allowing both the virtual and physical address spaces to support complete forwarding and addressing models. However, like memory virtualization’s page table, encapsulation requires maintenance of the address mappings (rules to tunnel mappings). The interface for doing so should be open, and a good candidate for that interface is OpenFlow.
Now, onto the detailed argument …
Virtual Memory in Compute
Virtual memory has been a core component of compute virtualization for decades. The basic concept is very simple, multiple (generally sparse) virtual address spaces are multiplexed to a single compact physical address using a page table that contains the between the two.
An OS virtualizes the address space for a process by populating a table with a single level of translations from virtual addresses (VA) to physical addresses (PA). All hypervisors support the ability for a guest OS to continue working in this mode by adding the notion of a third address space called machine addresses (MA) for the true physical addresses.
Since x86 hardware initially supported only a single level of mappings the hypervisor implemented a complete MMU in software to capture and maintain the guest’s VA to PA mappings, and then created a second set of mappings from guest PA to actual hardware MA. What was actually programmed into hardware was VA to MA mappings, in order to not incur overhead during the actual memory references by the guest, but the full heirarchy was maintained so at any time the hypervisor components could easily take an address from any of the three address spaces and map it back to any of the other address spaces.
Having a multi-level hierarchical mapping was so powerful that eventually CPU vendors added support for multi-level page tables in the hardware MMU (called Nested Page Tables on AMD and Extended Page Tables on Intel). Arguably this was the biggest architectural change to CPUs over the last decade.
Benefits of address virtualization in compute
Although this is pretty basic stuff, it is worth enumerating the benefits it provides to see how these can be applied to the networking world.
- It allows the multiplexing of multiple large, sparse address spaces onto a smaller, compact physical address space.
- It supports mobility of a process within a physical address space. This can be used to more efficiently allocated processes to memory, or take advantage of new memory as it is added.
- VMs don’t have to coordinate with other VMs to select their address space. Or more generally, there are no constraints of the virtual address space that can be allocated to a guest VM.
- A virtual memory subsystem provides a basic unit of isolation. A VM cannot mistakingly (or maliciously) address another VM unless another part of the system is busted or compromised.
Address Virtualization for Networking
The point of this post is to explore whether memory virtualization in compute can provide guidance on network virtualization. It would be nice, for example, to retain the same benefits that made memory virtualization so successful.
We’ll start with some common approaches to network virtualization and see how they compare.
Tagging : The basic idea behind tagging is to mark packets at the edge of the network with some bits (the tag) that contains the virtual context. This tag generally encodes a unique identifier for the virtual network and perhaps the virtual ingress port. As the packet traverses the network, the tag is used to segment the forwarding tables to only apply to rules associated with that tag.
A tag does not provide “virtualization” in the same way virtual memory of compute does. Rather it provides segmentation (which was also a phase compute memory went through decades ago). That is, it doesn’t introduce a new address space, but rather segments an existing address space. As a result, the same addresses are used for both physical and virtual purposes. The implications of this can be quite limiting. For example:
- Since the addresses are used to address things in the virtual world *and* the physical world, they have to be exposed to the physical forwarding tables. Therefore the nice property of aggregation that comes with hierarchical address mapping cannot be exploited. Using VLANs as an example, every VM MAC has to be exposed to the hardware putting a lot of pressure on physical L2 tables. If virtual addresses where mapped to a smaller subset of physical addresses instead, the requirement for large L2 tables would go away.
- Due to layering, tags generally only segment a single address space (e.g. L2). Again, because there is no new address space introduced, this means that all of the virtual contexts must all have identical addressing models (e.g. L2) *and* the virtual addresses space must be the same as the physical address space. This unnecessarily couples the virtual and physical worlds. Imagine a case in which the model used to address the virtual domain would not be suitable for physical forwarding. The classic example of this is L2. VMs are given Ethernet NICs and may want to talk using L2 only. However, L2 is generally not a great way of building large fabrics. Introducing an additional address space can provide the desired service model at the virtual realm, and the appropriate forwarding model in the physical realm.
- Another shortcoming of tagging is that you cannot take advantage of mobility through address remapping. In the virtual address realm, you can arbitrarily map a virtual address to a physical address. If the process or VM moves, the mapping just needs to be updated. A tag however does not provide address mobility. The reason that VEPA, VNTAG, etc. support mobility within an L2 domain is by virtue of learned soft state in L2 (which exists whether or not a tag is in use). For layers that don’t support address mobility (say vanilla IP), using a tag won’t somehow enable it.It’s worth pointing out that this is a classic problem within virtualized datacenters today. L2 supports mobility, and L3 often does not. So while both can be segmented using tags, often only the L2 portion allows mobility confining VMs to a given subnet. Most of the approaches to VM mobility across subnets introduce another layer of addressing, whether this is done with LISP, tunnels, etc. However, as soon as encapsulation is introduced, it begs the question of why use tags at all.
There are a number of other differences between tagging and address-level virtualization (like requiring all switches en route to understand the tag), but hopefully you’ve gotten the point. To be more like memory virtualization, a new addressing layer needs to be present, and segmenting the physical address space doesn’t provide the same properties.
Address Mapping : Another method of network virtualization is address mapping. Unlike tagging, a new address space is introduced and mapped onto the physical address space by one or more devices in the network. The most common example of address mapping is NAT, although it doesn’t necessarily have to be limited to L3.
To avoid changing the framing format of the packet, address mapping operates by updating the address in place, meaning that the same field is used for the updated address. However, multiplexing multiple larger address spaces onto a smaller physical address space (for example, a bunch of private IP subnets onto a smaller physical IP subnet) is a lossy operation and therefore requires additional bits to map back from the smaller space to the larger space.
Here in lies much of the complexity (and commensurate shortcoming) of address mapping. Generally, the additional bits are added to a smaller field in the packet, and the full mapping information is stored as state on the device performing translation. This is what NAT does, the original 5 tuple is stashed on the device, and the ephemeral transport port is used to launder a key that points back to that 5 tuple on the return path.
When compared to virtual memory, the model doesn’t hold up particularly well. Within a server, the entire page table is always accessible, so addresses can be mapped back and forth with some additional information to aid in the demultiplexing (generally the pid). With network address translation, the “page table” is created on demand, and stored in a single device. Therefore, in order to support component failover of asymmetric paths, the per-flow state has to be replicated to all other devices that could be on the alternate path. Also, because state is set up during flow initiation from the virtual address, inbound flows cannot be forwarded unless the destination public IP address is effectively “pinned” to a virtual address. As a result, virtual to virtual communication in which the end points are behind different devices have to resort to OOB techniques like NAT punching in order to communicated.
This doesn’t mean that addressing mapping isn’t hugely useful in practice. Clearly for mapping from a virtual address space to a physical address space this is the correct approach. However, if communicating from a virtual address to a virtual address through a physical address space, it it is far more limited than its compute analog.
Tunneling : By tunneling we mean that the payload of a packet is another packet (headers and all).
Like address mapping (and virtual memory) tunneling introduces another address space. However, there are two differences. First, the virtual address space doesn’t have to look anything like the physical. For example, it is possible to have the virtual address space be IPv6 and the physical address space be IPv4. Second, the full address mapping is stored in the packet so that there is no need to create per-flow state within the network to manage the mappings.
Tunneling (or perhaps more broadly encapsulation) is probably the most popular method of doing full network virtualization today. TRILL uses L2 in L2, LISP uses encapsulation, VXLAN (which is LISP as far as I can tell) uses L2 in L3, VCDNI uses L2 in L2, NVGRE uses L2 in L3, etc.
The general approach maps to virtual memory pretty well. The outer header can be likened to a physical address. The inner header can be likened to a virtual address. And often a shim header is included just after the outer header that contains demultiplexing information (the equivalent of a PID). The “page table” consists of on-datapath table table entries which map packets to tunnels.
Take for example an overlay mesh (like NVGRE). The L2 table at the edge that points to the tunnels is effectively mapping from virtual addresses to physical addresses. And there is no reason to limit this lookup to L2, it could provide L2 and L3 in the virtual domain. If a VM moves, this “page table” is updated as it would be in a server of a process is moved.
Because the packets are encapsulated, switches in the “physical domain” only have to deal with the outer header greatly reducing the number of addresses they have to deal with. Switches which contain the “page table” however, have to do lookups in the virtual world (e.g. map from packet to tunnel), map to the physical world (throw a tunnel on the packet) and then forward the packet in the physical world (figure out which port to send the tunneled traffic out on). Note that these are effectively the same steps an MMU takes within a server.
Further, like memory virtualization, tunneling provides nice isolation properties. For example, a VM in the virtual address space cannot address the physical network unless the virtual network is somehow bridged into it.
Also, like hierarchical memory, it is possible for the logical networks to have totally different forwarding stacks. One could be L2-only, the other ipv4, and another ipv6, and all of these could differ from the physical substrate. A common setup has IPv6 run in the virtual domain (for end-to-end addressing), and IPv4 for the physical fabric (where a large address space isn’t needed).
Fortunately, we already have a a multi-level hierarchical topology in the network (in particularly the datacenter). This allows for either the addition of a new layer at the edge which provides a virtual address space without changing any other components, or just changing the components at the outer edge of the hierarchy.
So what are the shortcomings of this approach? The most immediate are performance issues with tunneling, additional overhead overhead in the packet, and the need to maintain the distributed “page table”.
Tunneling performance is no longer the problem it use to be. Even from software in the server, clever tunnel implementations are able to take advantage of LRO and the like and can achieve 10G performance without much CPU. Hardware tunneling on most switching chipsets performs at line rate as well. Header overhead marginally increases transmission delay, and will reduce total throughput if a link is saturated.
Keeping the “page tables” up to date on the other hand is still something the industry is grappling with. Most standards punt on describing how this is done, or even the interface to use to write to the “page table”. Some piggyback on L2 learning (like NVGRE) to construct some of the state but still rely on an out of band mechanism for other state (in the case of NVGRE the VIF to tenantID mapping).
As we’ve said many times in the past, this interface should be standardized if for no other reason than to provide modularity in the architecture. Not doing that would be like limiting a hardware MMU on a CPU to only work with a single Operating System.
Fortunately, Microsoft realizes this need and has opened the interface to their vswitch on Windows Server 8, and XenServer, KVM and other Linux-based hypervisors support Open vSwitch.
For hardware platforms it would also be nice to expose the ability to manage this state. Of course, we would argue that OpenFlow is a good candidate if for no other reason than it has been already used for this purpose successfully.
Wrapping Up ..
If the analogy of memory virtualization in compute holds then encapsulation is the correct way to go about network virtualization. It introduces a new address space that is unrestricted in forwarding model, it takes advantage of hierarchy inherent in the physical topology allowing for address aggregation further up in the hierarchy, and it provides the basic virtualization properties of isolation and transparent mobility.
In the next post of this series, we’ll explore how server virtualization suggests that the way that encapsulation is implemented today isn’t quite right, and some things we may want to do to fix it.
[A lot of the content of this post was drawn from conversations with Juan Lage, Rajiv Ramanathan, and Mohammad Attar]
Some of the more aggressive buzz around OpenFlow has lauded it as a mechanism for re-implementing networking in total. While in some (fairly fanciful) reality, that could be the case, categorical statements of this nature tend hide practical design trade-offs that exist between any set of technologies within a design continuum.
What do I mean by that? Just that there are things that OpenFlow and more broadly SDN are better suited for than others.
In fact, the original work that lead to OpenFlow was not meant to re-implement all of networking. That’s not to say that this isn’t a worthwhile (if not quixotic) goal. Yet our focus was to explore new methods for datapath state whose management was difficult to do with using the traditional approach of full distribution with eventual consistency.
And what sort of state might that be? Clearly destination-based, shortest path forwarding state can be calculated in a distributed fashion. But there is a lot more state in the datapath beyond that used for standard forwarding (filters, tagging, policy routing, QoS policy, etc.). And there are a lot more desired uses for networks than vanilla destination-based forwarding.
Of course, much of this “other” state is not computed algorithmically today. Rather, it is updated manually, or through scripts whose function more closely resembles macro replacement than computation.
Still, our goal was (and still is) to compute this state programmatically.
So the question really boils down to whether the algorithm needed to compute the datapath state is easily distributed. For example, if the state management algorithm has any of the following properties it probably isn’t, and is therefore a potential candidate for SDN.
- It is not amenable to being split up into many smaller pieces (as opposed to fewer, larger instances). The limiting property in these cases is often excessive communication overhead between the control nodes.
- It is not amenable to running on heterogeneous compute environments. For example those with varying processor speeds and available memory.
- It is not amenable to relatively long RTT’s for communication between distributed instances. In the purely distributed case, the upper bound for communicating between any two nodes scales linearly with the longest loop free path.
- The algorithm requires sophisticated distributed coordination between instances (for example distributed locking or leader election)
There are many examples of algorithms that have these properties which can (and are) used in networking. The one I generally use as an example is a runtime policy compiler. One of our earliest SDN implementations was effectively a Datalog compiler that would take a topologically independent network policy and compile it into flows (in the form of ACL and policy routing rules).
Other examples include implementing global solvers to optimize routes for power, cost, security policy, etc., and managing distributed virtual network contexts.
The distribution properties of most of these algorithms are well understood (and have been for decades). And so it is fairly straightforward to put together an argument which demonstrates that SDN offers advantages over traditional approaches in these environments. In fact, outside of OpenFlow, it isn’t uncommon to see elements of the control plane decoupled from the dataplane in security, management, and virtualization products.
However, networking’s raison d’être, it’s killer app, is forwarding. Plain ol’ vanilla forwarding. And as we all know, the networking community long ago developed algorithms to do that which distribute wonderfully across many heterogeneous compute nodes.
So, that begs the question. Does SDN provide any value to the simple problem of forwarding? That is, if the sole purpose of my network is to move traffic between two end-points, should I use a trusty distributed algorithm (like an L3 stack) that is well understood and has matured for the last couple of decades? Or is there some compelling reason to use an SDN approach?
This is the question we’d like to explore in this post. The punchline (as always, for the impatient) is that I find it very difficult to argue that SDN has value when it comes to providing simple connectivity. That’s simply not the point in the design space that OpenFlow was created for. And simple distributed approaches, like L3 + ECMP, tend to work very well. On the other hand, in environments where transport is expensive along some dimension, and global optimization provides value, SDN starts to become attractive.
First, lets take a look at the problem of forwarding:
Forwarding to me simply means find a path between a source and a destination in a network topology. Of course, you don’t want the path to suck, meaning that the algorithm should efficiently use available bandwidth and not choose horribly suboptimal hop counts.
For the purposes of this discussion, I’m going to assume two scenarios in which we want to do forwarding: (a) let’s assume that bandwidth and connectivity are cheap and plentiful and that any path to get between two points are roughly the same (b) lets assume none of these properties.
We’ll start with the latter case.
In many networks, not all paths are equal. Some may be more expensive than others due to costs of third party transit, some may be more congested than others leading to queuing delays and loss, different paths may support different latencies or maximum bandwidth limits, and so on.
In some deployments, the forwarding problem can be further complicated by security constraints (all flows of type X must go through middleboxes) and other policy requirements.
Further, some of these properties change in real time. Take for example the cost of transit. The price of a link could increases dramatically after some threshold of use has been hit. Or consider how a varying traffic matrix may affect the queuing delay and available bandwidth of network paths.
Under such conditions, one can start to make an argument for SDN over traditional distributed routing protocols.
Why? For two reasons. First, the complexity of the computation increases with the number of properties being optimized over. It also increases with the complexity in the policy model, for example a policy that operates over source, protocol and destination is going to be more difficult than one that only considers destination.
Second, as the frequency in which these properties change increases, the amount of information the needs to be disseminated increases. An SDN approach can greatly limit the total amount of information that needs to hit the wire by reducing the distribution of the control nodes. Fully distributed protocols generally flood this information as it isn’t clear which node needs to know about the updates. There have been many proposals in the literature to address these problems, but to my knowledge they’ve seen little or no adoption.
I presume these costs are why a lot of TE engines operate offline where you can throw a lot of compute and memory at the problem.
So, if optimaility is really important (due to the cost of getting it wrong), and the inputs change a lot, and there is a lot of stuff being optimized over, SDN can help.
Now, what about the case in which bandwidth is abundant, relatively cheap and any sensible path between two points will do?
The canonical example for this is the datacenter fabric. In order to accommodate workload placement anywhere, datacenter physical networks are often built using non-oversubscribed topologies. And the cost of equipment to build these is plummeting. If you are comfortable with an extremely raw supply channel, it’s possible to get 48 ports of 10G today for under $5k. That’s pretty damn cheap.
So, say you’re building a datacenter fabric, you’ve purchased piles of cheap 10G gear and you’ve wired up a fat tree of some sort. Do you then configure OSPF with ECMP (or some other multipathing approach) and be done with it? Or does it make sense to attempt to use an SDN approach to compute the forwarding paths?
It turns out that efficiently calculating forwarding paths in a highly connected graph, and converging those paths on failure is something distributed protocols do really, really well. So well, in fact, that it’s hard to find areas to improve.
For example, the common approach of using multipathing approximates Valiant load balancing which effectively means the following: if you send a packet to an arbitrary point in the network, and that point forwards it to the destination, then for a regular topology, and any traffic matrix, you’ll be pretty close to fully using the fabric bandwidth (within a factor of two given some assumptions on flow arrival rates).
That’s a pretty stunning statement. What’s more stunning is that it can be accomplished with local decisions, and without requiring per-flow state, or any additional control overhead in the network. It also obviates the need for a control loop to monitor and respond to changes in the network. This latter property is particularly nice as the care and feeding of control loops to prevent oscillations or divergence can add a ton of complexity to the system.
On the other hand, trying to scale a solution using classic OpenFlow almost certainly won’t work. To begin with the use of n-tuples (say per-flow, or even per host/destination pair) will most likely result in table space exhaustion. Even with very large tables (hundreds of thousands) the solution is unlikely to be suitable for even moderately sized datacenters.
Further, to efficiently use the fabric, multipathing should be done per flow. It’s highly unlikely (in my experience) that having the controller participate in flow setup will have the desired performance and scale characteristics. Therefore the multipathing should be done in the fabric (which is possible in a later version of OpenFlow like 1.1 or the upcoming 1.2).
Given these constraints, an SDN approach would probably look a lot like a traditional routing protocol. That is, the resulting state would most likely be destination IP prefixes (so we can take advantage of aggregation and reduce the table requirements by a factor of N over source-destination pairs). Further, multipathing, and link failure detection would have to be done on the switch.
Another complication of using SDN is establishing connectivity to the controller. This clearly requires each switch to run something to bootstrap communication, like a traditional protocol. In most SDN implementations I know of, L3 is used for this purpose. So now, not only are we effectively mimicking L3 in the controller to manage datapath state, we haven’t been able to get rid of the distributed approach potentially doubling the control complexity network wide.
So, does SDN provide value for forwarding in these environments? Given the previous discussion it is difficult to argue in favor of SDN.
Does SDN provide more functionality? Unlikely, being limited to manipulating destination prefixes with multipathing being carried out by switches robs SDN of much of its value. The controller cannot make decisions on anything but the destination. And since the controller doesn’t know a priori which path a flow will take, it will be difficult to add additional table rules without replicating state all over the network (clearly a scalability issue).
How about operational simplicity? One might argue that in order to scale an IGP one would have to manually configure areas which presumably would be done automatically with SDN. However, it is difficult to argue that building a separate control network, or in-band-configuring is any less complex than a simple ID to switch mapping.
What about scale? I’ll leave the details of that discussion to another post. Juan Lage, Rajiv Ramanathan, and I have had a on-again, off-again e-mail discussion comparing that scaling properties of SDN to that of L3 for building a fabric. The upshot is that there are nice proof-points on both sides, but given today’s switching chipsets, even with SDN, you basically end up populating the same tables that L3 does. So any scale argument revolves around reducing the RTT to the controller through a single-function control network, and reducing the need to flood. Neither of these tricks are likely to produce a significant scale advantage for SDN, but they seem to produce some.
So, what’s the take-away?
It’s worth remembering that SDN, like any technology, is actually a point in the design space, and is not necessarily the best option for all deployment environments.
My mental model for SDN starts with looking at the forwarding state, and asking the question “what algorithm needs to run to compute that state”. The second question I ask is, “can that algorithm be easily distributed amongst many nodes with different compute and memory resources”. In many use cases, the answer to this latter question is “not-really”. The examples I provide general reduce to global solvers used for finding optimal solutions with many (often changing) variables. And in such cases, SDN is a shoo in.
However, in the case of building out networking fabrics in which bandwidth is plentiful and shortest path (or something similar) is sufficient, there are a number of great, well tested distributed algorithms for the job. In such environments, it’s difficult to argue the merits of building out a separate control network, adding additional control nodes, and running vastly less mature software.
But perhaps, that’s just me. Thoughts?
For those of you who are interested, my keynote at last week’s Open Networking Summit provides some background on OpenFlow and SDN in the form of a historical narrative. However, the main point of the talk (which doesn’t come across as well as I would have liked) is that it really is the community, and not necessarily the technology, that makes the SDN movement so special. I believe the technology will work itself out. However, building a diverse community with strong representation across the networking ecosystem (from ODMs, to customers, and everyone in between) is a very difficult undertaking. And now that we have such a community let’s be sure to acknowledge its importance and focus on continuing to cultivate and grow it.
And for some only tangentially related trivia, I’ve been tracking down the origins of the term ‘SDN’. From what I’ve been able to dig up, it was coined by Kate Greene (who had covered software defined radio) while putting together this article. So, thanks Kate.
NetworkWorld recently posted a blog that includes an interview with me (Martin) on OpenFlow’s roots and relevance in the current ecosystem.
While it’s generous to label me as the “inventor” of OpenFlow, the accolade is somewhat misleading. So I wanted to jot off a quick post to clarify things a bit.
To begin with, there is no sole inventor of OpenFlow. It’s been the product of many tens of individuals and organizations, over multiple years, and really, it’s still in its infancy and can be expected to evolve drastically going forward.
Regarding the specifics of its origins. I wrote the first, half-baked, unofficial draft in late 2007 with Nick McKeown (my advisor) as a follow on to research we were doing at Stanford with Scott Shenker, and Justin Pettit (among others). The first contributors to a “full” spec were Ben Pfaff, Justin, Nick, and myself. Ben and Justin did the lion’s share of the work evolving the protocol into something that could actually be used and implemented. And Justin was the primary editor and wrote the bulk of the initial spec text.
Justin, Ben, and I were at Nicira at the time and needed a protocol for remote switch management, so we coordinated with Nick to develop something that could be useful to a wider community.
Within a few months, we handed the spec, along with a working implementation (primarily written by Ben and Justin) to Stanford as there was growing interest in building a community effort around it. Since then, we’ve continued to play a limited role in its development. However, there have been many very influential contributors since (Rajiv Ramanathan, Jean Tourhilles, Glen Gibb, Brandon Heller, and Ed Crabbe just to name a very few).
So there you have it. The largely uninteresting, and somewhat convoluted origins of OpenFlow.
[This post is written by Teemu Koponen, and Martin Casado. Teemu has architected and been involved in the implementation of multiple widely used OpenFlow controllers.]
Over the last year, there has been some discussion within the OpenFlow and broader SDN community about designing and standardizing a controller API in addition to the switch interface.
To be honest, standardizing a controller API sounds like a poor idea. The controller is software, not a protocol. To our knowledge there are 9 controllers in the wild today, and more than half of them are open source. If the community wants an open software platform, it should divert resources from arguing over standards to building open source software.
(yeah ok, this comic is only tangentially relevant … )
In addition to being a waste of time, premature standardization could be detrimental to the SDN effort. Minimally, it will unnecessarily constrain a fledgling software ecosystem. Software innovation should come through development and experience with deployment, not through consensus built on prophecies.
Further, it’s not clear how to design and standardize an interface to a (non-existent) large software system out of whole cloth. Even Unix which started in the early 70s did not get standardized until the mid/late 80s.
For SDN in particular, there is really very little experience building controllers systems in the wild today. We (the authors) have built four controllers, three which have seen production use, and two of which have multiple products built on them from different organizations. And still, we would certainly not argue for standardizing a particular interface. Hell, we’re not even clear if there is a one-size-fits-all interface for controllers targeted at drastically different deployment environments. In fact, our experience suggests the contrary.
In our opinion, if an interface is going to be standardized, it should follow the software model of standardization. Either (a) take a system that has demonstrated broad success and generality over many years and call it a standard, or (b) during the design phase of a particular software project, have the *developers* work to create a standard API with the appropriate community feedback (which is probably product and system developers).
But most importantly, and it really bears repeating, we all win if we let an unbridled software ecosystem flourish within the controller space.
OK, so while we think any discussion about standardizing a controller interface is extremely premature, the question of “what makes a good controller API” is absolutely crucial to SDN and very much worth discussing. However, there has been relatively little discourse on controller APIs. And that’s exactly what we’ll be focusing on in rest of this post. We’ll start by providing a brief look at common controller designs, and then take a closer look at Onix, the controller platform we work on.
Common Controller Types
If you survey the existing controller landscape there appears to be three broad classes of API.
Single Purpose Controllers: These controllers are built for a specific function. So while they may use some common code (like an OpenFlow library) to communicate with the switch, they are otherwise not built with different uses in mind. Or put another way, they lack a control-logic agnostic interface for extension, so there really is little more to talk about.
OpenFlow Controllers: Yeah, that is probably a confusing name, but bear with us. A number of controllers provide a high-level interface to OpenFlow (specifically) and some infrastructure for hosting one or more control “applications”. These are general purpose platforms meant for extension. However, the interface is built around the OpenFlow protocol itself, meaning the interface is a thin wrapper around each message, without offering higher-level abstractions. Therefore, as the protocol changes (e.g., a new message is added), the API necessarily reflects this change. Most controllers (including one we’ve written) fall in this camp.
In our experience, the primary problem with this tight coupling is that it is challenging to evolve any aspect of the control application, switch protocol, or controller state distribution (assuming the controllers form a cluster) without refactoring all the layers of the software stack in tandem.
General SDN Controllers: We’re sort of making these names up as we go along, but a general SDN controller is one in which the control application is fully decoupled from both the underlying protocol(s) communicating with the switches, and the protocol(s) exchanging state between controller instances (assuming the controllers can run in a cluster).
The decoupling is achieved by turning the network control problem into a state management problem, and only exposing the state to be managed to the application, and not the mechanisms by which it arrived.
Applications operate over control traffic, flow table state, configuration state, port state, counters, etc. However, how this data is pulled into the controller, and how any changes are pushed back down to the switch is not the application’s business. It can just happily modify network state as if it were local and the platform will take care of the state dissemination.
The platform we’ve been working on over the last couple of years (Onix) is of this latter category. It supports controller clustering (distribution), multiple controller/switch protocols (including OpenFlow) and provides a number of API design concessions to allow it to scale to very large deployments (tens or hundreds of thousands of ports under control). Since Onix is the controller we’re most familiar with, we’ll focus on it.
So, what does the Onix API look like? It’s extremely simple. In brief, it presents the network to applications as an eventually consistent graph that is shared among the nodes in the controller cluster. In addition, it provides applications with distributed coordination primitives for coordinating nodes’ access to that graph.
Conceptually the interface is as straightforward as it sounds, however understanding its use in practice, especially in a distributed setting, bears more discussion.
So, digging a little further …
The Network as an Eventually Consistent Graph
In Onix, the physical network is represented as a graph for the control application. Elements in the graph may represent physical objects such as switches, or their subparts, such as physical ports, or lookup tables. And it may contain logical entities such as logical ports, tunnels, BFD configuration, etc.
Control applications operate on the graph to find elements of interest and read and write to them. They can register for notifications about state changes in the elements, and they can even extend the existing elements to hold network state specific to a particular control logic.
We happen to call this graph “the NIB” (for Network Information Base).
The NIB abstracts away both the protocols for state synchronization between the controller and the switches and the protocols used for state synchronization between controllers (discussed further below).
Regarding switch state, the control applications operate on the state directly, independent of how it is synchronized with the switch. For example, control logic may traverse the NIB looking for a particular switch, modify some forwarding table entries, and then place a trigger to receive notification on any future changes (such as a port status change event).
Doing this within a single controller node is pretty straightforward. However, for resilience and scale, most production deployments will want to run with multiple controllers. Hence, the platform needs to support clustering.
Within Onix, the NIB is the central mechanism for distributed state sharing. That is, the NIB is actually a shared datastructure among the nodes in the cluster. If a node updates the NIB, eventually that update will propagate to the other nodes interested in that portion of the graph (assuming no network failures).
Under this model, the control logic remains unaware of the specifics of the distribution mechanisms for controller-to-controller state. However, it is the responsibility of the application logic to coordinate which controller instance is responsible for which portion of the graph at any given time. To aid with this, Onix includes a number of standard distributed coordination primitives such as distributed locking, leader election, etc. Using them, control applications can safely partition the work amongst the cluster.
Because the control logic is decoupled from the distribution mechanisms, the Onix may be configured to use different distributed datastores to share different parts of the graph. For more dynamic state, it may use high-performance, memory-only, key/value store, whereas for more static information (such as admin configured parts of the graph) it may prefer storage that is strongly consistent and provides durability.
Of course, a design like this pushes a lot of complexity to the application. For example, the application is responsible for all distributed coordination (though tools for doing so are provided), and if the application chooses to use an eventually consistent storage back-end, it must tolerate eventual consistency in the NIB (including conflicts, data disappearing, etc.)
An alternative approach would have been to constrain the distribution model to something simple (like strongly consistent updates to all data) which would have greatly simplified the API. However, this would come at a huge cost to scalability.
Which begs the question …
Is This the Right Interface?
Perhaps for some environments. Almost certainly not for others. There have been multiple control applications built on Onix, and it is used in large production deployments in the data centers, as well as in the access and core networks. However, it is probably too heavyweight for smaller networks (the home or small enterprise), and it is certainly too complex to use as a basic research tool.
The NIB is also a very low-level interface. Clearly, there is a lot of scaffolding one can build on top to aid in application development. For example, routing libraries, packet classification engines, network discovery components, configuration management frameworks, and what not.
And that is the beauty of software. If you don’t like an API, you fix it, build on it, or you build your own platform. Sure, as a platform matures, binary compatibility and stability of APIs becomes crucial, but that is more a matter of versioning than a priori standardization and design.
So our vote is to keep standards away from the controller design, to support and promote multiple efforts, and to let the ecosystem play out naturally.
Wow, over the last couple of weeks there has been an escalation in the confusion around Openflow. The problem is on both sides of the metaphorical isle. Those behind the hype engine are spinning out of control with fanciful tales about bending the laws of physics (“Openflow will keep the icecream in your refrigerator cold during a power outage!”). And those opposed to OpenFlow (for whatever reason) are using the hype to build non-existant strawmen to attack (“Openflow cannot bend the laws of physics, therefore it isn’t useful”).
So for this post, I figured I’d go through some of the dubious claims and bogus strawmen and try and beat some sense into the discussion. Each of the claims I talk to below were pulled from some article/blog/interview I ran into over the last few weeks.
Claim: “Openflow provides a more powerful interface for managing the network than exists today”
Patently false. APIs expose functionality, not create it. Openflow is a subset of what switching chipsets offer today. If you want more power, sign an NDA with Broadcom and write to their SDK directly.
What Openflow does attempt to do is provide a standard interface above the box. If you believe that SDN is a powerful concept (and I do), then you do need some interface that is sufficiently expressive and hopefully widely supported.
This is almost not worth saying, but clearly Openflow is more expressive than a CLI. CLIs are fine for manual configuration, but they totally blow for building automated systems. The shortcomings are blindingly obvious, for example traditional CLIs have no clear data schema or state semantics and they change all the time because the designers are trying to solve an HCI not an API versioning problem. However, I don’t think there is any real disagreement on this point.
Claim: “Openflow will make hardware simpler”
I highly doubt this. Even with Openflow, a practical network deployment needs a bunch of stuff. It needs to do lookups over common headers (minimally L2/L3/L4), it may need hash-based multi-pathing, it may need port groups, it may need tunneling, it may need fine-grained link timers. If you grab a chip from Broadcom, I’m not exactly sure what you’d throw away if using Openflow.
What Openflow may discourage is stupid shit like creating new tags as an excuse for hardware stickyness (“you want new feature X? It’s implemented using the Suxen tag(tm) and only our new chip supports it.”). This is because an Openflow-like interface can effectively flatten layers. For example, I don’t have to use the MPLs tag for MPLS per se. I can use it for network virtualization (for instance) or for identifying aggregates that correspond to the same security group. However, that doesn’t mean hardware is simpler. Just that the design isn’t redundant to fulfill a business need. (Post facto note: There are some great points regarding this issue in the comments. We’ll work to spin this out into a separate post.)
Or more to the point. There are an awful lot of fields and bits in a header. And an awful lot of lookup capacity in a switch. If you shuffle around what fields mean what, you can almost certainly do what you want without having to change how the hardware parses the packet.
Claim: “Openflow will commoditize the hardware layer”
This is total Kaiser Soze-esque nonsense (yup, that’s the picture at the top of the post). To begin with, networking hardware is already on its way to horizontal integration. The reason that Arista is successful has nothing to do with Openflow (they strictly don’t support it and never have), it has nothing to do with SDN, it’s because Ken Duda, and Andy Bechtolsheim are total bad-asses and have built a great product. Oh, and because merchant silicon has come of age, meaning you don’t need need an ASIC team to be successful in the market.
The difference between OpenFlow and what Arista supports is that with Openflow you choose to build to an industry standard interface, and not a proprietary one. Whether or not you care is, of course, totally up to you.
So Openflow does provide another layer of horizontal integration and once the ecosystem develops that it is, in my opinion, a very good thing. But the ecosystem is still embryonic, so it will take some time before the benefits can be realized.
I think the power of merchant silicon and the rise of the “commodity” fabric are a far greater threat to the crusty scale-up network model. Oddly, Openflow has become a relatively significant distraction. That said, as this area matures, creating a horizontal layer “above the box” will grow in significance.
Claim: “Openflow does not map efficiently to existing hardware”
True! Older versions of Openflow did not map well to existing silicon chips. This is a *major* focus of the existing design effort; to make Openflow more flexible. As with any design effort, the trade-off is between having a future proof roadmap, and having a practical design that can have tangible benefits now. So we’re trying to thread the needle practically but with some foresight. It will take a little patience to see how this evolves.
Claim: “Openflow reduces complexity”
This is a meaningless statement. Openflow is like USB, it doesn’t “do” anything new. A case can be made for SDN to reduce complexity, but it does this by constraining the distribution model and providing a development platform that doesn’t suck. No magic there.
Claim: “Openflow will obviate the need for traditional distributed routing protocols”
I hear this a lot, but I just don’t buy it. There are certainly those within the Openflow community who disagree with me, so please take this for what it’s worth (very little …)
In my opinion, traditional distributed protocols are very good at what they do, scaling, routing packets in complex graphs, converging, etc. If I were to build a fabric, I’d sure as hell do it with an distributed routing protocol (again, not everyone agrees on this point). What traditional protocols suck at is distributing all the other state in the network. What state is that? If you look at the state in a switching chip, a lot of it isn’t traditionally populated through distributed protocols. For example, the ACL tables, the tunnels, many types of tags (e.g. VLAN), etc.
Going forward, it may be the case the distributed routing protocols get pulled into the controller. Why? Because controllers have much higher cpu density and therefore can run the computation for multiple switches. A multicore server will kick the crap out of the standard switch management CPU. Like I described in a previous post, this is no different than the evolution from distance vector to link state, it’s just taking it a step further. However, the protocol certainly is still distributed.
Claim: “Openflow cannot scale because of X”
I’ve addressed this at length in a previous post. Still, I’m going to have to call Doug Gourlay out. And I quote …
“I honestly don’t think it’s going to work,” Arista’s Gourlay said. “The flow setup rates on Stanford’s largest production OpenFlow network are 500 flows per second. We’ve had to deal with networks with a million flows being set up in seconds. I’m not sure it’s going to scale up to that.”
Other than being a basic logical fallacy (Stanford’s largest production Openflow network has absolutely nothing to do with the scaling properties of the Openflow or SDN architecture), there appears to be an implicit assumption that flow-setups are in some way a limiting resource. This clearly isn’t the case if they don’t leave the datapath which is a valid (and popular) deployment model for Openflow. Doug’s a great guy, at a great company, but this is a careless statement, and it doesn’t help further the dialog.
Claim: “Openflow helps virtualize the network”
Again, this is almost a meaningless statement. A compiler helps virtualize the network if it is being used to write code to that effect. The fact is, the pioneers of network virtualization, such as VMWare, Amazon, and Cisco, don’t use Openflow.
So yes, you can use Openflow for virtualization (that’s a help, I guess … ). And open standards are a good thing. But no, you certainly don’t need to.
OK, that’s enough for now.
To be very clear, I’m a huge fan of Openflow. And I’m a huge fan of SDN. Yet, neither is a panacea, and neither is a system or even a solution. One is an effort to provide a standardized interface for building awesome systems (Openflow). The other is a philosophical model for how to build those systems (SDN). It will be up to the systems themselves to validate the approach.