List of OpenFlow Software Projects

On a lark, I compiled a list of “open” OpenFlow software projects that I knew about off-hand, or could find with minimal effort searching online.

You can find the list here.

Unsurprisingly, most of the projects either come directly from or originated at academia or industry research.  As I’ve argued before, with respect to standardization, the more design and insight that comes from real code and plugging real holes, the better.  And it is still very, very early days in the SDN engineering cycle.   So, it is nice to see the diversity in projects and I hope the ecosystem continues to broaden with more controllers and associated projects entering the fray.

If you know of a project that I’ve missed (I’m only listing those that have code or bits available for free online — with the exception of Pica8 which will send you code on request) please mention it in the comments or e-mail me and I’ll add it to the list.

Networking Needs a VMware (Part 1: Address Virtualization)

[This post was written with Andrew Lambeth]

Our last post “Networking Doesn’t Need a VMware” made the point that drawing a simple analogy between server and network virtualization can steer the technical discourse on network virtualization in the wrong direction. The sentiment comes from the many partner, analyst, and media meetings we’ve been involved in that persistently focus on relatively uninteresting areas of the network virtualization space, specifically, details of encapsulation formats and lookup pipelines.

In this series of writeups, we take a deeper look and discuss some areas in which network
virtualization would do well to emulate server virtualization. This is a fairly broad topic so we’ll break it up across a couple of posts.

In this part, we’ll focus on address space virtualization.

Quick heads up that the length of this post got a little bit out of hand. For those who don’t have the time/patience/inclination/attention span, the synopsis is as follows:

One of the key strengths of a hypervisor lies in its insertion of a completely new address space below the guest OS’s view of what it believes to be the physical address space. And while there are several possible ways to interpose on network address space to achieve some form of virtualization, encapsulation provides the closest analog to the hierarchical memory virtualization used in compute. It does so by taking advantage of the hierarchy inherent in the physical topology, and allowing both the virtual and physical address spaces to support complete forwarding and addressing models. However, like memory virtualization’s page table, encapsulation requires maintenance of the address mappings (rules to tunnel mappings). The interface for doing so should be open, and a good candidate for that interface is OpenFlow.

Now, onto the detailed argument …

Virtual Memory in Compute

Virtual memory has been a core component of compute virtualization for decades. The basic concept is very simple, multiple (generally sparse) virtual address spaces are multiplexed to a single compact physical address using a page table that contains the between the two.

An OS virtualizes the address space for a process by populating a table with a single level of translations from virtual addresses (VA) to physical addresses (PA). All hypervisors support the ability for a guest OS to continue working in this mode by adding the notion of a third address space called machine addresses (MA) for the true physical addresses.

Since x86 hardware initially supported only a single level of mappings the hypervisor implemented a complete MMU in software to capture and maintain the guest’s VA to PA mappings, and then created a second set of mappings from guest PA to actual hardware MA. What was actually programmed into hardware was VA to MA mappings, in order to not incur overhead during the actual memory references by the guest, but the full heirarchy was maintained so at any time the hypervisor components could easily take an address from any of the three address spaces and map it back to any of the other address spaces.

Having a multi-level hierarchical mapping was so powerful that eventually CPU vendors added support for multi-level page tables in the hardware MMU (called Nested Page Tables on AMD and Extended Page Tables on Intel). Arguably this was the biggest architectural change to CPUs over the last decade.

Benefits of address virtualization in compute

Although this is pretty basic stuff, it is worth enumerating the benefits it provides to see how these can be applied to the networking world.

  • It allows the multiplexing of multiple large, sparse address spaces onto a smaller, compact physical address space.
  • It supports mobility of a process within a physical address space. This can be used to more efficiently allocated processes to memory, or take advantage of new memory as it is added.
  • VMs don’t have to coordinate with other VMs to select their address space. Or more generally, there are no constraints of the virtual address space that can be allocated to a guest VM.
  • A virtual memory subsystem provides a basic unit of isolation. A VM cannot mistakingly (or maliciously) address another VM unless another part of the system is busted or compromised.

Address Virtualization for Networking

The point of this post is to explore whether memory virtualization in compute can provide guidance on network virtualization. It would be nice, for example, to retain the same benefits that made memory virtualization so successful.

We’ll start with some common approaches to network virtualization and see how they compare.

Tagging : The basic idea behind tagging is to mark packets at the edge of the network with some bits (the tag) that contains the virtual context. This tag generally encodes a unique identifier for the virtual network and perhaps the virtual ingress port. As the packet traverses the network, the tag is used to segment the forwarding tables to only apply to rules associated with that tag.

A tag does not provide “virtualization” in the same way virtual memory of compute does. Rather it provides segmentation (which was also a phase compute memory went through decades ago). That is, it doesn’t introduce a new address space, but rather segments an existing address space. As a result, the same addresses are used for both physical and virtual purposes. The implications of this can be quite limiting. For example:

  • Since the addresses are used to address things in the virtual world *and* the physical world, they have to be exposed to the physical forwarding tables. Therefore the nice property of aggregation that comes with hierarchical address mapping cannot be exploited. Using VLANs as an example, every VM MAC has to be exposed to the hardware putting a lot of pressure on physical L2 tables. If virtual addresses where mapped to a smaller subset of physical addresses instead, the requirement for large L2 tables would go away.
  • Due to layering, tags generally only segment a single address space (e.g. L2). Again, because there is no new address space introduced, this means that all of the virtual contexts must all have identical addressing models (e.g. L2) *and* the virtual addresses space must be the same as the physical address space. This unnecessarily couples the virtual and physical worlds. Imagine a case in which the model used to address the virtual domain would not be suitable for physical forwarding. The classic example of this is L2. VMs are given Ethernet NICs and may want to talk using L2 only. However, L2 is generally not a great way of building large fabrics. Introducing an additional address space can provide the desired service model at the virtual realm, and the appropriate forwarding model in the physical realm.
  • Another shortcoming of tagging is that you cannot take advantage of mobility through address remapping. In the virtual address realm, you can arbitrarily map a virtual address to a physical address. If the process or VM moves, the mapping just needs to be updated.  A tag however does not provide address mobility.  The reason that VEPA, VNTAG, etc. support mobility within an L2 domain is by virtue of learned soft state in L2 (which exists whether or not a tag is in use).  For layers that don’t support address mobility (say vanilla IP), using a tag won’t somehow enable it.It’s worth pointing out that this is a classic problem within virtualized datacenters today. L2 supports mobility, and L3 often does not. So while both can be segmented using tags, often only the L2 portion allows mobility confining VMs to a given subnet. Most of the approaches to VM mobility across subnets introduce another layer of addressing, whether this is done with LISP, tunnels, etc. However, as soon as encapsulation is introduced, it begs the question of why use tags at all.

There are a number of other differences between tagging and address-level virtualization (like requiring all switches en route to understand the tag), but hopefully you’ve gotten the point. To be more like memory virtualization, a new addressing layer needs to be present, and segmenting the physical address space doesn’t provide the same properties.

Address Mapping : Another method of network virtualization is address mapping. Unlike tagging, a new address space is introduced and mapped onto the physical address space by one or more devices in the network. The most common example of address mapping is NAT, although it doesn’t necessarily have to be limited to L3.

To avoid changing the framing format of the packet, address mapping operates by updating the address in place, meaning that the same field is used for the updated address. However, multiplexing multiple larger address spaces onto a smaller physical address space (for example, a bunch of private IP subnets onto a smaller physical IP subnet) is a lossy operation and therefore requires additional bits to map back from the smaller space to the larger space.

Here in lies much of the complexity (and commensurate shortcoming) of address mapping. Generally, the additional bits are added to a smaller field in the packet, and the full mapping information is stored as state on the device performing translation. This is what NAT does, the original 5 tuple is stashed on the device, and the ephemeral transport port is used to launder a key that points back to that 5 tuple on the return path.

When compared to virtual memory, the model doesn’t hold up particularly well. Within a server, the entire page table is always accessible, so addresses can be mapped back and forth with some additional information to aid in the demultiplexing (generally the pid). With network address translation, the “page table” is created on demand, and stored in a single device. Therefore, in order to support component failover of asymmetric paths, the per-flow state has to be replicated to all other devices that could be on the alternate path.  Also, because state is set up during flow initiation from the virtual address, inbound flows cannot be forwarded unless the destination public IP address is effectively “pinned”  to a virtual address.  As a result, virtual to virtual communication in which the end points are behind different devices have to resort to OOB techniques like NAT punching in order to communicated.

This doesn’t mean that addressing mapping isn’t hugely useful in practice. Clearly for mapping from a virtual address space to a physical address space this is the correct approach. However, if communicating from a virtual address to a virtual address through a physical address space, it it is far more limited than its compute analog.

Tunneling : By tunneling we mean that the payload of a packet is another packet (headers and all).

Like address mapping (and virtual memory) tunneling introduces another address space. However, there are two differences. First, the virtual address space doesn’t have to look anything like the physical. For example, it is possible to have the virtual address space be IPv6 and the physical address space be IPv4. Second, the full address mapping is stored in the packet so that there is no need to create per-flow state within the network to manage the mappings.

Tunneling (or perhaps more broadly encapsulation) is probably the most popular method of doing full network virtualization today. TRILL uses L2 in L2, LISP uses encapsulation, VXLAN (which is LISP as far as I can tell) uses L2 in L3, VCDNI uses L2 in L2, NVGRE uses L2 in L3, etc.

The general approach maps to virtual memory pretty well. The outer header can be likened to a physical address. The inner header can be likened to a virtual address. And often a shim header is included just after the outer header that contains demultiplexing information (the equivalent of a PID). The “page table” consists of on-datapath table table entries which map packets to tunnels.

Take for example an overlay mesh (like NVGRE). The L2 table at the edge that points to the tunnels is effectively mapping from virtual addresses to physical addresses. And there is no reason to limit this lookup to L2, it could provide L2 and L3 in the virtual domain. If a VM moves, this “page table” is updated as it would be in a server of a process is moved.

Because the packets are encapsulated, switches in the “physical domain” only have to deal with the outer header greatly reducing the number of addresses they have to deal with. Switches which contain the “page table” however, have to do lookups in the virtual world (e.g. map from packet to tunnel), map to the physical world (throw a tunnel on the packet) and then forward the packet in the physical world (figure out which port to send the tunneled traffic out on). Note that these are effectively the same steps an MMU takes within a server.

Further, like memory virtualization, tunneling provides nice isolation properties. For example, a VM in the virtual address space cannot address the physical network unless the virtual network is somehow bridged into it.

Also, like hierarchical memory, it is possible for the logical networks to have totally different forwarding stacks. One could be L2-only, the other ipv4, and another ipv6, and all of these could differ from the physical substrate. A common setup has IPv6 run in the virtual domain (for end-to-end addressing), and IPv4 for the physical fabric (where a large address space isn’t needed).

Fortunately, we already have a a multi-level hierarchical topology in the network (in particularly the datacenter). This allows for either the addition of a new layer at the edge which provides a virtual address space without changing any other components, or just changing the components at the outer edge of the hierarchy.

So what are the shortcomings of this approach? The most immediate are performance issues with tunneling, additional overhead overhead in the packet, and the need to maintain the distributed “page table”.

Tunneling performance is no longer the problem it use to be. Even from software in the server, clever tunnel implementations are able to take advantage of LRO and the like and can achieve 10G performance without much CPU. Hardware tunneling on most switching chipsets performs at line rate as well. Header overhead marginally increases transmission delay, and will reduce total throughput if a link is saturated.

Keeping the “page tables” up to date on the other hand is still something the industry is grappling with. Most standards punt on describing how this is done, or even the interface to use to write to the “page table”. Some piggyback on L2 learning (like NVGRE) to construct some of the state but still rely on an out of band mechanism for other state (in the case of NVGRE the VIF to tenantID mapping).

As we’ve said many times in the past, this interface should be standardized if for no other reason than to provide modularity in the architecture. Not doing that would be like limiting a hardware MMU on a CPU to only work with a single Operating System.

Fortunately, Microsoft realizes this need and has opened the interface to their vswitch on Windows Server 8, and XenServer, KVM and other Linux-based hypervisors support Open vSwitch.

For hardware platforms it would also be nice to expose the ability to manage this state. Of course, we would argue that OpenFlow is a good candidate if for no other reason than it has been already used for this purpose successfully.

Wrapping Up ..

If the analogy of memory virtualization in compute holds then encapsulation is the correct way to go about network virtualization. It introduces a new address space that is unrestricted in forwarding model, it takes advantage of hierarchy inherent in the physical topology allowing for address aggregation further up in the hierarchy, and it provides the basic virtualization properties of isolation and transparent mobility.

In the next post of this series, we’ll explore how server virtualization suggests that the way that encapsulation is implemented today isn’t quite right, and some things we may want to do to fix it.

Is OpenFlow/SDN Good at Forwarding?

[A lot of the content of this post was drawn from conversations with Juan Lage, Rajiv Ramanathan, and Mohammad Attar]

Some of the more aggressive buzz around OpenFlow has lauded it as a mechanism for re-implementing networking in total. While in some (fairly fanciful) reality, that could be the case, categorical statements of this nature tend hide practical design trade-offs that exist between any set of technologies within a design continuum.

What do I mean by that? Just that there are things that OpenFlow and more broadly SDN are better suited for than others.

In fact, the original work that lead to OpenFlow was not meant to re-implement all of networking. That’s not to say that this isn’t a worthwhile (if not quixotic) goal. Yet our focus was to explore new methods for datapath state whose management was difficult to do with using the traditional approach of full distribution with eventual consistency.

And what sort of state might that be? Clearly destination-based, shortest path forwarding state can be calculated in a distributed fashion. But there is a lot more state in the datapath beyond that used for standard forwarding (filters, tagging, policy routing, QoS policy, etc.). And there are a lot more desired uses for networks than vanilla destination-based forwarding.

Of course, much of this “other” state is not computed algorithmically today. Rather, it is updated manually, or through scripts whose function more closely resembles macro replacement than computation.

Still, our goal was (and still is) to compute this state programmatically.

So the question really boils down to whether the algorithm needed to compute the datapath state is easily distributed. For example, if the state management algorithm has any of the following properties it probably isn’t, and is therefore a potential candidate for SDN.

  •  It is not amenable to being split up into many smaller pieces (as opposed to fewer, larger instances). The limiting property in these cases is often excessive communication overhead between the control nodes.
  • It is not amenable to running on heterogeneous compute environments. For example those with varying processor speeds and available memory.
  •  It is not amenable to relatively long RTT’s for communication between distributed instances. In the purely distributed case, the upper bound for communicating between any two nodes scales linearly with the longest loop free path.
  • The algorithm requires sophisticated distributed coordination between instances (for example distributed locking or leader election)

There are many examples of algorithms that have these properties which can (and are) used in networking. The one I generally use as an example is a runtime policy compiler. One of our earliest SDN implementations was effectively a Datalog compiler that would take a topologically independent network policy and compile it into flows (in the form of ACL and policy routing rules).

Other examples include implementing global solvers to optimize routes for power, cost, security policy, etc., and managing distributed virtual network contexts.

The distribution properties of most of these algorithms are well understood (and have been for decades).  And so it is fairly straightforward to put together an argument which demonstrates that SDN offers advantages over traditional approaches in these environments. In fact, outside of OpenFlow, it isn’t uncommon to see elements of the control plane decoupled from the dataplane in security, management, and virtualization products.

However, networking’s raison d’être, it’s killer app, is forwarding. Plain ol’ vanilla forwarding. And as we all know, the networking community long ago developed algorithms to do that which distribute wonderfully across many heterogeneous compute nodes.

So, that begs the question. Does SDN provide any value to the simple problem of forwarding? That is, if the sole purpose of my network is to move traffic between two end-points, should I use a trusty distributed algorithm (like an L3 stack) that is well understood and has matured for the last couple of decades? Or is there some compelling reason to use an SDN approach?

This is the question we’d like to explore in this post. The punchline (as always, for the impatient) is that I find it very difficult to argue that SDN has value when it comes to providing simple connectivity. That’s simply not the point in the design space that OpenFlow was created for. And simple distributed approaches, like L3 + ECMP, tend to work very well. On the other hand, in environments where transport is expensive along some dimension, and global optimization provides value, SDN starts to become attractive.

First, lets take a look at the problem of forwarding:

Forwarding to me simply means find a path between a source and a destination in a network topology. Of course, you don’t want the path to suck, meaning that the algorithm should efficiently use available bandwidth and not choose horribly suboptimal hop counts.

For the purposes of this discussion, I’m going to assume two scenarios in which we want to do forwarding: (a) let’s assume that bandwidth and connectivity are cheap and plentiful and that any path to get between two points are roughly the same (b) lets assume none of these properties.

We’ll start with the latter case.

In many networks, not all paths are equal. Some may be more expensive than others due to costs of third party transit, some may be more congested than others leading to queuing delays and loss, different paths may support different latencies or maximum bandwidth limits, and so on.

In some deployments, the forwarding problem can be further complicated by security constraints (all flows of type X must go through middleboxes) and other policy requirements.

Further, some of these properties change in real time. Take for example the cost of transit. The price of a link could increases dramatically after some threshold of use has been hit. Or consider how a varying traffic matrix may affect the queuing delay and available bandwidth of network paths.

Under such conditions, one can start to make an argument for SDN over traditional distributed routing protocols.

Why? For two reasons. First, the complexity of the computation increases with the number of properties being optimized over. It also increases with the complexity in the policy model, for example a policy that operates over source, protocol and destination is going to be more difficult than one that only considers destination.

Second, as the frequency in which these properties change increases, the amount of information the needs to be disseminated increases. An SDN approach can greatly limit the total amount of information that needs to hit the wire by reducing the distribution of the control nodes. Fully distributed protocols generally flood this information as it isn’t clear which node needs to know about the updates. There have been many proposals in the literature to address these problems, but to my knowledge they’ve seen little or no adoption.

I presume these costs are why a lot of TE engines operate offline where you can throw a lot of compute and memory at the problem.

So, if optimaility is really important (due to the cost of getting it wrong), and the inputs change a lot, and there is a lot of stuff being optimized over, SDN can help.

Now, what about the case in which bandwidth is abundant, relatively cheap and any sensible path between two points will do?

The canonical example for this is the datacenter fabric. In order to accommodate workload placement anywhere, datacenter physical networks are often built using non-oversubscribed topologies. And the cost of equipment to build these is plummeting. If you are comfortable with an extremely raw supply channel, it’s possible to get 48 ports of 10G today for under $5k. That’s pretty damn cheap.

So, say you’re building a datacenter fabric, you’ve purchased piles of cheap 10G gear and you’ve wired up a fat tree of some sort. Do you then configure OSPF with ECMP (or some other multipathing approach) and be done with it? Or does it make sense to attempt to use an SDN approach to compute the forwarding paths?

It turns out that efficiently calculating forwarding paths in a highly connected graph, and converging those paths on failure is something distributed protocols do really, really well. So well, in fact, that it’s hard to find areas to improve.

For example, the common approach of using multipathing approximates Valiant load balancing which effectively means the following: if you send a packet to an arbitrary point in the network, and that point forwards it to the destination, then for a regular topology, and any traffic matrix, you’ll be pretty close to fully using the fabric bandwidth (within a factor of two given some assumptions on flow arrival rates).

That’s a pretty stunning statement. What’s more stunning is that it can be accomplished with local decisions, and without requiring per-flow state, or any additional control overhead in the network. It also obviates the need for a control loop to monitor and respond to changes in the network. This latter property is particularly nice as the care and feeding of control loops to prevent oscillations or divergence can add a ton of complexity to the system.

On the other hand, trying to scale a solution using classic OpenFlow almost certainly won’t work. To begin with the use of n-tuples (say per-flow, or even per host/destination pair) will most likely result in table space exhaustion. Even with very large tables (hundreds of thousands) the solution is unlikely to be suitable for even moderately sized datacenters.

Further, to efficiently use the fabric, multipathing should be done per flow. It’s highly unlikely (in my experience) that having the controller participate in flow setup will have the desired performance and scale characteristics. Therefore the multipathing should be done in the fabric (which is possible in a later version of OpenFlow like 1.1 or the upcoming 1.2).

Given these constraints, an SDN approach would probably look a lot like a traditional routing protocol. That is, the resulting state would most likely be destination IP prefixes (so we can take advantage of aggregation and reduce the table requirements by a factor of N over source-destination pairs). Further, multipathing, and link failure detection would have to be done on the switch.

Another complication of using SDN is establishing connectivity to the controller. This clearly requires each switch to run something to bootstrap communication, like a traditional protocol. In most SDN implementations I know of, L3 is used for this purpose. So now, not only are we effectively mimicking L3 in the controller to manage datapath state, we haven’t been able to get rid of the distributed approach potentially doubling the control complexity network wide.


So, does SDN provide value for forwarding in these environments? Given the previous discussion it is difficult to argue in favor of SDN.

Does SDN provide more functionality? Unlikely, being limited to manipulating destination prefixes with multipathing being carried out by switches robs SDN of much of its value. The controller cannot make decisions on anything but the destination. And since the controller doesn’t know a priori which path a flow will take, it will be difficult to add additional table rules without replicating state all over the network (clearly a scalability issue).

How about operational simplicity? One might argue that in order to scale an IGP one would have to manually configure areas which presumably would be done automatically with SDN. However, it is difficult to argue that building a separate control network, or in-band-configuring is any less complex than a simple ID to switch mapping.

What about scale? I’ll leave the details of that discussion to another post. Juan Lage, Rajiv Ramanathan, and I have had a on-again, off-again e-mail discussion comparing that scaling properties of SDN to that of L3 for building a fabric. The upshot is that there are nice proof-points on both sides, but given today’s switching chipsets, even with SDN, you basically end up populating the same tables that L3 does. So any scale argument revolves around reducing the RTT to the controller through a single-function control network, and reducing the need to flood. Neither of these tricks are likely to produce a significant scale advantage for SDN, but they seem to produce some.

So, what’s the take-away?

It’s worth remembering that SDN, like any technology, is actually a point in the design space, and is not necessarily the best option for all deployment environments.

My mental model for SDN starts with looking at the forwarding state, and asking the question “what algorithm needs to run to compute that state”. The second question I ask is, “can that algorithm be easily distributed amongst many nodes with different compute and memory resources”. In many use cases, the answer to this latter question is “not-really”. The examples I provide general reduce to global solvers used for finding optimal solutions with many (often changing) variables. And in such cases, SDN is a shoo in.

However, in the case of building out networking fabrics in which bandwidth is plentiful and shortest path (or something similar) is sufficient, there are a number of great, well tested distributed algorithms for the job. In such environments, it’s difficult to argue the merits of building out a separate control network, adding additional control nodes, and running vastly less mature software.

But perhaps, that’s just me. Thoughts?

Origins and Evolution of OpenFlow/SDN

For those of you who are interested, my keynote at last week’s Open Networking Summit provides some background on OpenFlow and SDN in the form of a historical narrative.  However, the main point of the talk (which doesn’t come across as well as I would have liked) is that it really is the community, and not necessarily the technology, that makes the SDN movement so special.  I believe the technology will work itself out.  However, building a diverse community with strong representation across the networking ecosystem (from ODMs, to customers, and everyone in between) is a very difficult undertaking.  And now that we have such a community let’s be sure to acknowledge its importance and focus on continuing to cultivate and grow it.

And for some only tangentially related trivia, I’ve been tracking down the origins of the term ‘SDN’.  From what I’ve been able to dig up, it was coined by Kate Greene (who had covered software defined radio) while putting together this article.  So, thanks Kate.

OpenFlow’s Beginnings

NetworkWorld recently posted a blog that includes an interview with me (Martin) on OpenFlow’s roots and relevance in the current ecosystem.

While it’s generous to label me as the “inventor” of OpenFlow, the accolade is somewhat misleading. So I wanted to jot off a quick post to clarify things a bit.

To begin with, there  is no sole inventor of OpenFlow.  It’s been the product of many tens of individuals and organizations, over multiple years, and really, it’s still in its infancy and can be expected to evolve drastically going forward.

Regarding the specifics of its origins.  I wrote the first, half-baked, unofficial draft in late 2007 with Nick McKeown (my advisor) as a follow on to research we were doing at Stanford with Scott Shenker, and Justin Pettit (among others). The first contributors to a “full” spec were Ben Pfaff, Justin, Nick, and myself. Ben and Justin did the lion’s share of the work evolving the protocol into something that could actually be used and implemented. And Justin was the primary editor and wrote the bulk of the initial spec text.

Justin, Ben, and I were at Nicira at the time and needed a protocol for remote switch management, so we coordinated with Nick to develop something that could be useful to a wider community.

Within a few months, we handed the spec, along with a working implementation (primarily written by Ben and Justin) to Stanford as there was growing interest in building a community effort around it. Since then, we’ve continued to play a limited role in its development.  However, there have been many very influential contributors since (Rajiv Ramanathan, Jean Tourhilles, Glen Gibb, Brandon Heller, and Ed Crabbe just to name a very few).

So there you have it. The largely uninteresting, and somewhat convoluted origins of OpenFlow.

What Might an SDN Controller API Look Like? (and should we standardize it?)

[This post is written by Teemu Koponen, and Martin Casado. Teemu has architected and been involved in the implementation of multiple widely used OpenFlow controllers.]

Over the last year, there has been some discussion within the OpenFlow and broader SDN community about designing and standardizing a controller API in addition to the switch interface.

To be honest, standardizing a controller API sounds like a poor idea. The controller is software, not a protocol. To our knowledge there are 9 controllers in the wild today, and more than half of them are open source. If the community wants an open software platform, it should divert resources from arguing over standards to building open source software.

(yeah ok, this comic is only tangentially relevant … )

In addition to being a waste of time, premature standardization could be detrimental to the SDN effort. Minimally, it will unnecessarily constrain a fledgling software ecosystem. Software innovation should come through development and experience with deployment, not through consensus built on prophecies.

Further, it’s not clear how to design and standardize an interface to a (non-existent) large software system out of whole cloth. Even Unix which started in the early 70s did not get standardized until the mid/late 80s.

For SDN in particular, there is really very little experience building controllers systems in the wild today. We (the authors) have built four controllers, three which have seen production use, and two of which have multiple products built on them from different organizations. And still, we would certainly not argue for standardizing a particular interface. Hell, we’re not even clear if there is a one-size-fits-all interface for controllers targeted at drastically different deployment environments. In fact, our experience suggests the contrary.

In our opinion, if an interface is going to be standardized, it should follow the software model of standardization. Either (a) take a system that has demonstrated broad success and generality over many years and call it a standard, or (b) during the design phase of a particular software project, have the *developers* work to create a standard API with the appropriate community feedback (which is probably product and system developers).

But most importantly, and it really bears repeating, we all win if we let an unbridled software ecosystem flourish within the controller space.

OK, so while we think any discussion about standardizing a controller interface is extremely premature, the question of “what makes a good controller API” is absolutely crucial to SDN and very much worth discussing. However, there has been relatively little discourse on controller APIs. And that’s exactly what we’ll be focusing on in rest of this post. We’ll start by providing a brief look at common controller designs, and then take a closer look at Onix, the controller platform we work on.

Common Controller Types

If you survey the existing controller landscape there appears to be three broad classes of API.

Single Purpose Controllers:  These controllers are built for a specific function. So while they may use some common code (like an OpenFlow library) to communicate with the switch, they are otherwise not built with different uses in mind. Or put another way, they lack a control-logic agnostic interface for extension, so there really is little more to talk about.

OpenFlow Controllers:  Yeah, that is probably a confusing name, but bear with us. A number of controllers provide a high-level interface to OpenFlow (specifically) and some infrastructure for hosting one or more control “applications”. These are general purpose platforms meant for extension. However, the interface is built around the OpenFlow protocol itself, meaning the interface is a thin wrapper around each message, without offering higher-level abstractions. Therefore, as the protocol changes (e.g., a new message is added), the API necessarily reflects this change. Most controllers (including one we’ve written) fall in this camp.

In our experience, the primary problem with this tight coupling is that it is challenging to evolve any aspect of the control application, switch protocol, or controller state distribution (assuming the controllers form a cluster) without refactoring all the layers of the software stack in tandem.

General SDN Controllers: We’re sort of making these names up as we go along, but a general SDN controller is one in which the control application is fully decoupled from both the underlying protocol(s) communicating with the switches, and the protocol(s) exchanging state between controller instances (assuming the controllers can run in a cluster).

The decoupling is achieved by turning the network control problem into a state management problem, and only exposing the state to be managed to the application, and not the mechanisms by which it arrived.

Applications operate over control traffic, flow table state, configuration state, port state, counters, etc. However, how this data is pulled into the controller, and how any changes are pushed back down to the switch is not the application’s business. It can just happily modify network state as if it were local and the platform will take care of the state dissemination.

The platform we’ve been working on over the last couple of years (Onix) is of this latter category. It supports controller clustering (distribution), multiple controller/switch protocols (including OpenFlow) and provides a number of API design concessions to allow it to scale to very large deployments (tens or hundreds of thousands of ports under control). Since Onix is the controller we’re most familiar with, we’ll focus on it.

So, what does the Onix API look like? It’s extremely simple. In brief, it presents the network to applications as an eventually consistent graph that is shared among the nodes in the controller cluster. In addition, it provides applications with distributed coordination primitives for coordinating nodes’ access to that graph.

Conceptually the interface is as straightforward as it sounds, however understanding its use in practice, especially in a distributed setting, bears more discussion.

So, digging a little further …

The Network as an Eventually Consistent Graph

In Onix, the physical network is represented as a graph for the control application. Elements in the graph may represent physical objects such as switches, or their subparts, such as physical ports, or lookup tables.  And it may contain logical entities such as logical ports, tunnels, BFD configuration, etc.

Control applications operate on the graph to find elements of interest and read and write to them. They can register for notifications about state changes in the elements, and they can even extend the existing elements to hold network state specific to a particular control logic.

We happen to call this graph “the NIB” (for Network Information Base).

The NIB abstracts away both the protocols for state synchronization between the controller and the switches and the protocols used for state synchronization between controllers (discussed further below).

Crummy diagram of Onix showing the major components ...

Regarding switch state, the control applications operate on the state directly, independent of how it is synchronized with the switch. For example, control logic may traverse the NIB looking for a particular switch, modify some forwarding table entries, and then place a trigger to receive notification on any future changes (such as a port status change event).

Doing this within a single controller node is pretty straightforward. However, for resilience and scale, most production deployments will want to run with multiple controllers. Hence, the platform needs to support clustering.

Within Onix, the NIB is the central mechanism for distributed state sharing. That is, the NIB is actually a shared datastructure among the nodes in the cluster. If a node updates the NIB, eventually that update will propagate to the other nodes interested in that portion of the graph (assuming no network failures).

Under this model, the control logic remains unaware of the specifics of the distribution mechanisms for controller-to-controller state. However, it is the responsibility of the application logic to coordinate which controller instance is responsible for which portion of the graph at any given time. To aid with this, Onix includes a number of standard distributed coordination primitives such as distributed locking, leader election, etc. Using them, control applications can safely partition the work amongst the cluster.

Because the control logic is decoupled from the distribution mechanisms, the Onix may be configured to use different distributed datastores to share different parts of the graph. For more dynamic state, it may use high-performance, memory-only, key/value store, whereas for more static information (such as admin configured parts of the graph) it may prefer storage that is strongly consistent and provides durability.

Of course, a design like this pushes a lot of complexity to the application. For example, the application is responsible for all distributed coordination (though tools for doing so are provided), and if the application chooses to use an eventually consistent storage back-end, it must tolerate eventual consistency in the NIB (including conflicts, data disappearing, etc.)

An alternative approach would have been to constrain the distribution model to something simple (like strongly consistent updates to all data) which would have greatly simplified the API. However, this would come at a huge cost to scalability.

Which begs the question …

Is This the Right Interface?

Perhaps for some environments.  Almost certainly not for others.  There have been multiple control applications built on Onix, and it is used in large production deployments in the data centers, as well as in the access and core networks. However, it is probably too heavyweight for smaller networks (the home or small enterprise), and it is certainly too complex to use as a basic research tool.

The NIB is also a very low-level interface. Clearly, there is a lot of scaffolding one can build on top to aid in application development. For example, routing libraries, packet classification engines, network discovery components, configuration management frameworks, and what not.

And that is the beauty of software. If you don’t like an API, you fix it, build on it, or you build your own platform. Sure, as a platform matures, binary compatibility and stability of APIs becomes crucial, but that is more a matter of versioning than a priori standardization and design.

So our vote is to keep standards away from the controller design, to support and promote multiple efforts, and to let the ecosystem play out naturally.

Openflow is the answer! (now .. what was the question again?)

Wow, over the last couple of weeks there has been an escalation in the confusion around Openflow. The problem is on both sides of the metaphorical isle.   Those behind the hype engine are spinning out of control with fanciful tales about bending the laws of physics (“Openflow will keep the icecream in your refrigerator cold during a power outage!”). And those opposed to OpenFlow (for whatever reason) are using the hype to build non-existant strawmen to attack (“Openflow cannot bend the laws of physics, therefore it isn’t useful”).

So for this post, I figured I’d go through some of the dubious claims and bogus strawmen and try and beat some sense into the discussion.  Each of the claims I talk to below were pulled from some article/blog/interview I ran into over the last few weeks.

Claim: “Openflow provides a more powerful interface for managing the network than exists today”

Patently false. APIs expose functionality, not create it. Openflow is a subset of what switching chipsets offer today. If you want more power, sign an NDA with Broadcom and write to their SDK directly.

What Openflow does attempt to do is provide a standard interface above the box. If you believe that SDN is a powerful concept (and I do), then you do need some interface that is sufficiently expressive and hopefully widely supported.

This is almost not worth saying, but clearly Openflow is more expressive than a CLI. CLIs are fine for manual configuration, but they totally blow for building automated systems. The shortcomings are blindingly obvious, for example traditional CLIs have no clear data schema or state semantics and they change all the time because the designers are trying to solve an HCI not an API versioning problem. However, I don’t think there is any real disagreement on this point.

Claim: “Openflow will make hardware simpler”

I highly doubt this. Even with Openflow, a practical network deployment needs a bunch of stuff. It needs to do lookups over common headers (minimally L2/L3/L4), it may need hash-based multi-pathing, it may need port groups, it may need tunneling, it may need fine-grained link timers. If you grab a chip from Broadcom, I’m not exactly sure what you’d throw away if using Openflow.

What Openflow may discourage is stupid shit like creating new tags as an excuse for hardware stickyness (“you want new feature X? It’s implemented using the Suxen tag(tm) and only our new chip supports it.”). This is because an Openflow-like interface can  effectively flatten layers. For example, I don’t have to use the MPLs tag for MPLS per se. I can use it for network virtualization (for instance) or for identifying aggregates that correspond to the same security group. However, that doesn’t mean hardware is simpler. Just that the design isn’t redundant to fulfill a business need. (Post facto note:  There are some great points regarding this issue in the comments.  We’ll work to spin this out into a separate post.)

Or more to the point. There are an awful lot of fields and bits in a header. And an awful lot of lookup capacity in a switch. If you shuffle around what fields mean what, you can almost certainly do what you want without having to change how the hardware parses the packet.

Claim: “Openflow will commoditize the hardware layer”

This is total Kaiser Soze-esque nonsense (yup, that’s the picture at the top of the post). To begin with, networking hardware is already on its way to horizontal integration. The reason that Arista is successful has nothing to do with Openflow (they strictly don’t support it and never have), it has nothing to do with SDN, it’s because Ken Duda, and Andy Bechtolsheim are total bad-asses and have built a great product. Oh, and because merchant silicon has come of age, meaning you don’t need need an ASIC team to be successful in the market.

The difference between OpenFlow and what Arista supports is that with Openflow you choose to build to an industry standard interface, and not a proprietary one. Whether or not you care is, of course, totally up to you.

So Openflow does provide another layer of horizontal integration and once the ecosystem develops that it is, in my opinion, a very good thing. But the ecosystem is still embryonic, so it will take some time before the benefits can be realized.

I think the power of merchant silicon and the rise of the “commodity” fabric are a far greater threat to the crusty scale-up network model. Oddly, Openflow has become a relatively significant distraction. That said, as this area matures, creating a horizontal layer “above the box” will grow in significance.

Claim: “Openflow does not map efficiently to existing hardware”

True! Older versions of Openflow did not map well to existing silicon chips. This is a *major* focus of the existing design effort; to make Openflow more flexible. As with any design effort, the trade-off is between having a future proof roadmap, and having a practical design that can have tangible benefits now. So we’re trying to thread the needle practically but with some foresight. It will take a little patience to see how this evolves.

Claim: “Openflow reduces complexity”

This is a meaningless statement. Openflow is like USB, it doesn’t “do” anything new. A case can be made for SDN to reduce complexity, but it does this by constraining the distribution model and providing a development platform that doesn’t suck. No magic there.

Claim: “Openflow will obviate the need for traditional distributed routing protocols”

I hear this a lot, but I just don’t buy it. There are certainly those within the Openflow community who disagree with me, so please take this for what it’s worth (very little …)

In my opinion, traditional distributed protocols are very good at what they do, scaling, routing packets in complex graphs, converging, etc. If I were to build a fabric, I’d sure as hell do it with an distributed routing protocol (again, not everyone agrees on this point). What traditional protocols suck at is distributing all the other state in the network. What state is that? If you look at the state in a switching chip, a lot of it isn’t traditionally populated through distributed protocols. For example, the ACL tables, the tunnels, many types of tags (e.g. VLAN), etc.

Going forward, it may be the case the distributed routing protocols get pulled into the controller. Why? Because controllers have much higher cpu density and therefore can run the computation for multiple switches. A multicore server will kick the crap out of the standard switch management CPU. Like I described in a previous post, this is no different than the evolution from distance vector to link state, it’s just taking it a step further. However, the protocol certainly is still distributed.

Claim: “Openflow cannot scale because of X”

I’ve addressed this at length in a previous post.  Still, I’m going to have to call Doug Gourlay out.  And I quote …

“I honestly don’t think it’s going to work,” Arista’s Gourlay said. “The flow setup rates on Stanford’s largest production OpenFlow network are 500 flows per second. We’ve had to deal with networks with a million flows being set up in seconds. I’m not sure it’s going to scale up to that.”

Other than being a basic logical fallacy (Stanford’s largest production Openflow network has absolutely nothing to do with the scaling properties of the Openflow or SDN architecture),  there appears to be an implicit assumption that flow-setups are in some way a limiting resource.  This clearly isn’t the case if they don’t leave the datapath which is a valid (and popular) deployment model for Openflow.  Doug’s a great guy, at a great company, but this is a careless statement, and it doesn’t help further the dialog.

Claim: “Openflow helps virtualize the network”

Again, this is almost a meaningless statement. A compiler helps virtualize the network if it is being used to write code to that effect. The fact is, the pioneers of network virtualization, such as VMWare, Amazon, and Cisco, don’t use Openflow.

So yes, you can use Openflow for virtualization (that’s a help, I guess … ). And open standards are a good thing. But no, you certainly don’t need to.

OK, that’s enough for now.

To be very clear, I’m a huge fan of Openflow. And I’m a huge fan of SDN.  Yet, neither is a panacea, and neither is a system or even a solution. One is an effort to provide a standardized interface for building awesome systems (Openflow). The other is a philosophical model for how to build those systems (SDN). It will be up to the systems themselves to validate the approach.

The Scaling Implications of SDN

In my experience, it’s rare to have a discussion about SDN without someone getting their panties in a bunch over scaling. Nevermind that SDN networks have already been implemented that scale to tens of thousands of ports. And nevermind that distributed systems are built today that manage hundreds of thousands of entities and petabytes of data. It remains a perennial point of contention.

So here is a hand-wavy – probably not totally convincing – but hopefully offers some intuitive understanding – attempt at an explanation of the scaling limits of SDN.

But first! Some history: Unfortunately, I think I am partially to blame for the sorry state of affairs. In some of the earliest writeups we would describe an SDN controller as “logically centralized”. The intended meaning of this was something along the lines of “it has a centralized programmatic model, but was really distributed”.

Now, what does “logically centralized” actually mean? Nothing. It’s a nonsense term and the result of sloppy thinking. Either you’re centralized, or you’re distributed (thanks to Teemu Koponen for pointing that out). So by logically centralized, I guess we really meant “distributed”, and that is what we should have said. Whoops!

Clearly, a centralized controller cannot scale. Perhaps it can handle a really large network. Hell, I’ve heard claims of single node SDN networks that scale to thousands of switches and tens of thousands of ports. Whether or not that’s true, at some point either CPU or memory will run out.

However, controllers can (and should!) be distributed. And in that case, I would argue that there are no inherent bottlenecks. Trivially, you can distribute controllers such that the total amount of available CPU and memory for the control path is equivalent to the sum total of management cpu’s on the switches themselves. Controllers can distribute state amongst themselves like network nodes do today.

Still not convinced? Let’s deconstruct this …

Latency: How might SDN affect latency? That depends on how it is used. In some implementations, datapath traffic never leaves the hardware. In which case, datapath latency should be identical to traditional networking. Other implementations will forward the first packet of a flow up to the controller and then cache the forward decisions on the datapath. In this case, the flow setup will have to pay an RTT to the controller. The following paragraph will discuss how much that might be.

Whether or not data traffic is sent to the controller, control traffic often is (for example BGP from a neighboring network, or the signalling of a port status change). In this case the additional overhead is the RTT to the controller + the cost of the computation to determine what to do next. How much that is will vary wildlyby deployment. It’s possible to keep this sub millisecond (200-300us is not unusual). I’ve seen as little as 70us claimed, but I doubt that could be maintained under any real load.

So you may not want to pay this latency per flow (though in many cases it probably doesn’t matter). However, there isn’t any appreciable overhead for responding to network events. And as I’ll describe later, it probably will reduce the total time needed to disseminate control traffic.

Datapath Scaling: The datapath is where the original OpenFlow design was broken. This is because OpenFlow used the abstraction of a switch datapath as a single TCAM. Clearly, this can be problematic for some forwarding rulesets. Assume, for example, that you are building an SDN application which would integrate with BGP, and map prefixes to tags with QoS policies. Implemented this within a single flow table would require (RIB * QoS rules) entries. Ouch.

However, later versions of OpenFlow support multiple tables which take care of this Cartesian explosion nightmare. Rather than having to multiply all the rules together, they can be placed in separate tables limiting the maximum size requirements of a single table (by a lot!).

Still, many modern implementations of SDN dispense with the abstraction of a flow altogether and program the switch hardware tables directly. I think we’ll see a lot more of this going forward.

Convergence on failure:  Alright, what happens on link (or node) failure? Traditionally, when a link fails, the information of the failed link is propagated through flooding. Flooding, of course, scales linearly with the distance of the longest loop free path. If you compare this with SDN, instead of being flooded the information goes to one of the controllers, which sends the updated routing tables to the affected switches.

A few things to note:

  • With SDN, the controller should only be updating the switches whose tables have actually changed.
  • With SDN, the total cost is link propagation to the controller + cost of computation + time to update to affected switches.
  • In a distributed SDN case, an implementation may distributed the link update among the controller clusters which then do the computation for their piece of the network.

Unfortunately for SDN, when in-band control is in play, things are a lot uglier. Inband control basically means that the datapath is used both for SDN control traffic and data. So a failure on the network may affect connectivity to the controller which must be patched up somehow before the controller can fix the problem for the rest of the network.

Patching up the control channel, in this case, is generally done with an IGP, so we’re strictly worse off than if we just used an IGP to begin with. Suck.

So bottom line, if convergence time is important, out of band control should be used. It’s simple to build a cheap, reliable out of band control network using traditional gear (the amount of traffic is minimal).

Computation and Memory: It’s easy to see that the latest multi-core server available from your friendly  server vendor can beat the pants off of whatever crap embedded cpu you’ll find in most networking gear (800mhz-1Ghz is common).

I think Nick McKeown said it best.  If we look historically, early routing protocols used distance vector routing algorithms (like RIP) which have very low computational requirements, but suck in a bunch of other ways (convergence times, split infinity etc.). As cpu’s became more powerful, it was possible for each node to compute single source shortest path to all destinations on each failure, which is the standard approach used today.

Calculating more routes (e.g. all pairs shortest path, or multiple source to all destinations) on stronger CPUs really  follows as a natural evolution. The clear “next step”. And a beefy server with lots of memory can compute *a lot* of routes, and quickly. Further, we already know how to distribute the algorithms.

Some parting notes: The goal of SDN is not to scale a simple network fabric. This is something that distributed routing algorithms do just fine. The goal of SDN is to provide a design paradigm which allows the creation of more sophisticated control paradigms: TE, security, virtualization, service interposition, etc. Perhaps that means just manipulating state at the edge of the network while traditional protocols are used in the core. Or perhaps that means using SDN throughout. In eithercase, SDN architecturally should be able to handle the scale. Remember, it’s the same amount of state that’s being passed around.

I think the right question to ask is not “does it scale” but “is it worth the hassle of building networks this way”? Building an SDN network requires some thought. In addition to the physical network nodes, a control network (probably), and controller servers need to be thrown into the mix.

So, is it worth it?

That, is up to you to decide.   Ultimately the answer will rely on what gets built using SDN and who wants to run it.  I think it’s still too early in the development cycle to derive a meaningful prediction.

Presentation on SDN and High-Level Network Abstractions

If you haven’t heard the talk, or seen the slides, these are a must.  Professor Shenker, one of the creators of SDN, put together this talk about why programming to high-level abstractions is better than protocol design, and why SDN helps achieve it.

What OpenFlow is (and more importantly, what it’s not)

There has been a lot of chatter on OpenFlow lately. Unsurprisingly, the vast majority is unguided hype and clueless naysaying. However, there have also been some very thoughtful blogs which, in my opinion, focus on some of the important issues without elevating it to something it isn’t.

Here are some of my favorites:

I would suggesting reading all three if you haven’t already.

In any case, the goal of this post is to provide some background on OpenFlow which seems to be missing outside of the crowd in which it was created and has been used.

Quickly, what is OpenFlow? OpenFlow is a fairly simple protocol for remote communication with a switch datapath. We created it because existing protocols were either too complex or too difficult to use for efficiently managing remote state. The initial draft of OpenFlow had support for reading and writing to the switch forwarding tables, soft forwarding state, asynchronous events (new packets arriving, flow timeouts), and access to switch counters. Later versions added barriers to allow synchronous operations, cookies for more robust state management, support for multiple forwarding tables, and a bunch of other stuff I can’t remember off hand.

Initially, OpenFlow attempted to define what a switch datpath should look like (first as one large TCAM, and then multiple in succession). However, in practice this is often ignored as hardware platforms tend to be irreconcilably different (there isn’t an obvious canonical forwarding pipeline common to the chipsets I’m familiar with).

And just as quickly, what isn’t OpenFlow? It isn’t new or novel, there have been many similar protocols proposed. It isn’t particularly well designed (the initial designers and implementors were more interested in building systems than designing a protocol), although it is continually being improved. And it isn’t a replacement for SNMP or NetConf, as it is myopically focussed on the forwarding path.

So, it’s reasonable to ask, why then is OpenFlow needed? The short answer is, it isn’t. In fact, focussing on OpenFlow misses the point of the rational behind its creation. What *is* needed are standardized and supported APIs for controlling switch forwarding state.

Why is this?

Consider system design in traditional networking versus modern distributed software development.

Traditional Networking (Protocol design):

Adding functionality to a network generally reduces to creating a new protocol. This is a fairly low-level endeavor often consisting of worrying about the wire format and state distribution algorithm (e.g. flooding). In almost all cases, protocols assume a purely distributed model which means it should handle high rates of churn (nodes coming and going), heterogenous node resources (e.g. cpu), and almost certainly means the state across nodes is eventually consistent (often the bane of correctness).

In addition, to follow the socially acceptable route, somewhere during this process you bring your new protocol to a standards body stuffed by the vendors and spend years (likely) arguing with a bunch of lifers on exactly what the spec should look like. But I’ll ignore that for this discussion.

So, what sucks about this? Many things. First, the distribution model is fixed. It’s *much* easier to build a system where you can dictate the distribution model (e.g. a tightly coupled cluster of servers). It involves a low-level fetish with on-the-wire data formats, and there is no standard way to implement your protocol on a given vendors platform (and even if you could, it isn’t clear how you would handle conflicts with the other mechanisms managing switch state on the box).

Developing Distributed Software:

Now, lets consider how this compares to building a modern distributed  system. Say, for example, you wanted to build a system which abstracts a large datacenter consisting of hundreds of switches and tens of thousands of access ports as a single logical switch to the administrator. The administrator should be able to do everything they
can do on a single switch, allocate subnets, configure ACLs and QoS policy, etc. Finally, assume all of the switches supported something like OpenFlow.

The problem then reduces to building a software system that manages the global network state. While it may be possible to implement everything on a single server, for reliability (and probably scale) it’s more likely you’d want to implement this as a tightly coupled cluster of servers. No problem, we know how to do that, there are tons of great tools available for building scale-out distributed systems.

So why is building a tightly coupled distributed system preferable to designing a new protocol (or more likely, set of protocols)?

  • You have fine grained control of the distribution mechanism (you don’t have to do something crude like send all state to all nodes)
  • You have fine grained control of the consistency model. You can use existing (high-level) software for state distribution and coordination (for example, ZooKeeper and Cassandra)
  • You don’t have to worry about low-level data formatting issues. Distributed data stores and packages like thrift and protocol buffers handle that for you.
  • You (hopefully) reduce the debugging problem to a tightly coupled cluster of nodes.

And what might you lose?

  • Scale? Seeing that systems of similar design are used today to orchestrate hundreds of thousands of servers and petabytes of data, if built right, scale should not be an issue.  There are OpenFlow-like solutions deployed today that have been tested to tens of thousands of ports.
  • Multi-vendor support? If the switches support a standard like OpenFlow (and can forward Ethernet) then you should be able to use any vendors’ switch.
  • However, you most likely will not have interoperability at the controller level (unless a standardized software platform was introduced).

It is important to note that in both approaches (traditional networking and building a modern distributed system), the state being managed is the same. It is the forwarding and configuration state of the switching chips network wide. It’s only the method of it’s management that has changed.

So where does OpenFlow fit in all this? It’s really a very minor, unimportant, easy to replace, mechanism in a broader development paradigm. Whether it is OpenFlow, or a simple RPC schema on protocol buffers/thrift/json/etc. just doesn’t matter. What does matter is that the protocol has very clear data consistency semantics so that it can be used to build a robust, remote state management system.

Do I believe that building networks in this manner is a net win? I really do. It may not be suitable for all deployment environments, but for those building networks as systems, it clearly has value.

I realize that the following is a tired example. But it is nevertheless true. Companies like Google, Yahoo, Amazon, and Facebook have changed how we think about infrastructure through kick-ass innovation in compute, storage and databases. The algorithm is simple: bad-ass developers + infrastructure = competitive advantage + general awesomeness.

However, lacking a consistent API across vendors and devices, the network has remained, in contrast, largely untouched. This is what will change.

Blah, blah, blah, Is it actually real? There are a number of production networks outside of research and academia running OpenFlow (or similar mechanism) today. Generally these are built using whitebox hardware since the vendor supported OpenFlow supply chain is extremely immature. Clearly it is very early days and these deployments are limited to the secretive, thrill-seeking fringe. However, the black art is escaping the cloister and I suspect we’ll hear much more about these and new deployments in the near future.

What does this mean to the hardware ecosystem? Certainly less than the tech pundits would like to claim.

Minimally it will enable innovation for those who care, namely those who take pride in infrastructure innovation, and the many startups looking to offer products built on this model.

Maximally? Who the hell knows. Is there going to be an industry cataclysm in which the last companies standing are Broadcom, an ODM (Quanta), an OEM (Dell) and a software company to string them all together? No, of course not. But that doesn’t mean we won’t see some serious shifts in hegemony in the market over the next 5-10 years. And my guess is that today’s incumbents are going to have a hell of a time holding on. Why? for the following two reasons:

  1. The organizational white blood cells are too near sighted (or too stupid) to allow internal efforts build the correct systems. For example, I’m sure “overlay” is a four letter word in Cisco because it robs hardware of value. This addiction to hardware development cycle and margins is a critical hindrance.So while these companies potter around with their 4-7 year ASIC design cycles, others are going to maximize what goes in software and start innovating like Lady Gaga with a sewing machine.
  2. The traditional vendors, to my knowledge, just don’t have the teams to build this type of software. At least not today. This can be bought or built, but it will take time. And it’s not at all clear that the whiteblood cells will let that happen.

So there you have it. OpenFlow is a small piece to a broader puzzle. It has warts, but with community involvement is getting improved, and much more importantly, it is being used.