OVS-on-Hyper-V Design Document
==============================
-There has been an effort in the recent past to develop the Open vSwitch (OVS)
-solution onto multiple hypervisor platforms such as FreeBSD and Microsoft
-Hyper-V. VMware has been working on a OVS solution for Microsoft Hyper-V for
-the past few months and has successfully completed the implementation.
-
-This document provides details of the development effort. We believe this
-document should give enough information to members of the community who are
-curious about the developments of OVS on Hyper-V. The community should also be
-able to get enough information to make plans to leverage the deliverables of
-this effort.
-
-The userspace portion of the OVS has already been ported to Hyper-V and
-committed to the openvswitch repo. So, this document will mostly emphasize on
-the kernel driver, though we touch upon some of the aspects of userspace as
-well.
+There has been a community effort to develop Open vSwitch on Microsoft Hyper-V.
+In this document, we provide details of the development effort. We believe this
+document should give enough information to understand the overall design.
+
+The userspace portion of the OVS has been ported to Hyper-V in a separate
+effort, and committed to the openvswitch repo. So, this document will mostly
+emphasize on the kernel driver, though we touch upon some of the aspects of
+userspace as well.
We cover the following topics:
1. Background into relevant Hyper-V architecture
physical NIC on the Hyper-V extensible switch is attached via a port. Each port
is both on the ingress path or the egress path of the switch. The ingress path
is used for packets being sent out of a port, and egress is used for packet
-being received on a port. By design, NDIS provides a layered interface, where
-in the ingress path, higher level layers call into lower level layers, and on
-the egress path, it is the other way round. In addition, there is a object
-identifier (OID) interface for control operations Eg. addition of a port. The
-workflow for the calls is similar in nature to the packets, where higher level
-layers call into the lower level layers. A good representational diagram of
-this architecture is in [4].
+being received on a port. By design, NDIS provides a layered interface. In this
+layered interface, higher level layers call into lower level layers, in the
+ingress path. In the egress path, it is the other way round. In addition, there
+is a object identifier (OID) interface for control operations Eg. addition of
+a port. The workflow for the calls is similar in nature to the packets, where
+higher level layers call into the lower level layers. A good representational
+diagram of this architecture is in [4].
Windows Filtering Platform (WFP)[5] is a platform implemented on Hyper-V that
provides APIs and services for filtering packets. WFP has been utilized to
| |
+------+ +--------------+ | +-----------+ +------------+ |
| | | | | | | | | |
- | OVS- | | OVS | | | Virtual | | Virtual | |
- | wind | | USERSPACE | | | Machine #1| | Machine #2 | |
- | | | DAEMON/CTL | | | | | | |
+ | ovs- | | OVS- | | | Virtual | | Virtual | |
+ | *ctl | | USERSPACE | | | Machine #1| | Machine #2 | |
+ | | | DAEMON | | | | | | |
+------+-++---+---------+ | +--+------+-+ +----+------++ | +--------+
- | DPIF- | | netdev- | | |VIF #1| |VIF #2| | |Physical|
- | Windows |<=>| Windows | | +------+ +------+ | | NIC |
+ | dpif- | | netdev- | | |VIF #1| |VIF #2| | |Physical|
+ | netlink | | windows | | +------+ +------+ | | NIC |
+---------+ +---------+ | || /\ | +--------+
-User /\ | || *#1* *#4* || | /\
-=========||=======================+------||-------------------||--+ ||
-Kernel || \/ || ||=====/
- \/ +-----+ +-----+ *#5*
+User /\ /\ | || *#1* *#4* || | /\
+=========||=========||============+------||-------------------||--+ ||
+Kernel || || \/ || ||=====/
+ \/ \/ +-----+ +-----+ *#5*
+-------------------------------+ | | | |
| +----------------------+ | | | | |
| | OVS Pseudo Device | | | | | |
- | +----------------+-----+ | | | | |
- | | | I | | |
+ | +----------------------+ | | | | |
+ | | Netlink Impl. | | | | | |
+ | ----------------- | | I | | |
| +------------+ | | N | | E |
| | Flowtable | +------------+ | | G | | G |
| +------------+ | Packet | |*#2*| R | | R |
Figure 2 shows the various blocks involved in the OVS Windows implementation,
along with some of the components available in the NDIS stack, and also the
virtual machines. The workflow of a packet being transmitted from a VIF out and
-into another VIF and to a physical NIC is also shown. New userspace components
-being added as also shown. Later on in this section, we’ll discuss the flow of
-a packet at a high level.
+into another VIF and to a physical NIC is also shown. Later on in this section,
+we will discuss the flow of a packet at a high level.
The figure gives a general idea of where the OVS userspace and the kernel
components fit in, and how they interface with each other.
sub-modules/functionality. Details of each of these sub-components in the
kernel are contained in later sections:
* Interfacing with the NDIS stack
+ * Netlink message parser
+ * Netlink sockets
* Switch/Datapath management
* Interfacing with userspace portion of the OVS solution to implement the
- necessary ioctls that userspace needs
+ necessary functionality that userspace needs
* Port management
* Flowtable/Actions/packet forwarding
* Tunneling
* Interface between the userspace and the kernel module.
* Event notifications are significantly different.
* The communication interface between DPIF and the kernel module need not be
- implemented in the way OVS on Linux does.
+ implemented in the way OVS on Linux does. That said, it would be
+ advantageous to have a similar interface to the kernel module for reasons of
+ readability and maintainability.
* Any licensing issues of using Linux kernel code directly.
Due to these differences, it was a straightforward decision to develop the
datapath for OVS on Hyper-V from scratch rather than porting the one on Linux.
-A re-development focussed on the following goals:
+A re-development focused on the following goals:
* Adhere to the existing requirements of userspace portion of OVS (such as
- ovs- vswitchd), to minimize changes in the userspace workflow.
+ ovs-vswitchd), to minimize changes in the userspace workflow.
* Fit well into the typical workflow of a Hyper-V extensible switch forwarding
extension.
The userspace portion of the OVS solution is mostly POSIX code, and not very
-Linux specific. Majority of the code has already been ported and committed to
-the openvswitch repo. Most of the daemons such as ovs-vswitchd or ovsdb-server
-can run on Windows now. One additional daemon that has been implemented is
-called ovs-wind. At a high level ovs-wind manages keeps the ovsdb used by
-userspace in sync with the kernel state. More details in the userspace section.
+Linux specific. Majority of the userspace code does not interface directly with
+the kernel datapath and was ported independently of the kernel datapath
+effort.
As explained in the OVS porting design document [7], DPIF is the portion of
-userspace that interfaces with the kernel portion of the OVS. Each platform can
-have its own implementation of the DPIF provider whose interface is defined in
-dpif-provider.h [3]. For OVS on Hyper-V, we have an implementation of DPIF
-provider for Hyper-V. The communication interface between userspace and the
-kernel is a pseudo device and is different from that of the Linux’s DPIF
-provider which uses netlink. But, as long as the DPIF provider interface is the
-same, the callers should be agnostic of the underlying communication interface.
+userspace that interfaces with the kernel portion of the OVS. The interface
+that each DPIF provider has to implement is defined in dpif-provider.h [3].
+Though each platform is allowed to have its own implementation of the DPIF
+provider, it was found, via community feedback, that it is desired to
+share code whenever possible. Thus, the DPIF provider for OVS on Hyper-V shares
+code with the DPIF provider on Linux. This interface is implemented in
+dpif-netlink.c, formerly dpif-linux.c.
+
+We'll elaborate more on kernel-userspace interface in a dedicated section
+below. Here it suffices to say that the DPIF provider implementation for
+Windows is netlink-based and shares code with the Linux one.
2.a) Kernel module (datapath)
-----------------------------
physical adapters are connected as external adapters to the extensible switch.
When the OVS switch extension registers itself as a filter driver, it also
-registers callbacks for the switch management and datapath functions. In other
-words, when a switch is created on the Hyper-V root partition (host), the
+registers callbacks for the switch/port management and datapath functions. In
+other words, when a switch is created on the Hyper-V root partition (host), the
extension gets an activate callback upon which it can initialize the data
structures necessary for OVS to function. Similarly, there are callbacks for
when a port gets added to the Hyper-V switch, and an External Network adapter
As shown in the figures, an extensible switch extension gets to see a packet
sent by the VM (VIF) twice - once on the ingress path and once on the egress
path. Forwarding decisions are to be made on the ingress path. Correspondingly,
-we’ll be hooking onto the following interfaces:
+we will be hooking onto the following interfaces:
* Ingress send indication: intercept packets for performing flow based
forwarding.This includes straight forwarding to output ports. Any packet
modifications needed to be performed are done here either inline or by
Interfacing with OVS userspace
------------------------------
-We’ve implemented a pseudo device interface for letting OVS userspace talk to
+We have implemented a pseudo device interface for letting OVS userspace talk to
the OVS kernel module. This is equivalent to the typical character device
-interface on POSIX platforms. The pseudo device supports a whole bunch of
+interface on POSIX platforms where we can register custom functions for read,
+write and ioctl functionality. The pseudo device supports a whole bunch of
ioctls that netdev and DPIF on OVS userspace make use of.
+Netlink message parser
+----------------------
+The communication between OVS userspace and OVS kernel datapath is in the form
+of Netlink messages [1]. More details about this are provided in #2.c section,
+kernel-userspace interface. In the kernel, a full fledged netlink message
+parser has been implemented along the lines of the netlink message parser in
+OVS userspace. In fact, a lot of the code is ported code.
+
+On the lines of 'struct ofpbuf' in OVS userspace, a managed buffer has been
+implemented in the kernel datapath to make it easier to parse and construct
+netlink messages.
+
+Netlink sockets
+---------------
+On Linux, OVS userspace utilizes netlink sockets to pass back and forth netlink
+messages. Since much of userspace code including DPIF provider in
+dpif-netlink.c (formerly dpif-linux.c) has been reused, pseudo-netlink sockets
+have been implemented in OVS userspace. As it is known, Windows lacks native
+netlink socket support, and also the socket family is not extensible either.
+Hence it is not possible to provide a native implementation of netlink socket.
+We emulate netlink sockets in lib/netlink-socket.c and support all of the nl_*
+APIs to higher levels. The implementation opens a handle to the pseudo device
+for each netlink socket. Some more details on this topic are provided in the
+userspace section on netlink sockets.
+
+Typical netlink semantics of read message, write message, dump, and transaction
+have been implemented so that higher level layers are not affected by the
+netlink implementation not being native.
+
Switch/Datapath management
--------------------------
As explained above, we hook onto the management callback functions in the NDIS
As explained above, we hook onto the management callback functions in the NDIS
interface to know when a port is added/connected to the Hyper-V switch. We use
these callbacks to initialize the port related data structures in OVS. Also,
-some of the ports are tunnel ports that don’t exist on the Hyper-V switch that
-are initiated from OVS userspace.
+some of the ports are tunnel ports that don’t exist on the Hyper-V switch and
+get added from OVS userspace.
+
+In order to identify a Hyper-V port, we use the value of 'FriendlyName' field
+in each Hyper-V port. We call this the "OVS-port-name". The idea is that OVS
+userspace sets 'OVS-port-name' in each Hyper-V port to the same value as the
+'name' field of the 'Interface' table in OVSDB. When OVS userspace calls into
+the kernel datapath to add a port, we match the name of the port with the
+'OVS-port-name' of a Hyper-V port.
+
+We maintain separate hash tables, and separate counters for ports that have
+been added from the Hyper-V switch, and for ports that have been added from OVS
+userspace.
Flowtable/Actions/packet forwarding
-----------------------------------
2.b) Userspace components
-------------------------
-A new daemon has been added to userspace to manage the entities in OVSDB, and
-also to keep it in sync with the kernel state, and this include bridges,
-physical NICs, VIFs etc. For example, upon bootup, ovs-wind does a get on the
-kernel to get a list of the bridges, and the corresponding ports and populates
-OVSDB. If a new VIF gets added to the kernel switch because a user powered on a
-Virtual Machine, ovs-wind detects it, and adds a corresponding entry in the
-ovsdb. This implies that ovs-wind has a synchronous as well as an asynchronous
-interface to the OVS kernel driver.
-
+The userspace portion of the OVS solution is mostly POSIX code, and not very
+Linux specific. Majority of the userspace code does not interface directly with
+the kernel datapath and was ported independently of the kernel datapath
+effort.
+
+In this section, we cover the userspace components that interface with the
+kernel datapath.
+
+As explained earlier, OVS on Hyper-V shares the DPIF provider implementation
+with Linux. The DPIF provider on Linux uses netlink sockets and netlink
+messages. Netlink sockets and messages are extensively used on Linux to
+exchange information between userspace and kernel. In order to satisfy these
+dependencies, netlink socket (pseudo and non-native) and netlink messages
+are implemented on Hyper-V.
+
+The following are the major advantages of sharing DPIF provider code:
+1. Maintenance is simpler:
+ Any change made to the interface defined in dpif-provider.h need not be
+ propagated to multiple implementations. Also, developers familiar with the
+ Linux implementation of the DPIF provider can easily ramp on the Hyper-V
+ implementation as well.
+2. Netlink messages provides inherent advantages:
+ Netlink messages are known for their extensibility. Each message is
+ versioned, so the provided data structures offer a mechanism to perform
+ version checking and forward/backward compatibility with the kernel
+ module.
+
+Netlink sockets
+---------------
+As explained in other sections, an emulation of netlink sockets has been
+implemented in lib/netlink-socket.c for Windows. The implementation creates a
+handle to the OVS pseudo device, and emulates netlink socket semantics of
+receive message, send message, dump, and transact. Most of the nl_* functions
+are supported.
+
+The fact that the implementation is non-native manifests in various ways.
+One example is that PID for the netlink socket is not automatically assigned in
+userspace when a handle is created to the OVS pseudo device. There's an extra
+command (defined in OvsDpInterfaceExt.h) that is used to grab the PID generated
+in the kernel.
+
+DPIF provider
+--------------
+As has been mentioned in earlier sections, the netlink socket and netlink
+message based DPIF provider on Linux has been ported to Windows.
+Correspondingly, the file is called lib/dpif-netlink.c now from its former
+name of lib/dpif-linux.c.
-2.c) Kernel-Userspace interface
--------------------------------
-DPIF-Windows
-------------
-DPIF-Windows is the Windows implementation of the interface defined in dpif-
-provider.h, and provides an interface into the OVS kernel driver. We implement
-most of the callbacks required by the DPIF provider. A quick summary of the
-functionality implemented is as follows:
- * dp_dump, dp_get: dump all datapath information or get information for a
- particular datapath. Currently we only support one datapath.
- * flow_dump, flow_put, flow_get, flow_flush: These functions retrieve all
- flows in the kernel, add a flow to the kernel, get a specific flow and
- delete all the flows in the kernel.
- * recv_set, recv, recv_wait, recv_purge: these poll packets for upcalls.
- * execute: This is used to send packets from userspace to the kernel. The
- packets could be either flow miss packet punted from kernel earlier or
- userspace generated packets.
- * vport_dump, vport_get, ext_info: These functions dump all ports in the
- kernel, get a specific port in the kernel, or get extended information
- about a port.
- * event_subscribe, wait, poll: These functions subscribe, wait and poll the
- events that kernel posts. A typical example is kernel notices a port has
- gone up/down, and would like to notify the userspace.
+Most of the code is common. Some divergence is in the code to receive
+packets. The Linux implementation uses epoll() which is not natively supported
+on Windows.
Netdev-Windows
--------------
-We have a Windows implementation of the the interface defined in lib/netdev-
-provider.h. The implementation provided functionality to get extended
-information about an interface. It is limited in functionality compared to the
-Linux implementation of the netdev provider and cannot be used to add any
-interfaces in the kernel such as a tap interface.
+We have a Windows implementation of the interface defined in
+lib/netdev-provider.h. The implementation provides functionality to get
+extended information about an interface. It is limited in functionality
+compared to the Linux implementation of the netdev provider and cannot be used
+to add any interfaces in the kernel such as a tap interface or to send/receive
+packets. The netdev-windows implementation uses the datapath interface
+extensions defined in:
+datapath-windows/include/OvsDpInterfaceExt.h
+
+Powershell extensions to set "OVS-port-name"
+--------------------------------------------
+As explained in the section on "Port management", each Hyper-V port has a
+'FriendlyName' field, which we call as the "OVS-port-name" field. We have
+implemented powershell command extensions to be able to set the "OVS-port-name"
+of a Hyper-V port.
+2.c) Kernel-Userspace interface
+-------------------------------
+openvswitch.h and OvsDpInterfaceExt.h
+-------------------------------------
+Since the DPIF provider is shared with Linux, the kernel datapath provides the
+same interface as the Linux datapath. The interface is defined in
+datapath/linux/compat/include/linux/openvswitch.h. Derivatives of this
+interface file are created during OVS userspace compilation. The derivative for
+the kernel datapath on Hyper-V is provided in the following location:
+datapath-windows/include/OvsDpInterface.h
+
+That said, there are Windows specific extensions that are defined in the
+interface file:
+datapath-windows/include/OvsDpInterfaceExt.h
2.d) Flow of a packet
---------------------
Reference list:
===============
-1: Hyper-V Extensible Switch
+1. Hyper-V Extensible Switch
http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx
-2: Hyper-V Extensible Switch Extensions
+2. Hyper-V Extensible Switch Extensions
http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx
3. DPIF Provider
http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif-
http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx
7. How to Port Open vSwitch to New Software or Hardware
http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
+8. Netlink
+http://en.wikipedia.org/wiki/Netlink
+9. epoll
+http://en.wikipedia.org/wiki/Epoll