OVS-on-Hyper-V Design Document ============================== There has been an effort in the recent past to develop the Open vSwitch (OVS) solution onto multiple hypervisor platforms such as FreeBSD and Microsoft Hyper-V. VMware has been working on a OVS solution for Microsoft Hyper-V for the past few months and has successfully completed the implementation. This document provides details of the development effort. We believe this document should give enough information to members of the community who are curious about the developments of OVS on Hyper-V. The community should also be able to get enough information to make plans to leverage the deliverables of this effort. The userspace portion of the OVS has already been ported to Hyper-V and committed to the openvswitch repo. So, this document will mostly emphasize on the kernel driver, though we touch upon some of the aspects of userspace as well. We cover the following topics: 1. Background into relevant Hyper-V architecture 2. Design of the OVS Windows implementation a. Kernel module (datapath) b. Userspace components c. Kernel-Userspace interface d. Flow of a packet 3. Build/Deployment environment For more questions, please contact dev@openvswitch.org 1) Background into relevant Hyper-V architecture ------------------------------------------------ Microsoft’s hypervisor solution - Hyper-V[1] implements a virtual switch that is extensible and provides opportunities for other vendors to implement functional extensions[2]. The extensions need to be implemented as NDIS drivers that bind within the extensible switch driver stack provided. The extensions can broadly provide the functionality of monitoring, modifying and forwarding packets to destination ports on the Hyper-V extensible switch. Correspondingly, the extensions can be categorized into the following types and provide the functionality noted: * Capturing extensions: monitoring packets * Filtering extensions: monitoring, modifying packets * Forwarding extensions: monitoring, modifying, forwarding packets As can be expected, the kernel portion (datapath) of OVS on Hyper-V solution will be implemented as a forwarding extension. In Hyper-V, the virtual machine is called the Child Partition. Each VIF or physical NIC on the Hyper-V extensible switch is attached via a port. Each port is both on the ingress path or the egress path of the switch. The ingress path is used for packets being sent out of a port, and egress is used for packet being received on a port. By design, NDIS provides a layered interface, where in the ingress path, higher level layers call into lower level layers, and on the egress path, it is the other way round. In addition, there is a object identifier (OID) interface for control operations Eg. addition of a port. The workflow for the calls is similar in nature to the packets, where higher level layers call into the lower level layers. A good representational diagram of this architecture is in [4]. Windows Filtering Platform (WFP)[5] is a platform implemented on Hyper-V that provides APIs and services for filtering packets. WFP has been utilized to filter on some of the packets that OVS is not equipped to handle directly. More details in later sections. IP Helper [6] is a set of API available on Hyper-V to retrieve information related to the network configuration information on the host machine. IP Helper has been used to retrieve some of the configuration information that OVS needs. 2) Design of the OVS Windows implementation ------------------------------------------- +-------------------------------+ | | | CHILD PARTITION | | | +------+ +--------------+ | +-----------+ +------------+ | | | | | | | | | | | | OVS- | | OVS | | | Virtual | | Virtual | | | wind | | USERSPACE | | | Machine #1| | Machine #2 | | | | | DAEMON/CTL | | | | | | | +------+-++---+---------+ | +--+------+-+ +----+------++ | +--------+ | DPIF- | | netdev- | | |VIF #1| |VIF #2| | |Physical| | Windows |<=>| Windows | | +------+ +------+ | | NIC | +---------+ +---------+ | || /\ | +--------+ User /\ | || *#1* *#4* || | /\ =========||=======================+------||-------------------||--+ || Kernel || \/ || ||=====/ \/ +-----+ +-----+ *#5* +-------------------------------+ | | | | | +----------------------+ | | | | | | | OVS Pseudo Device | | | | | | | +----------------+-----+ | | | | | | | | I | | | | +------------+ | | N | | E | | | Flowtable | +------------+ | | G | | G | | +------------+ | Packet | |*#2*| R | | R | | +--------+ | Processing | |<=> | E | | E | | | WFP | | | | | S | | S | | | Driver | +------------+ | | S | | S | | +--------+ | | | | | | | | | | | | OVS FORWARDING EXTENSION | | | | | +-------------------------------+ +-----+-----------------+-----+ |HYPER-V Extensible Switch *#3| +-----------------------------+ NDIS STACK Fig 2. Various blocks of the OVS Windows implementation Figure 2 shows the various blocks involved in the OVS Windows implementation, along with some of the components available in the NDIS stack, and also the virtual machines. The workflow of a packet being transmitted from a VIF out and into another VIF and to a physical NIC is also shown. New userspace components being added as also shown. Later on in this section, we’ll discuss the flow of a packet at a high level. The figure gives a general idea of where the OVS userspace and the kernel components fit in, and how they interface with each other. The kernel portion (datapath) of OVS on Hyper-V solution has be implemented as a forwarding extension roughly implementing the following sub-modules/functionality. Details of each of these sub-components in the kernel are contained in later sections: * Interfacing with the NDIS stack * Switch/Datapath management * Interfacing with userspace portion of the OVS solution to implement the necessary ioctls that userspace needs * Port management * Flowtable/Actions/packet forwarding * Tunneling * Event notifications The datapath for the OVS on Linux is a kernel module, and cannot be directly ported since there are significant differences in architecture even though the end functionality provided would be similar. Some examples of the differences are: * Interfacing with the NDIS stack to hook into the NDIS callbacks for functionality such as receiving and sending packets, packet completions, OIDs used for events such as a new port appearing on the virtual switch. * Interface between the userspace and the kernel module. * Event notifications are significantly different. * The communication interface between DPIF and the kernel module need not be implemented in the way OVS on Linux does. * Any licensing issues of using Linux kernel code directly. Due to these differences, it was a straightforward decision to develop the datapath for OVS on Hyper-V from scratch rather than porting the one on Linux. A re-development focussed on the following goals: * Adhere to the existing requirements of userspace portion of OVS (such as ovs- vswitchd), to minimize changes in the userspace workflow. * Fit well into the typical workflow of a Hyper-V extensible switch forwarding extension. The userspace portion of the OVS solution is mostly POSIX code, and not very Linux specific. Majority of the code has already been ported and committed to the openvswitch repo. Most of the daemons such as ovs-vswitchd or ovsdb-server can run on Windows now. One additional daemon that has been implemented is called ovs-wind. At a high level ovs-wind manages keeps the ovsdb used by userspace in sync with the kernel state. More details in the userspace section. As explained in the OVS porting design document [7], DPIF is the portion of userspace that interfaces with the kernel portion of the OVS. Each platform can have its own implementation of the DPIF provider whose interface is defined in dpif-provider.h [3]. For OVS on Hyper-V, we have an implementation of DPIF provider for Hyper-V. The communication interface between userspace and the kernel is a pseudo device and is different from that of the Linux’s DPIF provider which uses netlink. But, as long as the DPIF provider interface is the same, the callers should be agnostic of the underlying communication interface. 2.a) Kernel module (datapath) ----------------------------- Interfacing with the NDIS stack ------------------------------- For each virtual switch on Hyper-V, the OVS extensible switch extension can be enabled/disabled. We support enabling the OVS extension on only one switch. This is consistent with using a single datapath in the kernel on Linux. All the physical adapters are connected as external adapters to the extensible switch. When the OVS switch extension registers itself as a filter driver, it also registers callbacks for the switch management and datapath functions. In other words, when a switch is created on the Hyper-V root partition (host), the extension gets an activate callback upon which it can initialize the data structures necessary for OVS to function. Similarly, there are callbacks for when a port gets added to the Hyper-V switch, and an External Network adapter or a VM Network adapter is connected/disconnected to the port. There are also callbacks for when a VIF (NIC of a child partition) send out a packet, or a packet is received on an external NIC. As shown in the figures, an extensible switch extension gets to see a packet sent by the VM (VIF) twice - once on the ingress path and once on the egress path. Forwarding decisions are to be made on the ingress path. Correspondingly, we’ll be hooking onto the following interfaces: * Ingress send indication: intercept packets for performing flow based forwarding.This includes straight forwarding to output ports. Any packet modifications needed to be performed are done here either inline or by creating a new packet. A forwarding action is performed as the flow actions dictate. * Ingress completion indication: cleanup and free packets that we generated on the ingress send path, pass-through for packets that we did not generate. * Egress receive indication: pass-through. * Egress completion indication: pass-through. Interfacing with OVS userspace ------------------------------ We’ve implemented a pseudo device interface for letting OVS userspace talk to the OVS kernel module. This is equivalent to the typical character device interface on POSIX platforms. The pseudo device supports a whole bunch of ioctls that netdev and DPIF on OVS userspace make use of. Switch/Datapath management -------------------------- As explained above, we hook onto the management callback functions in the NDIS interface for when to initialize the OVS data structures, flow tables etc. Some of this code is also driven by OVS userspace code which sends down ioctls for operations like creating a tunnel port etc. Port management --------------- As explained above, we hook onto the management callback functions in the NDIS interface to know when a port is added/connected to the Hyper-V switch. We use these callbacks to initialize the port related data structures in OVS. Also, some of the ports are tunnel ports that don’t exist on the Hyper-V switch that are initiated from OVS userspace. Flowtable/Actions/packet forwarding ----------------------------------- The flowtable and flow actions based packet forwarding is the core of the OVS datapath functionality. For each packet on the ingress path, we consult the flowtable and execute the corresponding actions. The actions can be limited to simple forwarding to a particular destination port(s), or more commonly involves modifying the packet to insert a tunnel context or a VLAN ID, and thereafter forwarding to the external port to send the packet to a destination host. Tunneling --------- We make use of the Internal Port on a Hyper-V switch for implementing tunneling. The Internal Port is a virtual adapter that is exposed on the Hyper- V host, and connected to the Hyper-V switch. Basically, it is an interface between the host and the virtual switch. The Internal Port acts as the Tunnel end point for the host (aka VTEP), and holds the VTEP IP address. Tunneling ports are not actual ports on the Hyper-V switch. These are virtual ports that OVS maintains and while executing actions, if the outport is a tunnel port, we short circuit by performing the encapsulation action based on the tunnel context. The encapsulated packet gets forwarded to the external port, and appears to the outside world as though it was set from the VTEP. Similarly, when a tunneled packet enters the OVS from the external port bound to the internal port (VTEP), and if yes, we short circuit the path, and directly forward the inner packet to the destination port (mostly a VIF, but dictated by the flow). We leverage the Windows Filtering Platform (WFP) framework to be able to receive tunneled packets that cannot be decapsulated by OVS right away. Currently, fragmented IP packets fall into that category, and we leverage the code in the host IP stack to reassemble the packet, and performing decapsulation on the reassembled packet. We’ll also be using the IP helper library to provide us IP address and other information corresponding to the Internal port. Event notifications ------------------- The pseudo device interface described above is also used for providing event notifications back to OVS userspace. A shared memory/overlapped IO model is used. 2.b) Userspace components ------------------------- A new daemon has been added to userspace to manage the entities in OVSDB, and also to keep it in sync with the kernel state, and this include bridges, physical NICs, VIFs etc. For example, upon bootup, ovs-wind does a get on the kernel to get a list of the bridges, and the corresponding ports and populates OVSDB. If a new VIF gets added to the kernel switch because a user powered on a Virtual Machine, ovs-wind detects it, and adds a corresponding entry in the ovsdb. This implies that ovs-wind has a synchronous as well as an asynchronous interface to the OVS kernel driver. 2.c) Kernel-Userspace interface ------------------------------- DPIF-Windows ------------ DPIF-Windows is the Windows implementation of the interface defined in dpif- provider.h, and provides an interface into the OVS kernel driver. We implement most of the callbacks required by the DPIF provider. A quick summary of the functionality implemented is as follows: * dp_dump, dp_get: dump all datapath information or get information for a particular datapath. Currently we only support one datapath. * flow_dump, flow_put, flow_get, flow_flush: These functions retrieve all flows in the kernel, add a flow to the kernel, get a specific flow and delete all the flows in the kernel. * recv_set, recv, recv_wait, recv_purge: these poll packets for upcalls. * execute: This is used to send packets from userspace to the kernel. The packets could be either flow miss packet punted from kernel earlier or userspace generated packets. * vport_dump, vport_get, ext_info: These functions dump all ports in the kernel, get a specific port in the kernel, or get extended information about a port. * event_subscribe, wait, poll: These functions subscribe, wait and poll the events that kernel posts. A typical example is kernel notices a port has gone up/down, and would like to notify the userspace. Netdev-Windows -------------- We have a Windows implementation of the the interface defined in lib/netdev- provider.h. The implementation provided functionality to get extended information about an interface. It is limited in functionality compared to the Linux implementation of the netdev provider and cannot be used to add any interfaces in the kernel such as a tap interface. 2.d) Flow of a packet --------------------- Figure 2 shows the numbered steps in which a packets gets sent out of a VIF and is forwarded to another VIF or a physical NIC. As mentioned earlier, each VIF is attached to the switch via a port, and each port is both on the ingress and egress path of the switch, and depending on whether a packet is being transmitted or received, one of the paths gets used. In the figure, each step n is annotated as *#n* The steps are as follows: 1. When a packet is sent out of a VIF or an physical NIC or an internal port, the packet is part of the ingress path. 2. The OVS kernel driver gets to intercept this packet. a. OVS looks up the flows in the flowtable for this packet, and executes the corresponding action. b. If there is not action, the packet is sent up to OVS userspace to examine the packet and figure out the actions. v. Userspace executes the packet by specifying the actions, and might also insert a flow for such a packet in the future. d. The destination ports are added to the packet and sent down to the Hyper- V switch. 3. The Hyper-V forwards the packet to the destination ports specified in the packet, and sends it out on the egress path. 4. The packet gets forwarded to the destination VIF. 5. It might also get forwarded to a physical NIC as well, if the physical NIC has been added as a destination port by OVS. 3) Build/Deployment: -------------------- The userspace components added as part of OVS Windows implementation have been integrated with autoconf, and can be built using the steps mentioned in the BUILD.Windows file. Additional targets need to be specified to make. The OVS kernel code is part of a Visual Studio 2013 solution, and is compiled from the IDE. There are plans in the future to move this to a compilation mode such that we can compile it without an IDE as well. Once compiled, we have an install script that can be used to load the kernel driver. Reference list: =============== 1: Hyper-V Extensible Switch http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx 2: Hyper-V Extensible Switch Extensions http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx 3. DPIF Provider http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif- provider_8h_source.html 4. Hyper-V Extensible Switch Components http://msdn.microsoft.com/en-us/library/windows/hardware/hh598163(v=vs.85).aspx 5. Windows Filtering Platform http://msdn.microsoft.com/en-us/library/windows/desktop/aa366510(v=vs.85).aspx 6. IP Helper http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx 7. How to Port Open vSwitch to New Software or Hardware http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING