X-Git-Url: http://git.cascardo.eti.br/?a=blobdiff_plain;f=INSTALL.DPDK.md;h=00e75bda23fd737b034597040a5326196c7cd032;hb=refs%2Fheads%2Frtnetlink;hp=20bd1c69eb25e4a04960cf1de4d318666d622377;hpb=f748d99a5957443fa7cab5d79200fc10d9dbf4fd;p=cascardo%2Fovs.git diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md index 20bd1c69e..00e75bda2 100644 --- a/INSTALL.DPDK.md +++ b/INSTALL.DPDK.md @@ -16,7 +16,7 @@ OVS needs a system with 1GB hugepages support. Building and Installing: ------------------------ -Required: DPDK 2.0 +Required: DPDK 16.04, libnuma Optional (if building with vhost-cuse): `fuse`, `fuse-devel` (`libfuse-dev` on Debian/Ubuntu) @@ -24,28 +24,18 @@ on Debian/Ubuntu) 1. Set `$DPDK_DIR` ``` - export DPDK_DIR=/usr/src/dpdk-2.0 + export DPDK_DIR=/usr/src/dpdk-16.04 cd $DPDK_DIR ``` - 2. Update `config/common_linuxapp` so that DPDK generate single lib file. - (modification also required for IVSHMEM build) - - `CONFIG_RTE_BUILD_COMBINE_LIBS=y` - - Update `config/common_linuxapp` so that DPDK is built with vhost - libraries. - - `CONFIG_RTE_LIBRTE_VHOST=y` - - Then run `make install` to build and install the library. + 2. Then run `make install` to build and install the library. For default install without IVSHMEM: - `make install T=x86_64-native-linuxapp-gcc` + `make install T=x86_64-native-linuxapp-gcc DESTDIR=install` To include IVSHMEM (shared memory): - `make install T=x86_64-ivshmem-linuxapp-gcc` + `make install T=x86_64-ivshmem-linuxapp-gcc DESTDIR=install` For further details refer to http://dpdk.org/ @@ -65,7 +55,7 @@ on Debian/Ubuntu) `export DPDK_BUILD=$DPDK_DIR/x86_64-ivshmem-linuxapp-gcc/` ``` - cd $(OVS_DIR)/openvswitch + cd $(OVS_DIR)/ ./boot.sh ./configure --with-dpdk=$DPDK_BUILD [CFLAGS="-g -O2 -Wno-cast-align"] make @@ -112,7 +102,7 @@ Using the DPDK with ovs-vswitchd: 3. Bind network device to vfio-pci: `$DPDK_DIR/tools/dpdk_nic_bind.py --bind=vfio-pci eth1` -3. Mount the hugetable filsystem +3. Mount the hugetable filesystem `mount -t hugetlbfs -o pagesize=1G none /dev/hugepages` @@ -148,22 +138,67 @@ Using the DPDK with ovs-vswitchd: 5. Start vswitchd: - DPDK configuration arguments can be passed to vswitchd via `--dpdk` - argument. This needs to be first argument passed to vswitchd process. - dpdk arg -c is ignored by ovs-dpdk, but it is a required parameter - for dpdk initialization. + DPDK configuration arguments can be passed to vswitchd via Open_vSwitch + other_config column. The recognized configuration options are listed. + Defaults will be provided for all values not explicitly set. + + * dpdk-init + Specifies whether OVS should initialize and support DPDK ports. This is + a boolean, and defaults to false. + + * dpdk-lcore-mask + Specifies the CPU cores on which dpdk lcore threads should be spawned. + The DPDK lcore threads are used for DPDK library tasks, such as + library internal message processing, logging, etc. Value should be in + the form of a hex string (so '0x123') similar to the 'taskset' mask + input. + If not specified, the value will be determined by choosing the lowest + CPU core from initial cpu affinity list. Otherwise, the value will be + passed directly to the DPDK library. + For performance reasons, it is best to set this to a single core on + the system, rather than allow lcore threads to float. + + * dpdk-alloc-mem + This sets the total memory to preallocate from hugepages regardless of + processor socket. It is recommended to use dpdk-socket-mem instead. + + * dpdk-socket-mem + Comma separated list of memory to pre-allocate from hugepages on specific + sockets. + + * dpdk-hugepage-dir + Directory where hugetlbfs is mounted + + * dpdk-extra + Extra arguments to provide to DPDK EAL, as previously specified on the + command line. Do not pass '--no-huge' to the system in this way. Support + for running the system without hugepages is nonexistent. + + * cuse-dev-name + Option to set the vhost_cuse character device name. + + * vhost-sock-dir + Option to set the path to the vhost_user unix socket files. + + NOTE: Changing any of these options requires restarting the ovs-vswitchd + application. + + Open vSwitch can be started as normal. DPDK will be initialized as long + as the dpdk-init option has been set to 'true'. + ``` export DB_SOCK=/usr/local/var/run/openvswitch/db.sock - ovs-vswitchd --dpdk -c 0x1 -n 4 -- unix:$DB_SOCK --pidfile --detach + ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true + ovs-vswitchd unix:$DB_SOCK --pidfile --detach ``` If allocated more than one GB hugepage (as for IVSHMEM), set amount and use NUMA node 0 memory: ``` - ovs-vswitchd --dpdk -c 0x1 -n 4 --socket-mem 1024,0 \ - -- unix:$DB_SOCK --pidfile --detach + ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,0" + ovs-vswitchd unix:$DB_SOCK --pidfile --detach ``` 6. Add bridge & ports @@ -212,61 +247,274 @@ Using the DPDK with ovs-vswitchd: ./ovs-ofctl add-flow br0 in_port=2,action=output:1 ``` -8. Performance tuning +8. QoS usage example + + Assuming you have a vhost-user port transmitting traffic consisting of + packets of size 64 bytes, the following command would limit the egress + transmission rate of the port to ~1,000,000 packets per second: + + `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos create qos + type=egress-policer other-config:cir=46000000 other-config:cbs=2048` + + To examine the QoS configuration of the port: + + `ovs-appctl -t ovs-vswitchd qos/show vhost-user0` + + To clear the QoS configuration from the port and ovsdb use the following: + + `ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos` + + For more details regarding egress-policer parameters please refer to the + vswitch.xml. + +9. Ingress Policing Example - With pmd multi-threading support, OVS creates one pmd thread for each - numa node as default. The pmd thread handles the I/O of all DPDK - interfaces on the same numa node. The following two commands can be used - to configure the multi-threading behavior. + Assuming you have a vhost-user port receiving traffic consisting of + packets of size 64 bytes, the following command would limit the reception + rate of the port to ~1,000,000 packets per second: + + `ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 + ingress_policing_burst=1000` + + To examine the ingress policer configuration of the port: + + `ovs-vsctl list interface vhost-user0` + + To clear the ingress policer configuration from the port use the following: + + `ovs-vsctl set interface vhost-user0 ingress_policing_rate=0` + + For more details regarding ingress-policer see the vswitch.xml. + +Performance Tuning: +------------------- + +1. PMD affinitization + + A poll mode driver (pmd) thread handles the I/O of all DPDK + interfaces assigned to it. A pmd thread will busy loop through + the assigned port/rxq's polling for packets, switch the packets + and send to a tx port if required. Typically, it is found that + a pmd thread is CPU bound, meaning that the greater the CPU + occupancy the pmd thread can get, the better the performance. To + that end, it is good practice to ensure that a pmd thread has as + many cycles on a core available to it as possible. This can be + achieved by affinitizing the pmd thread with a core that has no + other workload. See section 7 below for a description of how to + isolate cores for this purpose also. + + The following command can be used to specify the affinity of the + pmd thread(s). `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=` - The command above asks for a CPU mask for setting the affinity of pmd - threads. A set bit in the mask means a pmd thread is created and pinned - to the corresponding CPU core. For more information, please refer to + By setting a bit in the mask, a pmd thread is created and pinned + to the corresponding CPU core. e.g. to run a pmd thread on core 1 + + `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=2` + + For more information, please refer to the Open_vSwitch TABLE section in + `man ovs-vswitchd.conf.db` - `ovs-vsctl set Open_vSwitch . other_config:n-dpdk-rxqs=` + Note, that a pmd thread on a NUMA node is only created if there is + at least one DPDK interface from that NUMA node added to OVS. + +2. Multiple poll mode driver threads + + With pmd multi-threading support, OVS creates one pmd thread + for each NUMA node by default. However, it can be seen that in cases + where there are multiple ports/rxq's producing traffic, performance + can be improved by creating multiple pmd threads running on separate + cores. These pmd threads can then share the workload by each being + responsible for different ports/rxq's. Assignment of ports/rxq's to + pmd threads is done automatically. - The command above sets the number of rx queues of each DPDK interface. The - rx queues are assigned to pmd threads on the same numa node in round-robin - fashion. For more information, please refer to `man ovs-vswitchd.conf.db` + The following command can be used to specify the affinity of the + pmd threads. - Ideally for maximum throughput, the pmd thread should not be scheduled out - which temporarily halts its execution. The following affinitization methods - can help. + `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=` + + A set bit in the mask means a pmd thread is created and pinned + to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2 - Lets pick core 4,6,8,10 for pmd threads to run on. Also assume a dual 8 core - sandy bridge system with hyperthreading enabled where CPU1 has cores 0,...,7 - and 16,...,23 & CPU2 cores 8,...,15 & 24,...,31. (A different cpu - configuration could have different core mask requirements). + `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6` - To kernel bootline add core isolation list for cores and associated hype cores - (e.g. isolcpus=4,20,6,22,8,24,10,26,). Reboot system for isolation to take - effect, restart everything. + For more information, please refer to the Open_vSwitch TABLE section in - Configure pmd threads on core 4,6,8,10 using 'pmd-cpu-mask': + `man ovs-vswitchd.conf.db` - `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=00000550` + For example, when using dpdk and dpdkvhostuser ports in a bi-directional + VM loopback as shown below, spreading the workload over 2 or 4 pmd + threads shows significant improvements as there will be more total CPU + occupancy available. - You should be able to check that pmd threads are pinned to the correct cores - via: + NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 + + The following command can be used to confirm that the port/rxq assignment + to pmd threads is as required: + + `ovs-appctl dpif-netdev/pmd-rxq-show` + + This can also be checked with: ``` - top -p `pidof ovs-vswitchd` -H -d1 + top -H + taskset -p ``` - Note, the pmd threads on a numa node are only created if there is at least - one DPDK interface from the numa node that has been added to OVS. - - To understand where most of the time is spent and whether the caches are - effective, these commands can be used: + To understand where most of the pmd thread time is spent and whether the + caches are being utilized, these commands can be used: ``` - ovs-appctl dpif-netdev/pmd-stats-clear #To reset statistics + # Clear previous stats + ovs-appctl dpif-netdev/pmd-stats-clear + + # Check current stats ovs-appctl dpif-netdev/pmd-stats-show ``` +3. DPDK port Rx Queues + + `ovs-vsctl set Interface options:n_rxq=` + + The command above sets the number of rx queues for DPDK interface. + The rx queues are assigned to pmd threads on the same NUMA node in a + round-robin fashion. For more information, please refer to the + Open_vSwitch TABLE section in + + `man ovs-vswitchd.conf.db` + +4. Exact Match Cache + + Each pmd thread contains one EMC. After initial flow setup in the + datapath, the EMC contains a single table and provides the lowest level + (fastest) switching for DPDK ports. If there is a miss in the EMC then + the next level where switching will occur is the datapath classifier. + Missing in the EMC and looking up in the datapath classifier incurs a + significant performance penalty. If lookup misses occur in the EMC + because it is too small to handle the number of flows, its size can + be increased. The EMC size can be modified by editing the define + EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c. + + As mentioned above an EMC is per pmd thread. So an alternative way of + increasing the aggregate amount of possible flow entries in EMC and + avoiding datapath classifier lookups is to have multiple pmd threads + running. This can be done as described in section 2. + +5. Compiler options + + The default compiler optimization level is '-O2'. Changing this to + more aggressive compiler optimizations such as '-O3' or + '-Ofast -march=native' with gcc can produce performance gains. + +6. Simultaneous Multithreading (SMT) + + With SMT enabled, one physical core appears as two logical cores + which can improve performance. + + SMT can be utilized to add additional pmd threads without consuming + additional physical cores. Additional pmd threads may be added in the + same manner as described in section 2. If trying to minimize the use + of physical cores for pmd threads, care must be taken to set the + correct bits in the pmd-cpu-mask to ensure that the pmd threads are + pinned to SMT siblings. + + For example, when using 2x 10 core processors in a dual socket system + with HT enabled, /proc/cpuinfo will report 40 logical cores. To use + two logical cores which share the same physical core for pmd threads, + the following command can be used to identify a pair of logical cores. + + `cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list` + + where N is the logical core number. In this example, it would show that + cores 1 and 21 share the same physical core. The pmd-cpu-mask to enable + two pmd threads running on these two logical cores (one physical core) + is. + + `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002` + + Note that SMT is enabled by the Hyper-Threading section in the + BIOS, and as such will apply to the whole system. So the impact of + enabling/disabling it for the whole system should be considered + e.g. If workloads on the system can scale across multiple cores, + SMT may very beneficial. However, if they do not and perform best + on a single physical core, SMT may not be beneficial. + +7. The isolcpus kernel boot parameter + + isolcpus can be used on the kernel bootline to isolate cores from the + kernel scheduler and hence dedicate them to OVS or other packet + forwarding related workloads. For example a Linux kernel boot-line + could be: + + ``` + GRUB_CMDLINE_LINUX_DEFAULT="quiet hugepagesz=1G hugepages=4 + default_hugepagesz=1G 'intel_iommu=off' isolcpus=1-19" + ``` + +8. NUMA/Cluster On Die + + Ideally inter NUMA datapaths should be avoided where possible as packets + will go across QPI and there may be a slight performance penalty when + compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3, + Cluster On Die is introduced on models that have 10 cores or more. + This makes it possible to logically split a socket into two NUMA regions + and again it is preferred where possible to keep critical datapaths + within the one cluster. + + It is good practice to ensure that threads that are in the datapath are + pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs + responsible for forwarding. If DPDK is built with + CONFIG_RTE_LIBRTE_VHOST_NUMA=y, vHost User ports automatically + detect the NUMA socket of the QEMU vCPUs and will be serviced by a PMD + from the same node provided a core on this node is enabled in the + pmd-cpu-mask. + +9. Rx Mergeable buffers + + Rx Mergeable buffers is a virtio feature that allows chaining of multiple + virtio descriptors to handle large packet sizes. As such, large packets + are handled by reserving and chaining multiple free descriptors + together. Mergeable buffer support is negotiated between the virtio + driver and virtio device and is supported by the DPDK vhost library. + This behavior is typically supported and enabled by default, however + in the case where the user knows that rx mergeable buffers are not needed + i.e. jumbo frames are not needed, it can be forced off by adding + mrg_rxbuf=off to the QEMU command line options. By not reserving multiple + chains of descriptors it will make more individual virtio descriptors + available for rx to the guest using dpdkvhost ports and this can improve + performance. + +10. Packet processing in the guest + + It is good practice whether simply forwarding packets from one + interface to another or more complex packet processing in the guest, + to ensure that the thread performing this work has as much CPU + occupancy as possible. For example when the DPDK sample application + `testpmd` is used to forward packets in the guest, multiple QEMU vCPU + threads can be created. Taskset can then be used to affinitize the + vCPU thread responsible for forwarding to a dedicated core not used + for other general processing on the host system. + +11. DPDK virtio pmd in the guest + + dpdkvhostcuse or dpdkvhostuser ports can be used to accelerate the path + to the guest using the DPDK vhost library. This library is compatible with + virtio-net drivers in the guest but significantly better performance can + be observed when using the DPDK virtio pmd driver in the guest. The DPDK + `testpmd` application can be used in the guest as an example application + that forwards packet from one DPDK vhost port to another. An example of + running `testpmd` in the guest can be seen here. + + ``` + ./testpmd -c 0x3 -n 4 --socket-mem 512 -- --burst=64 -i --txqflags=0xf00 + --disable-hw-vlan --forward-mode=io --auto-start + ``` + + See below information on dpdkvhostcuse and dpdkvhostuser ports. + See [DPDK Docs] for more information on `testpmd`. + DPDK Rings : ------------ @@ -315,7 +563,7 @@ the vswitchd. DPDK vhost: ----------- -DPDK 2.0 supports two types of vhost: +DPDK 16.04 supports two types of vhost: 1. vhost-user 2. vhost-cuse @@ -336,7 +584,7 @@ with OVS. DPDK vhost-user Prerequisites: ------------------------- -1. DPDK 2.0 with vhost support enabled as documented in the "Building and +1. DPDK 16.04 with vhost support enabled as documented in the "Building and Installing section" 2. QEMU version v2.1.0+ @@ -350,7 +598,8 @@ Adding DPDK vhost-user ports to the Switch: Following the steps above to create a bridge, you can now add DPDK vhost-user as a port to the vswitch. Unlike DPDK ring ports, DPDK vhost-user ports can -have arbitrary names. +have arbitrary names, except that forward and backward slashes are prohibited +in the names. - For vhost-user, the name of the port type is `dpdkvhostuser` @@ -363,11 +612,12 @@ have arbitrary names. `/usr/local/var/run/openvswitch/vhost-user-1`, which you must provide to your VM on the QEMU command line. More instructions on this can be found in the next section "DPDK vhost-user VM configuration" - Note: If you wish for the vhost-user sockets to be created in a - directory other than `/usr/local/var/run/openvswitch`, you may specify - another location on the ovs-vswitchd command line like so: + - If you wish for the vhost-user sockets to be created in a sub-directory of + `/usr/local/var/run/openvswitch`, you may specify this directory in the + ovsdb like so: - `./vswitchd/ovs-vswitchd --dpdk -vhost_sock_dir /my-dir -c 0x1 ...` + `./utilities/ovs-vsctl --no-wait \ + set Open_vSwitch . other_config:vhost-sock-dir=subdir` DPDK vhost-user VM configuration: --------------------------------- @@ -409,6 +659,41 @@ Follow the steps below to attach vhost-user port(s) to a VM. -numa node,memdev=mem -mem-prealloc ``` +3. Optional: Enable multiqueue support + The vhost-user interface must be configured in Open vSwitch with the + desired amount of queues with: + + ``` + ovs-vsctl set Interface vhost-user-2 options:n_rxq= + ``` + + QEMU needs to be configured as well. + The $q below should match the queues requested in OVS (if $q is more, + packets will not be received). + The $v is the number of vectors, which is '$q x 2 + 2'. + + ``` + -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 + -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q + -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v + ``` + + If one wishes to use multiple queues for an interface in the guest, the + driver in the guest operating system must be configured to do so. It is + recommended that the number of queues configured be equal to '$q'. + + For example, this can be done for the Linux kernel virtio-net driver with: + + ``` + ethtool -L combined <$q> + ``` + + A note on the command above: + + `-L`: Changes the numbers of channels of the specified network device + + `combined`: Changes the number of multi-purpose channels. + DPDK vhost-cuse: ---------------- @@ -418,10 +703,10 @@ with OVS. DPDK vhost-cuse Prerequisites: ------------------------- -1. DPDK 2.0 with vhost support enabled as documented in the "Building and +1. DPDK 16.04 with vhost support enabled as documented in the "Building and Installing section" As an additional step, you must enable vhost-cuse in DPDK by setting the - following additional flag in `config/common_linuxapp`: + following additional flag in `config/common_base`: `CONFIG_RTE_LIBRTE_VHOST_USER=n` @@ -480,14 +765,13 @@ DPDK vhost-cuse VM configuration: 1. This step is only needed if using an alternative character device. - The new character device filename must be specified on the vswitchd - commandline: + The new character device filename must be specified in the ovsdb: - `./vswitchd/ovs-vswitchd --dpdk --cuse_dev_name my-vhost-net -c 0x1 ...` + `./utilities/ovs-vsctl --no-wait set Open_vSwitch . \ + other_config:cuse-dev-name=my-vhost-net` - Note that the `--cuse_dev_name` argument and associated string must be the first - arguments after `--dpdk` and come before the EAL arguments. In the example - above, the character device to be used will be `/dev/my-vhost-net`. + In the example above, the character device to be used will be + `/dev/my-vhost-net`. 2. This step is only needed if reusing the standard character device. It will conflict with the kernel vhost character device so the user must first @@ -599,8 +883,8 @@ steps. ``` refers to "vhost-net" if using the `/dev/vhost-net` - device. If you have specificed a different name on the ovs-vswitchd - commandline using the "--cuse_dev_name" parameter, please specify that + device. If you have specificed a different name in the database + using the "other_config:cuse-dev-name" parameter, please specify that filename instead. 2. Disable SELinux or set to permissive mode @@ -721,17 +1005,17 @@ Restrictions: this with smaller page sizes. Platform and Network Interface: - - Currently it is not possible to use an Intel XL710 Network Interface as a - DPDK port type on a platform with more than 64 logical cores. This is - related to how DPDK reports the number of TX queues that may be used by - a DPDK application with an XL710. The maximum number of TX queues supported - by a DPDK application for an XL710 is 64. If a user attempts to add an - XL710 interface as a DPDK port type to a system as described above the - port addition will fail as OVS will attempt to initialize a TX queue greater - than 64. This issue is expected to be resolved in a future DPDK release. - As a workaround a user can disable hyper-threading to reduce the overall - core count of the system to be less than or equal to 64 when using an XL710 - interface with DPDK. + - By default with DPDK 16.04, a maximum of 64 TX queues can be used with an + Intel XL710 Network Interface on a platform with more than 64 logical + cores. If a user attempts to add an XL710 interface as a DPDK port type to + a system as described above, an error will be reported that initialization + failed for the 65th queue. OVS will then roll back to the previous + successful queue initialization and use that value as the total number of + TX queues available with queue locking. If a user wishes to use more than + 64 queues and avoid locking, then the + `CONFIG_RTE_LIBRTE_I40E_QUEUE_NUM_PER_PF` config parameter in DPDK must be + increased to the desired number of queues. Both DPDK and OVS must be + recompiled for this change to take effect. Bug Reporting: --------------