X-Git-Url: http://git.cascardo.eti.br/?a=blobdiff_plain;f=INSTALL.DPDK.md;h=00e75bda23fd737b034597040a5326196c7cd032;hb=refs%2Fheads%2Frtnetlink;hp=68735ccda43aa0184938463552ee07a63f48a189;hpb=9509913aa722158472bebe0d38ddef15eb49729a;p=cascardo%2Fovs.git diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md index 68735ccda..00e75bda2 100644 --- a/INSTALL.DPDK.md +++ b/INSTALL.DPDK.md @@ -16,7 +16,7 @@ OVS needs a system with 1GB hugepages support. Building and Installing: ------------------------ -Required: DPDK 16.04 +Required: DPDK 16.04, libnuma Optional (if building with vhost-cuse): `fuse`, `fuse-devel` (`libfuse-dev` on Debian/Ubuntu) @@ -289,223 +289,231 @@ Using the DPDK with ovs-vswitchd: Performance Tuning: ------------------- - 1. PMD affinitization +1. PMD affinitization - A poll mode driver (pmd) thread handles the I/O of all DPDK - interfaces assigned to it. A pmd thread will busy loop through - the assigned port/rxq's polling for packets, switch the packets - and send to a tx port if required. Typically, it is found that - a pmd thread is CPU bound, meaning that the greater the CPU - occupancy the pmd thread can get, the better the performance. To - that end, it is good practice to ensure that a pmd thread has as - many cycles on a core available to it as possible. This can be - achieved by affinitizing the pmd thread with a core that has no - other workload. See section 7 below for a description of how to - isolate cores for this purpose also. + A poll mode driver (pmd) thread handles the I/O of all DPDK + interfaces assigned to it. A pmd thread will busy loop through + the assigned port/rxq's polling for packets, switch the packets + and send to a tx port if required. Typically, it is found that + a pmd thread is CPU bound, meaning that the greater the CPU + occupancy the pmd thread can get, the better the performance. To + that end, it is good practice to ensure that a pmd thread has as + many cycles on a core available to it as possible. This can be + achieved by affinitizing the pmd thread with a core that has no + other workload. See section 7 below for a description of how to + isolate cores for this purpose also. - The following command can be used to specify the affinity of the - pmd thread(s). + The following command can be used to specify the affinity of the + pmd thread(s). - `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=` + `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=` - By setting a bit in the mask, a pmd thread is created and pinned - to the corresponding CPU core. e.g. to run a pmd thread on core 1 + By setting a bit in the mask, a pmd thread is created and pinned + to the corresponding CPU core. e.g. to run a pmd thread on core 1 - `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=2` + `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=2` - For more information, please refer to the Open_vSwitch TABLE section in + For more information, please refer to the Open_vSwitch TABLE section in - `man ovs-vswitchd.conf.db` + `man ovs-vswitchd.conf.db` - Note, that a pmd thread on a NUMA node is only created if there is - at least one DPDK interface from that NUMA node added to OVS. + Note, that a pmd thread on a NUMA node is only created if there is + at least one DPDK interface from that NUMA node added to OVS. - 2. Multiple poll mode driver threads +2. Multiple poll mode driver threads - With pmd multi-threading support, OVS creates one pmd thread - for each NUMA node by default. However, it can be seen that in cases - where there are multiple ports/rxq's producing traffic, performance - can be improved by creating multiple pmd threads running on separate - cores. These pmd threads can then share the workload by each being - responsible for different ports/rxq's. Assignment of ports/rxq's to - pmd threads is done automatically. + With pmd multi-threading support, OVS creates one pmd thread + for each NUMA node by default. However, it can be seen that in cases + where there are multiple ports/rxq's producing traffic, performance + can be improved by creating multiple pmd threads running on separate + cores. These pmd threads can then share the workload by each being + responsible for different ports/rxq's. Assignment of ports/rxq's to + pmd threads is done automatically. - The following command can be used to specify the affinity of the - pmd threads. + The following command can be used to specify the affinity of the + pmd threads. - `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=` + `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=` - A set bit in the mask means a pmd thread is created and pinned - to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2 + A set bit in the mask means a pmd thread is created and pinned + to the corresponding CPU core. e.g. to run pmd threads on core 1 and 2 - `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6` + `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6` - For more information, please refer to the Open_vSwitch TABLE section in + For more information, please refer to the Open_vSwitch TABLE section in - `man ovs-vswitchd.conf.db` + `man ovs-vswitchd.conf.db` - For example, when using dpdk and dpdkvhostuser ports in a bi-directional - VM loopback as shown below, spreading the workload over 2 or 4 pmd - threads shows significant improvements as there will be more total CPU - occupancy available. + For example, when using dpdk and dpdkvhostuser ports in a bi-directional + VM loopback as shown below, spreading the workload over 2 or 4 pmd + threads shows significant improvements as there will be more total CPU + occupancy available. - NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 + NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 - The following command can be used to confirm that the port/rxq assignment - to pmd threads is as required: + The following command can be used to confirm that the port/rxq assignment + to pmd threads is as required: - `ovs-appctl dpif-netdev/pmd-rxq-show` + `ovs-appctl dpif-netdev/pmd-rxq-show` - This can also be checked with: + This can also be checked with: - ``` - top -H - taskset -p - ``` - - To understand where most of the pmd thread time is spent and whether the - caches are being utilized, these commands can be used: - - ``` - # Clear previous stats - ovs-appctl dpif-netdev/pmd-stats-clear - - # Check current stats - ovs-appctl dpif-netdev/pmd-stats-show - ``` - - 3. DPDK port Rx Queues - - `ovs-vsctl set Interface options:n_rxq=` - - The command above sets the number of rx queues for DPDK interface. - The rx queues are assigned to pmd threads on the same NUMA node in a - round-robin fashion. For more information, please refer to the - Open_vSwitch TABLE section in + ``` + top -H + taskset -p + ``` - `man ovs-vswitchd.conf.db` + To understand where most of the pmd thread time is spent and whether the + caches are being utilized, these commands can be used: - 4. Exact Match Cache + ``` + # Clear previous stats + ovs-appctl dpif-netdev/pmd-stats-clear - Each pmd thread contains one EMC. After initial flow setup in the - datapath, the EMC contains a single table and provides the lowest level - (fastest) switching for DPDK ports. If there is a miss in the EMC then - the next level where switching will occur is the datapath classifier. - Missing in the EMC and looking up in the datapath classifier incurs a - significant performance penalty. If lookup misses occur in the EMC - because it is too small to handle the number of flows, its size can - be increased. The EMC size can be modified by editing the define - EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c. + # Check current stats + ovs-appctl dpif-netdev/pmd-stats-show + ``` - As mentioned above an EMC is per pmd thread. So an alternative way of - increasing the aggregate amount of possible flow entries in EMC and - avoiding datapath classifier lookups is to have multiple pmd threads - running. This can be done as described in section 2. +3. DPDK port Rx Queues - 5. Compiler options + `ovs-vsctl set Interface options:n_rxq=` - The default compiler optimization level is '-O2'. Changing this to - more aggressive compiler optimizations such as '-O3' or - '-Ofast -march=native' with gcc can produce performance gains. + The command above sets the number of rx queues for DPDK interface. + The rx queues are assigned to pmd threads on the same NUMA node in a + round-robin fashion. For more information, please refer to the + Open_vSwitch TABLE section in - 6. Simultaneous Multithreading (SMT) + `man ovs-vswitchd.conf.db` - With SMT enabled, one physical core appears as two logical cores - which can improve performance. +4. Exact Match Cache - SMT can be utilized to add additional pmd threads without consuming - additional physical cores. Additional pmd threads may be added in the - same manner as described in section 2. If trying to minimize the use - of physical cores for pmd threads, care must be taken to set the - correct bits in the pmd-cpu-mask to ensure that the pmd threads are - pinned to SMT siblings. + Each pmd thread contains one EMC. After initial flow setup in the + datapath, the EMC contains a single table and provides the lowest level + (fastest) switching for DPDK ports. If there is a miss in the EMC then + the next level where switching will occur is the datapath classifier. + Missing in the EMC and looking up in the datapath classifier incurs a + significant performance penalty. If lookup misses occur in the EMC + because it is too small to handle the number of flows, its size can + be increased. The EMC size can be modified by editing the define + EM_FLOW_HASH_SHIFT in lib/dpif-netdev.c. - For example, when using 2x 10 core processors in a dual socket system - with HT enabled, /proc/cpuinfo will report 40 logical cores. To use - two logical cores which share the same physical core for pmd threads, - the following command can be used to identify a pair of logical cores. + As mentioned above an EMC is per pmd thread. So an alternative way of + increasing the aggregate amount of possible flow entries in EMC and + avoiding datapath classifier lookups is to have multiple pmd threads + running. This can be done as described in section 2. - `cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list` +5. Compiler options - where N is the logical core number. In this example, it would show that - cores 1 and 21 share the same physical core. The pmd-cpu-mask to enable - two pmd threads running on these two logical cores (one physical core) - is. + The default compiler optimization level is '-O2'. Changing this to + more aggressive compiler optimizations such as '-O3' or + '-Ofast -march=native' with gcc can produce performance gains. - `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002` +6. Simultaneous Multithreading (SMT) - Note that SMT is enabled by the Hyper-Threading section in the - BIOS, and as such will apply to the whole system. So the impact of - enabling/disabling it for the whole system should be considered - e.g. If workloads on the system can scale across multiple cores, - SMT may very beneficial. However, if they do not and perform best - on a single physical core, SMT may not be beneficial. + With SMT enabled, one physical core appears as two logical cores + which can improve performance. - 7. The isolcpus kernel boot parameter + SMT can be utilized to add additional pmd threads without consuming + additional physical cores. Additional pmd threads may be added in the + same manner as described in section 2. If trying to minimize the use + of physical cores for pmd threads, care must be taken to set the + correct bits in the pmd-cpu-mask to ensure that the pmd threads are + pinned to SMT siblings. - isolcpus can be used on the kernel bootline to isolate cores from the - kernel scheduler and hence dedicate them to OVS or other packet - forwarding related workloads. For example a Linux kernel boot-line - could be: + For example, when using 2x 10 core processors in a dual socket system + with HT enabled, /proc/cpuinfo will report 40 logical cores. To use + two logical cores which share the same physical core for pmd threads, + the following command can be used to identify a pair of logical cores. - 'GRUB_CMDLINE_LINUX_DEFAULT="quiet hugepagesz=1G hugepages=4 default_hugepagesz=1G 'intel_iommu=off' isolcpus=1-19"' + `cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list` - 8. NUMA/Cluster On Die + where N is the logical core number. In this example, it would show that + cores 1 and 21 share the same physical core. The pmd-cpu-mask to enable + two pmd threads running on these two logical cores (one physical core) + is. - Ideally inter NUMA datapaths should be avoided where possible as packets - will go across QPI and there may be a slight performance penalty when - compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3, - Cluster On Die is introduced on models that have 10 cores or more. - This makes it possible to logically split a socket into two NUMA regions - and again it is preferred where possible to keep critical datapaths - within the one cluster. + `ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=100002` - It is good practice to ensure that threads that are in the datapath are - pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs - responsible for forwarding. + Note that SMT is enabled by the Hyper-Threading section in the + BIOS, and as such will apply to the whole system. So the impact of + enabling/disabling it for the whole system should be considered + e.g. If workloads on the system can scale across multiple cores, + SMT may very beneficial. However, if they do not and perform best + on a single physical core, SMT may not be beneficial. - 9. Rx Mergeable buffers +7. The isolcpus kernel boot parameter - Rx Mergeable buffers is a virtio feature that allows chaining of multiple - virtio descriptors to handle large packet sizes. As such, large packets - are handled by reserving and chaining multiple free descriptors - together. Mergeable buffer support is negotiated between the virtio - driver and virtio device and is supported by the DPDK vhost library. - This behavior is typically supported and enabled by default, however - in the case where the user knows that rx mergeable buffers are not needed - i.e. jumbo frames are not needed, it can be forced off by adding - mrg_rxbuf=off to the QEMU command line options. By not reserving multiple - chains of descriptors it will make more individual virtio descriptors - available for rx to the guest using dpdkvhost ports and this can improve - performance. - - 10. Packet processing in the guest - - It is good practice whether simply forwarding packets from one - interface to another or more complex packet processing in the guest, - to ensure that the thread performing this work has as much CPU - occupancy as possible. For example when the DPDK sample application - `testpmd` is used to forward packets in the guest, multiple QEMU vCPU - threads can be created. Taskset can then be used to affinitize the - vCPU thread responsible for forwarding to a dedicated core not used - for other general processing on the host system. - - 11. DPDK virtio pmd in the guest - - dpdkvhostcuse or dpdkvhostuser ports can be used to accelerate the path - to the guest using the DPDK vhost library. This library is compatible with - virtio-net drivers in the guest but significantly better performance can - be observed when using the DPDK virtio pmd driver in the guest. The DPDK - `testpmd` application can be used in the guest as an example application - that forwards packet from one DPDK vhost port to another. An example of - running `testpmd` in the guest can be seen here. + isolcpus can be used on the kernel bootline to isolate cores from the + kernel scheduler and hence dedicate them to OVS or other packet + forwarding related workloads. For example a Linux kernel boot-line + could be: - `./testpmd -c 0x3 -n 4 --socket-mem 512 -- --burst=64 -i --txqflags=0xf00 --disable-hw-vlan --forward-mode=io --auto-start` + ``` + GRUB_CMDLINE_LINUX_DEFAULT="quiet hugepagesz=1G hugepages=4 + default_hugepagesz=1G 'intel_iommu=off' isolcpus=1-19" + ``` - See below information on dpdkvhostcuse and dpdkvhostuser ports. - See [DPDK Docs] for more information on `testpmd`. +8. NUMA/Cluster On Die + + Ideally inter NUMA datapaths should be avoided where possible as packets + will go across QPI and there may be a slight performance penalty when + compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3, + Cluster On Die is introduced on models that have 10 cores or more. + This makes it possible to logically split a socket into two NUMA regions + and again it is preferred where possible to keep critical datapaths + within the one cluster. + + It is good practice to ensure that threads that are in the datapath are + pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs + responsible for forwarding. If DPDK is built with + CONFIG_RTE_LIBRTE_VHOST_NUMA=y, vHost User ports automatically + detect the NUMA socket of the QEMU vCPUs and will be serviced by a PMD + from the same node provided a core on this node is enabled in the + pmd-cpu-mask. + +9. Rx Mergeable buffers + + Rx Mergeable buffers is a virtio feature that allows chaining of multiple + virtio descriptors to handle large packet sizes. As such, large packets + are handled by reserving and chaining multiple free descriptors + together. Mergeable buffer support is negotiated between the virtio + driver and virtio device and is supported by the DPDK vhost library. + This behavior is typically supported and enabled by default, however + in the case where the user knows that rx mergeable buffers are not needed + i.e. jumbo frames are not needed, it can be forced off by adding + mrg_rxbuf=off to the QEMU command line options. By not reserving multiple + chains of descriptors it will make more individual virtio descriptors + available for rx to the guest using dpdkvhost ports and this can improve + performance. + +10. Packet processing in the guest + + It is good practice whether simply forwarding packets from one + interface to another or more complex packet processing in the guest, + to ensure that the thread performing this work has as much CPU + occupancy as possible. For example when the DPDK sample application + `testpmd` is used to forward packets in the guest, multiple QEMU vCPU + threads can be created. Taskset can then be used to affinitize the + vCPU thread responsible for forwarding to a dedicated core not used + for other general processing on the host system. + +11. DPDK virtio pmd in the guest + + dpdkvhostcuse or dpdkvhostuser ports can be used to accelerate the path + to the guest using the DPDK vhost library. This library is compatible with + virtio-net drivers in the guest but significantly better performance can + be observed when using the DPDK virtio pmd driver in the guest. The DPDK + `testpmd` application can be used in the guest as an example application + that forwards packet from one DPDK vhost port to another. An example of + running `testpmd` in the guest can be seen here. + ``` + ./testpmd -c 0x3 -n 4 --socket-mem 512 -- --burst=64 -i --txqflags=0xf00 + --disable-hw-vlan --forward-mode=io --auto-start + ``` + See below information on dpdkvhostcuse and dpdkvhostuser ports. + See [DPDK Docs] for more information on `testpmd`. DPDK Rings : ------------