avatarLevente Csikor

Summary

The provided content is a technical tutorial exploring the performance implications of offloading Open vSwitch (OvS) datapath to hardware, specifically on the NVIDIA Mellanox Bluefield-2 SmartNIC, with a focus on comparing kernel and DPDK datapath offloading.

Abstract

The article delves into the intricacies of packet processing with OvS on the NVIDIA Mellanox Bluefield-2 SmartNIC, examining the benefits and performance impacts of offloading the OvS datapath to hardware. It details the process of setting up specific Layer-3 flow rules and monitoring the flow cache to determine if hardware offloading is taking place. The author conducts experiments to measure the performance differences between offloaded and non-offloaded OvS datapaths, finding that hardware offloading significantly improves throughput, especially for smaller packet sizes. The tutorial also guides readers through the steps to disable hardware offloading and to configure OvS with DPDK for further performance enhancements, while noting a current limitation with adding DPDK ports on the Bluefield-2 SmartNIC that requires attention from the NVIDIA developer community.

Opinions

  • The author suggests that hardware offloading is beneficial for OvS performance, particularly when using TC flowers, which can achieve line-rate performance with larger packet sizes.
  • There is an indication that the default behavior of OvS on the Bluefield-2 SmartNIC involves hardware offloading, as evidenced by the flow cache entries.
  • The performance of OvS without offloading is significantly lower, with a notable drop in throughput for smaller packet sizes.
  • The author expresses a strong interest in the community's engagement with the issue of adding DPDK ports to OvS on the Bluefield-2 SmartNIC, highlighting the importance of resolving this for continued performance improvements.
  • The tutorial implies that readers should be familiar with advanced networking concepts and command-line operations to follow along and implement the described configurations.

Part VII/A — NVIDIA Mellanox Bluefield-2 SmartNIC Hands-On Tutorial: To Offload or Not To Offload?

Is it beneficial to offload OvS datapath to the hardware? Does it matter if the kernel or DPDK datapath is offloaded? In this episode, I dig a bit deeper into the OvS offloading matters, and I also explain how the packet processing is done with OvS running on the SmartNIC.

[UPDATE 08/2023]: I started to revise my tutorials here by reproducing them from scratch. The content below has been updated accordingly without explicitly mentioning it at every single instance.

In the previous episodes, I have already been dealing with OvS and DPDK (separately) on the Bluefield-2 DPU SmartNIC; however, we did not touch upon hardware offloading at all. Let us continue our journey directly from where we stopped in Part VI., where Host1 and Host2 were running pktgen for sending and receiving packets, respectively, while the Bluefields were running an OVS instance with the default NORMAL flow rule removed and hard-coded L3-based forwarding rules were added.

Offload Packet Processing to the BlueField-2 SmartNIC

On the Bluefields, we have added some very specific Layer-3 flow rules to the OVS bridges.

bf2@host1# ovs-ofctl del-flows ovsbr1
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 arp,actions=FLOOD
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=pf0hpf,ip_dst=10.0.0.2,ip_src=10.0.0.1,actions=output:p0
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=p0,ip_dst=10.0.0.1,ip_src=10.0.0.2,actions=output:pf0hpf

Bear in mind the IP addresses and the ports. The first OVS running “beneath” Host1 has the host-facing logical interface is pf0hpf, and the packets coming from Host1 have an IP of 10.0.0.1. Accordingly, all such packets should be sent out on port p0 towards Host2. Conversely, I also set a similar rule for the reverse direction as well as on the other OVS running “beneath” Host2.

bf2@host2# ovs-ofctl del-flows ovsbr1
bf2@host2# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 arp,actions=FLOOD
bf2@host2# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=pf0hpf,ip_dst=10.0.0.1,ip_src=10.0.0.2,actions=output:p0
bf2@host2# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=p0,ip_dst=10.0.0.2,ip_src=10.0.0.1,actions=output:pf0hpf

So, we saw that pktgen was working and we could achieve a pretty good performance. But do we know whether it is because OVS kernel driver is mysteriously pretty damn good or whether hardware offloading was enabled by default? This is what we are going to figure out.

Offloaded or Not Offloaded?

There is a user-space application for OVS to monitor the flow cache, in particular, the MegaFlow Cache of OVS. Recently, there was a study about how this flow cache can be populated in a covert way with only a few number of packets that causes a Denial-of-Service attack. Here, I do not want to dig deeper in the caching architecture of OVS; let it be enough that flow rules matched in the flow table are cached to make packet processing faster for each subsequent packet belonging to the same flows.

Dump the Flow Cache

Let’s check the flow cache at of OVS on the Bluefield at Host1. Note, the steps and outcomes detailed below are identical to Host2.

bf2@host1# ovs-dpctl dump-flows
recirc_id(0),in_port(2),eth(src=18:5a:58:0c:c9:42,dst=01:80:c2:00:00:00),eth_type(0/0xffff), packets:80206, bytes:4812360, used:1.628s, actions:userspace(pid=4294967295,controller(reason=7,dont_send=0,continuation=0,recirc_id=1,rule_cookie=0,controller_id=0,max_len=65535))

We see that there is nothing related to the pktgen flows (assuming you have not shut pktgen down yet ;) otherwise, it is clear why you would not have any cached flow corresponding to pktgen).

This is already suspicious, which led me to assume that the flows should be already offloaded to the hardware.

First, let us get the OVS configuration flag to double-check:

bf2@host1# ovs-vsctl --no-wait get Open_vSwitch . other_config:hw-offload
"true"

To confirm, let us see what the hardware flow cache says then. There is another user-space tool for OVS that allows us to dump any flow cache entry that exists in the system.

Let’s use that command and also use grep to quickly find the entries related to our pktgen traffic.

bf2@host1# ovs-appctl dpctl/dump-flows -m |grep 10.0.0.
ufid:f8a3234a-3f74-43ab-9d5d-fe2148d0d4cc, skb_priority(0/0),skb_mark(0/0),
ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),
dp_hash(0/0),in_port(pf0hpf),packet_type(ns=0/0,id=0/0),
eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),
eth_type(0x0800),ipv4(src=10.0.0.1,dst=10.0.0.2,proto=0/0,tos=0/0,ttl=0/0,
frag=no), packets:2075031628, bytes:1054116049076, used:0.220s, 
offloaded:yes, dp:tc, actions:p0

As you can see in the last line of the snippet above, the offloaded flag is set to yes, while the dp (corresponding to the datapath) flag is set to TC flowers.

Performance of the TC Flowers

Since this was the default behavior, we already know that the performance is around 17 Gbps with packet size 64B, while line-rate (i.e., ~100Gbps) could be reached by 512B or bigger packets.

Performance without offloading

Next, we try to disable hardware offloading altogether to see how the OVS will perform. First, stop the pktgen session and let the caches expire; this usually happens after 5–10 seconds you have stopped pktgen.

On the Bluefield, set the OvS-database configuration for hardware offloading to False and restart OvS.

bf2@host1# ovs-vsctl --no-wait set Open_vSwitch . other_config:hw-offload=false
bf2@host1# /etc/init.d/openvswitch-switch restart

Similarly, delete the default NORMAL flow rule and re-add the specific Layer-3 flow rules again to ovsbr1.

bf2@host1# ovs-ofctl del-flows ovsbr1
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 arp,actions=FLOOD
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=pf0hpf,ip_dst=10.0.0.2,ip_src=10.0.0.1,actions=output:p0
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=p0,ip_dst=10.0.0.1,ip_src=10.0.0.2,actions=output:pf0hpf

Now, if we start pktgen, it works again; however, corresponding flow cache entries are not offloaded to the hardware. We can use both flow cache monitoring applications to observe this.

First, we can see the corresponding flow caches in the software cache.

bf2@host1# ovs-ofctl dump-flows ovsbr1|grep 10.0.0
 cookie=0x0, duration=196.075s, table=0, n_packets=30675800, n_bytes=1840548000, idle_age=0, ip,in_port=2,nw_src=10.0.0.1,nw_dst=10.0.0.2 actions=output:1
 cookie=0x0, duration=187.524s, table=0, n_packets=0, n_bytes=0, idle_age=187, ip,in_port=1,nw_src=10.0.0.2,nw_dst=10.0.0.1 actions=output:2

If we check the rest of the flow caches, we can see in the highlighted row that the flow cache entry corresponding to the pings have no offloaded flag now, and the dp flag is set to ovs (instead of tc).

bf2@host1# ovs-appctl dpctl/dump-flows -m |grep 10.0.0.
ufid:b8d4b1f6-0b68-4f4a-9aa6-2259c4b2678e, recirc_id(0),dp_hash(0/0),
skb_priority(0/0),in_port(pf0hpf),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),
ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,
dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=10.0.0.1,
dst=10.0.0.2,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:60970416, 
bytes:3658224960, used:0.000s, flags:., dp:ovs, actions:p0

Measured numbers

The performance of OvS with the kernel datapath is much worse than when its datapath is offloaded to the hardware (via TC flowers).

Actually, the amount of 64B-sized packets reaching Host2 from Host1 is just ~470,000, which means around 315 Mbps throughput. When we set the packet size to 512, which when using TC flower-based offloading resulted in line-rate performance, the pure software OVS switch can only ~2Gbps. With MTU-sized packets of 1500B, the same packet per second metric results in ~5.6 Gbps.

Offload or Not Offload? — Offload with DPDK

And now, we reached the section that is of utmost interest for most of the readers who ended up here; at least I suppose :`).

To use OVS with DPDK, there are further OVS-database configurations we have to carry out before starting OVS itself.

Issue all commands below on the Bluefield.

Initialize OvS with DPDK and Offloading

First, let’s set the hw-offload flag back to True

bf2@host1# ovs-vsctl --no-wait set Open_vSwitch . other_config:hw-offload=true

Initialize DPDK (for more details, refer to Part II.).

This first library export is needed for any DPDK application running on the system to find the corresponding DPDK libraries.

bf2@host1# export LD_LIBRARY_PATH=/opt/mellanox/dpdk/lib/aarch64-linux-gnu/

Ensure that hugepages are enabled and mount them. Nothing (bad) will happen if you just simply redo them.

bf2@host1# echo 12288 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
bf2@host1# mountpoint -q /dev/hugepages || mount -t hugetlbfs nodev /dev/hugepages

Or if preferred and you want to keep track of the hugepages easier:

bf2@host1# umount /dev/hugepages
bf2@host1# mkdir -p /mnt/huge
bf2@host1# mount -t hugetlbfs nodev /mnt/huge

Again, check Part II, if you want to use 1G hugepages, instead of 2M.

Set hugepage configuration for OVS. Note that you should have more hugepages allocated for the system than the amount you assign to OVS. So, if we want OVS to use 4096MB then, make sure you have more in total — if you followed the tutorial, we purposely set 11280 for this reason. You can verify the free hugepages by

bf2@host1# cat /proc/meminfo  |grep -i ^Huge
HugePages_Total:    6797
HugePages_Free:     6669
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        13920256 kB
bf2@host1# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="4096"

Set DPDK initialization flag to True

bf2@host1# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true

Finally, restart OVS

bf2@host1# /etc/init.d/openvswitch-switch restart

Create OVS-DPDK bridge

Now, to make sure only the DPDK OVS bridge will process packets, remove ovsbr1 and ovsbr2 created by default.

bf2@host1# ovs-vsctl --if-exists del-br ovsbr1
bf2@host1# ovs-vsctl --if-exists del-br ovsbr2

Add a new bridge to the OVS system that is DPDK-enabled

bf2@host1# ovs-vsctl --no-wait add-br ovs_dpdk_br0 -- set bridge ovs_dpdk_br0 datapath_type=netdev

We can further restrict OVS to only be able to use the ports that we let it operate. Accordingly, let’s assign only port0 (our former p0 port from above), particularly port0’s PCI ID to OVS.

bf2@host1# ovs-vsctl set Open_vSwitch . other_config:dpdk-extra="-a 0000:03:00.0,representor=[0,65535]"

Add the two ports to the bridge — one for the physical port (dpdk0) and one for the logical port using the representor.

The latter is actually the VF-PF (virtual function — physical function mapper). There is a pretty good paragraph here (20.1.1) about the VFs and PFs; basically, you can define many VFs that can be assigned to any application/VM/container/whatnot, and all are connected to the one and only physical function (PF) that represents the physical port itself.

Add the PF first

bf2@host1# ovs-vsctl --no-wait add-port ovs_dpdk_br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 options:dpdk-devargs=0000:03:00.0

Then, add the VFs (we add all VFs here without any restriction as we want to send all packets from Host1)

bf2@host1# ovs-vsctl --no-wait add-port ovs_dpdk_br0 dpdk1 -- set Interface dpdk1 type=dpdk -- set Interface dpdk1 options:dpdk-devargs=0000:03:00.0,representor=[0,65535]

Unfortunately, this is the stage, where I stuck now, in 09/2023. OvS fails in adding the DPDK ports. In particular, using the OvS and DPDK bundle provided by default on the Bluefield, I manage to add the PF, but when I add the VF, it simply fails without any meaningful error messages.

You have a look at the following NVidia developer forum for further updates:

From now on, I temporarily stop this episode here and I have removed the old results I obtained years ago.

Bluefield
Smart Nic
Openvswitch
Offloading
Dpdk
Recommended from ReadMedium