Part VII/A — NVIDIA Mellanox Bluefield-2 SmartNIC Hands-On Tutorial: To Offload or Not To Offload?
Is it beneficial to offload OvS datapath to the hardware? Does it matter if the kernel or DPDK datapath is offloaded? In this episode, I dig a bit deeper into the OvS offloading matters, and I also explain how the packet processing is done with OvS running on the SmartNIC.
[UPDATE 08/2023]: I started to revise my tutorials here by reproducing them from scratch. The content below has been updated accordingly without explicitly mentioning it at every single instance.
In the previous episodes, I have already been dealing with OvS and DPDK (separately) on the Bluefield-2 DPU SmartNIC; however, we did not touch upon hardware offloading at all. Let us continue our journey directly from where we stopped in Part VI., where Host1 and Host2 were running pktgen for sending and receiving packets, respectively, while the Bluefields were running an OVS instance with the default NORMAL flow rule removed and hard-coded L3-based forwarding rules were added.
Offload Packet Processing to the BlueField-2 SmartNIC
On the Bluefields, we have added some very specific Layer-3 flow rules to the OVS bridges.
bf2@host1# ovs-ofctl del-flows ovsbr1
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 arp,actions=FLOOD
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=pf0hpf,ip_dst=10.0.0.2,ip_src=10.0.0.1,actions=output:p0
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=p0,ip_dst=10.0.0.1,ip_src=10.0.0.2,actions=output:pf0hpf
Bear in mind the IP addresses and the ports. The first OVS running “beneath” Host1 has the host-facing logical interface is pf0hpf, and the packets coming from Host1 have an IP of 10.0.0.1. Accordingly, all such packets should be sent out on port p0 towards Host2. Conversely, I also set a similar rule for the reverse direction as well as on the other OVS running “beneath” Host2.
bf2@host2# ovs-ofctl del-flows ovsbr1
bf2@host2# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 arp,actions=FLOOD
bf2@host2# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=pf0hpf,ip_dst=10.0.0.1,ip_src=10.0.0.2,actions=output:p0
bf2@host2# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=p0,ip_dst=10.0.0.2,ip_src=10.0.0.1,actions=output:pf0hpf
So, we saw that pktgen was working and we could achieve a pretty good performance. But do we know whether it is because OVS kernel driver is mysteriously pretty damn good or whether hardware offloading was enabled by default? This is what we are going to figure out.
Offloaded or Not Offloaded?
There is a user-space application for OVS to monitor the flow cache, in particular, the MegaFlow Cache of OVS. Recently, there was a study about how this flow cache can be populated in a covert way with only a few number of packets that causes a Denial-of-Service attack. Here, I do not want to dig deeper in the caching architecture of OVS; let it be enough that flow rules matched in the flow table are cached to make packet processing faster for each subsequent packet belonging to the same flows.
Dump the Flow Cache
Let’s check the flow cache at of OVS on the Bluefield at Host1. Note, the steps and outcomes detailed below are identical to Host2.
bf2@host1# ovs-dpctl dump-flows
recirc_id(0),in_port(2),eth(src=18:5a:58:0c:c9:42,dst=01:80:c2:00:00:00),eth_type(0/0xffff), packets:80206, bytes:4812360, used:1.628s, actions:userspace(pid=4294967295,controller(reason=7,dont_send=0,continuation=0,recirc_id=1,rule_cookie=0,controller_id=0,max_len=65535))
We see that there is nothing related to the pktgen flows (assuming you have not shut pktgen down yet ;) otherwise, it is clear why you would not have any cached flow corresponding to pktgen).
This is already suspicious, which led me to assume that the flows should be already offloaded to the hardware.
First, let us get the OVS configuration flag to double-check:
bf2@host1# ovs-vsctl --no-wait get Open_vSwitch . other_config:hw-offload
"true"
To confirm, let us see what the hardware flow cache says then. There is another user-space tool for OVS that allows us to dump any flow cache entry that exists in the system.
Let’s use that command and also use grep to quickly find the entries related to our pktgen traffic.
bf2@host1# ovs-appctl dpctl/dump-flows -m |grep 10.0.0.
ufid:f8a3234a-3f74-43ab-9d5d-fe2148d0d4cc, skb_priority(0/0),skb_mark(0/0),
ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),
dp_hash(0/0),in_port(pf0hpf),packet_type(ns=0/0,id=0/0),
eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),
eth_type(0x0800),ipv4(src=10.0.0.1,dst=10.0.0.2,proto=0/0,tos=0/0,ttl=0/0,
frag=no), packets:2075031628, bytes:1054116049076, used:0.220s,
offloaded:yes, dp:tc, actions:p0
As you can see in the last line of the snippet above, the offloaded flag is set to yes, while the dp (corresponding to the datapath) flag is set to TC flowers.
Performance of the TC Flowers
Since this was the default behavior, we already know that the performance is around 17 Gbps with packet size 64B, while line-rate (i.e., ~100Gbps) could be reached by 512B or bigger packets.
Performance without offloading
Next, we try to disable hardware offloading altogether to see how the OVS will perform. First, stop the pktgen session and let the caches expire; this usually happens after 5–10 seconds you have stopped pktgen.
On the Bluefield, set the OvS-database configuration for hardware offloading to False and restart OvS.
bf2@host1# ovs-vsctl --no-wait set Open_vSwitch . other_config:hw-offload=false
bf2@host1# /etc/init.d/openvswitch-switch restart
Similarly, delete the default NORMAL flow rule and re-add the specific Layer-3 flow rules again to ovsbr1.
bf2@host1# ovs-ofctl del-flows ovsbr1
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 arp,actions=FLOOD
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=pf0hpf,ip_dst=10.0.0.2,ip_src=10.0.0.1,actions=output:p0
bf2@host1# ovs-ofctl -O OpenFlow12 add-flow ovsbr1 ip,in_port=p0,ip_dst=10.0.0.1,ip_src=10.0.0.2,actions=output:pf0hpf
Now, if we start pktgen, it works again; however, corresponding flow cache entries are not offloaded to the hardware. We can use both flow cache monitoring applications to observe this.
First, we can see the corresponding flow caches in the software cache.
bf2@host1# ovs-ofctl dump-flows ovsbr1|grep 10.0.0
cookie=0x0, duration=196.075s, table=0, n_packets=30675800, n_bytes=1840548000, idle_age=0, ip,in_port=2,nw_src=10.0.0.1,nw_dst=10.0.0.2 actions=output:1
cookie=0x0, duration=187.524s, table=0, n_packets=0, n_bytes=0, idle_age=187, ip,in_port=1,nw_src=10.0.0.2,nw_dst=10.0.0.1 actions=output:2
If we check the rest of the flow caches, we can see in the highlighted row that the flow cache entry corresponding to the pings have no offloaded flag now, and the dp flag is set to ovs (instead of tc).
bf2@host1# ovs-appctl dpctl/dump-flows -m |grep 10.0.0.
ufid:b8d4b1f6-0b68-4f4a-9aa6-2259c4b2678e, recirc_id(0),dp_hash(0/0),
skb_priority(0/0),in_port(pf0hpf),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),
ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,
dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=10.0.0.1,
dst=10.0.0.2,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:60970416,
bytes:3658224960, used:0.000s, flags:., dp:ovs, actions:p0
Measured numbers
The performance of OvS with the kernel datapath is much worse than when its datapath is offloaded to the hardware (via TC flowers).
Actually, the amount of 64B-sized packets reaching Host2 from Host1 is just ~470,000, which means around 315 Mbps throughput. When we set the packet size to 512, which when using TC flower-based offloading resulted in line-rate performance, the pure software OVS switch can only ~2Gbps. With MTU-sized packets of 1500B, the same packet per second metric results in ~5.6 Gbps.
Offload or Not Offload? — Offload with DPDK
And now, we reached the section that is of utmost interest for most of the readers who ended up here; at least I suppose :`).
To use OVS with DPDK, there are further OVS-database configurations we have to carry out before starting OVS itself.
Issue all commands below on the Bluefield.
Initialize OvS with DPDK and Offloading
First, let’s set the hw-offload flag back to True
bf2@host1# ovs-vsctl --no-wait set Open_vSwitch . other_config:hw-offload=true
Initialize DPDK (for more details, refer to Part II.).
This first library export is needed for any DPDK application running on the system to find the corresponding DPDK libraries.
bf2@host1# export LD_LIBRARY_PATH=/opt/mellanox/dpdk/lib/aarch64-linux-gnu/
Ensure that hugepages are enabled and mount them. Nothing (bad) will happen if you just simply redo them.
bf2@host1# echo 12288 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
bf2@host1# mountpoint -q /dev/hugepages || mount -t hugetlbfs nodev /dev/hugepages
Or if preferred and you want to keep track of the hugepages easier:
bf2@host1# umount /dev/hugepages
bf2@host1# mkdir -p /mnt/huge
bf2@host1# mount -t hugetlbfs nodev /mnt/huge
Again, check Part II, if you want to use 1G hugepages, instead of 2M.
Set hugepage configuration for OVS. Note that you should have more hugepages allocated for the system than the amount you assign to OVS. So, if we want OVS to use 4096MB then, make sure you have more in total — if you followed the tutorial, we purposely set 11280 for this reason. You can verify the free hugepages by
bf2@host1# cat /proc/meminfo |grep -i ^Huge
HugePages_Total: 6797
HugePages_Free: 6669
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 13920256 kB
bf2@host1# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="4096"
Set DPDK initialization flag to True
bf2@host1# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
Finally, restart OVS
bf2@host1# /etc/init.d/openvswitch-switch restart
Create OVS-DPDK bridge
Now, to make sure only the DPDK OVS bridge will process packets, remove ovsbr1 and ovsbr2 created by default.
bf2@host1# ovs-vsctl --if-exists del-br ovsbr1
bf2@host1# ovs-vsctl --if-exists del-br ovsbr2
Add a new bridge to the OVS system that is DPDK-enabled
bf2@host1# ovs-vsctl --no-wait add-br ovs_dpdk_br0 -- set bridge ovs_dpdk_br0 datapath_type=netdev
We can further restrict OVS to only be able to use the ports that we let it operate. Accordingly, let’s assign only port0 (our former p0 port from above), particularly port0’s PCI ID to OVS.
bf2@host1# ovs-vsctl set Open_vSwitch . other_config:dpdk-extra="-a 0000:03:00.0,representor=[0,65535]"
Add the two ports to the bridge — one for the physical port (dpdk0) and one for the logical port using the representor.
The latter is actually the VF-PF (virtual function — physical function mapper). There is a pretty good paragraph here (20.1.1) about the VFs and PFs; basically, you can define many VFs that can be assigned to any application/VM/container/whatnot, and all are connected to the one and only physical function (PF) that represents the physical port itself.
Add the PF first
bf2@host1# ovs-vsctl --no-wait add-port ovs_dpdk_br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 options:dpdk-devargs=0000:03:00.0
Then, add the VFs (we add all VFs here without any restriction as we want to send all packets from Host1)
bf2@host1# ovs-vsctl --no-wait add-port ovs_dpdk_br0 dpdk1 -- set Interface dpdk1 type=dpdk -- set Interface dpdk1 options:dpdk-devargs=0000:03:00.0,representor=[0,65535]
Unfortunately, this is the stage, where I stuck now, in 09/2023. OvS fails in adding the DPDK ports. In particular, using the OvS and DPDK bundle provided by default on the Bluefield, I manage to add the PF, but when I add the VF, it simply fails without any meaningful error messages.
You have a look at the following NVidia developer forum for further updates:
From now on, I temporarily stop this episode here and I have removed the old results I obtained years ago.