iperf3 at 40Gbps and above
Achieving line rate on a 40G or 100G test host often requires parallel streams. However, using iperf3, it isn't as simple as just adding a -P flag because each iperf3 process is single-threaded, including all streams used by that iperf process for a parallel test. This means all the parallel streams for one test use the same CPU core. If you are core limited (this is often the case for a 40G host and it's usually the case for a 100G host), adding parallel streams won't help you unless you do so by adding additional iperf3 processes which can use additional cores.
Note that it is not possible to do this using pscheduler to manage iperf3 tests, so this is typically better suited to lab or testbed environments.
To run multiple iperf3 processes for a testing a high-speed host, do the following:
Start multiple servers:
iperf3 -s -p 5101&; iperf3 -s -p 5102&; iperf3 -s -p 5103 &
and then run multiple clients, using the "-T" flag to label the output:
iperf3 -c hostname -T s1 -p 5101 &;
iperf3 -c hostname -T s2 -p 5102 &;
iperf3 -c hostname -T s3 -p 5103 &;
Also, there are a number of additional host tuning settings needed for 40/100G hosts. The TCP autotuning settings may not be large enough for 40G, and you may want to try using the iperf3 -w option to set the window even larger (e.g.: -w 128M). Be sure to check your IRQ settings as well.
https://fasterdata.es.net/performance-testing/network-troubleshooting-tools/iperf/multi-stream-iperf3/