HPMC User Guide v 1.00
© 2022 Bassem W. Jamaleddine
HPMC calculators that are fabric capable can be interconnected together to take advange of RDMA via DAPL to execute programs and distribute the processing on neighboring HPMC's.
When the HPMC is equipped with a switching card device named qib0, it is capable to communicate with a neighboring HPMC that equipped with a similar card via a switch.
To find out if your HPMC is RDMA capable and it is running properly, you need to verify that "IB/RoCE v1" is up. The following steps show how to determine if your HPMC is "IB/RoCE v1" capable.
Make sure opensm is running:
# systemctl status opensm
# Make sure RoCE is running:
# mkdir /sys/kernel/config/rdma_cm/qib0
# tree /sys/kernel/config/rdma_cm/qib0
# cat /sys/kernel/config/rdma_cm/qib0/ports/1/default_roce_mode
# mkdir /sys/kernel/config/rdma_cm/qib0
# tree /sys/kernel/config/rdma_cm/qib0
/sys/kernel/config/rdma_cm/qib0
└── ports
└── 1
└── default_roce_mode
2 directories, 1 file
# cat /sys/kernel/config/rdma_cm/qib0/ports/1/default_roce_mode
IB/RoCE v1
The device qib0 is configured on the network as ib0. Any of the following commands can be used to query ib0:
# ip link show ib0
# ifconfig ib0
# ibv_devinfo -v qib0
# ip link show ib0
5: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP mode DEFAULT group default qlen 256
link/infiniband 80:00:00:03:fe:80:00:00:00:00:00:00:00:11:75:00:00:6e:ef:26 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
# ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 172.16.0.132 netmask 255.255.0.0 broadcast 172.16.255.255
inet6 fe80::211:7500:6e:ef26 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
# ibv_devinfo -v qib0
hca_id: qib0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 0011:7500:006e:ef26
sys_image_guid: 0011:7500:006e:ef26
vendor_id: 0x1175
...
Make sure that it is set to datagram mode (this configuration is set in the network scripts: ifcnf-ib0 ifcnf-ib0.8002
# cat /sys/class/net/ib0/mode
HPMC calculators branded as HPMC-KLIB, KLIB Knight-Landing InfiniBand, are all fabric connectivity ready and can be plugged to the switch.
In the following chapter we will show some of the commands to test and measure the performance of RDMA between the nodes.
■ Testing Connectivity with ping
Testing the connectivity with pingpong between hpmc9 and hpmc7 that are interconnected by infiniband.
# 16:19 root@HPMC9: /mm03fs/MPI # ibv_rc_pingpong -d qib0 -g 0 -i 1
# 16:19 root@HPMC9: /mm03fs/MPI # ibping -S -P 1 -C qib0
From hpmc7 ping hpmc9 via qib0 Lid 2
# 16:06 root@HPMC7: ~ # ibping -c 10000 -f -C qib0 -P 1 -G 0x00117500006eef26
16:06 root@HPMC7: ~ # ibping -c 10000 -f -C qib0 -P 1 -G 0x00117500006eef26 --- (Lid 2) ibping statistics --- 10000 packets transmitted, 0 received, 100% packet loss, time 574 ms rtt min/avg/max = 0.000/0.000/0.000 ms
■ Testing Connectivity with Iterations Using pingpong
Testing the connectivity with pingpong between hpmc9 and hpmc7 that are interconnected by infiniband.
# 16:19 root@HPMC9: /mm03fs/MPI # ibv_rc_pingpong -d qib0 -g 0 -i 1
16:19 root@HPMC9: /mm03fs/MPI # ibv_rc_pingpong -d qib0 -g 0 -i 1 local address: LID 0x0002, QPN 0x00001b, PSN 0xadb290, GID fe80::11:7500:6e:ef26 remote address: LID 0x0005, QPN 0x00001b, PSN 0x0c5add, GID fe80::11:7500:6e:f940 8192000 bytes in 0.04 seconds = 1464.59 Mbit/sec 1000 iters in 0.04 seconds = 44.75 usec/iter
# 15:58 root@HPMC7: ~ # ibv_rc_pingpong -g 0 -d qib0 -i 1 192.168.0.131
15:58 root@HPMC7: ~ # ibv_rc_pingpong -g 0 -d qib0 -i 1 192.168.0.131 local address: LID 0x0005, QPN 0x00001b, PSN 0x0c5add, GID fe80::11:7500:6e:f940 remote address: LID 0x0002, QPN 0x00001b, PSN 0xadb290, GID fe80::11:7500:6e:ef26 8192000 bytes in 0.04 seconds = 1478.27 Mbit/sec 1000 iters in 0.04 seconds = 44.33 usec/iter
■ RDMA WRITE_BW
Both hpmc9 and hpmc7 are interconnected by infiniband.
# 16:27 root@HPMC9: /mm03fs/MPI # ib_write_bw -R
16:27 root@HPMC9: /mm03fs/MPI # ib_write_bw -R
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : qib0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 2048[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
Waiting for client rdma_cm QP to connect
Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
local address: LID 0x02 QPN 0x001f PSN 0x3c0bc5
remote address: LID 0x05 QPN 0x001f PSN 0x280903
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 5000 628.34 618.89 0.009902
---------------------------------------------------------------------------------------
# 16:06 root@HPMC7: ~ # ib_write_bw hpmc9ib0 -R --cpu_util
16:06 root@HPMC7: ~ # ib_write_bw hpmc9ib0 -R --cpu_util
---------------------------------------------------------------------------------------
CPU Utilization works only with Duration mode.
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : qib0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 2048[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0x05 QPN 0x001f PSN 0x280903
remote address: LID 0x02 QPN 0x001f PSN 0x3c0bc5
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 1501.062000 != 1096.101000. CPU Frequency is not max.
65536 5000 628.34 618.89 0.009902
---------------------------------------------------------------------------------------
■ RDMA WRITE CPU_UTIL
Both hpmc9 and hpmc7 are interconnected by infiniband.
# 00:58 root@HPMC9: ~ # ib_write_bw -F -c RC --cpu_util -D 10
00:58 root@HPMC9: ~ # ib_write_bw -F -c RC --cpu_util -D 10
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : qib0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 2048[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x01 QPN 0x09d4 PSN 0x2196eb RKey 0x2505100 VAddr 0x007f84a816a000
remote address: LID 0x05 QPN 0x0019 PSN 0x6d2660 RKey 0x0a0b00 VAddr 0x007f296b6c3000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] CPU_Util[%]
65536 57900 0.00 5.06 0.009648 0.00
---------------------------------------------------------------------------------------
# 00:50 root@HPMC7: ~ # ib_write_bw -c RC --report_gbits -F -d qib0 hpmc9 -c RC --cpu_util -D 10
00:50 root@HPMC7: ~ # ib_write_bw -c RC --report_gbits -F -d qib0 hpmc9 -c RC --cpu_util -D 10
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : qib0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 2048[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x05 QPN 0x0019 PSN 0x6d2660 RKey 0x0a0b00 VAddr 0x007f296b6c3000
remote address: LID 0x01 QPN 0x09d4 PSN 0x2196eb RKey 0x2505100 VAddr 0x007f84a816a000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] CPU_Util[%]
65536 57900 0.00 5.06 0.009648 0.36
---------------------------------------------------------------------------------------
■ RDMA WRITE LATENCY
Both hpmc9 and hpmc7 are interconnected by infiniband.
# 01:23 root@HPMC9: # ib_write_lat -a -d qib0 -i 1 --report_gbits -F -n 1000
01:23 root@HPMC9: ~ # ib_write_lat -a -d qib0 -i 1 --report_gbits -F -n 1000
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write Latency Test
Dual-port : OFF Device : qib0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
Mtu : 2048[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x01 QPN 0x09ec PSN 0x7ddf72 RKey 0x2646500 VAddr 0x007fa7d18c3000
remote address: LID 0x05 QPN 0x0029 PSN 0xc1d796 RKey 0x1e1f00 VAddr 0x007ffbde44c000
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
2 1000 6.23 57.38 7.30 7.48 3.09 19.36 57.38
4 1000 6.20 41.64 6.62 7.34 2.32 17.11 41.64
8 1000 5.85 973.74 7.27 7.51 2.88 21.10 973.74
16 1000 5.93 41.00 7.28 7.51 2.44 17.73 41.00
32 1000 6.20 40.25 7.29 7.53 2.39 18.21 40.25
64 1000 6.24 38.57 7.36 7.66 2.59 19.52 38.57
128 1000 6.33 44.93 7.45 7.86 2.46 18.67 44.93
256 1000 6.33 43.53 7.41 7.91 2.27 16.97 43.53
512 1000 6.73 43.11 7.46 8.10 2.29 16.52 43.11
1024 1000 7.32 42.31 8.82 8.70 2.24 18.12 42.31
2048 1000 7.68 43.44 9.01 9.38 2.18 17.30 43.44
4096 1000 8.92 40.99 9.73 10.19 2.29 21.20 40.99
8192 1000 11.80 65.74 12.68 13.15 2.37 21.79 65.74
16384 1000 17.30 52.91 18.97 19.37 2.39 27.93 52.91
32768 1000 27.50 64.84 30.05 30.48 2.61 39.86 64.84
65536 1000 46.11 98.02 51.34 51.55 3.49 69.52 98.02
131072 1000 84.70 181.23 91.18 94.01 11.50 158.25 181.23
262144 1000 160.20 318.40 164.29 169.40 20.16 302.77 318.40
524288 1000 315.31 631.62 481.23 491.63 48.41 618.88 631.62
1048576 1000 668.80 1892.48 980.28 975.48 97.46 1184.61 1892.48
2097152 1000 1581.80 2357.73 1999.08 2001.44 159.81 2335.71 2357.73
4194304 1000 2980.43 6388.11 4170.43 4265.46 455.42 5018.32 6388.11
8388608 1000 6790.65 19943.24 9643.13 9673.86 460.42 11191.81 19943.24
---------------------------------------------------------------------------------------
# 01:15 root@HPMC7: ~ # ib_write_lat -n 1000 -a -F hpmc9ib0
01:15 root@HPMC7: ~ # ib_write_lat -n 1000 -a -F hpmc9ib0
---------------------------------------------------------------------------------------
RDMA_Write Latency Test
Dual-port : OFF Device : qib0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 1
Mtu : 2048[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x05 QPN 0x0029 PSN 0xc1d796 RKey 0x1e1f00 VAddr 0x007ffbde44c000
remote address: LID 0x01 QPN 0x09ec PSN 0x7ddf72 RKey 0x2646500 VAddr 0x007fa7d18c3000
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
2 1000 6.26 65.92 7.11 7.45 2.72 18.53 65.92
4 1000 6.30 35.33 6.85 7.32 2.00 17.23 35.33
8 1000 6.22 77.72 7.12 7.48 2.50 17.39 77.72
16 1000 6.26 38.88 7.19 7.48 2.22 16.28 38.88
32 1000 6.25 36.02 7.17 7.51 2.10 17.28 36.02
64 1000 6.31 40.74 7.30 7.63 2.28 17.50 40.74
128 1000 6.34 34.77 7.45 7.84 2.19 17.62 34.77
256 1000 6.46 37.31 7.43 7.88 1.99 17.07 37.31
512 1000 6.78 32.63 7.49 8.08 2.05 17.07 32.63
1024 1000 7.31 33.82 8.74 8.68 1.99 17.88 33.82
2048 1000 7.72 38.05 9.01 9.35 1.91 17.68 38.05
4096 1000 9.06 39.97 9.74 10.17 2.01 21.05 39.97
8192 1000 11.98 66.97 12.72 13.12 2.17 20.83 66.97
16384 1000 17.99 53.98 18.95 19.34 2.08 27.43 53.98
32768 1000 28.76 63.88 30.04 30.46 2.39 37.89 63.88
65536 1000 45.77 92.29 51.34 51.53 2.79 64.58 92.29
131072 1000 87.75 172.78 91.35 93.99 10.07 151.73 172.78
262144 1000 160.01 316.95 165.04 169.29 19.55 298.54 316.95
524288 1000 314.70 627.75 481.06 491.46 47.88 617.67 627.75
1048576 1000 677.13 1205.93 980.28 975.31 97.09 1184.69 1205.93
2097152 1000 1577.45 2366.46 1997.71 2001.18 159.95 2336.93 2366.46
4194304 1000 2990.30 5675.28 4171.74 4264.63 453.91 4979.38 5675.28
8388608 1000 6789.65 18944.48 9641.62 9664.59 360.09 11280.70 18944.48
---------------------------------------------------------------------------------------
■ Measuring Performance with iperf3 Over Infiniband
Both hpmc9 and hpmc7 are interconnected by infiniband, we want to measure performace between the two nodes for sender and receiver.
# 01:32 root@HPMC9: ~ # iperf3 -i 5 -s -B hpmc9ib0
01:32 root@HPMC9: ~ # iperf3 -i 5 -s -B hpmc9ib0 ----------------------------------------------------------- Server listening on 5201 ----------------------------------------------------------- Accepted connection from 172.16.0.119, port 55015 [ 5] local 172.16.0.132 port 5201 connected to 172.16.0.119 port 50086 [ ID] Interval Transfer Bandwidth [ 5] 0.00-5.00 sec 1.15 GBytes 1.98 Gbits/sec [ 5] 5.00-10.00 sec 1.18 GBytes 2.03 Gbits/sec [ 5] 10.00-10.04 sec 9.00 MBytes 2.02 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.04 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-10.04 sec 2.34 GBytes 2.00 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 ----------------------------------------------------------- Accepted connection from 172.16.0.119, port 34576 [ 5] local 172.16.0.132 port 5201 connected to 172.16.0.119 port 34578 [ ID] Interval Transfer Bandwidth [ 5] 0.00-5.00 sec 1.13 GBytes 1.94 Gbits/sec [ 5] 5.00-10.00 sec 1.15 GBytes 1.97 Gbits/sec [ 5] 10.00-10.04 sec 8.31 MBytes 1.72 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.04 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-10.04 sec 2.28 GBytes 1.95 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 ----------------------------------------------------------- Accepted connection from 172.16.0.119, port 34580 [ 5] local 172.16.0.132 port 5201 connected to 172.16.0.119 port 34582 [ ID] Interval Transfer Bandwidth [ 5] 0.00-5.00 sec 1.33 GBytes 2.29 Gbits/sec [ 5] 5.00-10.00 sec 1.51 GBytes 2.60 Gbits/sec [ 5] 10.00-10.04 sec 12.2 MBytes 2.56 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.04 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-10.04 sec 2.86 GBytes 2.45 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 -----------------------------------------------------------
On hpmc7 we run iperf3:
# 01:24 root@HPMC7: ~ # iperf3 -i 5 -B hpmc7ib0 -c hpmc9ib0
01:24 root@HPMC7: ~ # iperf3 -i 5 -B hpmc7ib0 -c hpmc9ib0 Connecting to host hpmc9ib0, port 5201 [ 4] local 172.16.0.119 port 50086 connected to 172.16.0.132 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-5.00 sec 1.16 GBytes 2.00 Gbits/sec 0 177 KBytes [ 4] 5.00-10.00 sec 1.18 GBytes 2.03 Gbits/sec 0 278 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 2.34 GBytes 2.01 Gbits/sec 0 sender [ 4] 0.00-10.00 sec 2.34 GBytes 2.01 Gbits/sec receiver iperf Done. 01:25 root@HPMC7: ~ # iperf3 -i 5 -c hpmc9ib0 Connecting to host hpmc9ib0, port 5201 [ 4] local 172.16.0.119 port 34578 connected to 172.16.0.132 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-5.00 sec 1.14 GBytes 1.95 Gbits/sec 0 152 KBytes [ 4] 5.00-10.00 sec 1.15 GBytes 1.97 Gbits/sec 0 267 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 2.28 GBytes 1.96 Gbits/sec 0 sender [ 4] 0.00-10.00 sec 2.28 GBytes 1.96 Gbits/sec receiver iperf Done. 01:26 root@HPMC7: ~ # iperf3 -i 5 -c hpmc9ib0 Connecting to host hpmc9ib0, port 5201 [ 4] local 172.16.0.119 port 34582 connected to 172.16.0.132 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-5.00 sec 1.35 GBytes 2.31 Gbits/sec 0 148 KBytes [ 4] 5.00-10.00 sec 1.51 GBytes 2.60 Gbits/sec 0 231 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 2.86 GBytes 2.45 Gbits/sec 0 sender [ 4] 0.00-10.00 sec 2.86 GBytes 2.45 Gbits/sec receiver iperf Done.
■ Measuring Performance with iperf3 Over Ethernet
Here we measure performace between the nodes by binding to the ethernet. Notice how the transfer amd bandwith have dropped drappatically.
# 01:30 root@HPMC9: ~ # iperf3 -i 5 -s -B hpmc9
01:30 root@HPMC9: ~ # iperf3 -i 5 -s -B hpmc9 ----------------------------------------------------------- Server listening on 5201 ----------------------------------------------------------- Accepted connection from 192.168.0.117, port 56832 [ 5] local 192.168.0.131 port 5201 connected to 192.168.0.117 port 56834 [ ID] Interval Transfer Bandwidth [ 5] 0.00-5.00 sec 55.6 MBytes 93.2 Mbits/sec [ 5] 5.00-10.00 sec 56.1 MBytes 94.0 Mbits/sec [ 5] 10.00-10.03 sec 396 KBytes 93.9 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.03 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-10.03 sec 112 MBytes 93.6 Mbits/sec receiver ----------------------------------------------------------- Server listening on 5201 -----------------------------------------------------------
# 01:24 root@HPMC7: ~ # iperf3 -i 5 -c hpmc9
01:24 root@HPMC7: ~ # iperf3 -i 5 -c hpmc9 Connecting to host hpmc9, port 5201 [ 4] local 192.168.0.117 port 56834 connected to 192.168.0.131 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-5.00 sec 56.7 MBytes 95.2 Mbits/sec 0 194 KBytes [ 4] 5.00-10.00 sec 56.7 MBytes 95.1 Mbits/sec 0 260 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 113 MBytes 95.1 Mbits/sec 0 sender [ 4] 0.00-10.00 sec 112 MBytes 93.9 Mbits/sec receiver iperf Done.