HPMC User Guide v 1.00
© 2022 Bassem W. Jamaleddine
HPMC calculators that are fabric capable can be interconnected together to take advange of RDMA via DAPL to execute programs and distribute the processing on neighboring HPMC's.
It is possible to interconnect two or more HPMC's DMA using switching networking cards, called infinibands. An application that is executed on the MIC of an HPMC can also be executed on the MAC of another HPMC system.
The core (* not CPU core) of such execution is laid through RDMA basic and essential components, all known to be stacked in what is called rdma-core. Message passing across the RDMA is essentially what allow the software application to be executed on a group (* we will avoid using cluster or clustering) of HPMC. The Message Passing Interface (MPI) is a set of RFC protocol (open or not open) that allow the execution of a program to be distributed on many HPMCs. While MPI is a monotonic interface that explore the RDMA and IPoIB switching (subject of DAPL) its execution benefit from leaving the Central Processing Unit (CPU) alone, hence it is asynchronous and fully duplexed depending on the netfiniend.
Consider the following trap_offload.c program that we will be running to test the MPI execution on our HPMCs.
1. #include <math.h> 2. #include <mpi.h> 3. #include <stdio.h> 4. #include <unistd.h> 5. #define NUM_TRAPEZOIDS 10000000000 6. __attribute__((target(mic))) inline double f(double x) { 7. return 1.00*x*x*exp(-(x-0.0)*(x-0.0)/(2.0*0.25*0.25)) 8. + 0.50*x*x*exp(-(x-0.2)*(x-0.2)/(2.0*0.50*0.50)) 9. + 0.50*x*x*exp(-(x+0.2)*(x+0.2)/(2.0*0.50*0.50)) 10. + 0.25*x*x*exp(-(x-0.4)*(x-0.4)/(2.0*1.00*1.00)) 11. + 0.25*x*x*exp(-(x+0.4)*(x+0.4)/(2.0*1.00*1.00)); 12. } 13. 14. int main (int argc, char *argv[]) { 15. int namelen, rank, size; 16. char name[MPI_MAX_PROCESSOR_NAME]; 17. double upper_bound = 5.0, lower_bound = -5.0; 18. double x0, x1, width; 19. double integral = 0; 20. double compute_time, total_time; 21. int chunk_size; 22. MPI_Init(&argc, &argv); 23. MPI_Comm_size(MPI_COMM_WORLD, &size); 24. MPI_Comm_rank(MPI_COMM_WORLD, &rank); 25. MPI_Get_processor_name(name, &namelen); 26. chunk_size = NUM_TRAPEZOIDS / size; 27. x0 = lower_bound+(upper_bound-lower_bound)*rank/size; 28. x1 = x0 + (upper_bound - lower_bound)/size; 29. width = (x1-x0)/chunk_size; 30. MPI_Barrier(MPI_COMM_WORLD); 31. compute_time = total_time = MPI_Wtime(); 32. #pragma offload target(mic) 33. #pragma omp parallel 34. #pragma omp for reduction(+:integral) 35. for (int i = 0; i < chunk_size ; i++) { 36. integral += 0.5*width *(f(x0+width*i)+f(x0+width*(i+1))); 37. } 38. compute_time = MPI_Wtime() - compute_time; 39. MPI_Allreduce(MPI_IN_PLACE, &integral, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); 40. total_time = MPI_Wtime() - total_time; 41. printf("rank %d of %d on %s: %f seconds\n", rank, size, name, compute_time); 42. if (rank == 0) { 43. printf("integral = %f, time = %f\n", integral, total_time); 44. } 45. MPI_Finalize(); 46. return(0); 47. }
■ Offloading to MIC
Both hpmc9 and hpmc7 are interconnected by infiniband.
# 13:17 root@HPMC9: /mm03fs/MPI # I_MPI_FABRICS=ofa mpirun -n 8 ./trap_offload
13:17 root@HPMC9: /mm03fs/MPI # I_MPI_FABRICS=ofa mpirun -n 8 ./trap_offload rank 0 of 8 on HPMC9: 10.728176 seconds integral = 1.856399, time = 13.593669 rank 1 of 8 on HPMC9: 9.398290 seconds rank 2 of 8 on HPMC9: 12.958419 seconds rank 3 of 8 on HPMC9: 11.538105 seconds rank 4 of 8 on HPMC9: 13.529295 seconds rank 5 of 8 on HPMC9: 13.593640 seconds rank 6 of 8 on HPMC9: 13.218497 seconds rank 7 of 8 on HPMC9: 13.058236 seconds
■ Offloading to MIC with 2 Ranks
Running the MPI program, offloading by using the ofa (Open Fabrics Alliance).
# 13:36 root@HPMC9: /mm03fs/MPI # OMP_NUM_THREADS=8 I_MPI_FABRICS=ofa mpirun -n 2 ./trap_offload
13:36 root@HPMC9: /mm03fs/MPI # OMP_NUM_THREADS=8 I_MPI_FABRICS=ofa mpirun -n 2 ./trap_offload rank 0 of 2 on HPMC9: 22.342904 seconds integral = 1.856399, time = 22.342928 rank 1 of 2 on HPMC9: 22.163031 seconds
Running the MPI progra, offloading by using dapl.
# 13:36 root@HPMC9: /mm03fs/MPI # OMP_NUM_THREADS=8 I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u I_MPI_PERHOST=1 I_MPI_DEBUG=5 I_MPI_FABRICS=shm:dapl I_MPI_FALLBACK=0 /opt/INTEL-XE-2017-update7/compilers_and_libraries_2017.7.259/linux/mpi/intel64/bin/mpiexec.hydra -n 2 -hosts hpmc9 ./trap_offload
13:36 root@HPMC9: /mm03fs/MPI # OMP_NUM_THREADS=8 I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u I_MPI_PERHOST=1 I_MPI_DEBUG=5 I_MPI_FABRICS=shm:dapl I_MPI_FALLBACK=0 /opt/INTEL-XE-2017-update7/compilers_and_libraries_2017.7.259/linux/mpi/intel64/bin/mpiexec.hydra -n 2 -hosts hpmc9 ./trap_offload
[0] MPI startup(): Multi-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[0] MPI startup(): shm and dapl data transfer modes
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[1] MPI startup(): shm and dapl data transfer modes
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 33344 HPMC9 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,56,57,
58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83}
[0] MPI startup(): 1 33345 HPMC9 {28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54
,55,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,
107,108,109,110,111}
[0] MPI startup(): I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:dapl
[0] MPI startup(): I_MPI_FALLBACK=0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=i40iw0:0,i40iw1:0,qib0:0,qib1:0,mic0:1
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 28
rank 0 of 2 on HPMC9: 24.074761 seconds
integral = 1.856399, time = 24.215596
rank 1 of 2 on HPMC9: 24.215570 seconds