HPMC - Fabric MPIRUN Offloading to MIC

HPMC User Guide v 1.00
© 2022 Bassem W. Jamaleddine

5-2

Fabric MPIRUN Offloading to MIC

HPMC calculators that are fabric capable can be interconnected together to take advange of RDMA via DAPL to execute programs and distribute the processing on neighboring HPMC's.

It is possible to interconnect two or more HPMC's DMA using switching networking cards, called infinibands. An application that is executed on the MIC of an HPMC can also be executed on the MAC of another HPMC system.

The core (* not CPU core) of such execution is laid through RDMA basic and essential components, all known to be stacked in what is called rdma-core. Message passing across the RDMA is essentially what allow the software application to be executed on a group (* we will avoid using cluster or clustering) of HPMC. The Message Passing Interface (MPI) is a set of RFC protocol (open or not open) that allow the execution of a program to be distributed on many HPMCs. While MPI is a monotonic interface that explore the RDMA and IPoIB switching (subject of DAPL) its execution benefit from leaving the Central Processing Unit (CPU) alone, hence it is asynchronous and fully duplexed depending on the netfiniend.

Consider the following trap_offload.c program that we will be running to test the MPI execution on our HPMCs.

-- Program Code 5.2.1 : [LISTING trap_offload.c] - [Trap Offload Program]

(raw text)

1.     #include <math.h>
2.     #include <mpi.h>
3.     #include <stdio.h>
4.     #include <unistd.h>
5.     #define NUM_TRAPEZOIDS 10000000000
6.     __attribute__((target(mic))) inline double f(double x) {
7.         return 1.00*x*x*exp(-(x-0.0)*(x-0.0)/(2.0*0.25*0.25))
8.             + 0.50*x*x*exp(-(x-0.2)*(x-0.2)/(2.0*0.50*0.50))
9.             + 0.50*x*x*exp(-(x+0.2)*(x+0.2)/(2.0*0.50*0.50))
10.            + 0.25*x*x*exp(-(x-0.4)*(x-0.4)/(2.0*1.00*1.00))
11.            + 0.25*x*x*exp(-(x+0.4)*(x+0.4)/(2.0*1.00*1.00));
12.    }
13.    
14.    int main (int argc, char *argv[]) {
15.        int namelen, rank, size;
16.        char name[MPI_MAX_PROCESSOR_NAME];
17.        double upper_bound = 5.0, lower_bound = -5.0;
18.        double x0, x1, width;
19.        double integral = 0;
20.        double compute_time, total_time;
21.        int chunk_size;
22.        MPI_Init(&argc, &argv);
23.        MPI_Comm_size(MPI_COMM_WORLD, &size);
24.        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
25.        MPI_Get_processor_name(name, &namelen);
26.        chunk_size = NUM_TRAPEZOIDS / size;
27.        x0 = lower_bound+(upper_bound-lower_bound)*rank/size;
28.        x1 = x0 + (upper_bound - lower_bound)/size;
29.        width = (x1-x0)/chunk_size;
30.        MPI_Barrier(MPI_COMM_WORLD);
31.        compute_time = total_time = MPI_Wtime();
32.    #pragma offload target(mic)
33.    #pragma omp parallel
34.    #pragma omp for reduction(+:integral)
35.        for (int i = 0; i < chunk_size ; i++) {
36.            integral += 0.5*width *(f(x0+width*i)+f(x0+width*(i+1)));
37.        }
38.        compute_time = MPI_Wtime() - compute_time;
39.        MPI_Allreduce(MPI_IN_PLACE, &integral, 1, MPI_DOUBLE, MPI_SUM, 
      MPI_COMM_WORLD); 
40.        total_time = MPI_Wtime() - total_time;
41.        printf("rank %d of %d on %s: %f seconds\n", rank, size, name, 
      compute_time); 
42.        if (rank == 0) {
43.            printf("integral = %f, time = %f\n", integral, total_time);
44.        }
45.        MPI_Finalize();
46.        return(0);
47.    }

HPMC 2022

■ Offloading to MIC

Both hpmc9 and hpmc7 are interconnected by infiniband.

# 13:17 root@HPMC9: /mm03fs/MPI # I_MPI_FABRICS=ofa mpirun -n 8 ./trap_offload

13:17 root@HPMC9: /mm03fs/MPI #  I_MPI_FABRICS=ofa mpirun -n 8 ./trap_offload
rank 0 of 8 on HPMC9: 10.728176 seconds
integral = 1.856399, time = 13.593669
rank 1 of 8 on HPMC9: 9.398290 seconds
rank 2 of 8 on HPMC9: 12.958419 seconds
rank 3 of 8 on HPMC9: 11.538105 seconds
rank 4 of 8 on HPMC9: 13.529295 seconds
rank 5 of 8 on HPMC9: 13.593640 seconds
rank 6 of 8 on HPMC9: 13.218497 seconds
rank 7 of 8 on HPMC9: 13.058236 seconds

■ Offloading to MIC with 2 Ranks

Running the MPI program, offloading by using the ofa (Open Fabrics Alliance).

# 13:36 root@HPMC9: /mm03fs/MPI # OMP_NUM_THREADS=8 I_MPI_FABRICS=ofa mpirun -n 2 ./trap_offload

13:36 root@HPMC9: /mm03fs/MPI #  OMP_NUM_THREADS=8 I_MPI_FABRICS=ofa mpirun  -n 2 ./trap_offload
rank 0 of 2 on HPMC9: 22.342904 seconds
integral = 1.856399, time = 22.342928
rank 1 of 2 on HPMC9: 22.163031 seconds

Running the MPI progra, offloading by using dapl.

# 13:36 root@HPMC9: /mm03fs/MPI # OMP_NUM_THREADS=8 I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u I_MPI_PERHOST=1 I_MPI_DEBUG=5 I_MPI_FABRICS=shm:dapl I_MPI_FALLBACK=0 /opt/INTEL-XE-2017-update7/compilers_and_libraries_2017.7.259/linux/mpi/intel64/bin/mpiexec.hydra -n 2 -hosts hpmc9 ./trap_offload

13:36 root@HPMC9: /mm03fs/MPI #  OMP_NUM_THREADS=8 I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u I_MPI_PERHOST=1 I_MPI_DEBUG=5 I_MPI_FABRICS=shm:dapl I_MPI_FALLBACK=0 /opt/INTEL-XE-2017-update7/compilers_and_libraries_2017.7.259/linux/mpi/intel64/bin/mpiexec.hydra -n 2  -hosts hpmc9 ./trap_offload
[0] MPI startup(): Multi-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1u
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[0] MPI startup(): shm and dapl data transfer modes
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[1] MPI startup(): shm and dapl data transfer modes
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       33344    HPMC9      {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,56,57,
                                 58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83}
[0] MPI startup(): 1       33345    HPMC9      {28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54
                                 ,55,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,
                                 107,108,109,110,111}
[0] MPI startup(): I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:dapl
[0] MPI startup(): I_MPI_FALLBACK=0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=i40iw0:0,i40iw1:0,qib0:0,qib1:0,mic0:1
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 28
rank 0 of 2 on HPMC9: 24.074761 seconds
integral = 1.856399, time = 24.215596
rank 1 of 2 on HPMC9: 24.215570 seconds