HPMC - CUDA Programming over the HPMC GPU

HPMC User Guide v 1.00
© 2022 Bassem W. Jamaleddine

3-1

CUDA Programming over the HPMC GPU

The HPMC computer has a GPU that allows you to run CUDA related programs. Your HPMC has been already configured with the essential device drivers so that the GPU can serve you to run any of the following: torch, pytorch, theano, caffe, cuda, and pycuda.

■ About your HPMC CUDA Device

# checkDeviceInfor

20:00 root@HPMC7: /appz/professional-c-cuda-programming/CodeSamples/chapter02 #  ./checkDeviceInfor 
./checkDeviceInfor Starting...
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX TITAN X"
  CUDA Driver Version / Runtime Version          9.2 / 6.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 11.93 MBytes (12805668864 bytes)
  GPU Clock rate:                                1240 MHz (1.24 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes

# ./simpleDeviceQuery

19:59 root@HPMC7: /appz/professional-c-cuda-programming/CodeSamples/chapter03 #  ./simpleDeviceQuery
Device 0: GeForce GTX TITAN X
  Number of multiprocessors:                     24
  Total amount of constant memory:               64.00 KB
  Total amount of shared memory per block:       48.00 KB
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum number of threads per multiprocessor:  2048
  Maximum number of warps per multiprocessor:    64

To make sure that your GPU is ready to serve CUDA programming capability, just invoke the randomFog program at the prompt. Figure shows the randomFog. In case the randomFog failed to execute then refer to Appendix Troubleshooting the CUDA GPU.

full view

Figure. 3.1.1 [Random FOG running on the CUDA Capable GPU]

HPMC 2022 by Bassem Jamaleddine

■ CUDNN with Torch

You program in Lua and run torch programs on your HPMC.

torch: A Tensor library like NumPy, with strong GPU support PyTorch is used either as a replacement for NumPy to use the GPU as a processing agent. It is also used a platform for deep learning research.

# 20:09 root@HPMC7: /appz/torch # th bench.lua

20:09 root@HPMC7: /appz/torch #  th bench.lua 
Found Environment variable CUDNN_PATH = /usr/local/cuda-9.2/lib64/libcudnn.so.7.1.4
Running on device: GeForce GTX TITAN X

CONFIG: input = 3x128x128 * ker = 3x96x11x11 (bs = 128, stride = 1)
cudnn.SpatialConvolution                :updateOutput():      51.10
cudnn.SpatialConvolution             :updateGradInput():      57.34
cudnn.SpatialConvolution           :accGradParameters():      50.86
cudnn.SpatialConvolution                         :TOTAL:     159.30
nn.SpatialConvolutionMM                 :updateOutput():      44.83
nn.SpatialConvolutionMM              :updateGradInput():      54.59
nn.SpatialConvolutionMM            :accGradParameters():      56.69
nn.SpatialConvolutionMM                          :TOTAL:     156.12
ccn2.SpatialConvolution                 :updateOutput():      32.38
ccn2.SpatialConvolution              :updateGradInput():      50.96
ccn2.SpatialConvolution            :accGradParameters():      44.72
ccn2.SpatialConvolution                          :TOTAL:     128.06

CONFIG: input = 64x64x64 * ker = 64x128x9x9 (bs = 128, stride = 1)
cudnn.SpatialConvolution                :updateOutput():      89.60
cudnn.SpatialConvolution             :updateGradInput():     264.64
cudnn.SpatialConvolution           :accGradParameters():     141.98
cudnn.SpatialConvolution                         :TOTAL:     496.21
nn.SpatialConvolutionMM                 :updateOutput():     138.05
nn.SpatialConvolutionMM              :updateGradInput():     219.20
nn.SpatialConvolutionMM            :accGradParameters():     330.22
nn.SpatialConvolutionMM                          :TOTAL:     687.47
ccn2.SpatialConvolution                 :updateOutput():     109.02
ccn2.SpatialConvolution              :updateGradInput():     125.64
ccn2.SpatialConvolution            :accGradParameters():     227.74
ccn2.SpatialConvolution                          :TOTAL:     462.40

CONFIG: input = 128x32x32 * ker = 128x128x9x9 (bs = 128, stride = 1)
cudnn.SpatialConvolution                :updateOutput():      30.75
cudnn.SpatialConvolution             :updateGradInput():      74.42
cudnn.SpatialConvolution           :accGradParameters():      54.36
cudnn.SpatialConvolution                         :TOTAL:     159.52
nn.SpatialConvolutionMM                 :updateOutput():      53.73
nn.SpatialConvolutionMM              :updateGradInput():     103.11
nn.SpatialConvolutionMM            :accGradParameters():      55.03
nn.SpatialConvolutionMM                          :TOTAL:     211.87
ccn2.SpatialConvolution                 :updateOutput():      37.26
ccn2.SpatialConvolution              :updateGradInput():      46.71
ccn2.SpatialConvolution            :accGradParameters():      85.09
ccn2.SpatialConvolution                          :TOTAL:     169.07

CONFIG: input = 128x16x16 * ker = 128x128x7x7 (bs = 128, stride = 1)
cudnn.SpatialConvolution                :updateOutput():       4.33
cudnn.SpatialConvolution             :updateGradInput():       7.33
cudnn.SpatialConvolution           :accGradParameters():       8.70
cudnn.SpatialConvolution                         :TOTAL:      20.36
nn.SpatialConvolutionMM                 :updateOutput():      14.75
nn.SpatialConvolutionMM              :updateGradInput():      18.78
nn.SpatialConvolutionMM            :accGradParameters():      12.61
nn.SpatialConvolutionMM                          :TOTAL:      46.14
ccn2.SpatialConvolution                 :updateOutput():       4.57
ccn2.SpatialConvolution              :updateGradInput():       4.29
ccn2.SpatialConvolution            :accGradParameters():       7.69
ccn2.SpatialConvolution                          :TOTAL:      16.55

CONFIG: input = 384x13x13 * ker = 384x384x3x3 (bs = 128, stride = 1)
cudnn.SpatialConvolution                :updateOutput():       7.77
cudnn.SpatialConvolution             :updateGradInput():      11.71
cudnn.SpatialConvolution           :accGradParameters():      12.91
cudnn.SpatialConvolution                         :TOTAL:      32.39
nn.SpatialConvolutionMM                 :updateOutput():      13.15
nn.SpatialConvolutionMM              :updateGradInput():      17.28
nn.SpatialConvolutionMM            :accGradParameters():      14.63
nn.SpatialConvolutionMM                          :TOTAL:      45.06
ccn2.SpatialConvolution                 :updateOutput():       7.46
ccn2.SpatialConvolution              :updateGradInput():       8.58
ccn2.SpatialConvolution            :accGradParameters():      14.21
ccn2.SpatialConvolution                          :TOTAL:      30.25

■ nn with PyTorch

In PyTorch, the nn package defines a set of Modules, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The nn package also defines a set of useful loss functions that are commonly used when training neural networks.

-- Program Code 3.1.1 : [LISTING pytnn.py] - [Implement polynomial model network using nn package]

(raw text)

1.     # -*- coding: utf-8 -*-
2.     import torch
3.     import math
4.     
5.     
6.     class LegendrePolynomial3(torch.autograd.Function):
7.         """
8.         We can implement our own custom autograd Functions by subclassing
9.         torch.autograd.Function and implementing the forward and backward 
      passes 
10.        which operate on Tensors.
11.        """
12.    
13.        @staticmethod
14.        def forward(ctx, input):
15.            """
16.            In the forward pass we receive a Tensor containing the input 
      and return 
17.            a Tensor containing the output. ctx is a context object that 
      can be used 
18.            to stash information for backward computation. You can cache 
      arbitrary 
19.            objects for use in the backward pass using the 
      ctx.save_for_backward method. 
20.            """
21.            ctx.save_for_backward(input)
22.            return 0.5 * (5 * input ** 3 - 3 * input)
23.    
24.        @staticmethod
25.        def backward(ctx, grad_output):
26.            """
27.            In the backward pass we receive a Tensor containing the 
      gradient of the loss 
28.            with respect to the output, and we need to compute the gradient 
      of the loss 
29.            with respect to the input.
30.            """
31.            input, = ctx.saved_tensors
32.            return grad_output * 1.5 * (5 * input ** 2 - 1)
33.    
34.    
35.    dtype = torch.float
36.    device = torch.device("cpu")
37.    # device = torch.device("cuda:0")  # Uncomment this to run on GPU
38.    
39.    # Create Tensors to hold input and outputs.
40.    # By default, requires_grad=False, which indicates that we do not need 
      to 
41.    # compute gradients with respect to these Tensors during the backward 
      pass. 
42.    x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
43.    y = torch.sin(x)
44.    
45.    # Create random Tensors for weights. For this example, we need
46.    # 4 weights: y = a + b * P3(c + d * x), these weights need to be 
      initialized 
47.    # not too far from the correct result to ensure convergence.
48.    # Setting requires_grad=True indicates that we want to compute 
      gradients with 
49.    # respect to these Tensors during the backward pass.
50.    a = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
51.    b = torch.full((), -1.0, device=device, dtype=dtype, 
      requires_grad=True) 
52.    c = torch.full((), 0.0, device=device, dtype=dtype, requires_grad=True)
53.    d = torch.full((), 0.3, device=device, dtype=dtype, requires_grad=True)
54.    
55.    learning_rate = 5e-6
56.    for t in range(2000):
57.        # To apply our Function, we use Function.apply method. We alias 
      this as 'P3'. 
58.        P3 = LegendrePolynomial3.apply
59.    
60.        # Forward pass: compute predicted y using operations; we compute
61.        # P3 using our custom autograd operation.
62.        y_pred = a + b * P3(c + d * x)
63.    
64.        # Compute and print loss
65.        loss = (y_pred - y).pow(2).sum()
66.        if t % 100 == 99:
67.            print(t, loss.item())
68.    
69.        # Use autograd to compute the backward pass.
70.        loss.backward()
71.    
72.        # Update weights using gradient descent
73.        with torch.no_grad():
74.            a -= learning_rate * a.grad
75.            b -= learning_rate * b.grad
76.            c -= learning_rate * c.grad
77.            d -= learning_rate * d.grad
78.    
79.            # Manually zero the gradients after updating weights
80.            a.grad = None
81.            b.grad = None
82.            c.grad = None
83.            d.grad = None
84.    
85.    print(f'Result: y = {a.item()} + {b.item()} * P3({c.item()} + 
      {d.item()} x)') 
86.    
87.

HPMC 2022

Running the program on the HPMC.

14:32 root@HPMC7: /home/hpcusr/examples/pytorch #  python3 pytnn.py  
99 209.95834350585938
199 144.66018676757812
299 100.70249938964844
399 71.03519439697266
499 50.97850799560547
599 37.403133392333984
699 28.206867218017578
799 21.97318458557129
899 17.7457275390625
999 14.877889633178711
1099 12.93176555633545
1199 11.610918998718262
1299 10.71425724029541
1399 10.10548210144043
1499 9.692106246948242
1599 9.411375045776367
1699 9.220745086669922
1799 9.091285705566406
1899 9.003360748291016
1999 8.943639755249023
Result: y = -5.394172664097141e-09 + -2.208526849746704 * P3(1.367587154632588e-09 + 0.2554861009120941 x)