Saturday, December 29, 2018

Implementation on a GPU Example


When using the GPU, an additional level of trickiness comes in to play with arrays of our structures (e.g., vector.v[i], Matrix.M[i][j], Tensor.T[i][j][k] and Hypercube.H[i][j][k][l].

This is because there are really two sets of indices we need track: one for the spatial location (i.e., which grid point) and one for the values of each vector, Matrix, Tensor or Hypercube at that point. 

The solution we use in this example is, again, to create flattened arrays of these structures. The trickiness we use to solve this goes right to the heart of GPU programming. 

We’ll start with a CPU example first as a baseline, and then see how we modify this for CUDA C/C++ programming to get significant speed up.

First, as an example, let’s take a look at how we might scroll through all the values of the conformal metric, which, I’m going to call ‘hg’. We’ll use x, y, and z to scroll through all the spatial points in our NxNxN grid, and i and j to scroll through the element values of  at each point. Note we first need to allocate our memory for the array of hg, which is a Matrix in this example:

Matrix *hg = NULL;               //conformal metric, 3x3 Matrix
hg = (Matrix*)malloc(sizeof(Matrix)*N*N*N);

for(int = x;x<N;x++){
              for(int = y;y<N;y++){
           for(int = z;z<N;z++){
                 Indx = N*N*x + N*y + z;
for(int = i;i<N;i++){
                                                          for(int = j;j<N;j++){
                                                          hg[Indx].M[i][j] = something;
}
}
}
}
}

One can see from the example that flattened arrays applies to the spatial indices, x, y, and z, while the components at each point, because we are using the structures we have created, look a whole lot like they might if they were plain-old-vanilla 2-dimensional arrays at each point.

To exploit the speed up of a GPU, we’re going to convert those spatial index for-loops into individual blocks and threads, so that the above gets converted into the following steps. (A fuller explanation of CUDA’s blocks and threads structure on a GPU can be found in references [2] and [3].)

First, we need to create memory for  not just in the CPU memory, but in the GPU memory as well (they are distinct and forgetting that detail can lead to completely confusing bugs.) 

For clarity, I’ve adopted the convention of prefixing all host (i.e., CPU) variables with an ‘h’ and then use the same name (without the ‘h’) for the GPU variable.

Matrix *hg = NULL;               //conformal metric, CPU variable
Matrix *g = 0;                                            //conformal metric, GPU variable

hg = (Matrix*)malloc(sizeof(Matrix)*N*N*N); //allocate on CPU
cudaMalloc(&g, N*N*N*sizeof(Matrix));       //allocate on GPU 

Now, suppose we’ve used the CPU to read in from some file the initial values of the conformal metric we want to use. That means the elements of the NxNxNx3x3 array of hg.M[i][j] are in CPU memory; but not, at this point, in GPU memory. 

The second step might be to copy those values from CPU memory into GPU memory. One way to do that in CUDA C/C++ might be to use the built-in CUDA function cudaMemcpy as in the example below :

cudaMemcpy(g,hg,N*N*N*sizeof(Matrix),cudaMemcpyDeviceToHost);

Let’s talk about blocks and threads on the GPU for a moment. Not in great depth, mind you. For a more complete discussion, one would do well to check the articles, books, and websites in the references. GPU memory, and specifically the 1080-Ti, gives each block a maximum of about 1,000 threads per block. The only limit on blocks is going to be the memory limit of the GPU card itself, which for the 1080-Ti is 11G. The number of threads and blocks is a subtle thing, as far as peak performance goes. However, for this example, we’re going use the following:

#define TX 8
#define TY 8
#define TZ 8
const dim3 blockSize(TX,TY,TZ);
const int bx = (N + TX -1)/TX;
const int by = (N + TY -1)/TY;
const int bz = (N + TZ -1)/TZ;
const dim3 gridSize = dim3(bx,by,bz);

Calling a kernel on the GPU, for example one to evolve the values of the conformal metric, might be done using the following in the CPU code:

evolveMetric<<<gridSize,blockSize>>>(gnext,g,delt,N);

The GPU kernel which implements the evolution of the conformal spatial metric might look like the following:

__global__ void evolveMetric(Matrix *gnext, Matrix *g, float delt,
int n)
{
      const int x = blockIdx.x*blockDim.x + threadIdx.x;
      const int y = blockIdx.y*blockDim.y + threadIdx.y;
      const int z = blockIdx.z*blockDim.z + threadIdx.z;
     
int Indx= x*n*n+y*n+z;
      float term1, term2;

      for(int i = 0; i<3; i++){
        for(int j=0; j<3; j++){
          term1 = something
          term2 = something else
          gnext[Indx].M[i][j]=g[Indx].M[i][j]+delt*(term1 + term2);
        }
      }
}

The above GPU code snippet makes use of several built-in CUDA C/C++ structures:

·       blockIdx – Gives the index of the block
·       blockDim – Gives the dimension of the block
·       threadIdx – Gives the index of the thread in the block

An important point in the above snippet is that in CUDA we’ve replaced the for-loops for x, y, and z that were used in the CPU version with indices to the blocks and threads for each instead. Remember, we have up to about 1,000 threads per block we can use, each thread running its own copy of the kernel. You might say that each thread ‘knows’ its own thread and block index. Therefore, a statement like one used to define Indx in the code snippet above is translated to a different number depending on which thread and block is reading at it. 

Tricky, right?

Next: Finite Differences on a GPU Machine
<--Back: Implementation on a CPU example
<--Back: Overview

No comments:

Post a Comment

Overview -- Numerical Relativity Using CUDA C/C++ Is Easier Than You Think!

Simulation results of a binary black hole head-on collision run on a GPU based home gaming computer Not too long ago, the only way ...