Monday, December 31, 2018

Structures Used to Implement on GPU Machine

So, our job is to figure out how best to code the above equations in CUDA C/C++. After examination of these basic evolution equations, it is clear we need to keep track of the 5 evolution variables at each of our NxNxN points in our grid. These are:

 conformal metric (3x3 matrix)

 Trace-free excrinsic curvature (3x3 matrix)

moving puncture conformal variable (scalar)

K  Trace of the extrinsic curvature (scalar)


moving puncture Gamma (vector)

Note that scalars are one-valued at each point to keep track of. Alternatively,  each vector has three values:  x-direction, y-direction,, and z-directionfor each of our grid points. 

The 3x3 matrices, with 9 values at each point, for example:
                                      


In addition, at each point in the grid we need to track the Ricci curvature tensor, and its component values, Note that Ricci curvature tensor and its components are all 3x3 matrices:



One might use a 3x3x3 (Tensor) structure store the 3x3x3 = 27 values the Christoffel symbol:               
   
                                        
We also need to track the lapse (α) and shift variables (β i). While α is a scalar, the shift variable is a 3-value vector (β x,  β and β z).

Additionally, there are a lot first and second order derivatives to keep track of. Now, knowing that each time we take the derivative of a tensor, the result is a tensor of one higher order, we can predict that, the 3 partial derivatives of a scaler will be a 3-vector, the 3 partial derivatives of each component of a 3-vector will be a 3x3 matrix, the 3 partial derivatives of each component of a 3x3-matrix will be a 3x3x3-tensor, and the 3 partial derivatives of each component of a 3x3x3-tensor (or the second derivatives of a 3x3 matrix) will be a 3x3x3x3-tensor (which, for convenience, will be referred to as a 3x3x3x3-hypercube).

As a side note, of course all of the 3x3 matrices we use in the BSSN moving puncture method are symmetric, which is to say that for any 3x3 matrix F, the components . While this looks like a good way to save memory, it does involve more complicated indices in many of the for-loops we’ll be using. So, for this simplest of examples, we’ll stick to just computing the full 3x3 = 9 values, knowing full well that exploiting these symmetries would be a great way to optimize this code in time and memory space.

One way to code this in CUDA C/C++ is to create a few simple structures which match our needs. For example:

typedef struct{
      float v[3];
}vector;                         //1x3 array

typedef struct{
      float M[3][3];
}Matrix;                         //3x3 array

typedef struct{
      float T[3][3][3];
}Tensor;                         //3x3x3 array

typedef struct{
      float H[3][3][3][3];
}Hypercube;                      //3x3x3x3 array

The general idea is to create NxNxN arrays of these structures.

Next: Implementation on CPU Machine -->

<--Back: Kreiss-Oliger Spatial Filtering
<--Back: Overview

Sunday, December 30, 2018

Implementation on a CPU Example

With respect to arrays in CUDA C/C++, many programmers have found it convenient to use so-called ‘flattened arrays’. They take a little getting used to, but they do make creation and parameter passing to the GPU much simpler.

So, in this example, that is what we will use. Although, first let's look at how one might create a flattened-array of a structure one any old CPU in C/C++. Then, we'll jump into how to do this on a GPU in CUDA C/C++.

The basic idea of a flattened array is to treat any NxN matrix in our equations as if were an N2 element one-dimensional array. As long as we are consistent in how we do this, it all works out fine. Therefore, instead of creating a 3x3 2-dimensional array, we just create a 9-element one-dimensional array instead.

Note that we want to create arrays of structures. For our purposes, we'll be creating arrays of structures that will hold the values of our variables for every grid point in our 3-D grid. We might want to store, for example, the shift variable, β, at each point in a 100x100x100 grid. Therefore, we would create a 1,000,000 element one-dimensional array with each element being a vector (the vector structure we created.)

All well and good. We just have to figure out how to map the 3D grid to the 1D flattened array. Let's look at how that is done on a CPU machine, using C.

So, for example, a loop that scrolls through the elements of a NxNxN Tensor, B, we might code like the following without the use of flattened arrays:

for(int = i;i<N;i++){
              for(int = j;j<N;j++){
for(int = k;k<N;k++){
                                           B[i][j][k] = something;
}
}
}

Now, using flattened arrays, which is to say all of our arrays are one-dimensional, there is just a slight modification we need, a way to modify and track the index (called Indx below.) An example of such a loop written in standard C (not using the GPU functions) is shown below:

for(int = i;i<N;i++){
              for(int = j;j<N;j++){
for(int = k;k<N;k++){
                 Indx = N*N*i + N*j + k;
                                           B[Indx] = something;
      }
}
}

Next: Implementation on GPU Example-->
<--Back: Structures Used to Implement on GPU
<--Back: Overview

Saturday, December 29, 2018

Implementation on a GPU Example


When using the GPU, an additional level of trickiness comes in to play with arrays of our structures (e.g., vector.v[i], Matrix.M[i][j], Tensor.T[i][j][k] and Hypercube.H[i][j][k][l].

This is because there are really two sets of indices we need track: one for the spatial location (i.e., which grid point) and one for the values of each vector, Matrix, Tensor or Hypercube at that point. 

The solution we use in this example is, again, to create flattened arrays of these structures. The trickiness we use to solve this goes right to the heart of GPU programming. 

We’ll start with a CPU example first as a baseline, and then see how we modify this for CUDA C/C++ programming to get significant speed up.

First, as an example, let’s take a look at how we might scroll through all the values of the conformal metric, which, I’m going to call ‘hg’. We’ll use x, y, and z to scroll through all the spatial points in our NxNxN grid, and i and j to scroll through the element values of  at each point. Note we first need to allocate our memory for the array of hg, which is a Matrix in this example:

Matrix *hg = NULL;               //conformal metric, 3x3 Matrix
hg = (Matrix*)malloc(sizeof(Matrix)*N*N*N);

for(int = x;x<N;x++){
              for(int = y;y<N;y++){
           for(int = z;z<N;z++){
                 Indx = N*N*x + N*y + z;
for(int = i;i<N;i++){
                                                          for(int = j;j<N;j++){
                                                          hg[Indx].M[i][j] = something;
}
}
}
}
}

One can see from the example that flattened arrays applies to the spatial indices, x, y, and z, while the components at each point, because we are using the structures we have created, look a whole lot like they might if they were plain-old-vanilla 2-dimensional arrays at each point.

To exploit the speed up of a GPU, we’re going to convert those spatial index for-loops into individual blocks and threads, so that the above gets converted into the following steps. (A fuller explanation of CUDA’s blocks and threads structure on a GPU can be found in references [2] and [3].)

First, we need to create memory for  not just in the CPU memory, but in the GPU memory as well (they are distinct and forgetting that detail can lead to completely confusing bugs.) 

For clarity, I’ve adopted the convention of prefixing all host (i.e., CPU) variables with an ‘h’ and then use the same name (without the ‘h’) for the GPU variable.

Matrix *hg = NULL;               //conformal metric, CPU variable
Matrix *g = 0;                                            //conformal metric, GPU variable

hg = (Matrix*)malloc(sizeof(Matrix)*N*N*N); //allocate on CPU
cudaMalloc(&g, N*N*N*sizeof(Matrix));       //allocate on GPU 

Now, suppose we’ve used the CPU to read in from some file the initial values of the conformal metric we want to use. That means the elements of the NxNxNx3x3 array of hg.M[i][j] are in CPU memory; but not, at this point, in GPU memory. 

The second step might be to copy those values from CPU memory into GPU memory. One way to do that in CUDA C/C++ might be to use the built-in CUDA function cudaMemcpy as in the example below :

cudaMemcpy(g,hg,N*N*N*sizeof(Matrix),cudaMemcpyDeviceToHost);

Let’s talk about blocks and threads on the GPU for a moment. Not in great depth, mind you. For a more complete discussion, one would do well to check the articles, books, and websites in the references. GPU memory, and specifically the 1080-Ti, gives each block a maximum of about 1,000 threads per block. The only limit on blocks is going to be the memory limit of the GPU card itself, which for the 1080-Ti is 11G. The number of threads and blocks is a subtle thing, as far as peak performance goes. However, for this example, we’re going use the following:

#define TX 8
#define TY 8
#define TZ 8
const dim3 blockSize(TX,TY,TZ);
const int bx = (N + TX -1)/TX;
const int by = (N + TY -1)/TY;
const int bz = (N + TZ -1)/TZ;
const dim3 gridSize = dim3(bx,by,bz);

Calling a kernel on the GPU, for example one to evolve the values of the conformal metric, might be done using the following in the CPU code:

evolveMetric<<<gridSize,blockSize>>>(gnext,g,delt,N);

The GPU kernel which implements the evolution of the conformal spatial metric might look like the following:

__global__ void evolveMetric(Matrix *gnext, Matrix *g, float delt,
int n)
{
      const int x = blockIdx.x*blockDim.x + threadIdx.x;
      const int y = blockIdx.y*blockDim.y + threadIdx.y;
      const int z = blockIdx.z*blockDim.z + threadIdx.z;
     
int Indx= x*n*n+y*n+z;
      float term1, term2;

      for(int i = 0; i<3; i++){
        for(int j=0; j<3; j++){
          term1 = something
          term2 = something else
          gnext[Indx].M[i][j]=g[Indx].M[i][j]+delt*(term1 + term2);
        }
      }
}

The above GPU code snippet makes use of several built-in CUDA C/C++ structures:

·       blockIdx – Gives the index of the block
·       blockDim – Gives the dimension of the block
·       threadIdx – Gives the index of the thread in the block

An important point in the above snippet is that in CUDA we’ve replaced the for-loops for x, y, and z that were used in the CPU version with indices to the blocks and threads for each instead. Remember, we have up to about 1,000 threads per block we can use, each thread running its own copy of the kernel. You might say that each thread ‘knows’ its own thread and block index. Therefore, a statement like one used to define Indx in the code snippet above is translated to a different number depending on which thread and block is reading at it. 

Tricky, right?

Next: Finite Differences on a GPU Machine
<--Back: Implementation on a CPU example
<--Back: Overview

Overview -- Numerical Relativity Using CUDA C/C++ Is Easier Than You Think!

Simulation results of a binary black hole head-on collision run on a GPU based home gaming computer Not too long ago, the only way ...