When using the GPU, an
additional level of trickiness comes in to play with arrays of our structures
(e.g.,
vector.v[i],
Matrix.M[i][j],
Tensor.T[i][j][k] and
Hypercube.H[i][j][k][l].)
This is because there are really two sets of indices we need track: one for the spatial
location (i.e., which grid point) and one for the values of each vector,
Matrix, Tensor or Hypercube at that point.
The solution we use in this example
is, again, to create flattened arrays of these structures. The trickiness we
use to solve this goes right to the heart of GPU programming.
We’ll start with a CPU example first as a baseline, and then see
how we modify this for CUDA C/C++ programming to get significant speed up.
First, as an example,
let’s take a look at how we might scroll through all the values of the
conformal metric, which, I’m going to
call ‘hg’. We’ll use x, y,
and z to scroll through all the
spatial points in our NxNxN grid, and
i and j to scroll through the element values of at each point. Note we first need to allocate
our memory for the array of hg, which
is a Matrix in this example:
Matrix *hg = NULL; //conformal
metric, 3x3 Matrix
hg = (Matrix*)malloc(sizeof(Matrix)*N*N*N);
for(int = x;x<N;x++){
for(int
= y;y<N;y++){
for(int =
z;z<N;z++){
Indx =
N*N*x + N*y + z;
for(int
= i;i<N;i++){
for(int = j;j<N;j++){
hg[Indx].M[i][j] = something;
}
}
}
}
}
One can see from the
example that flattened arrays applies to the spatial indices, x, y,
and z, while the components at each
point, because we are using the structures we have created, look a whole lot
like they might if they were plain-old-vanilla 2-dimensional arrays at each
point.
To exploit the speed up
of a GPU, we’re going to convert those spatial index for-loops into individual
blocks and threads, so that the above gets converted into the following steps.
(A fuller explanation of CUDA’s blocks and threads structure on a GPU can be
found in references [2] and [3].)
First, we need to
create memory for not just in the CPU memory, but in the GPU
memory as well (they are distinct and forgetting that detail can lead to
completely confusing bugs.)
For clarity, I’ve adopted the convention of
prefixing all host (i.e., CPU) variables with an ‘h’ and then use the same name
(without the ‘h’) for the GPU variable.
Matrix *hg = NULL; //conformal
metric, CPU variable
Matrix *g = 0; //conformal metric, GPU variable
hg = (Matrix*)malloc(sizeof(Matrix)*N*N*N); //allocate on CPU
cudaMalloc(&g, N*N*N*sizeof(Matrix)); //allocate on GPU
Now, suppose we’ve used
the CPU to read in from some file the initial values of the conformal metric we
want to use. That means the elements of the NxNxNx3x3
array of hg.M[i][j] are in CPU
memory; but not, at this point, in GPU memory.
The second step might be to copy
those values from CPU memory into GPU memory. One way to do that in CUDA C/C++
might be to use the built-in CUDA function cudaMemcpy as in the example below :
cudaMemcpy(g,hg,N*N*N*sizeof(Matrix),cudaMemcpyDeviceToHost);
Let’s talk about blocks
and threads on the GPU for a moment. Not in great depth, mind you. For a more
complete discussion, one would do well to check the articles, books, and
websites in the references. GPU memory, and specifically the 1080-Ti, gives
each block a maximum of about 1,000 threads per block. The only limit on blocks
is going to be the memory limit of the GPU card itself, which for the 1080-Ti
is 11G. The number of threads and blocks is a subtle thing, as far as peak
performance goes. However, for this example, we’re going use the following:
#define TX 8
#define TY 8
#define TZ 8
const dim3 blockSize(TX,TY,TZ);
const int bx = (N + TX -1)/TX;
const int by = (N + TY -1)/TY;
const int bz = (N + TZ -1)/TZ;
const dim3 gridSize = dim3(bx,by,bz);
Calling a kernel on the
GPU, for example one to evolve the values of the conformal metric, might be
done using the following in the CPU code:
evolveMetric<<<gridSize,blockSize>>>(gnext,g,delt,N);
The GPU kernel which
implements the evolution of the conformal spatial metric might look like the
following:
__global__ void evolveMetric(Matrix *gnext, Matrix *g, float
delt,
int
n)
{
const int x = blockIdx.x*blockDim.x
+ threadIdx.x;
const int y =
blockIdx.y*blockDim.y + threadIdx.y;
const int z =
blockIdx.z*blockDim.z + threadIdx.z;
int Indx= x*n*n+y*n+z;
float term1, term2;
for(int i = 0;
i<3; i++){
for(int j=0; j<3; j++){
term1 = something
term2 = something
else
gnext[Indx].M[i][j]=g[Indx].M[i][j]+delt*(term1 + term2);
}
}
}
The above GPU code
snippet makes use of several built-in CUDA C/C++ structures:
·
blockIdx
– Gives the index of the block
·
blockDim
– Gives the dimension of the block
·
threadIdx
– Gives the index of the thread in the block
An important point in
the above snippet is that in CUDA we’ve replaced the for-loops for x, y, and z that were used in the CPU
version with indices to the blocks and threads for each instead. Remember, we
have up to about 1,000 threads per block we can use, each thread running its
own copy of the kernel. You might say that each thread ‘knows’ its own thread
and block index. Therefore, a statement like one used to define Indx in the code snippet
above is translated to a different number depending on which thread and block
is reading at it.
Tricky, right?
Next: Finite Differences on a GPU Machine
<--Back: Implementation on a CPU example
<--Back: Overview