Friday, January 30, 2009

Tesla and CUDA

The new NVIDIA Tesla GPU computing solution for businesses that do CAD/CAM/CE, Imaging, and GIS for their clients provides personal supercomputing power which the company states is 250 times faster than standard PCs. Tesla uses the NVIDIA's CUDA parallel computing architecture and costs less that $10,000. A YouTube video or commercial can be examined at http://www.youtube.com/nvidiatesla . Developing with CUDA at
http://www.nvidia.com/object/cuda_learn.html provides the necessary toolkit and SDK to program in C for Tesla. There is a good article on "Parallel Processing with CUDA" at
http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf .

For example of both CPU and CUDA code, Consider matrix addition example by Seland at
http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf .

A: CPU Code

void add_matrix
( float* a, float* b, float* c, int N ) {
int index;
for ( int i = 0; i < N; ++i )
for ( int j = 0; j < N; ++j ) {
index = i + j*N;
c[index] = a[index] + b[index];
}
}
int main() {
add_matrix( a, b, c, N );
}

B: CUDA Code

//Compute Kernel
__global__
void add_matrix
( float* a, float* b, float* c, int N ) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i + j*N;
if ( i < N && j < N )
c[index] = a[index] + b[index];
}


int main() {
dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
add_matrix<<>>( a, b, c, N );
}

Notice that the double for loop for the CPU is placed with a grid. This presentation examines the use of threads in thread blocks contained in a grid of thread blocks. We can extend the main function:

//Define Grid Size

const int N=1024;
const int blocksize=16;

int main() {

//CPU Memory allocation

float *a = new float[N*N];
float *b = new float[N*N];
float *c = new float[N*N];
for ( int i = 0; i < N*N; ++i ) {
a[i] = 1.0f; b[i] = 3.5f; }

//GPU Memory allocation

float *ad, *bd, *cd;
const int size = N*N*sizeof(float);
cudaMalloc( (void**)&ad, size );
cudaMalloc( (void**)&bd, size );
cudaMalloc( (void**)&cd, size );

//Copy data to GPU

cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );

//Execute Kernel

dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
add_matrix<<>>( ad, bd, cd, N );

//Copy result back to CPU

cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );

//Clean Up and Return

cudaFree( ad ); cudaFree( bd ); cudaFree( cd );
delete[] a; delete[] b; delete[] c;
return EXIT_SUCCESS;
}

Following these basics and using the examples in the CUDA sdk, one can easily be running parallel programs for your GIS and imaging applications on your very own personal supercomputer.

No comments: