http://www.nvidia.com/object/cuda_learn.html provides the necessary toolkit and SDK to program in C for Tesla. There is a good article on "Parallel Processing with CUDA" at
http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf .
For example of both CPU and CUDA code, Consider matrix addition example by Seland at
http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf .
A: CPU Code
void add_matrix
( float* a, float* b, float* c, int N ) {
int index;
for ( int i = 0; i < N; ++i )
for ( int j = 0; j < N; ++j ) {
index = i + j*N;
c[index] = a[index] + b[index];
}
}
int main() {
add_matrix( a, b, c, N );
}
B: CUDA Code
//Compute Kernel
__global__
void add_matrix
( float* a, float* b, float* c, int N ) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i + j*N;
if ( i < N && j < N )
c[index] = a[index] + b[index];
}
int main() {
dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
add_matrix<<
}
Notice that the double for loop for the CPU is placed with a grid. This presentation examines the use of threads in thread blocks contained in a grid of thread blocks. We can extend the main function:
//Define Grid Size
const int N=1024;
const int blocksize=16;
int main() {
//CPU Memory allocation
float *a = new float[N*N];
float *b = new float[N*N];
float *c = new float[N*N];
for ( int i = 0; i < N*N; ++i ) {
a[i] = 1.0f; b[i] = 3.5f; }
//GPU Memory allocation
float *ad, *bd, *cd;
const int size = N*N*sizeof(float);
cudaMalloc( (void**)&ad, size );
cudaMalloc( (void**)&bd, size );
cudaMalloc( (void**)&cd, size );
//Copy data to GPU
cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );
//Execute Kernel
dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
add_matrix<<
//Copy result back to CPU
cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );
//Clean Up and Return
cudaFree( ad ); cudaFree( bd ); cudaFree( cd );
delete[] a; delete[] b; delete[] c;
return EXIT_SUCCESS;
}
Following these basics and using the examples in the CUDA sdk, one can easily be running parallel programs for your GIS and imaging applications on your very own personal supercomputer.
No comments:
Post a Comment