December 2010
My first project as a GRA under Rich Vuduc involved accelerating 3D Fast Fourier Transforms (3D FFT) with GPUs.
The project was basically porting the open-source P3DFFT code (written in FORTRAN) to run on GPU(instead of CPU) clusters using CUFFT.
Update: 04/16/2011 –
This project has morphed into a SC11 paper – a draft that was submitted on 04/15/2011 can be found here [Prospects for scalable 3D FFTs on heterogeneous exascale systems] – where we describe DiGPUFFT, the implementation of P3DFFT+CUDA, an FFT algorithm performance scaling model, and future projections about FFT performance on exascale supercomputing systems.
[Special thanks to Rich, Casey, Kent.]
Update: 05/02/2011 –
DiGPUFFT on Google Code!
May I ask how were the GFLOPS values in Fig.3 of the SC11 paper measured or evaluated? Thanks
Hi,
The sample test program “driver_rand.F90” included in P3DFFT was ran and timed.
This sample program, when ran once, computes a forward 3D FFT of random data, then computes the reverse transform.
The time for one forward + reverse transform was averaged over 10 runs, then divided by 2, to get an average time for a transform in one direction.
This time is then used in the formula: GFLOPS=5*(N^3)*Log(N^3)/(1E9*seconds).
Thanks for the interest in the project.
~Chris
I can understand that the flops of a 1D FFT is approximately 5N*log(N) where n is the sample size and the log is 2-based. And we have N^2 of 1D FFTs for each dimension. That gives a total of 5N^3*log(N).
I could not understand where the Log(N^3) part came from.
Ah yes, there are three 5N^3*log(N) 1D pencils of computation over the whole 3D FFT;
so I guess a more intuitive way of writing the formula would be:
GFLOPS=[ 3* ( 5*(N ^3) * log(N)) ] / (1E9*seconds)
My understanding is, if there are M 1D FFTs of size N, then the performance of Complex FFT is 1E-09 * M * 5 * N * log2(N). Are you taking into consideration the creation of FFT plan and/or data transfer in your timing as well? In my tests, the kernel execution time (only CUFFT_Z2Z) is too small to produce reasonable results. On the other hand if I include data transfer times, then the GFLOPS is quite low.
I have used CUDA events as well as gettimeofday(…). Any comments on that? This is a very nice paper, I have cited it in my course work.
Sayan:
Thanks for you interest. The timings for DiGPUFFT were wall clock times for each FFT call so yes, gpu memory allocation, gpu memory transfers and gpu fft plan creation were reflected in the timings. Note, the code is written such that memory allocation and cufft planns were done “laziliy”, such that they occur only when the sizes change, and are otherwise reused. Also, only single-precision FFTs (CUFFT_C2C) were timed. Double-precision (CUFFT_Z2Z) ones are inherently slower. Please take a look at the Google Code site for DiGPUFFT for more info.
Thanks and good luck,
~Chris