P3DFFT + CUDA (GPU 3D FFT)

June 10, 2023January 1, 2011 by mcclanahoochie

December 2010

My first project as a GRA under Rich Vuduc involved accelerating 3D Fast Fourier Transforms (3D FFT) with GPUs.

The project was basically porting the open-source P3DFFT code (written in FORTRAN) to run on GPU(instead of CPU) clusters using CUFFT.

Update: 04/16/2011 –

This project has morphed into a SC11 paper – a draft that was submitted on 04/15/2011 can be found here [Prospects for scalable 3D FFTs on heterogeneous exascale systems] – where we describe DiGPUFFT, the implementation of P3DFFT+CUDA, an FFT algorithm performance scaling model, and future projections about FFT performance on exascale supercomputing systems.
[Special thanks to Rich, Casey, Kent.]

Update: 05/02/2011 –

DiGPUFFT on Google Code!

6 thoughts on “P3DFFT + CUDA (GPU 3D FFT)”

Ning

May 10, 2011 at 9:38 am

May I ask how were the GFLOPS values in Fig.3 of the SC11 paper measured or evaluated? Thanks
mcclanahoochie

May 10, 2011 at 9:31 pm

Hi,
The sample test program “driver_rand.F90” included in P3DFFT was ran and timed.
This sample program, when ran once, computes a forward 3D FFT of random data, then computes the reverse transform.
The time for one forward + reverse transform was averaged over 10 runs, then divided by 2, to get an average time for a transform in one direction.
This time is then used in the formula: GFLOPS=5*(N^3)*Log(N^3)/(1E9*seconds).
Thanks for the interest in the project.
~Chris
Ning

May 11, 2011 at 7:58 am

I can understand that the flops of a 1D FFT is approximately 5N*log(N) where n is the sample size and the log is 2-based. And we have N^2 of 1D FFTs for each dimension. That gives a total of 5N^3*log(N).

I could not understand where the Log(N^3) part came from.
mcclanahoochie

May 11, 2011 at 9:24 am

Ah yes, there are three 5N^3*log(N) 1D pencils of computation over the whole 3D FFT;
so I guess a more intuitive way of writing the formula would be:
GFLOPS=[ 3* ( 5*(N ^3) * log(N)) ] / (1E9*seconds)
Sayan

February 24, 2012 at 9:22 am

My understanding is, if there are M 1D FFTs of size N, then the performance of Complex FFT is 1E-09 * M * 5 * N * log2(N). Are you taking into consideration the creation of FFT plan and/or data transfer in your timing as well? In my tests, the kernel execution time (only CUFFT_Z2Z) is too small to produce reasonable results. On the other hand if I include data transfer times, then the GFLOPS is quite low.
I have used CUDA events as well as gettimeofday(…). Any comments on that? This is a very nice paper, I have cited it in my course work.
mcclanahoochie

February 24, 2012 at 1:06 pm

Sayan:
Thanks for you interest. The timings for DiGPUFFT were wall clock times for each FFT call so yes, gpu memory allocation, gpu memory transfers and gpu fft plan creation were reflected in the timings. Note, the code is written such that memory allocation and cufft planns were done “laziliy”, such that they occur only when the sizes change, and are otherwise reused. Also, only single-precision FFTs (CUFFT_C2C) were timed. Double-precision (CUFFT_Z2Z) ones are inherently slower. Please take a look at the Google Code site for DiGPUFFT for more info.
Thanks and good luck,
~Chris

P3DFFT + CUDA (GPU 3D FFT)

December 2010

Update: 04/16/2011 –

Update: 05/02/2011 –

Related

6 thoughts on “P3DFFT + CUDA (GPU 3D FFT)”

Leave a ReplyCancel reply

December 2010

Update: 04/16/2011 –

Update: 05/02/2011 –

Share this!

Related

6 thoughts on “P3DFFT + CUDA (GPU 3D FFT)”

Leave a ReplyCancel reply