{"id":929,"date":"2011-01-01T11:17:16","date_gmt":"2011-01-01T18:17:16","guid":{"rendered":"http:\/\/mcclanahoochie.com\/blog\/?post_type=portfolio&#038;p=929"},"modified":"2023-06-10T10:32:21","modified_gmt":"2023-06-10T17:32:21","slug":"p3dfft-cuda-gpu-3d-fft","status":"publish","type":"post","link":"https:\/\/mcclanahoochie.com\/blog\/2011\/01\/p3dfft-cuda-gpu-3d-fft\/","title":{"rendered":"P3DFFT + CUDA (GPU 3D FFT)"},"content":{"rendered":"<h3>December 2010<\/h3>\n<p><a href=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/pencil-decomp.png\"><img data-recalc-dims=\"1\" decoding=\"async\" data-attachment-id=\"1258\" data-permalink=\"https:\/\/mcclanahoochie.com\/blog\/2011\/01\/p3dfft-cuda-gpu-3d-fft\/pencil-decomp\/#main\" data-orig-file=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/pencil-decomp.png?fit=200%2C124&amp;ssl=1\" data-orig-size=\"200,124\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"pencil-decomp\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/pencil-decomp.png?fit=200%2C124&amp;ssl=1\" class=\"alignnone size-full wp-image-1258\" title=\"pencil-decomp\" src=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/pencil-decomp.png?resize=200%2C124\" alt=\"\" width=\"200\" height=\"124\" \/><\/a><\/p>\n<p>My first project as a <a href=\"http:\/\/www.hpcgarage.org\/\" target=\"_blank\" rel=\"noopener\">GRA<\/a> under Rich Vuduc involved accelerating <em>3D Fast Fourier Transforms <\/em>(3D FFT) with GPUs.<\/p>\n<p>The project was basically porting the open-source <a href=\"http:\/\/code.google.com\/p\/p3dfft\/\" target=\"_blank\" rel=\"noopener\">P3DFFT code<\/a>\u00a0(written in FORTRAN) to run on GPU(instead of CPU)\u00a0clusters using CUFFT.<\/p>\n<p>&nbsp;<\/p>\n<h4>Update: <em>04\/16\/2011 &#8211;<\/em><\/h4>\n<p>This project has morphed into a <a href=\"http:\/\/sc11.supercomputing.org\/\" target=\"_blank\" rel=\"noopener\">SC11<\/a> paper &#8211; a <em>draft<\/em> that was submitted on 04\/15\/2011 can be found here<em> <a href=\"http:\/\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/fft-sc11.pdf\" target=\"_blank\" rel=\"noopener\">[Prospects for scalable 3D FFTs on heterogeneous exascale systems]<\/a> <\/em>&#8211; where we describe <em>DiGPUFFT<\/em>, the implementation of P3DFFT+CUDA, an FFT algorithm performance scaling model, and future projections about FFT performance on exascale\u00a0supercomputing\u00a0systems. <span style=\"color: #999999;\"><br \/>\n<em>[Special thanks to Rich, Casey, Kent.]<\/em><\/span><\/p>\n<p>&nbsp;<\/p>\n<h4>Update:\u00a0<em>05\/02\/2011 &#8211;<\/em><\/h4>\n<p>&nbsp;<\/p>\n<p style=\"display: inline !important;\"><a href=\"http:\/\/code.google.com\/p\/digpufft\/\" target=\"_blank\" rel=\"noopener\"><img data-recalc-dims=\"1\" decoding=\"async\" data-attachment-id=\"1200\" data-permalink=\"https:\/\/mcclanahoochie.com\/blog\/2011\/01\/p3dfft-cuda-gpu-3d-fft\/digpufft-s\/#main\" data-orig-file=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/digpufft-s.png?fit=200%2C200&amp;ssl=1\" data-orig-size=\"200,200\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"digpufft log\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/digpufft-s.png?fit=200%2C200&amp;ssl=1\" class=\"alignnone size-thumbnail wp-image-1200\" title=\"digpufft log\" src=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/digpufft-s-150x150.png?resize=90%2C90\" alt=\"digpufft log\" width=\"90\" height=\"90\" srcset=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/digpufft-s.png?resize=150%2C150&amp;ssl=1 150w, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/digpufft-s.png?w=200&amp;ssl=1 200w\" sizes=\"(max-width: 90px) 100vw, 90px\" \/>DiGPUFFT<\/a> on Google Code!<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>December 2010 My first project as a GRA under Rich Vuduc involved accelerating 3D Fast Fourier Transforms (3D FFT) with GPUs. The project was basically porting the open-source P3DFFT code\u00a0(written in FORTRAN) to run on GPU(instead of CPU)\u00a0clusters using CUFFT. &nbsp; Update: 04\/16\/2011 &#8211; This project has morphed into a SC11 paper &#8211; a draft &#8230; <a title=\"P3DFFT + CUDA (GPU 3D FFT)\" class=\"read-more\" href=\"https:\/\/mcclanahoochie.com\/blog\/2011\/01\/p3dfft-cuda-gpu-3d-fft\/\" aria-label=\"Read more about P3DFFT + CUDA (GPU 3D FFT)\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[110,125,84,124,101,29,111],"class_list":["post-929","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-cuda","tag-cufft","tag-fft","tag-p3dfft","tag-programming","tag-projects","tag-supercomputing"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pZdXI-eZ","jetpack-related-posts":[{"id":886,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/01\/mtimes-gpu-matrix-multiplication\/","url_meta":{"origin":929,"position":0},"title":"MTIMES &#8211; GPU Matrix Multiplication","author":"mcclanahoochie","date":"January 1, 2011","format":false,"excerpt":"July 2010 OK, it's not really a project, but I did learn a lot about GPU matrix multiplication over the summer, working\u00a0at AccelerEyes. I\u00a0re-worked the back-end CUDA code for\u00a0the MTIMES Jacket function. I also modified it to accelerate SUM, MIN, and\u00a0MAX. Checkout my MTIMES wiki page!","rel":"","context":"In \"arrayfire\"","block_context":{"text":"arrayfire","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/arrayfire\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/fermi_gflops_single.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":611,"url":"https:\/\/mcclanahoochie.com\/blog\/2010\/09\/gtc-2010-trip\/","url_meta":{"origin":929,"position":1},"title":"GTC 2010 Trip","author":"mcclanahoochie","date":"September 26, 2010","format":false,"excerpt":"I just got back from Nvidia's 2010 GPU Technology Conference in San Jose California. I had an amazing trip, and am thankful that I got to go, as it was my first visit to California as well as my first trade show attendance. [Side Note: The afternoon before the conference,\u2026","rel":"","context":"In \"arrayfire\"","block_context":{"text":"arrayfire","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/arrayfire\/"},"img":{"alt_text":"Arriving at the 2010 GPU Tech Conference","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2010\/09\/2010-09-20_09-39-29_568-1024x577.jpg?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2010\/09\/2010-09-20_09-39-29_568-1024x577.jpg?resize=350%2C200 1x, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2010\/09\/2010-09-20_09-39-29_568-1024x577.jpg?resize=525%2C300 1.5x"},"classes":[]},{"id":1663,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/08\/cuda-connected-component-labeling\/","url_meta":{"origin":929,"position":2},"title":"GPU Connected Component Labeling","author":"mcclanahoochie","date":"August 6, 2011","format":false,"excerpt":"Connected Component Labeling (CCL): \"is used in computer vision to detect connected regions in binary digital images\", and sometimes referred to as blob coloring. Motivation: To keep AccelerEyes'\u00a0ever expanding GPU library growing, over a few weeks of this summer\u00a0I took on the project of writing a CUDA version of connected\u2026","rel":"","context":"In \"arrayfire\"","block_context":{"text":"arrayfire","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/arrayfire\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-bwlabel-300x122.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":2514,"url":"https:\/\/mcclanahoochie.com\/blog\/2012\/12\/music-visualization-with-an-arduino\/","url_meta":{"origin":929,"position":3},"title":"Music Visualization with an Arduino","author":"mcclanahoochie","date":"December 13, 2012","format":false,"excerpt":"Audio Frequency Spectrum Analyzer &\u00a0Spectrogram As a followup to a previous post on Music Visualization with Processing (and a good excuse to play with my Arduino), I decided to convert my Processing music visualizer into hardware. The project is not finished yet, but I wanted to post a quick update\u2026","rel":"","context":"In \"arduino\"","block_context":{"text":"arduino","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/arduino\/"},"img":{"alt_text":"_DSC2334","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2012\/12\/DSC2334-229x300.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":1896,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/11\/gpu-tv-l1-optical-flow-with-libjacket\/","url_meta":{"origin":929,"position":4},"title":"GPU TV-L1 Optical Flow with ArrayFire","author":"mcclanahoochie","date":"November 6, 2011","format":false,"excerpt":"Update 1: LibJacket has been renamed to\u00a0\u00a0ArrayFire. Update 2: Huang Chao-Hui was nice enough to port the LibJacket code mentioned here to ArrayFire - see his work here. As one of my\u00a0Computer Vision\u00a0class\u00a0projects, I decided to implement optical flow, because I wanted to learn more about optical flow, and also\u2026","rel":"","context":"In \"arrayfire\"","block_context":{"text":"arrayfire","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/arrayfire\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/11\/jkt-oflow-tvl1-1024x626.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/11\/jkt-oflow-tvl1-1024x626.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/11\/jkt-oflow-tvl1-1024x626.png?resize=525%2C300 1.5x"},"classes":[]},{"id":1810,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/09\/opencv-vs-libjacket-gpu-sobel-filtering\/","url_meta":{"origin":929,"position":5},"title":"OpenCV vs. LibJacket: GPU Sobel Filtering","author":"mcclanahoochie","date":"September 24, 2011","format":false,"excerpt":"Update: LibJacket has been renamed to\u00a0\u00a0ArrayFire. In response to a comment on a previous post about integrating LibJacket into an OpenCV project, below is just a simple FYI performance comparison of OpenCV's GPU Sobel filter versus LibJacket's conv2\u00a0convolution\u00a0filter (with a sobel kernel)... This is an evolutionary post, so be sure\u2026","rel":"","context":"In \"arrayfire\"","block_context":{"text":"arrayfire","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/arrayfire\/"},"img":{"alt_text":"Sobel filter: OpenCV GPU vs. LibJacket","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/09\/cv-versus-jkt.png?resize=350%2C200","width":350,"height":200},"classes":[]}],"jetpack_likes_enabled":false,"_links":{"self":[{"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/posts\/929","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/comments?post=929"}],"version-history":[{"count":0,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/posts\/929\/revisions"}],"wp:attachment":[{"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/media?parent=929"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/categories?post=929"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/tags?post=929"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}