{"id":1663,"date":"2011-08-06T23:27:33","date_gmt":"2011-08-07T06:27:33","guid":{"rendered":"http:\/\/mcclanahoochie.com\/blog\/?post_type=portfolio&#038;p=1663"},"modified":"2023-06-10T10:32:17","modified_gmt":"2023-06-10T17:32:17","slug":"cuda-connected-component-labeling","status":"publish","type":"post","link":"https:\/\/mcclanahoochie.com\/blog\/2011\/08\/cuda-connected-component-labeling\/","title":{"rendered":"GPU Connected Component Labeling"},"content":{"rendered":"<p><span style=\"font-size: medium;\">Connected Component Labeling<\/span> (<a title=\"CCL\" href=\"http:\/\/en.wikipedia.org\/wiki\/Connected-component_labeling\" target=\"_blank\" rel=\"noopener\">CCL<\/a>): &#8220;is used in computer vision to detect connected regions in binary digital images&#8221;, and sometimes referred to as <em>blob coloring<\/em>.<\/p>\n<p><span style=\"font-size: medium;\">Motivation:<\/span><br \/>\nTo keep <a title=\"Accelereyes\" href=\"http:\/\/www.accelereyes.com\/\" target=\"_blank\" rel=\"noopener\">AccelerEyes&#8217;<\/a>\u00a0ever expanding GPU library growing, over a few weeks of this summer\u00a0I took on the project of writing a CUDA version of connected component labeling\u00a0to be used in Jacket (M)\u00a0and <a title=\"LibJacket (C++)\" href=\"http:\/\/www.accelereyes.com\/\" target=\"_blank\" rel=\"noopener\">LibJacket (C++)<\/a>.<em>\u00a0It turned out to be quite a fun learning experience!<\/em> I will share some of the details here.<\/p>\n<p>I came across several projects and papers on parallel connected component labeling. Many were very general and graph based, and most of them weren&#8217;t GPU specific (good for the authors, but more work for me!). The GPU related projects I did find were either not open source, or too complicated\/specific to get working. From this research though, I was equipped to start from scratch with my first attempt at CCL in CUDA.<\/p>\n<p><span style=\"font-size: medium;\">First approach:<\/span><br \/>\nRead input binary image &gt; Re-label every non-zero element (in-place) to a unique ID (global thread ID) &gt; Scan through the image in threadblocks\/shared memory, taking the MAX of each pixel&#8217;s 4 or 8 connected neighbors &gt; Re-label current pixel to be the max of the connected neighbors, in-place &gt; Repeat entire process (kernel launch) for a minimum number of times, and until no thread detects a change (convergence) &gt; A last pass scans the image for all\u00a0<em>n<\/em> unique labels, and re-maps them into the range [1-n].<\/p>\n<p>This &#8220;<em>local maximum\u00a0propagation<\/em>&#8221; approach worked well, but only for &#8216;small&#8217; images, as otherwise it was really, really slow. This was mostly due to the way I was checking for convergence (high minimum number of passes), and for using unguarded shared memory updates (lots of contention).<\/p>\n<p>Unknowingly, I ended up writing my own variant of the kernels described in section 5.2. CUDA Implementations in <a title=\"Parallel Graph Component Labelling with GPUs and CUDA\" href=\"http:\/\/www.massey.ac.nz\/~kahawick\/cstn\/089\/cstn-089.pdf\">this paper<\/a>, as I discovered this PDF <em>after<\/em> I started coding. Nevertheless, this paper is a good resource of parallel CCL information, and gave me some ideas on improving my code.<\/p>\n<p><span style=\"font-size: medium;\">Second approach:<\/span><br \/>\nVery similar to first approach, but with smarter optimization: create a look up table (LUT) image for storing labels at each pixel, and use constant memory to store the flag marking if the labeling has converged or not. The use of a global flag in constant memory enabled quicker label convergence and using a separate LUT allowed for\u00a0cleaner code.\u00a0Even though I wasn&#8217;t using shared memory this time around, the performance tremendously improved! Making use of shared memory somehow is the obvious next step for further optimization.<\/p>\n<p><span style=\"font-size: medium;\">Example Use:<\/span><\/p>\n<figure id=\"attachment_1664\" aria-describedby=\"caption-attachment-1664\" style=\"width: 290px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-bwlabel.png\"><img data-recalc-dims=\"1\" decoding=\"async\" data-attachment-id=\"1664\" data-permalink=\"https:\/\/mcclanahoochie.com\/blog\/2011\/08\/cuda-connected-component-labeling\/coins-bwlabel\/#main\" data-orig-file=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-bwlabel.png?fit=1131%2C460&amp;ssl=1\" data-orig-size=\"1131,460\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"coins-bwlabel\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;Binary Image Labeling: uniquely coloring coins&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-bwlabel.png?fit=1024%2C416&amp;ssl=1\" class=\"size-medium wp-image-1664\" title=\"coins-bwlabel\" src=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-bwlabel-300x122.png?resize=300%2C122\" alt=\"\" width=\"300\" height=\"122\" srcset=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-bwlabel.png?resize=300%2C122&amp;ssl=1 300w, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-bwlabel.png?resize=1024%2C416&amp;ssl=1 1024w, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-bwlabel.png?resize=500%2C203&amp;ssl=1 500w, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-bwlabel.png?w=1131&amp;ssl=1 1131w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-1664\" class=\"wp-caption-text\">Binary Image Labeling: uniquely coloring coins<\/figcaption><\/figure>\n<p>The following M-script loads in the <em>coins.png<\/em> image and performs bwlabel on the CPU and GPU to uniquely label and color each coin, while timing the process.<br \/>\n<code><br \/>\n%===============================%<\/code><br \/>\n<code><br \/>\n% params<br \/>\nscale = 5;<br \/>\nconn = 4;<br \/>\nruns = 10;<br \/>\n<\/code><code><br \/>\n% image<br \/>\nbw = imread('coins.png');<br \/>\nim = (bw &gt; 90);<br \/>\nim = imresize(im,scale);<br \/>\nss = size(im)<br \/>\nim = logical(im);<br \/>\ngm = glogical(im);<\/code><br \/>\n<code><br \/>\n% cpu time<br \/>\ntic<br \/>\nfor rr=1:runs,<br \/>\ncc = bwlabel(im,conn);<br \/>\nend<br \/>\ncpu=toc\/runs<\/code><br \/>\n<code><br \/>\n% gpu time<br \/>\ntic<br \/>\nfor rr=1:runs,<br \/>\ngg = bwlabel(gm,conn);<br \/>\nend<br \/>\ngpu=toc\/runs<\/code><br \/>\n<code><br \/>\n% compare<br \/>\nspeedup=cpu\/gpu<\/code><br \/>\n<code><br \/>\n% plot<br \/>\nclose all<br \/>\nfigure<br \/>\nimage(bw\/10)<br \/>\nfigure<br \/>\nimage(cc*6)<\/code><br \/>\n<code><br \/>\nreturn;<\/code><br \/>\n<code><br \/>\n%===============================%<\/code><\/p>\n<p><span style=\"font-size: medium;\">Performance:<\/span><br \/>\nRight now, the speed of the GPU implementation is comparable to that of the CPU for small image sizes, and really starts to shine as the image gets bigger.<\/p>\n<figure id=\"attachment_1665\" aria-describedby=\"caption-attachment-1665\" style=\"width: 290px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-perf.png\"><img data-recalc-dims=\"1\" decoding=\"async\" data-attachment-id=\"1665\" data-permalink=\"https:\/\/mcclanahoochie.com\/blog\/2011\/08\/cuda-connected-component-labeling\/coins-perf\/#main\" data-orig-file=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-perf.png?fit=561%2C411&amp;ssl=1\" data-orig-size=\"561,411\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"coins-perf\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;GPU speedup over CPU bwlabel&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-perf.png?fit=561%2C411&amp;ssl=1\" class=\"size-medium wp-image-1665 \" title=\"coins-perf\" src=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-perf-300x219.png?resize=300%2C219\" alt=\"\" width=\"300\" height=\"219\" srcset=\"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-perf.png?resize=300%2C219&amp;ssl=1 300w, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-perf.png?resize=409%2C300&amp;ssl=1 409w, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/coins-perf.png?w=561&amp;ssl=1 561w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-1665\" class=\"wp-caption-text\">GPU CCL speedup over CPU CCL<\/figcaption><\/figure>\n<p>While the chart looks promising as the sizes get bigger, there is definitely room for improvement in my CUDA CCL implementation. Also note &#8211; the 0.6x speedup for the first size represents a 1.1ms difference: 1.7ms on the CPU vs 2.8ms on the GPU &#8211; still very competitive. Performance shall improve when I get around to tweaking the code more and applying different CUDA techniques.<\/p>\n<p><span style=\"font-size: medium;\">Next Steps:<\/span><br \/>\nWhat can you do with a labeled image? The next image processing step usually involves computing statistics on each labeled region. I have an initial version of this available, for use in Jacket &amp;\u00a0<a title=\"LibJacket\" href=\"http:\/\/www.accelereyes.com\/\" target=\"_blank\" rel=\"noopener\">LibJacket<\/a> (extending this is another interesting and challenging exercise to be discussed later!).<\/p>\n<p>Example uses of <em>connected component labeling<\/em> and <em>labeled image region statistics<\/em> in\u00a0C++ can be found in the image processing demo, included with LibJacket: image_demo.cpp.<\/p>\n<p><span style=\"font-size: x-small;\">Shameless Plug \ud83d\ude42 \u00a0\u00a0<\/span>Download a free trial of <a href=\"http:\/\/www.accelereyes.com\/\" target=\"_blank\" rel=\"noopener\">Jacket &amp; LibJacket<\/a> today and see for yourself!<\/p>\n<p><em>\u00a0<strong>Update<\/strong>: AccelerEyes is now <a href=\"http:\/\/arrayfire.com\" target=\"_blank\" rel=\"noopener\">ArrayFire<\/a>\u00a0and the connected components labeling function is called\u00a0<a href=\"http:\/\/www.arrayfire.com\/docs\/group__image__func__regions.htm\" target=\"_blank\" rel=\"noopener\">regions()<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Connected Component Labeling (CCL): &#8220;is used in computer vision to detect connected regions in binary digital images&#8221;, and sometimes referred to as blob coloring. Motivation: To keep AccelerEyes&#8217;\u00a0ever expanding GPU library growing, over a few weeks of this summer\u00a0I took on the project of writing a CUDA version of connected component labeling\u00a0to be used in &#8230; <a title=\"GPU Connected Component Labeling\" class=\"read-more\" href=\"https:\/\/mcclanahoochie.com\/blog\/2011\/08\/cuda-connected-component-labeling\/\" aria-label=\"Read more about GPU Connected Component Labeling\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_seo_schema_type":"","_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[91,117,113,116,100,54,101,29],"class_list":["post-1663","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-arrayfire","tag-ccl","tag-computer-vision","tag-connected-component","tag-gpu","tag-image-processing","tag-programming","tag-projects"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pZdXI-qP","jetpack-related-posts":[{"id":1896,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/11\/gpu-tv-l1-optical-flow-with-libjacket\/","url_meta":{"origin":1663,"position":0},"title":"GPU TV-L1 Optical Flow with ArrayFire","author":"mcclanahoochie","date":"November 6, 2011","format":false,"excerpt":"Update 1: LibJacket has been renamed to\u00a0\u00a0ArrayFire. Update 2: Huang Chao-Hui was nice enough to port the LibJacket code mentioned here to ArrayFire - see his work here. As one of my\u00a0Computer Vision\u00a0class\u00a0projects, I decided to implement optical flow, because I wanted to learn more about optical flow, and also\u2026","rel":"","context":"In \"arrayfire\"","block_context":{"text":"arrayfire","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/arrayfire\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/11\/jkt-oflow-tvl1-1024x626.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/11\/jkt-oflow-tvl1-1024x626.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/11\/jkt-oflow-tvl1-1024x626.png?resize=525%2C300 1.5x"},"classes":[]},{"id":1731,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/08\/image-processing-with-libjacket-opencv\/","url_meta":{"origin":1663,"position":1},"title":"Image processing with LibJacket + OpenCV","author":"mcclanahoochie","date":"August 24, 2011","format":false,"excerpt":"Update: one year later:\u00a0ArrayFire+OpenCV The OpenCV library is the de-facto standard for doing computer vision and image processing research projects. OpenCV includes several hundreds of computer vision algorithms, aimed for use in real-time vision applications. LibJacket is a matrix library built on CUDA. LibJacket offers hundreds of general matrix and\u2026","rel":"","context":"In \"arrayfire\"","block_context":{"text":"arrayfire","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/arrayfire\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/Screen-shot-2011-08-24-at-2.42.52-PM-1024x640.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/Screen-shot-2011-08-24-at-2.42.52-PM-1024x640.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/08\/Screen-shot-2011-08-24-at-2.42.52-PM-1024x640.png?resize=525%2C300 1.5x"},"classes":[]},{"id":1966,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/12\/computer-vision-learning-portfolio\/","url_meta":{"origin":1663,"position":2},"title":"Computer Vision Learning Portfolio","author":"mcclanahoochie","date":"December 12, 2011","format":false,"excerpt":"This page constitutes my required\u00a0external\u00a0learning portfolio for CS 7495, Computer Vision, taken in Fall 2011. In it, I discuss what I have learned throughout the course, my activities and findings, how I think I did, and what impact it had on me. About me I am a coffee fanatic that\u2026","rel":"","context":"In \"computer vision\"","block_context":{"text":"computer vision","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/computer-vision\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/12\/chris-raffertys-2-150x150.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":886,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/01\/mtimes-gpu-matrix-multiplication\/","url_meta":{"origin":1663,"position":3},"title":"MTIMES &#8211; GPU Matrix Multiplication","author":"mcclanahoochie","date":"January 1, 2011","format":false,"excerpt":"July 2010 OK, it's not really a project, but I did learn a lot about GPU matrix multiplication over the summer, working\u00a0at AccelerEyes. I\u00a0re-worked the back-end CUDA code for\u00a0the MTIMES Jacket function. I also modified it to accelerate SUM, MIN, and\u00a0MAX. Checkout my MTIMES wiki page!","rel":"","context":"In \"arrayfire\"","block_context":{"text":"arrayfire","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/arrayfire\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/fermi_gflops_single.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":929,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/01\/p3dfft-cuda-gpu-3d-fft\/","url_meta":{"origin":1663,"position":4},"title":"P3DFFT + CUDA (GPU 3D FFT)","author":"mcclanahoochie","date":"January 1, 2011","format":false,"excerpt":"December 2010 My first project as a GRA under Rich Vuduc involved accelerating 3D Fast Fourier Transforms (3D FFT) with GPUs. The project was basically porting the open-source P3DFFT code\u00a0(written in FORTRAN) to run on GPU(instead of CPU)\u00a0clusters using CUFFT. \u00a0 Update: 04\/16\/2011 - This project has morphed into a\u2026","rel":"","context":"In \"cuda\"","block_context":{"text":"cuda","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/cuda\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mcclanahoochie.com\/blog\/wp-content\/uploads\/2011\/01\/pencil-decomp.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":1133,"url":"https:\/\/mcclanahoochie.com\/blog\/2011\/03\/the-history-and-evolution-of-gpu-hardware\/","url_meta":{"origin":1663,"position":5},"title":"The History and Evolution of GPU Hardware","author":"mcclanahoochie","date":"March 27, 2011","format":false,"excerpt":"Here is a paper survey I wrote last semester in my CS6290 class about how the Graphics Processing Units (GPUs) hardware architecture has evolved over time. I found the research quite interesting, and spent a lot of time doing it. I'm posting this here, as I feel that more people\u2026","rel":"","context":"In \"evolution\"","block_context":{"text":"evolution","link":"https:\/\/mcclanahoochie.com\/blog\/tag\/evolution\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":false,"_links":{"self":[{"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/posts\/1663","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/comments?post=1663"}],"version-history":[{"count":0,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/posts\/1663\/revisions"}],"wp:attachment":[{"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/media?parent=1663"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/categories?post=1663"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mcclanahoochie.com\/blog\/wp-json\/wp\/v2\/tags?post=1663"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}