GPU Connected Component Labeling

Connected Component Labeling (CCL): “is used in computer vision to detect connected regions in binary digital images”, and sometimes referred to as blob coloring.

Motivation:
To keep AccelerEyes’ ever expanding GPU library growing, over a few weeks of this summer I took on the project of writing a CUDA version of connected component labeling to be used in Jacket (M) and LibJacket (C++). It turned out to be quite a fun learning experience! I will share some of the details here.

I came across several projects and papers on parallel connected component labeling. Many were very general and graph based, and most of them weren’t GPU specific (good for the authors, but more work for me!). The GPU related projects I did find were either not open source, or too complicated/specific to get working. From this research though, I was equipped to start from scratch with my first attempt at CCL in CUDA.

First approach:
Read input binary image > Re-label every non-zero element (in-place) to a unique ID (global thread ID) > Scan through the image in threadblocks/shared memory, taking the MAX of each pixel’s 4 or 8 connected neighbors > Re-label current pixel to be the max of the connected neighbors, in-place > Repeat entire process (kernel launch) for a minimum number of times, and until no thread detects a change (convergence) > A last pass scans the image for all n unique labels, and re-maps them into the range [1-n].

This “local maximum propagation” approach worked well, but only for ‘small’ images, as otherwise it was really, really slow. This was mostly due to the way I was checking for convergence (high minimum number of passes), and for using unguarded shared memory updates (lots of contention).

Unknowingly, I ended up writing my own variant of the kernels described in section 5.2. CUDA Implementations in this paper, as I discovered this PDF after I started coding. Nevertheless, this paper is a good resource of parallel CCL information, and gave me some ideas on improving my code.

Second approach:
Very similar to first approach, but with smarter optimization: create a look up table (LUT) image for storing labels at each pixel, and use constant memory to store the flag marking if the labeling has converged or not. The use of a global flag in constant memory enabled quicker label convergence and using a separate LUT allowed for cleaner code. Even though I wasn’t using shared memory this time around, the performance tremendously improved! Making use of shared memory somehow is the obvious next step for further optimization.

Example Use:

Binary Image Labeling: uniquely coloring coins

The following M-script loads in the coins.png image and performs bwlabel on the CPU and GPU to uniquely label and color each coin, while timing the process.

%===============================%


% params
scale = 5;
conn = 4;
runs = 10;

% image
bw = imread('coins.png');
im = (bw > 90);
im = imresize(im,scale);
ss = size(im)
im = logical(im);
gm = glogical(im);


% cpu time
tic
for rr=1:runs,
cc = bwlabel(im,conn);
end
cpu=toc/runs


% gpu time
tic
for rr=1:runs,
gg = bwlabel(gm,conn);
end
gpu=toc/runs


% compare
speedup=cpu/gpu


% plot
close all
figure
image(bw/10)
figure
image(cc*6)


return;


%===============================%

Performance:
Right now, the speed of the GPU implementation is comparable to that of the CPU for small image sizes, and really starts to shine as the image gets bigger.

GPU CCL speedup over CPU CCL

While the chart looks promising as the sizes get bigger, there is definitely room for improvement in my CUDA CCL implementation. Also note – the 0.6x speedup for the first size represents a 1.1ms difference: 1.7ms on the CPU vs 2.8ms on the GPU – still very competitive. Performance shall improve when I get around to tweaking the code more and applying different CUDA techniques.

Next Steps:
What can you do with a labeled image? The next image processing step usually involves computing statistics on each labeled region. I have an initial version of this available, for use in Jacket & LibJacket (extending this is another interesting and challenging exercise to be discussed later!).

Example uses of connected component labeling and labeled image region statistics in C++ can be found in the image processing demo, included with LibJacket: image_demo.cpp.

Shameless Plug 🙂   Download a free trial of Jacket & LibJacket today and see for yourself!

 Update: AccelerEyes is now ArrayFire and the connected components labeling function is called regions().

4 thoughts on “GPU Connected Component Labeling”

  1. I’m implementing connected components for the Raspberry pi’s Videocore 4 gpu and came across this article. I know it’s really old, but could you upload the code somewhere even it’s for deprecated libjacket? The link doesn’t work anymore and I can’t find in any internet archive, thanks!

Leave a Comment