OpenCV vs. LibJacket: GPU Sobel Filtering


Update: LibJacket has been renamed to  ArrayFire.

In response to a comment on a previous post about integrating LibJacket into an OpenCV project, below is just a simple FYI performance comparison of OpenCV‘s GPU Sobel filter versus LibJacket‘s conv2 convolution filter (with a sobel kernel)…

This is an evolutionary post, so be sure to scroll all the way down to see more comparisons…

Update (10/24/2011): Round 2

 

Sobel filter: OpenCV GPU vs. LibJacket
OpenCV GPU Sobel vs. LibJacket conv2 (2D kernel)

 

Test system:
[via /proc/cpuinfo]:
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
[via LibJacket’s ginfo()]:
Libjacket v1.0.1 (build dd66add) by AccelerEyes
CUDA Driver: 270.41.19
CUDA Toolkit: v4.0
CUDA capable devices detected:
GPU0 GeForce GTX 295, 896 MB, Compute 1.3 (single,double) (in use)
GPU1 GeForce GTX 295, 896 MB, Compute 1.3 (single,double)

 

Test procedure:
Random matrices were generated and used for testing. For every size, the same matrix(image) was used for each call. A “warm up” function call was made, then the average over 100 runs is reported. As of writing this, the latest versions of both libraries were used for comparison.

One note: I’m disabling LibJacket’s ‘dynamic caching’ by calling gsync() each loop, and without this call (i.e. in normal code) the functions run even faster than above. 

Get the source code and see for yourself!
(put the folder in your /libjacket/examples/ directory)

 

The test is by no means an extensive benchmark, but it does expose some hints about performance in general for the various platforms. I may bench other functions when I get more time, but right now it looks like LibJacket has got a few tricks up it’s sleeve!

 

Update 1:

To address the question in the comments about “convolve()”, I ran another quick test. This one compares OpenCV’s GPU “convolve()” method against LibJacket’s “conv2()” with varying size Sobel filter kernel sizes. Note: OpenCV’s filter2D() doesn’t support floating point images as of current, so it is not considered here.

Conv2: OpenCV GPU vs. LibJacket
OpenCV GPU convolve vs. LibJacket conv2

It’s interesting to see that OpenCV seems unaffected by kernel size, while LibJacket’s performance is quite dependent, and highly favoring smaller kernels.
(LibJacket is not using separable kernels here either).

 

Update 2:
This seems to be an evolving post, thanks to all the comments!

Up until now, the benchmarks of LibJacket have been using 2D kernels. Since OpenCV’s Sobel filter uses separable kernels, I re-ran the above benchmark using the separable kernel version of LibJacket’s conv2() function. The dotted line is the same LibJacket 3×3 kernel as in the first chart, for reference on the improvement of separable kernels.

 

Sobel filter: OpenCV GPU vs. LibJacket
OpenCV GPU Sobel vs. LibJacket conv2 (separable)

 

I wish that I had more time to do a full feature comparison for all overlapping LibJacket/OpenCV functions, but alas, maybe another day… The source above is enough for anyone else out there to get started though!

 

Update 3:
As requested in the comments, here are the Fermi Tesla benchmarks.

Note: If the goal here was solely Sobel filtering, then one would compare jkt::conv2 vs cv::gpu::Sobel (first figure below). To generalize to any convolution though (second/third figures below), in Opencv-GPU, one must either use cv::gpu::filter2D or cv::gpu::convolve. Unfortunately, filter2D only works on uchar images, while convolve works on any type; the common data-type between LibJacket and OpenCV float. According to the comments, convolve was designed for larger kernels, while their sobel stops at 16×16 (I experimentally discovered this). I would say for general floating point convolutions, jkt::conv2 vs cv::convolve is a fair comparison.

OpenCV GPU Sobel vs. LibJacket conv2 (separable)
OpenCV GPU convolve vs. LibJacket conv2 – small kernels
OpenCV GPU convolve vs. LibJacket conv2 – larger kernels

 

See also: OpenCV+ArrayFire

 


18 thoughts on “OpenCV vs. LibJacket: GPU Sobel Filtering”

  1. If OpenCV has conv2 also, may be you need to add that for comparison. May be they are using different algorithms for sobel and convolution ?

  2. Pavan:
    Good point: OpenCV does have other filtering methods. Looking into this, I find that their “filter2D()” doesn’t support floating point images, and their “convolve()” function is basically LibJacket’s “conv2()”. A quick benchmark of OpenCV’s GPU convolve() looks like…

    size: 512x512
    cv-gpu: 2.27729
    jacket: 0.17907
    size: 1024x1024
    cv-gpu: 3.64561
    jacket: 0.61625
    size: 1536x1536
    cv-gpu: 13.6968
    jacket: 1.05545
    size: 2048x2048
    cv-gpu: 13.954
    jacket: 2.43504
    size: 2560x2560
    cv-gpu: 27.7197
    jacket: 3.28057
    size: 3072x3072
    cv-gpu: 28.0698
    jacket: 4.26894
    size: 3584x3584
    cv-gpu: 47.5459
    jacket: 5.65912

    ^and the above Jacket results do include the faster “convn()” timings as well.

  3. mcclanahoochie:
    Is situation with convolve() the same for bigger kernel sizes? Say 5% or 10% of source image width.

  4. Dear mcclanahoochie,

    Thank you for benchmarking OpenCV library. We will add specializations for small Sobel kernels. We look forward your tests of other functions. Any help are welcome!!!

    OpenCV’s filter engine consists of 3 layers. At simplest and slowest user can call cv::gpu::Sobel. For better performance, he should use Filter Engine API. Our fault that we haven’t documented this very well. But I guess even in this case LibJacket will be a bit faster, because of universality of OpenCV’s code. Anyway we will add specializations. Many thanks.

    Also a little note: cv::convolve is utility function used in template matching. It is optimized for big pattern sizes like 100×100 or 250×250. It uses FFT inside and performs GPU buffers allocations. It’s quite funny to use this function for Sobel filtering 3×3. Sounds like a nailing by excavator rather than a hummer :)))

  5. Alexey:
    I’ve updated various size “convolve()” benchmarks, and as Anatoly points out, it is indeed designed for larger kernel sizes.

    Anatoly:
    Thanks for the clarification. I also noticed that OpenCV uses separable kernels for Sobel. When I get time next, I’ll try and re-do the LibJacket benchmarks using separable kernels as well, for a more fair comparison (meaning Jacket will probably be faster…).

  6. @pavan:
    I don’t doubt that LibJacket uses FFT techniques for large kernel convolutions.

    @mcclanahoochie:
    Thanks for for the new chart. That’s very helpful. BTW, do you plan to run the benchmark on Fermi? Also I wonder if LibJacket supports some kind of extrapolation, doesn’t it?

  7. Anatoly:
    Glad to hear that! I’ll see if I can get a hold of a Fermi card before the weekend, so check back then! As far as extrapolation, I assume you’re referring to how the border is handled… in that case, LibJacket (currently) supports filtering with/without zero-padded edges: . Thanks for the interest!

  8. Hi mcclana­hoochie. I’m trying to do one more comparison of OpenCV and LibJacket libs. Here is the modified version of your benchmark: http://pastebin.com/W41RwPnu. I got strange results for ksz = 32, while results for ksz = 64 seem to be OK. I wonder if I use LibJacket correctly.

    Results for ksz = 32
    ====================
    Libjacket v1.0.1 (build dd66add) by AccelerEyes
    CUDA Driver: 270.81
    CUDA Toolkit: v4.0

    CUDA capable devices detected:
    GPU0 Tesla C2050 / C2070, 2652 MB, Compute 2.0 (single,double) (in use)
    size: 512×512
    cv-gpu: 0.00202607
    jacket: 0.00790995
    size: 1024×1024
    cv-gpu: 0.00767946
    jacket: 0.0321756
    size: 1536×1536
    cv-gpu: 0.00893694
    jacket: 0.0735304
    size: 2048×2048
    cv-gpu: 0.0171747
    jacket: 0.131005
    size: 2560×2560
    cv-gpu: 0.0177356
    jacket: 0.205862
    size: 3072×3072
    cv-gpu: 0.0278928
    jacket: 0.297097
    size: 3584×3584
    cv-gpu: 0.0268914
    jacket: 0.404806

    Results for ksz = 64
    ====================
    Libjacket v1.0.1 (build dd66add) by AccelerEyes
    CUDA Driver: 270.81
    CUDA Toolkit: v4.0

    CUDA capable devices detected:
    GPU0 Tesla C2050 / C2070, 2652 MB, Compute 2.0 (single,double) (in use)
    size: 512×512
    cv-gpu: 0.0021476
    jacket: 0.00167873
    size: 1024×1024
    cv-gpu: 0.00802897
    jacket: 0.00602738
    size: 1536×1536
    cv-gpu: 0.00899348
    jacket: 0.00626257
    size: 2048×2048
    cv-gpu: 0.0173659
    jacket: 0.0232162
    size: 2560×2560
    cv-gpu: 0.0177426
    jacket: 0.0236838
    size: 3072×3072
    cv-gpu: 0.0264366
    jacket: 0.0242362
    size: 3584×3584
    cv-gpu: 0.0270701
    jacket: 0.0249083

  9. Anatoly:
    Very interesting results, and good work on your performance improvement! I hope to dive deeper into your other question soon and have relayed the message to developers at AccelerEyes… Cheers.
    ~Chris

Leave a Reply