Update: LibJacket has been renamed to ArrayFire.
In response to a comment on a previous post about integrating LibJacket into an OpenCV project, below is just a simple FYI performance comparison of OpenCV‘s GPU Sobel filter versus LibJacket‘s conv2 convolution filter (with a sobel kernel)…
This is an evolutionary post, so be sure to scroll all the way down to see more comparisons…
Update (10/24/2011): Round 2
Test system:
[via /proc/cpuinfo]:
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
[via LibJacket’s ginfo()]:
Libjacket v1.0.1 (build dd66add) by AccelerEyes
CUDA Driver: 270.41.19
CUDA Toolkit: v4.0
CUDA capable devices detected:
GPU0 GeForce GTX 295, 896 MB, Compute 1.3 (single,double) (in use)
GPU1 GeForce GTX 295, 896 MB, Compute 1.3 (single,double)
Test procedure:
Random matrices were generated and used for testing. For every size, the same matrix(image) was used for each call. A “warm up” function call was made, then the average over 100 runs is reported. As of writing this, the latest versions of both libraries were used for comparison.
One note: I’m disabling LibJacket’s ‘dynamic caching’ by calling gsync() each loop, and without this call (i.e. in normal code) the functions run even faster than above.
Get the source code and see for yourself!
(put the folder in your /libjacket/examples/ directory)
The test is by no means an extensive benchmark, but it does expose some hints about performance in general for the various platforms. I may bench other functions when I get more time, but right now it looks like LibJacket has got a few tricks up it’s sleeve!
Update 1:
To address the question in the comments about “convolve()”, I ran another quick test. This one compares OpenCV’s GPU “convolve()” method against LibJacket’s “conv2()” with varying size Sobel filter kernel sizes. Note: OpenCV’s filter2D() doesn’t support floating point images as of current, so it is not considered here.
It’s interesting to see that OpenCV seems unaffected by kernel size, while LibJacket’s performance is quite dependent, and highly favoring smaller kernels.
(LibJacket is not using separable kernels here either).
Update 2:
This seems to be an evolving post, thanks to all the comments!
Up until now, the benchmarks of LibJacket have been using 2D kernels. Since OpenCV’s Sobel filter uses separable kernels, I re-ran the above benchmark using the separable kernel version of LibJacket’s conv2() function. The dotted line is the same LibJacket 3×3 kernel as in the first chart, for reference on the improvement of separable kernels.
I wish that I had more time to do a full feature comparison for all overlapping LibJacket/OpenCV functions, but alas, maybe another day… The source above is enough for anyone else out there to get started though!
Update 3:
As requested in the comments, here are the Fermi Tesla benchmarks.
Note: If the goal here was solely Sobel filtering, then one would compare jkt::conv2 vs cv::gpu::Sobel (first figure below). To generalize to any convolution though (second/third figures below), in Opencv-GPU, one must either use cv::gpu::filter2D or cv::gpu::convolve. Unfortunately, filter2D only works on uchar images, while convolve works on any type; the common data-type between LibJacket and OpenCV float. According to the comments, convolve was designed for larger kernels, while their sobel stops at 16×16 (I experimentally discovered this). I would say for general floating point convolutions, jkt::conv2 vs cv::convolve is a fair comparison.
See also: OpenCV+ArrayFire
If OpenCV has conv2 also, may be you need to add that for comparison. May be they are using different algorithms for sobel and convolution ?
Also, for a 3×3 kernel, Libjacket is faster if you send in the host side data.
Pavan:
Good point: OpenCV does have other filtering methods. Looking into this, I find that their “filter2D()” doesn’t support floating point images, and their “convolve()” function is basically LibJacket’s “conv2()”. A quick benchmark of OpenCV’s GPU convolve() looks like…
size: 512x512
cv-gpu: 2.27729
jacket: 0.17907
size: 1024x1024
cv-gpu: 3.64561
jacket: 0.61625
size: 1536x1536
cv-gpu: 13.6968
jacket: 1.05545
size: 2048x2048
cv-gpu: 13.954
jacket: 2.43504
size: 2560x2560
cv-gpu: 27.7197
jacket: 3.28057
size: 3072x3072
cv-gpu: 28.0698
jacket: 4.26894
size: 3584x3584
cv-gpu: 47.5459
jacket: 5.65912
^and the above Jacket results do include the faster “convn()” timings as well.
mcclanahoochie:
Is situation with convolve() the same for bigger kernel sizes? Say 5% or 10% of source image width.
Dear mcclanahoochie,
Thank you for benchmarking OpenCV library. We will add specializations for small Sobel kernels. We look forward your tests of other functions. Any help are welcome!!!
OpenCV’s filter engine consists of 3 layers. At simplest and slowest user can call cv::gpu::Sobel. For better performance, he should use Filter Engine API. Our fault that we haven’t documented this very well. But I guess even in this case LibJacket will be a bit faster, because of universality of OpenCV’s code. Anyway we will add specializations. Many thanks.
Also a little note: cv::convolve is utility function used in template matching. It is optimized for big pattern sizes like 100×100 or 250×250. It uses FFT inside and performs GPU buffers allocations. It’s quite funny to use this function for Sobel filtering 3×3. Sounds like a nailing by excavator rather than a hummer :)))
Alexey:
I’ve updated various size “convolve()” benchmarks, and as Anatoly points out, it is indeed designed for larger kernel sizes.
Anatoly:
Thanks for the clarification. I also noticed that OpenCV uses separable kernels for Sobel. When I get time next, I’ll try and re-do the LibJacket benchmarks using separable kernels as well, for a more fair comparison (meaning Jacket will probably be faster…).
Anatoly, we use FFT based convolutions at large sizes too 🙂
Chris,
Just to piss you off a little more, Can you try using the separable kernels http://en.wikipedia.org/wiki/Sobel_operator#Technical_details ? This is much faster inside libjacket for small kernels (< 20×20)
And then just for the kicks, convolve with random kernels 🙂
Pavan:
Updated! (for separable kernels)
…and probably won’t have much time to do too many more benchmarks until the weekend comes again
@pavan:
I don’t doubt that LibJacket uses FFT techniques for large kernel convolutions.
@mcclanahoochie:
Thanks for for the new chart. That’s very helpful. BTW, do you plan to run the benchmark on Fermi? Also I wonder if LibJacket supports some kind of extrapolation, doesn’t it?
Anatoly:
Glad to hear that! I’ll see if I can get a hold of a Fermi card before the weekend, so check back then! As far as extrapolation, I assume you’re referring to how the border is handled… in that case, LibJacket (currently) supports filtering with/without zero-padded edges: . Thanks for the interest!
Hi mcclanahoochie. I’m trying to do one more comparison of OpenCV and LibJacket libs. Here is the modified version of your benchmark: http://pastebin.com/W41RwPnu. I got strange results for ksz = 32, while results for ksz = 64 seem to be OK. I wonder if I use LibJacket correctly.
Results for ksz = 32
====================
Libjacket v1.0.1 (build dd66add) by AccelerEyes
CUDA Driver: 270.81
CUDA Toolkit: v4.0
CUDA capable devices detected:
GPU0 Tesla C2050 / C2070, 2652 MB, Compute 2.0 (single,double) (in use)
size: 512×512
cv-gpu: 0.00202607
jacket: 0.00790995
size: 1024×1024
cv-gpu: 0.00767946
jacket: 0.0321756
size: 1536×1536
cv-gpu: 0.00893694
jacket: 0.0735304
size: 2048×2048
cv-gpu: 0.0171747
jacket: 0.131005
size: 2560×2560
cv-gpu: 0.0177356
jacket: 0.205862
size: 3072×3072
cv-gpu: 0.0278928
jacket: 0.297097
size: 3584×3584
cv-gpu: 0.0268914
jacket: 0.404806
Results for ksz = 64
====================
Libjacket v1.0.1 (build dd66add) by AccelerEyes
CUDA Driver: 270.81
CUDA Toolkit: v4.0
CUDA capable devices detected:
GPU0 Tesla C2050 / C2070, 2652 MB, Compute 2.0 (single,double) (in use)
size: 512×512
cv-gpu: 0.0021476
jacket: 0.00167873
size: 1024×1024
cv-gpu: 0.00802897
jacket: 0.00602738
size: 1536×1536
cv-gpu: 0.00899348
jacket: 0.00626257
size: 2048×2048
cv-gpu: 0.0173659
jacket: 0.0232162
size: 2560×2560
cv-gpu: 0.0177426
jacket: 0.0236838
size: 3072×3072
cv-gpu: 0.0264366
jacket: 0.0242362
size: 3584×3584
cv-gpu: 0.0270701
jacket: 0.0249083
We updated our code and rebenchmarked it.
http://opencv-gpu.blogspot.com/2011/10/opencv-vs-libjacket.html
Anatoly:
Very interesting results, and good work on your performance improvement! I hope to dive deeper into your other question soon and have relayed the message to developers at AccelerEyes… Cheers.
~Chris