CUDA is a C-based framework developed by NVidia to allow developers to write code for parallel processing using NVidia's GPUs. Typically the main CPU is considered the 'Host' while the GPU is considered the 'Device'. The general flow for using the GPU for general purpose computing is as follows:
- CPU: Transfers data from host-memory to device-memory
- CPU: Command CUDA process to run on GPU
- CPU: Either do other work or block (waiting) until the GPU has finished
- CPU: Transfer data from device-memory to host-memory
I started out learning how to do this purely with simple CUDA kernels. Generally, the CUDA compiler will not allow a CUDA function (or kernel) to operate on data-types that have CPU-type pointers etc. I came across different memory-methods of using CUDA:
- CUDA Device Copy
- CUDA Zero Copy
- CUDA UVA (Unified Memory)
However, I still ran into a problem using OpenCV on the Jetson. OpenCV has a CUDA module, however OpenCV is designed to use two different Mat data-types: mat for CPU and gpu::GpuMat for GPU. So you could not use OpenCV gpu::functions on cpu mat objects. OpenCV actually has you do the same thing as in 'device copy' for CUDA, and use their methods for copying a CPU mat to the GPU and vice-versa. When I realized this, I was stunned that there was no Unified Memory method (to my knowledge) in OpenCV. So all OpenCV gpu::functions required needless memory copying on the Jetson! On an embedded device this is an extreme bottleneck, as I was already hitting the wall with my programs working with the Kinect IR sensor and image data.
So after quite a bit of sand-box style experimentation, I found the correct approach to casting Mat pointers into GpuMat pointers without doing any memory copy and maintaining the CUDA UVA style. My original program with my Kinect sensor ran at 7-10FPS, and that was with cutting the width and height down from 640x480 to 320x240. With my new approach of avoiding any memory copy I was able to achieve full 30FPS at full 640x480 (this is on all the Depth Data from the IR sensor).
I will post code on my github and update this with the link soon.
Move on to Part 2 for examples