Wednesday, March 8, 2017

Zero-Copy: CUDA, OpenCV and NVidia Jetson TK1: Part 1

If you aren't yet familiar with NVidia's embedded ECU releases (NVidia Jetson TK1, TX1 and coming soon TX2) they are definitely something dig into. NVidia has been embedding their GPU architectures on the same IC as decent-speed processors (like a quad-core ARM Cortex A-15). The Jetson TK1 is by far the most affordable ($100-200), and is an excellent option to bring in some high performance computing to your mobile robotics projects. I'm posting my findings on a slight difference between programming with CUDA on NVidia discrete GPUs and on NVidia's embedded TK/TX platforms (in regards to their memory architecture)

CUDA is a C-based framework developed by NVidia to allow developers to write code for parallel processing using NVidia's GPUs. Typically the main CPU is considered the 'Host' while the GPU is considered the 'Device'. The general flow for using the GPU for general purpose computing is as follows:
  1. CPU: Transfers data from host-memory to device-memory
  2. CPU: Command CUDA process to run on GPU
  3. CPU: Either do other work or block (waiting) until the GPU has finished
  4. CPU: Transfer data from device-memory to host-memory 
This is the general flow, and you can read up on this more in depth. As far as the GPU and the CPU on the NVidia TK/TX SOCs go they are considered to have a Unified Memory Architecture (UMA): meaning they share RAM. Typical discrete GPU cards have their own RAM, thus the necessity of copying data to and from the device. When I learned this I figured the memory-copy process could be eliminated altogether on the Jetson!

I started out learning how to do this purely with simple CUDA kernels. Generally, the CUDA compiler will not allow a CUDA function (or kernel) to operate on data-types that have CPU-type pointers etc. I came across different memory-methods of using CUDA:
  • CUDA Device Copy
  • CUDA Zero Copy
  • CUDA UVA (Unified Memory)
 It took me a while to get used to using these different techniques, and I was not sure which one was appropriate so I did some profiling in order to find out which one gave me the best speed-up (Device Copy as the base). It turned out that CUDA UVA was the better method for coding on the Jetson TK1 embedded GPU.

However, I still ran into a problem using OpenCV on the Jetson. OpenCV has a CUDA module, however OpenCV is designed to use two different Mat data-types: mat for CPU and gpu::GpuMat for GPU. So you could not use OpenCV gpu::functions on cpu mat objects. OpenCV actually has you do the same thing as in 'device copy' for CUDA, and use their methods for copying a CPU mat to the GPU and vice-versa. When I realized this, I was stunned that there was no Unified Memory method (to my knowledge) in OpenCV. So all OpenCV gpu::functions required needless memory copying on the Jetson! On an embedded device this is an extreme bottleneck, as I was already hitting the wall with my programs working with the Kinect IR sensor and image data.

So after quite a bit of sand-box style experimentation, I found the correct approach to casting Mat pointers into GpuMat pointers without doing any memory copy and maintaining the CUDA UVA style. My original program with my Kinect sensor ran at 7-10FPS, and that was with cutting the width and height down from 640x480 to 320x240. With my new approach of avoiding any memory copy I was able to achieve full 30FPS at full 640x480 (this is on all the Depth Data from the IR sensor).

I will post code on my github and update this with the link soon.

Move on to Part 2 for examples

1 comment: