Wednesday, March 15, 2017

Zero-Copy: CUDA, OpenCV and NVidia Jetson TK1: Part 2

In this part 2 post I want to illustrate the difference in technique between the common 'device copy' method and the 'unified memory' method which is more suitable for memory architectures such as NVidia's Tegra K1/X1 processors used on NVidia Jetson development kits. I wanted to show an example using just a CUDA kernel as well as an example utilizing OpenCV gpu::functions().

1. CUDA kernels: Device Copy method

For this example, I've written a simple CUDA kernel that will take a fixed matrix (640x480) of depth values (delivered by Xbox 360's Kinect) and simultaneously convert to XYZ coordinates while rotating the points. This example only computes the Y dimension, but I can provide a full XYZ function as well, the math is fairly simple. The code may seem a bit intense, but try not to think of what's inside the CUDA kernel for now.

Kernel Code:

__global__ void cudaCalcXYZ_R2( float *dst, float *src, float *M, float heightCenter, float widthCenter, float scaleFactor, float minDistance)

//__shared__ float jFactor;
__shared__ float shM[3];
float nx,ny,nz, nzpminD, jFactor;
int blockCapacity;
int index;
if(threadIdx.x == 0)
shM[0] = M[4];
shM[1] = M[5];
shM[2] = M[6];
index = blockIdx.x*blockDim.x + threadIdx.x;
nz = src[index];
jFactor = ((float)blockIdx.x - heightCenter)*scaleFactor;
nzpminD = nz + minDistance;
nx = ((float)threadIdx.x - widthCenter )*(nzpminD)*scaleFactor;
ny = (jFactor)*(nzpminD);
//Solve for only Y matrix (height vlaues)
dst[index] = nx*shM[0] + ny*shM[1] + nz*shM[2];

Basically, a float pointer is sent as src (Depth data), it is manipulated to acquire the 'Y' parameter which is then stored in another float* dst. In a device copy implementation of the CUDA kernel, the data pointed to by src must first be copied to device memory using CUDA method cudaMemcpy(). Below is an example of how to do this ('h' generally means host (cpu) while 'd' means 'device' (gpu) ):

int rows = 480;
int cols = 640;
float* h_src, h_dst; //Host matrices
float* d_src, d_dst; //Device matrices
float* h_m, d_m; //4x4 rotation matrix (host/device)

//Allocate device copies using cudaMalloc
cudaMalloc( (void **)&d_src, sizeof(float)*rows*480);
cudaMalloc( (void **)&d_dst, sizeof(float)*rows*480);
cudaMalloc( (void **)&d_m, sizeof(float)*16);

//Allocate host pointers
h_src = (float*)malloc(sizeof(float)*rows*cols);
h_dst = (float*)malloc(sizeof(float)*rows*cols); 

h_m =   (float*)malloc(sizeof(float)*4*4);

//Copy all matrices from host to device
cudaMemcpy( d_src, h_src, sizeof(float)*rows*cols, cudaMemcpyHostToDevice);
cudaMemcpy( d_m, h_m, sizeof(float)*16, cudaMemcpyHostToDevice);

//Run the kernel
cudaCalcXYZ_R2<<< rows , cols>>>(d_dst, d_src, d_m, 240, 320, 0.0021, -10);

//Wait for GPU to finish

//Copy the result back to host memory
cudaMemcpy( h_dst, d_dst, sizeof(float)*rows*cols, cudaMemcpyDeviceToHost);


2. CUDA kernels: Unified Memory method

Here we are going to utilize the same kernel as the above example, but this time we are going to avoid any memory copy altogether by utilizing the CUDA_UVA technique. Here, instead of using cudaMallac() we have to use cudaMallocManaged();

cudaSetDeviceFlags(cudaDeviceMapHost); //Support for mapped pinned allocations

int rows = 480;
int cols = 640;
float* h_src, h_dst; //Src and Dst matrices
float* h_m;          //4x4 rotation matrix

//Allocate float*s for CUDA. No need to allocate host and device separately
cudaMallocManaged(&h_src, sizeof(float)*ros*
cudaMallocManaged(&h_M,   sizeof(float)*4*4);
cudaMallocManaged(&h_dst, sizeof(float)*

//Run the kernel
cudaCalcXYZ_R2<<< rows , cols>>>(h_dsth_srch_m, 240, 320, 0.0021, -10);

//Wait for GPU to finish

//Done, now h_dst contains the results}

So now we have completely eliminated copying over 1.23MB (640x480x4 bytes) prior to running the kernel as well as eliminated copying 1.23MB (640x480x4 bytes) after the kernel has finished. Imagine trying to achieve real-time performance on a robot reading a Kinect sensor at 30FPS, needlessly copying more than 73.3MB a second into the same RAM!

[Note]: This code would only function on an architecture such as the NVidia Tegra X/K processors, so no sense in trying to run it on your discrete GPU in your laptop or desktop (it just won't work!).

3. OpenCV GPU functions: Device Copy method

There is a module available with OpenCV called GPU written in CUDA, for those to take advantage of GPU acceleration of various functions. There is plenty of documentation online to understand how to use OpenCV's CUDA, I will go over the very basics. The example we will use is the per-element multiplication of two matrices a and b, where the result is stored in c. Using the 'device copy' method, here is how to do so with OpenCV's gpu function gpu::multiply():

int rows = 480;
int cols = 640;

float* h_a, h_b, h_c;
float* d_a, d_b, d_c;

//Allocate memory for host pointers
h_a = (float*)malloc(sizeof(float)*rows*cols);
h_b = (float*)malloc(sizeof(float)*rows*cols);
h_c = (float*)malloc(sizeof(float)*rows*cols);

//Allocate memory for device pointers
cudaMalloc( (void **)&d_a, sizeof(float)*rows*cols);
cudaMalloc( (void **)&d_b, sizeof(float)*rows*cols);
cudaMalloc( (void **)&d_c, sizeof(float)*rows*cols);

//Mats (declaring them using available pointers)
Mat hmat_a(cvSize(cols, rows), CV_32F, h_a);
Mat hmat_b(cvSize(cols, rows), CV_32F, h_b);
Mat hmat_c(cvSize(cols, rows), CV_32F, h_c);

//Gpu Mats (declaring with available pointers)
gpu::GpuMat dmat_a(cvSize(cols, rows), CV_32F, d_a);
gpu::GpuMat dmat_b(cvSize(cols, rows), CV_32F, d_b);
gpu::GpuMat dmat_c(cvSize(cols, rows), CV_32F, d_c);

//Let's assume our host matrices are filled with actual data, then copy them to the device matrices

//Run gpu::multiply()
gpu::multiply(dmat_a, dmat_b, dmat_c);

//Copy the result back to the host;

//Result now in hmat_c, required copying matrix a, b and c...

4. OpenCV GPU functions: Unified Memory method

You'll notice that in the above example I've been allocating memory to pointers for my images, rather than just using OpenCV to allocate memory upon declaration of a Mat or GpuMat. This is required for this section on utilizing OpenCV GpuMats without having to upload and download data to and from the GPU memory on chips such as the Jetson IC. There is another less obvious reason I use this method. For real-time performance on embedded processors, it is more efficient to allocate memory for objects early on prior to any operations that run cyclically. As long as you can spare the memory, this becomes an effective way to increase performance (granted the trade off is sacrificing some RAM which won't be freed up etc.). If you find yourself in need of dynamically freeing up space from these allocated methods, you can look into cudaFree() and cudaFreeHost().

Now on eliminating download() and upload() OpenCV function calls.

cudaSetDeviceFlags(cudaDeviceMapHost); //Support for mapped pinned allocations

int rows = 480;
int cols = 640;

float* h_a, h_b, h_c;

//Allocate memory for device pointers
cudaMallocManaged(&h_a, sizeof(float)*rows*cols);
cudaMallocManaged(&h_b, sizeof(float)*rows*cols);
cudaMallocManaged(&h_c, sizeof(float)*rows*cols);

//Mats (declaring them using pointers)
Mat hmat_a(cvSize(cols, rows), CV_32F, h_a);
Mat hmat_b(cvSize(cols, rows), CV_32F, h_b);
Mat hmat_c(cvSize(cols, rows), CV_32F, h_c);

//Gpu Mats (declaring with the same pointers!)
gpu::GpuMat dmat_a(cvSize(cols, rows), CV_32F, h_a);
gpu::GpuMat dmat_b(cvSize(cols, rows), CV_32F, h_b);
gpu::GpuMat dmat_c(cvSize(cols, rows), CV_32F, h_c);

//Run gpu::multiply()
gpu::multiply(dmat_a, dmat_b, dmat_c);

//Result now in hmat_c, no copying required!

Much like in the CUDA unified memory example, this method will only function on hardware with unified memory architecture (Jetson ICs for example). Now you do not need to bother using OpenCV download and upload methods for your algorithms.

Enjoy the speedups!

Wednesday, March 8, 2017

Zero-Copy: CUDA, OpenCV and NVidia Jetson TK1: Part 1

If you aren't yet familiar with NVidia's embedded ECU releases (NVidia Jetson TK1, TX1 and coming soon TX2) they are definitely something dig into. NVidia has been embedding their GPU architectures on the same IC as decent-speed processors (like a quad-core ARM Cortex A-15). The Jetson TK1 is by far the most affordable ($100-200), and is an excellent option to bring in some high performance computing to your mobile robotics projects. I'm posting my findings on a slight difference between programming with CUDA on NVidia discrete GPUs and on NVidia's embedded TK/TX platforms (in regards to their memory architecture)

CUDA is a C-based framework developed by NVidia to allow developers to write code for parallel processing using NVidia's GPUs. Typically the main CPU is considered the 'Host' while the GPU is considered the 'Device'. The general flow for using the GPU for general purpose computing is as follows:
  1. CPU: Transfers data from host-memory to device-memory
  2. CPU: Command CUDA process to run on GPU
  3. CPU: Either do other work or block (waiting) until the GPU has finished
  4. CPU: Transfer data from device-memory to host-memory 
This is the general flow, and you can read up on this more in depth. As far as the GPU and the CPU on the NVidia TK/TX SOCs go they are considered to have a Unified Memory Architecture (UMA): meaning they share RAM. Typical discrete GPU cards have their own RAM, thus the necessity of copying data to and from the device. When I learned this I figured the memory-copy process could be eliminated altogether on the Jetson!

I started out learning how to do this purely with simple CUDA kernels. Generally, the CUDA compiler will not allow a CUDA function (or kernel) to operate on data-types that have CPU-type pointers etc. I came across different memory-methods of using CUDA:
  • CUDA Device Copy
  • CUDA Zero Copy
  • CUDA UVA (Unified Memory)
 It took me a while to get used to using these different techniques, and I was not sure which one was appropriate so I did some profiling in order to find out which one gave me the best speed-up (Device Copy as the base). It turned out that CUDA UVA was the better method for coding on the Jetson TK1 embedded GPU.

However, I still ran into a problem using OpenCV on the Jetson. OpenCV has a CUDA module, however OpenCV is designed to use two different Mat data-types: mat for CPU and gpu::GpuMat for GPU. So you could not use OpenCV gpu::functions on cpu mat objects. OpenCV actually has you do the same thing as in 'device copy' for CUDA, and use their methods for copying a CPU mat to the GPU and vice-versa. When I realized this, I was stunned that there was no Unified Memory method (to my knowledge) in OpenCV. So all OpenCV gpu::functions required needless memory copying on the Jetson! On an embedded device this is an extreme bottleneck, as I was already hitting the wall with my programs working with the Kinect IR sensor and image data.

So after quite a bit of sand-box style experimentation, I found the correct approach to casting Mat pointers into GpuMat pointers without doing any memory copy and maintaining the CUDA UVA style. My original program with my Kinect sensor ran at 7-10FPS, and that was with cutting the width and height down from 640x480 to 320x240. With my new approach of avoiding any memory copy I was able to achieve full 30FPS at full 640x480 (this is on all the Depth Data from the IR sensor).

I will post code on my github and update this with the link soon.

Move on to Part 2 for examples

Monday, March 6, 2017

Ethernet-Based IMU for ROS

So I've been doing a bit of development with Robot Operating System (ROS) and a while back needed a flexible solution for acquiring inertial measurement data from multiple sensors. My intention was to have an embedded controller responsible for collecting IMU data and delivering to a master controller. Since in ROS it is easy to swap between a laptop and an embedded-Linux device as your platform, I decided to make a portable Ethernet-based IMU out of the MPU9150 and a raspberry pi (or pcduino).

I had previously made my own C/C++ Linux-based API interface for communicating to the MPU9150 (3axis gyrometer, accelerometer, magnetometer). This project is an extension to have an embedded server which would send periodic packets containing latest IMU data (so as of now, the embedded-IMU device does not run ROS). I wrote a ros-node to connect to this server, receive the data and publish it as a ROS message. Everything seems to be working out quite well and I currently am receiving IMU data at 100Hz using Ethernet (less over wifi). I can add additional Ethernet-based IMUs to the project with little complexity now. Below is my wireless version.

Pcduino-3, MPU9150 and a USB battery pack

For show, I setup a battery-powered pcduino-3 connected to an IMU as the server over wifi (eliminating wires which get in the way for a hand-held demo). On my ROS device (laptop) I have a node running which is dedicated to receiving the IMU data packets and publishing in ROS as a sensor message that other ROS nodes can subscribe to. Below is a video of real-time plotting of received IMU data using rqt. The video is not the best quality because my phone is not so great, but you can see that the top plot is the  Gyro-Z measurement and the bottom plot is the Accelerometer x-axis measurement.

I also wrote an additional node in ROS to subscribe to the IMU messages and perform some integration on the gyro data to estimate the sensor orientation around the z-axis (so what a compass would tell you). You can see in this short video the integration is somewhat reliable. There is definitely gyro drift over time which would end up corrupting the estimate in the long term, but that's where filtering techniques will come in to assist in state estimation (for example: Kalman filtering). Video below:

I will be posting my code for this project on my github soon, I will update this once that is ready (and hopefully making better videos)