forked from robotology/yarp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgpu.dox
113 lines (88 loc) · 4.27 KB
/
gpu.dox
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
/**
* @page gpu_tutorial Handling NVIDIA GPUs with YARP
\author Giacomo Spigler
\section gpu_tutorial_introduction Introduction
In the last years graphics cards became more and more powerful, and eventually they reached computing speeds of the order of hundreds of gigaflops.
Graphics Processing Units, or GPUs are dedicated processors, cheap enough to be affordable by everyone and powerful enough to be able to replace clusters of tens of modern computers.
YARP has now two new modules, which come with the main distribution (starting from the latest releases): CUDA and NVIDIA.
This documentation only regards CUDA, even thought it is compatible only with the latest Nvidia chipsets. The other driver is slower and worse optimized.
These new modules rely on the new IGPUInterface class, which can be easily found into libYARP_dev's includes.
\section gpu_tutorial_yarpcode Example: YARP application
Programming with the CUDA module is quite easy, but there are some requirements.
Let's start with an example.
The full code of this example can be found under the example/cuda directory.
First of all, make sure to include the following headers into your application:
\code
#include <yarp/dev/GPUInterface.h>
#include <yarp/dev/PolyDriver.h>
#include <yarp/dev/Drivers.h>
using namespace yarp::dev;
\endcode
Then you can proceed creating an instance to the driver's object:
\code
Property config;
// no arguments, use a default
char str[80];
sprintf(str, "(device cuda) (w %d) (h %d) (bpp 3)", img.width(), img.height());
config.fromString(str);
PolyDriver dd(config);
if (!dd.isValid()) {
printf("Failed to create and configure a device\n");
exit(1);
}
IGPUDevice *gpu;
if (!dd.view(gpu)) {
printf("Failed to view device through IGPUDevice interface\n");
exit(1);
}
\endcode
\section gpu_tutorial_cubin Example: GPU program
Now you have a working CUDA driver set up and configured.
However, to start using it, you have to load the programs you want to execute on your data.
Programs are in cubin format, which is a kind of ascii executable and that the GPU can understand.
We will see later on how to compile custom programs.
To load them, just do:
\code
int prog = gpu->load("progs/bw.cubin");
\endcode
Then you can finally execute it.
Processing on data is accomplished either with c arrays of elements or YARP images.
There are two overloaded methods CUDA has to handle it, IGPUInterface::execute, which can handle "unsigned char *" or "ImageOf<PixelRgb>".
Support for floats might come soon.
\code
gpu->execute(prog, &img, &out);
\endcode
At the end of the program, just put:
\code
dd.close();
\endcode
Now you have a YARP application, but you need some programs to run on the GPU.
Program's structure has always to be the same (for more information I'd suggest the CUDA programming manual).
Following, it is an example program (to convert RGB images to BW):
\code
extern "C" {
__global__ void FragmentProgram(int w, int h, unsigned char *in, unsigned char *out) {
int i;
for(i=threadIdx.x+blockIdx.x; i<w*h; i+=blockDim.x*gridDim.x) {
out[i*3]=(in[i*3]+in[i*3+1]+in[i*3+2])/3;
out[i*3+1]=out[i*3];
out[i*3+2]=out[i*3];
}
}
}
\endcode
Let's look at it.
Just a quick note, the software automatically defines some objects, namely `threadIdx`, `blockIdx`, `blockDim` and `gridDim`.
At the moment, for ease of usage, they are just used 1D, accessed as `threadIdx.x`.
The first one is the id of the thread which executes the main function (within its current block); the second one is the id of the block within the block grid.
`blockDim` is the numbe of threads contained in a single block, and `gridDim` is the size of the block grid.
Then, the main function (at the moment just use the main function) has to be named FragmentProgram, and to have a "__global__" before its type.
Type is not important, so you might always set it to void.
The FragmentProgram function will run on every GPU's core, executing the same instructions (SIMD approach), but every thread will have its own threadIdx.x identifier.
At last we have to compile the GPU program into the binary `.cubin` object:
\code
nvcc -cubin bw.cu
\endcode
Using this information, and knowing the number of threads currently running, you can write parallel code to run on the GPU, analyzing input data and generating outputs.
*
*/