Hi let us get started with the third topic
in this course on GPU architectures and programming. Now we are to introduce the notions of CUDA
programming language, one of the most popularly used GPU programming languages. So, what is CUDA? The full form, first of all, is compute unified
device architecture. So CUDA is essentially an extension of C programming
language with special construct that support parallel programing. Now, this is a programming language extension
developed by NVIDIA primarily for their GPU’s, as we know that NVIDIA GPU’s are very popularly
used as accelerating a device in conjunction with - I mean high performance CPU’s. So for the CUDA programmer, the perspective
is that the CPU is the host that means in CPU you have an orchestra program which is
running and it is dispatching parallel jobs to GPU devices. Well, they are may be more than one GPU device
attached with the CPU. So generally, it can dispatch multiple parallel
jobs to this GPU devices. This job will execute in the devices and get
back the results to the host. So, the way a basic CUDA program is structured
is as follows. There is a host code which is resident, that
means it is executing on a host device that is the CPU. And there is a part that we called as device
code which is supposed to execute on the GPU’s. Technically speaking any C program is a valid
CUDA host code. Essentially, it is a code that it can execute
on CPU. So in general CUDA programs constitute of
host plus device code and they cannot be complied by any standard C compiler, and for this purpose
we require this specific complier from NVIDIA that is the NVCC compiler. That is the NVIDIA C compiler. The compilation flow for NVIDIA C compiler
is as follows. You write the CUDA program essentially C program
with CUDA extension, you compile it with NVCC; what you get is the host code and the device
code that is the part of the code which will be compiled further by the host C compiler
and linker. And that device code will be JIT compile for
execution on the device. So overall these are the two different segments
and code that you get one to execute on the CPU and the other to execute on the GPU devices. (Refer slide Time 03:19) The execution flow is as follows. The host code is the code that is executing
in the CPU serial. The host code will launch the device code
that is the GPU parallel kernel. The GPU parallel kernel will execute in the
GPU, this is what we referring to the device code. It will return results to the host code. With those results the host code may again
execute, do some functionality and then again launch some parallel kernels for the GPU devices. And this computation can go on back and forth. I can have a simple CUDA program which will
launch one kernel per GPU device, get back the result and print me the result; Or I can
a sufficiently complex program which will launch some kernel get some results, do some
processing in the CPU and then again launch another kernel and so on so forth. So before discussing our first CUDA program,
let us start with an example vector addition code that is executing in a CPU - the very
simple C program. So you have a main in which you have this
float types define. So this are the pointers to floating point
arrays which are dynamically allocate with the malloc calls. Once this dynamically allocation is done,
you call this vector add function. Inside this vector add function, off course
I am just trying to show you an example, we have not written any code for initialization
of arrays and all that assume those are there. So once that is done, the vecAdd will be called
and after vecAdd is called inside a loop I am just trying to add the 2 vectors and return
the result in the h_C array. So in that way, this vecAdd with three arguments
- the first two are kind of the input argument and h_C is a dynamically allocated array which
we contain the output argument. Off course, the initialization code is missing
here as discussed earlier. So this is how the typical vector addition
will execute in a CPU. How it will work on CPU-GPU system? So, here comes the first CUDA program. First of all, you observe that we have hash
include <cuda.h> and also we have a hash include <cuda_runtime.h >. This are the header files
which contain the required CUDA functionalities that will be using in our code. In this first CUDA program, as we have discussed
earlier, there will be a piece of code that will execute in the CPU, it will launch the
parallel code or the device code for the GPU. This is what we called a kernel. Now, what is the code that will execute in
the CPU? That is essentially this vecAdd program. This is the program that is commented as a
host program, as you can see this is similar to our earlier CPU only vecAdd program. So essentially this is a program that is called
from main, it is expecting to be passed three pointers. So this pointers that containing base addresses
for three arrays, which has been dynamically allocated before the call has been made and
also suitably initialize the input arrays. They are added as vector and output array
is return. That was my original vecAdd code. Here, I can have the VecAddcode which is the
CPU side host program. Similar input arguments are there. Now, what we are trying to do here is that
we do not want the addition to be done in this CPU. This host code is supposed to launch something
called a GPU kernel which is the code that will be executed in the GPU. So, let see how it is done. We will just watch through this program step
by step. First thing, inside this host program vecAdd,
we declare some pointers, and then we also declare a specific enumerator data type cudaError_t
and we initialize it with one of the types that is cudaSuccess. So this is the error code to check return
values from CUDA calls. Now, these are defined in the header file
that we have defined earlier. These are not our own design they are already
defined. (Refer slide Time 08:25) Now coming to the code, the first we are trying
to do is just like we do malloc for dynamically allocating memory in a CPU, we fire a function
cudaMalloc. Now again, these are all defined in the CUDA
headers and what cudaMalloc will do is that it will allocate memory not on the CPU side,
but rather the memory resident on the GPU device. It will allocate a specific amount of the
memory of size equal to this size and return a pointer for that. This is the pointer d_A. So as you can see, we have already declared
this d_A and now after this cudaMalloc call d_A address is containing a generic pointer
and this pointing to size amount of memory resident on the GPU device. In case this malloc call is not going perfectly
then this next if condition will not be satisfied. So error will not be equal to cudaSuccess. Again, cudaSuccess is another enumerator data
type which should match provided the malloc call goes perfectly fine. If it is not so, then this condition will
fire, and we will have the fprintf a standard error where this message will be printed with
the suitable error code. Now, how is the error code coming in? The error code can be figured out by this
directive cudaGetErrorString. So cudaGetErrorString will take as a argument
with the error of type cudaError_t and from that error code it will figure out the suitable
error string which will be printed here using the fprintf command. Again I am assuming that you are very much
conversant with fprintf for printing strings to standard error and everything. So please get acquainted with it if you have
kind of forgot and all that. Hence forth, exit will happen with this suitable
code. So following this schema, we will do memory
allocation for the other pointers that is d_B as well as d_C. So as we can understand, there is a difference
between the original pointers that have been passed to the vecAdd host program i.e. h_A,
h_B and h_C. They point to memory location on the CPU memory;
they point to resident arrays on the memory location which have been dynamically allocated
and initialize before the call has been made for vecAdd. Internally, vecAdd defines more pointers and
make the cudaMalloc call to assign these generic pointers so that they can point to memory
locations not on the CPU side but on the GPU side. We will soon see why this is necessary. As you can see, we have a kind of repetitive
code of CUDA Mallocing for d_A, d_B and d_C and in case of error, suitable error commands
will trigger as has been written down here. Following this, we have print statement which
is saying that copy input data from the host memory to the CUDA device. So if this is print that means all the previous
malloc operation went fine, we have 3 locations in a memory of size equal to size allocated
on the GPU memory side and they are being pointed to by d_A, d_B and d_C. So the next thing that I need to do is, I
need to copy the input arrays from the host side memory to the device side memory. From the CPU memory to GPU memory. Because essentially, I want to do the vector
addition on the GPU, in parallel, not on the CPU. That is the basic idea of this program which
is the vector addition on a CPU-GPU system. So now for doing this transfer of data from
the host side to the device side, we have this command cudaMemcpy. So what it does! So this is the device side memory d_A, this
is the host side memory h_A. So there is a copy from h_A to d_A, the copies
size is dictated by this parameter “size” and the type of copy is “cudaMemcpyHostToDevice”. Following this directive, all the data resident
in h_A up to “size” amount of space gets severely copied to the location pointed to
by a d_A in the GPU side memory. Now in case this is not successful, suitable
error comments will get printed as has been already discussed. Similarly we also do the “cudaMemcpy”
for the array h_B to the device array d_B. As you can see that the codes are again kinds
of similar. Once both these operations get done, we have
the 2 input arrays containing the 2 input vectors ready for addition in the GPU side
memory. At this point, we are thinking of launching
a program on the GPU which will access this array location in the GPU side memory, do
the addition and stored the result and transfer the result back to the CPU side memory. For doing that, we declare certain suitable
parameters known as “threadsPerBlock” and “blocksPerGrid” and we pass this parameter
to function called vectorAdd. Now as you can see, the call of this function
is very different from our standard C function because apart from its argument d_A, d_B,
d_C and n, there is something here called blocksPerGrid and threadsPerBlock. Now what this is we will discuss a bit later
on, for that time being just think that this is specific type of function with some extra
parameters. Essentially, this is how we are launching
program for the GPU. We are going to launch a function in the GPU
which will do the computation of vector addition in parallel. This kind of function which does computation
on the GPU side is traditionally known as GPU kernel. GPU kernel launches follow this kind of parameter
reiterations. The parameter definition is something we will
speak a bit later on. Now just observe that the function has been,
apart from this threadsPerBlock and blocksPerGrid parameters, passed the GPU device memory location
d_A, d_B and d_C. d_A and d_B are essentially pointing to the
vectors which are to be added and the result is supposed to be stored in d_C, all of them
are in device memory. Now let us look at the implementation of this
kernel which is here. So this is how a CUDA kernel can be defined. So first we have the declaration of the vectorAdd
and then we have actual definition of the vectorAdd. Inside vectorAdd, we have a computation of
an index that decides what is the part thread behavior and then we have the familiar code
of the locations C getting the value for A[i] + B[i]. So C[i] is being written with A[i] + B[i]. Now the difference with the CPU side is here,
if the GPU has got n number of codes; inside it n number of scalar processor that many
of number of threads will be launched which will do all this addition in parallel. And in that way all this vector addition will
happen in parallel. How exactly is something which we will discuss. So coming back here, this was the call for
vectorAdd at this point and with this vectorAdd call we land up here, this addition is done. Now going back here the vectorAdd call, we
expect that d_C, this device memory location, contains the added vector value. Now some thing important to be noted here
is, this is a function which is executing on the GPU side, for such a function it can
take as operands value from memory i.e the GPU memory and write back values again on
the GPU memory. So they do not return anything. So that is what we are saying the device function
CUDA kernel call from the host side does not have a return type here. Now this is unlike CUDA runtime directives
like cudaMemcpy or cudaMalloc which can return on error code. This function cannot have a return type. But in case this function run into some issues
while executing, it will create a signature which can be caught by cudaGetLastError, it
is again a run time function. It is a function which is a feature of the
CUDA runtime system. The point I am trying to make here is the
vector add does not itself directly provide the return but rather in case there was some
issue in executing the function. This runtime function cudaGetLastError can
provide a suitable error and if that is not a cudaSuccess, again we have a way to know
that the launch of this vectorAdd kernel got into some issues. Now,we have the vector addition done and the
result is there in the device memory d_C which can be copied back again by cudaMemcpy directive
to the CPU side or host side memory h_C. Now observe that earlier when we copied the
values from host side to device side the directive inside cudaMemcpy was cudaMemcopyHostToDevice. But now when we copy from device side to host
side it is cudaMemcpyDeviceToHost. So that is the slight alteration. So with this directive we have the result
back in the CPU. So once I have the results back in the CPU,
I can deallocate all the allocation from the device side memory that is we can free of
this d_A, d_B and d_C using the cudaFree directive. So essentially it is very much C like with
some CUDA annotation. And then we have just written some small checking
code which is just checking whether it is a absolute value of the sum of A + B subtracted
from C within the error bound or not, otherwise we will say that there is some issue. And and then we do the test pass print. So, this code has to be compiled. We run the NVCC compiler with the kernel definition
in kernel.cu, the host side code host.cu to compile them to create the binary output. If you run it, you get this sequence of print
statements firing and this should be the output we would expect. Now coming back to how really the execution
is going on. So as we know that the GPU is a separate card,
its an accelerator card which is kind of getting attached to the CPU. The GPU has got its own device memory. For doing the computation in the GPU from
the last example we could figure out that suitable data points need to be transferred
by the CPU to the GPU device memory. Then the GPU kernel is suitably launched by
the first program. The GPU kernel executes on the GPU with input
parameters taken from the device memory of the GPU. It writes data back on the device memory of
the GPU which has to be again copied back to the CPU. So this is the overall execution flow in a
bit more detail. This header file cuda.h includes the compilation
CUDA API functions and the different CUDA system variables. We have been exhibiting with some examples
inside the code and we also seen that there were the host code using variables that were
mapped in the main memory of the CPU whereas before executing the device code we needed
to initialize pointers and allocate them suitable memory in the GPU DRAM or the device memory. (Refer slide Time 22:59) A few more observations; we have been using
functions supported by the CUDA runtime. So there is cudaMalloc now what was exactly
doing? it was allocating memories, it was allocating memories segment from the GPU global
memory which is physically different from the CPU’s global memory. Now this expects a generic pointer whose type
is void **. It is a generic pointer, it can point to any kind of data here. Now this low level function is common for
all object types, that is the reason for why it is a generic pointer. Now the other things is the cudaMemcpy comment
as we have discussed earlier, it will transfer data back and forth between the CPU and GPU. If it is a having the directive cudaMemcpyHostToDevice,
it is copy from CPU to the GPU; if it is cudaMemcpyDeviceToHost, then it is copy from the GPU memory to CPU
memory. The important thing is this device memory
pointers d_A. Once the cudaMalloc is done with d_A, this
device memory pointer cannot be dereferenced in the host code. Simply for that reason they are pointing to
a different physical location i.e the GPU memory. So they can be dereferenced only inside the
kernel code. So with respective cudaMemcpy, we can have
this kind of cudaMemcpy from device to host, host to device and just like the transfer
of data being supported among the GPU and CPU’s with cudaMemcpy.. We can also do a transfer among different
device memory locations. So even if I have two different device memory
locations on the same device, I can do a cudaMemcpyDeviceToDevice using the directive. Also we can do a transfer data from host to
host but we do not need to do that, because since I had a normal CPU so that is normal
movement of data inside two different location in the array. But I cannot transfer data among different
GPU devices using a cudaMemcpy directive. It is very easy to understand, because what
are the things that we can really do just to summarize:
I can copy data from host side memory to device side memory or from device side memory to
host side memory. I can copy data between two different locations
in the same device memory. I can copy data between two different memory
locations in the same host side memory, although that is not required because it is inside
the CPU’s DRAM. But what I cannot do is, I cannot have cudaMemcpy
copying data from one GPU devices memory to another GPU devices memory. Because in that case, they are two different
physical devices and cudaMemcpy supports transfer between one host and one GPU. So in that case, I have to copy data from
device one to the host and then I have to copy from the host to the device 2. Now coming back to the call of a CUDA kernel. How a CUDA kernel is called? When a CUDA kernel is invoked, it launches
multiple threads in a two level hierarchy. The threads are the basic computing units
which engage the scalar processor in the GPU. Each scalar processor will execute one thread;
all the scalar processor executes threads in parallel. Now going back to our example CUDA kernel
as we have discussed earlier that when this vector add kernel was launched there were
2 parameters blocksPerGrid and threadsPerBlock. Now let us try to understand the significance
of this parameters. So as we have already discussed that we are
trying to do computation using multiple computing threads in a GPU, that is the fundamental
thing. So when I am doing vector addition, I want
to launch a lot of compute threads. All of them will add components of the vector
in parallel. Now this arrangement of multiple threads follows
a two level hierarchy. So suppose I am trying to launch n number
of thread. So what I can do is, I can define their arrangement
in 2 levels at the higher level I say that, ok I have n / 256 sealing number of element
and then at the lower level is say that each element comprises 256 threads. So in total I have n number of threads being
launched. So this call specifies the grid of thread
to be launched. This is the technical terms people use that
you launch a grid of threads. The grid is arranged in a hierarchical manner,
you have the number of blocks. So the first component here in the hierarchy
is the number of block. So you have n/256 number of blocks. And then you say that each block contains
256 number of threads that is the second parameter. So overall, I have a number of blocks and
number of threads per block there is the specification which tells me that how many threads are launched. Now if we go back to the definition of the
vectorAdd kernel. We had threadsPerBlock defined as 256 and
then we define the number of blocksPerGrid or just the number of blocks which was just
(n + threads Per Block - 1 )/ threads Per Block. So essentially, we are looking at total n
number of threads here. So you are if this print statement fire it
will tell me how many threads are there per block and how many blocks has been launched,
essentially here we have the vectorAdd being launched with the number of blocks in the
grid, number of threads inside each block. So off course, since I have a definition of
threadsPerBlock that also means that all blocks contains same number of threads maximum 1024. Now I can even have a hierarchical specification
of blocks that means I can number blocks in 3 dimensions using triplets. We will discuss this later on. So how are really blocks are arranged so when
a CUDA kernel is launched I have this hierarchical arrangement of threads and suppose I have
n number of blocks and each block is containing this 256 sets.