week 2

7/20/2015

On the 2nd week I started learning more about CUDA.
There was important information I needed to know before parallelizing a program such as, the hardware version of the GPU's that Stampede uses, the compute capability of a GPU in Stampede, the compute capability is important because in order for a program to run in certain GPU's you must specify the compute capability in the command to compile a CUDA program.

During the second week I learn how to use the commands to compile a CUDA program.
One of the most important things was how to use a node in Stampede and what were the consequences of not using a node to run a program(If we run a program on the log in node we will slow down other people that are using stampede).
I learned how to run a program using a "batch" script where you specify the queue you want to use, the name of the program you want to compile, how many GPU's you want to use, and your user account. This method it is not really effective because it takes a long time to create the script and it is somehow a waste of time, because we can log into a node faster than I can type the whole script.

By the end of this week I touched few information about a kernel. Basically this kernel is the most important thing behind a parallelization without this we cannot run a parallelization.

kernel Code properties:

It is only executable on the device
It is only callable by the host
It is identified by the __global__ qualifier with void return type
There are other qualifiers like: __device__ & __host__

-Example of a kernel function
__global__ void hello()
{
       //Doesn't do anything here
}

CUDA supports thread abstraction. This programming model has the advantage of scalability since all threads execute the same piece of code.

-Kernel Configuration
A single invoked kernel is referred as a grid.
A grid is comprised of blocks of threads.
a block is comprised of multiple threads.
Threads: Smallest execution unit in a CUDA program.
Threads within a block can synchronize, share memory, and cooperate.
Threads from different blocks are completely independent, the only resource they share is the global memory
Block: is comprised of multiple threads.
Blocks in a grid are completely independent.
Different blocks are assigned to different SM(streaming processors).
Multiple blocks can reside in the same SM but cannot be distributed across multiple SMs.
Grid: is comprised of blocks of threads.
Is referred as an active kernel, each kernel has its own independent workplace.

A CUDA device has many different memory components, each with different size and bandwidth.

-To read and write effectively to a specific memory component, we need to know how memory is organized on these components. Ideally, a program would be structured so that threads don't need to constantly go to global memory to retrieve data, which is ineffective and slow.

The size of the different memory components also needs to be considered; global memory is significantly larger than cache memory and registers are much smaller. To fully exploit the potential of the device, we need to understand how to properly utilize the different levels in the memory hierarchy.

On a CUDA device, multiple kernels can be invoked. Each kernel is an independent grid consisting of one or more blocks. Each block has its own per-block shared memory, which is shared among the threads within that block.

Two important things to understand are:
What the Host is & what the device is.

In general, we use the host to:
   Transfer data to and from global memory
   Transfer data to and from constant memory

Once the data is in the device memory, our threads can read and write (R/W) different parts of memory:
   R/W per-thread register
   R/W per-thread local memory
   R/W per-block shared memory
   R/W per-grid global memory
   R per-grid constant memory

There is also important things about memory like the size and the bandwidth.

Global memory has the lowest bandwidth, but largest memory size.(5Gb on stampede).
Constant memory allows read only memory access by the device and provides faster and more parallel data access.
Per-Block shared memory is faster than global memory and constant memory, but is slower than the per-thread registers.
-Each Block has a maximum of 48k of shared memory for K20.
-Per-thread registers can only hold a small amount of data, but are the fastest. Per-thread local memory is slower than the registers and is used to hold the data that cannot fit into the register.

By the end of the second week I learned how to parallelize a simple program that used arrays. The goal of the program was to perform an addition of two arrays into one.
Some challenges that I had were understanding the number of blocks and threads that I had to use for the task that was being completed, which I did not understand by the end of the second week.

0 Comments

week 2

Leave a Reply.

Author