During the first day of the 3rd week my main goal was to understand the number of blocks and threads that I had to use.
Usually to perform a parallelization we need to have arrays and for loops.
If we do not have for loops it is hard for a parallelization to take place.
I figured that the max number of thread per block was 1024 threads and 2^32-1 blocks in a single launch per kernel. Though the number of threads and block launched in a single kernel depend on the compute capability of the device. For example the number of threads and blocks that I mentioned above are only able to be launched in a GPU with a compute capability of 3.0 and above. A device with a compute capability of 2.0 or less is able to launch less blocks and threads.
The number of threads we need to use relies on the number of iterations of a for loop and/or the size of an array.
Having small amounts of threads spread into different blocks is good so that the cores on the GPU process them even faster.
Example:
if we are thinking on computing 30 elements on an array, we can use 1 block with 30 threads or we can use 3 blocks of 10 threads it would be faster. It will not make a big difference because it a really small amount of process data. But if we are computing a ton of data then we would see a difference.
In the middle of the week I was asked by my mentor to learn some syntax about OpenCV GPU Module.
Basically this was only syntax and I haven't gotten into practice, it was only for me to get familiar with OpenCV because my futures tasks were going to require me to use some OpenCV functions that need a specific syntax in order to be able to correctly parallelize a program.
I was assigned my first program(circuit) by the end of the 3rd week.
One of the challenges that I faced was to find the correct code inside of this program that needed the parallelization.
I slowly learned that parallelizations usually must be applied on a for loops.
The code was not compiling and I don't why.
This was the end of the 3rd week.
Usually to perform a parallelization we need to have arrays and for loops.
If we do not have for loops it is hard for a parallelization to take place.
I figured that the max number of thread per block was 1024 threads and 2^32-1 blocks in a single launch per kernel. Though the number of threads and block launched in a single kernel depend on the compute capability of the device. For example the number of threads and blocks that I mentioned above are only able to be launched in a GPU with a compute capability of 3.0 and above. A device with a compute capability of 2.0 or less is able to launch less blocks and threads.
The number of threads we need to use relies on the number of iterations of a for loop and/or the size of an array.
Having small amounts of threads spread into different blocks is good so that the cores on the GPU process them even faster.
Example:
if we are thinking on computing 30 elements on an array, we can use 1 block with 30 threads or we can use 3 blocks of 10 threads it would be faster. It will not make a big difference because it a really small amount of process data. But if we are computing a ton of data then we would see a difference.
In the middle of the week I was asked by my mentor to learn some syntax about OpenCV GPU Module.
Basically this was only syntax and I haven't gotten into practice, it was only for me to get familiar with OpenCV because my futures tasks were going to require me to use some OpenCV functions that need a specific syntax in order to be able to correctly parallelize a program.
I was assigned my first program(circuit) by the end of the 3rd week.
One of the challenges that I faced was to find the correct code inside of this program that needed the parallelization.
I slowly learned that parallelizations usually must be applied on a for loops.
The code was not compiling and I don't why.
This was the end of the 3rd week.