 |
COMP 422 |
Parallel Computing |
Spring 2020 |
Assignment 4 |
Complete by the end of the spring semester: May 6 @5pm |
|
Implementing a data-parallel bitonic sort using a GPU with CUDA
Lecture 22 discusses bitonic sorting. There is also a good introduction at
http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm. In this
assignment, you will use CUDA to implement a data-parallel sorting
code using bitonic sort.
For this assignment, you will use NOTS as the
computational platform. NOTS contains 16 nodes equipped with GPU
accelerators. A pair of those nodes are reserved for the class using
the "classroom" reservation, as described below.
An installation of CUDA 10.1 is available at
/opt/apps/software/Compiler/GCC/8.3.0/CUDA/10.1.168.
To set yourself up for CUDA development, execute the following two
commands:
module load gcccuda/2019a
These commands will put the nvcc compiler and the cuda-gdb debugger for that release on your path.
Documentation for the CUDA 10.1 release is available on NOTS in
/opt/apps/software/Compiler/GCC/8.3.0/CUDA/10.1.168/doc, as
well as
on the
COMP 422 resources page.
Github Classroom Invitation
A github classroom repository for the assignment has been posted. Here
is the invitation:
https://classroom.github.com/a/Fx4pQ-Ln
The provided code includes a very nice implementation of bitonic sort on a GPU
from the CUDA 10.1 SDK. (A copy of all samples from the CUDA SDK is
available for you to peruse, copy, or play with in /projects/comp422/cuda-samples.)
There are two parts to the bitonic-sort implementation:
- host code implemented in main.cpp, and
- a mixture of host and device code for the GPU in bitonicSort.cu.
The .cu suffix is used by the Nvidia CUDA compiler
(nvcc). CUDA .cu files can
contain a mixture of host and device code; the directives in the code
tell the compiler whether to compile functions for the CPU host or the
GPU device.
The code in bitonicSort.cu implements a bitonic sort in
two parts. The code first performs a collection of small-scale bitonic
sorts by mapping them onto individual
GPU streaming multiprocessors and then performs a larger-scale bitonic
sort across all of the streaming multiprocessors.
The code generates a random sequence of 1M keys and
values, copies them into the GPU, sorts them, and copies them back to the
host. Then it checks the resulting sequence to make sure that it is in order.
You can build the sorting code by simply typing
This will build an executable for the GPU. The executable
"sortingNetworks" will be placed in the directory.
A limitation of this
implementation is that it can only sort vectors of numbers that fit
into the GPU all at once. Your assignment is to modify this
implementation (both the host side and GPU side code)
to sort large vectors of integer values instead of
(key, value) pairs. You may generate the
integers to be sorted using rand().
You will need to copy the data in and out of the GPU in stages
to acccomplish the assignment.
This will require
understanding and modifying the existing GPU code as well as writing
some host-side code.
To control the form of your solutions, I am imposing the following
rules.
- You must use bitonic sort at all levels. Don't write a hybrid
solution that simply uses host code to merge sequences that were sorted using
bitonic sort on the GPU.
- All element comparisons for sorting and/or
merging must be performed on the GPU. This
restriction will require that you write some GPU code rather than
host code only.
- Your verification code may perform element
comparisons on the host.
-
The vector of numbers that you sort will be of length 2^k. k should
be an input parameter. For timings, let k = 26. To evaluate your code,
we will run it code with arbitrary values of 0 <= k <= 26.
- You may only sort up to 1M elements on the
GPU in a single bunch. The intent of the assignment is for you to
perform bitonic sort by staging data in and out of the GPU in blocks.
- The sample code sorts data of several sizes. Your submitted
assignment should only generate and sort a single vector of the
specified size. Delete code that is not relevant to your solution.
NOTE: CUDA 10 seems to have some support for recursion, though I
don't have any experience with it and wouldn't encourage it.
Like the provided sample, I recommend that you use an interative form for the part of your
sorting solution on the GPU.
Running CUDA jobs on NOTS
You can edit and compile your code on any one of the login nodes on
NOTS. However, to run your code on the GPU hardware,
you need to use one of
the compute nodes with a GPU. You can use a GPU by submitting a
SLURM job or acquiring a node in an interactive
session.
To secure a GPU node for you to run and test your program, you can request an
interactive session by executing the following command
srun --pty --time=00:30:00 --cpus-per-task=2 --gres=gpu:1
--partition=interactive --reservation=classroom /bin/bash
which will give you a shell on a node that you can use for 30
minutes to work with your code on a GPU.
A copy of this command has been placed in file interactive.sh in your
GitHub Classroom repository.
You can use this by executing the command:
Please don't tie up a GPU unless you are actively running or debugging.
A template for a SLURM batch script script that you can copy to run
your sortingNetworks executable on a GPU node can be found here:
/projects/comp422/cuda-hw4/sortingNetworks/submit.sbatch
Copy that file, edit your input arguments to your executable
and run it in the batch queue with the following command:
This will run the "sortingNetworks" executable on a
GPU. It will write its output in a file slurm-<your slurm
jobid>.out
As you know, the status of a SLURM job can be checked by running the
squeue command on NOTS.
GPUs on NOTS
Note that there are 2 different kinds of NVIDIA GPUs on NOTS. Two GPU
nodes have NVIDIA K80 GPUs. 16 nodes have NVIDIA V100 GPUs. The code
will work on both types of GPUs. When you do timings, note what kind
of GPU is on your node. You can check by running the command-line tool
nvidia-smi.
Debugging your GPU code
You can use cuda-gdb to debug your code.
To debug your code with the cuda-gdb debugger, your code must
first be compiled for
debugging. To build a debug version of your code, type
The debug executable will be placed in the same directory. With the
CUDA module loaded (see above),
cuda-gdb will be on your path.
If you know how to use gdb, you will find that
cuda-gdb works much the same. You can set breakpoints in GPU
kernels as well as host code and inspect variables just as you would
using gdb.
Note: To run cuda-gdb, you must be
running on a GPU-equipped compute node in interactive mode to use
cuda-gdb. See the directions for launching an interactive session on a
GPU-equipped node above.
See the CUDA
gdb manual
for detailed information about how to use cuda-gdb.
Submitting your Assignment
You will use GitHub classroom to submit your assignment.
Your assignment submission should include:
- A copy of the sortingNetworks directory that containing your modified
source files and a Makefile. Please don't submit .o files and your executable.
- Your writeup. See the Grading Criteria below for notes about the
expected content of your writeup.
Grading Criteria
-
10% Program correctness.
-
40% Program clarity, elegance and parallelization. Your code should
be well-documented internally with comments.
-
50% Writeup.
For this assignment,
your writeup is a larger part of your grade. I want you to explain not
only the parts of the bitonic sorting code that you implemented, but
also explain the workings of the existing components that you use.
Explain how what is there works. Does it employ all of the multiprocessors?
Does it keep threads in a block busy? How and why does it make use of shared
memory?
Time how long it takes to perform your sort for k = 26 and report that in your
writeup. Make sure that the version you use for timing is the
executable compiled without debugging support.
Beyond the content, your grade on the writeup will also
depend upon the quality of your
writing (grammar, sentence structure) and the clarity of your explanations.
23 April 2020 - posted