COMP 422 Assignment 4

COMP 422

Parallel Computing

Spring 2020

Assignment 4

Complete by the end of the spring semester: May 6 @5pm

Implementing a data-parallel bitonic sort using a GPU with CUDA

Lecture 22 discusses bitonic sorting. There is also a good introduction at http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm. In this assignment, you will use CUDA to implement a data-parallel sorting code using bitonic sort.

For this assignment, you will use NOTS as the computational platform. NOTS contains 16 nodes equipped with GPU accelerators. A pair of those nodes are reserved for the class using the "classroom" reservation, as described below.

An installation of CUDA 10.1 is available at /opt/apps/software/Compiler/GCC/8.3.0/CUDA/10.1.168. To set yourself up for CUDA development, execute the following two commands:

module load gcccuda/2019a

These commands will put the nvcc compiler and the cuda-gdb debugger for that release on your path. Documentation for the CUDA 10.1 release is available on NOTS in /opt/apps/software/Compiler/GCC/8.3.0/CUDA/10.1.168/doc, as well as on the COMP 422 resources page.

Github Classroom Invitation

A github classroom repository for the assignment has been posted. Here is the invitation: https://classroom.github.com/a/Fx4pQ-Ln

The provided code includes a very nice implementation of bitonic sort on a GPU from the CUDA 10.1 SDK. (A copy of all samples from the CUDA SDK is available for you to peruse, copy, or play with in /projects/comp422/cuda-samples.)

There are two parts to the bitonic-sort implementation:

host code implemented in main.cpp, and
a mixture of host and device code for the GPU in bitonicSort.cu.

The .cu suffix is used by the Nvidia CUDA compiler (nvcc). CUDA .cu files can contain a mixture of host and device code; the directives in the code tell the compiler whether to compile functions for the CPU host or the GPU device.

The code in bitonicSort.cu implements a bitonic sort in two parts. The code first performs a collection of small-scale bitonic sorts by mapping them onto individual GPU streaming multiprocessors and then performs a larger-scale bitonic sort across all of the streaming multiprocessors. The code generates a random sequence of 1M keys and values, copies them into the GPU, sorts them, and copies them back to the host. Then it checks the resulting sequence to make sure that it is in order.

You can build the sorting code by simply typing

make

This will build an executable for the GPU. The executable "sortingNetworks" will be placed in the directory.

A limitation of this implementation is that it can only sort vectors of numbers that fit into the GPU all at once. Your assignment is to modify this implementation (both the host side and GPU side code) to sort large vectors of integer values instead of (key, value) pairs. You may generate the integers to be sorted using rand(). You will need to copy the data in and out of the GPU in stages to acccomplish the assignment. This will require understanding and modifying the existing GPU code as well as writing some host-side code.

To control the form of your solutions, I am imposing the following rules.

You must use bitonic sort at all levels. Don't write a hybrid solution that simply uses host code to merge sequences that were sorted using bitonic sort on the GPU.
All element comparisons for sorting and/or merging must be performed on the GPU. This restriction will require that you write some GPU code rather than host code only.
Your verification code may perform element comparisons on the host.
The vector of numbers that you sort will be of length 2^k. k should be an input parameter. For timings, let k = 26. To evaluate your code, we will run it code with arbitrary values of 0 <= k <= 26.
You may only sort up to 1M elements on the GPU in a single bunch. The intent of the assignment is for you to perform bitonic sort by staging data in and out of the GPU in blocks.
The sample code sorts data of several sizes. Your submitted assignment should only generate and sort a single vector of the specified size. Delete code that is not relevant to your solution.

NOTE: CUDA 10 seems to have some support for recursion, though I don't have any experience with it and wouldn't encourage it. Like the provided sample, I recommend that you use an interative form for the part of your sorting solution on the GPU.

Running CUDA jobs on NOTS

You can edit and compile your code on any one of the login nodes on NOTS. However, to run your code on the GPU hardware, you need to use one of the compute nodes with a GPU. You can use a GPU by submitting a SLURM job or acquiring a node in an interactive session.

To secure a GPU node for you to run and test your program, you can request an interactive session by executing the following command

srun --pty --time=00:30:00 --cpus-per-task=2 --gres=gpu:1 --partition=interactive --reservation=classroom /bin/bash which will give you a shell on a node that you can use for 30 minutes to work with your code on a GPU. A copy of this command has been placed in file interactive.sh in your GitHub Classroom repository. You can use this by executing the command:

source interactive.sh Please don't tie up a GPU unless you are actively running or debugging.

A template for a SLURM batch script script that you can copy to run your sortingNetworks executable on a GPU node can be found here:

/projects/comp422/cuda-hw4/sortingNetworks/submit.sbatch

Copy that file, edit your input arguments to your executable and run it in the batch queue with the following command:

sbatch submit.sbatch

This will run the "sortingNetworks" executable on a GPU. It will write its output in a file slurm-<your slurm jobid>.out As you know, the status of a SLURM job can be checked by running the squeue command on NOTS.

GPUs on NOTS

Note that there are 2 different kinds of NVIDIA GPUs on NOTS. Two GPU nodes have NVIDIA K80 GPUs. 16 nodes have NVIDIA V100 GPUs. The code will work on both types of GPUs. When you do timings, note what kind of GPU is on your node. You can check by running the command-line tool nvidia-smi.

Debugging your GPU code

You can use cuda-gdb to debug your code. To debug your code with the cuda-gdb debugger, your code must first be compiled for debugging. To build a debug version of your code, type make clean make dbg=1

The debug executable will be placed in the same directory. With the CUDA module loaded (see above), cuda-gdb will be on your path. If you know how to use gdb, you will find that cuda-gdb works much the same. You can set breakpoints in GPU kernels as well as host code and inspect variables just as you would using gdb. Note: To run cuda-gdb, you must be running on a GPU-equipped compute node in interactive mode to use cuda-gdb. See the directions for launching an interactive session on a GPU-equipped node above. See the CUDA gdb manual for detailed information about how to use cuda-gdb.

Submitting your Assignment

You will use GitHub classroom to submit your assignment. Your assignment submission should include:

A copy of the sortingNetworks directory that containing your modified source files and a Makefile. Please don't submit .o files and your executable.
Your writeup. See the Grading Criteria below for notes about the expected content of your writeup.

Grading Criteria

10% Program correctness.
40% Program clarity, elegance and parallelization. Your code should be well-documented internally with comments.
50% Writeup. For this assignment, your writeup is a larger part of your grade. I want you to explain not only the parts of the bitonic sorting code that you implemented, but also explain the workings of the existing components that you use. Explain how what is there works. Does it employ all of the multiprocessors? Does it keep threads in a block busy? How and why does it make use of shared memory? Time how long it takes to perform your sort for k = 26 and report that in your writeup. Make sure that the version you use for timing is the executable compiled without debugging support. Beyond the content, your grade on the writeup will also depend upon the quality of your writing (grammar, sentence structure) and the clarity of your explanations.

23 April 2020 - posted