COMP 422 Assignment 3

COMP 422/534

Parallel Computing

Spring 2020

Assignment 3

Complete by the end of the spring semester: May 6 @5pm

2.5D Matrix Multiplication using MPI

Matrix multiplication is a binary operation performed on a pair of matrices A, rank M x N, and B, rank N x P, resulting in a matrix C, rank M x P. In matrix multiplication, the element-wise products from a row of elements from A, and a column of elements from B are summed to generate an element of matrix C. Two popular methods to perform parallel matrix multiplication on a distributed memory system, namely, Cannon's algorithm and 3D matrix multiplication, were described in class.

For this assignment, you will implement the 2.5D version of matrix multiplication using MPI. You should refer to the technical report entitled Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms by Solomonik et. al. for complete information on this algorithm. Communication avoiding algorithms are a hot topic - their development was an accomplishment highlighted in President Obama's FY 2012 Department of Energy Budget Request to Congress.

Summary of the 2.5D algorithm to get you started: 2.5D matrix multiplication algorithm, originating from UC Berkeley, reduces communication bandwidth consumed and exposed latency in matrix multiplication by maintaining a total of 'C' replicas of the input matrices A and B on a 3D grid of processors, P in total, of dimension sqrt(P/C) x sqrt(P/C) x C. Initially, a single copy of the A, B matrices is broken into chunks, and distributed among the processors on the front face of this grid. Through a series of broadcasts, and point-to-point communications, these chunks are then distributed to different processors. Each processor in the grid performs a matrix multiplication on the chunks of A and B received, and a final phase of reduction produces the result, C, matrix on the front face of the grid.

You can write your program in C, C++, or Fortran with MPI communication. You should use MPI collective communication where appropriate.

To evaluate the performance of your implementation, perform the following experiments:

Square matrices A, B, C of dimension N x N where N=7500. Your program should be run with MPI ranks counts for appropriate values between 1-72. Not all core counts 1-72 make sense for this algorithm. Determine an appropriate value of 'C' appropriately for each feasible core count.

As with your previous assignment, randomly generate the input matrices. I recommend that you use drand48; see its man page for details how to use it.

You will run your MPI program on NOTS's compute nodes and measure its performance and scalability. Instructions how to do this with SLURM are given later on this page. Your program should measure the elapsed time of your multiplication phase, which means DO NOT include the matrix initialization or answer verification. You can use the code here as a guide for constructing a verification test that your matrix multiply is computing the correct answer. You may include any of the code in the attached file in your assignment. As part of your experimentation, you are required to collect a message trace and visualize it with jumpshot. (Details about collecting traces and visualizing them using jumpshot on NOTS are provided in the next section.)

Github Classroom Invitation

A github classroom repository for the assignment has been posted. Here is the invitation: href=https://classroom.github.com/a/Z-y9XELT

Using HPCToolkit (COMP 534 students)

As part of your experimentation, you are required to use HPCToolkit to measure and visualize your code performance. Use the -g option (along with optimization) when you compile your program so that the compiler augments your executable with information about how to relate its machine code back to your source code. You can collect information about where the application spends its time by collecting a trace using

hpcrun -e REALTIME@1000 -t ./multiply 7500

You should use hpcstruct to analyze your executable which gathers information about how to attribute performance measurements back to your source code.

hpcstruct multiply To attribute your performance measurements back to the source program, run

You can examine the profile from the parallel run with hpcviewer and visualize the trace with hpctraceviewer.

To collect more detailed information about the compute performance of your application, you can use hardware performance counters to measure machine instructions completed, cycles, cache misses, and more. See the HPCToolkit User's Manual for additional information.

Preparing your Report

Write a document that describes how your program works. Along with describing your performance results as required above, please answer the following questions:

Explain the bandwidth and latency lower bounds achieved by the 2.5D matrix multiplication algorithm.
If your program uses collective communication, explain how; if not, explain why not.
Your report should include a screenshot an interesting segment of jumpshot's timeline visualization of your program, along with a description of what you learned from it about your parallelization.
COMP 534 students: Your report should include at least one interesting hpcviewer screenshot and one hpctraceviewer screenshot, along with a discussion of what they indicate about the performance of your application.

Using MPI on NOTS

General information about MPI and links to online documentation can be found on the COMP 422 Resources Page. For helpful details explaining how to develop and debug your programs, see the Getting Started on NOTS FAQ. This FAQ contains information about how to compile your MPI program, run using srun, and submit your job to the batch queueing system using SLURM.

There are several MPI implementations on NOTS. We will be using OpenMPI along with the gcc compilers. To load the modules necessary to prepare your environment, use the following command:

module load assignment3

Compile and link your program using mpiXX, where XX is either cc, cxx, or fort for C, C++, or Fortran code respectively. The mpiXX script will use the gcc, g++ or gfortran compiler. The mpiXX script will provide the include path for the MPI include files and automatically add the proper MPI libraries when it links your executable.

I recommend editing your code on a login node to avoid the surprise of having your editor be killed as your interactive session on the compute node expires. I also recommend compiling your code on the login nodes rather than the compute nodes. To run an MPI program, you need to use the compute nodes. See the information below about using SLURM to run your job on a set of compute nodes.

To monitor the message passing communication of your MPI programs, you will use the MPE tools. These tools have been added to your PATH by the assignment3 module. To collect a communication trace, instead of compiling and linking your MPI program with mpicc, use mpecc. This script is like mpicc, but it contains a library that provides support for message tracing. Use the -mpilog option when you compile with mpecc to turn on message tracing. When you do that, your executable will record a trace of MPI communication when it is executed. I recommend that you collect a communication trace on 16 or more processors so that you get a sense of how the execution will scale. If your program is named multiply. The communication trace for multiply will be recorded in a file named multiply.clog2. You will use the jumpshot tool to examine the MPI communication trace from your program. The jumpshot executable is installed in the openmpi module. When you run

jumpshot multiply.clog2

jumpshot will prompt you to convert the clog2 file to slog2 (a scalable format display format used by jumpshot). Click the convert button and then click the OK button. jumpshot will then bring up a timeline view that shows how the computation and communication unfold as your program executes. (Time proceeds left to right.) If traces for the 7500x7500 problem are too large, you can trace a smaller problem size. A manual for jumpshot is available in PDF or HTML.

Using SLURM on NOTS

While you can write your program on one of the login nodes on NOTS, you must run your experiments on the compute nodes using the SLURM job manager. You can obtain an interactive session for debugging your program by requesting an interactive session in the "interactive" queue. I recommend using queue "interactive" for development on 1 node and then submitting timing experiments to the default queue. Specifying a realistic bound for the running time of your application will accelerate your progress through the queue. The more time you ask for, the longer you will probably wait in the queue. However, if you don't ask for enough, your job may be killed before it completes if your execution time exceeds the amount you requested. Ask for enough, but not too much.

You can obtain a pair of nodes to run your job on 32 MPI ranks by requesting an interactive session using

srun --pty --export=ALL --nodes=2 --ntasks-per-node=16 --time=00:30:00 bash

See the documentation on Using NOTS under the tab "Submitting Jobs" for more information about using SLURM. Below is a sample SLURM batch script that you can launch with sbatch.

#!/bin/bash #SBATCH --job-name=multiply #SBATCH --nodes=4 #SBATCH --ntasks-per-node=2 #SBATCH --exclusive #SBATCH --time=00:30:00 #SBATCH --export=ALL module use /projects/comp422/modulefiles module load assignment3 # The command below will run your program on the number of nodes with # the number of MPI tasks per node as specified above (in this case total 8 MPI ranks). srun ./multiply 7500 # The following command lets you specify the total number of MPI ranks the program will run on. # Here your program will run on 4 MPI ranks. srun -n 4 ./multiply 7500

Save it in a file multiply.sbatch and then launch it as

sbatch multiply.sbatch

The sample batch script runs the multiply program twice with different numbers of MPI ranks. As shown above, your script can use srun several times to run multiple experiments within a single script. While the number of tasks you request with the --nodes and --ntask-per-node line of your script will be the maximum number of tasks that you can run on, you can also use srun to use a subset of the tasks that the script secures for you. As shown in the second invocation of srun, you can then use the -n option to specify the total number of MPI ranks. The total number of ranks specified using -n must be less than or equal to the number of nodes requested by your script using --nodes option times the argument you specify for --ntasks-per-node.

The queue status can be checked by running the squeue command on NOTS.

Submitting your Assignment

You will use GitHub classroom to submit your assignment.

Your submission should contain:

a directory containing the code for your MPI program and a Makefile to build the code, and
a writeup about your programs in either Word or PDF format (PDF preferred).

Your assignment MUST be submitted prior to the submission date and time specified above. That is the latest date and time that the registrar will allow work to be submitted for credit this semester.

Grading Criteria

10% Program correctness.
40% Program clarity, elegance and parallelization. The programs should be well-documented internally with comments.
10% Program scalability and performance.
40% Writeup. Your grade on the writeup includes the quality of the writing (grammar, sentence structure), whether all of the required elements are included, and the clarity of your explanations.

9 April 2020 - initial version posted.