![]() |
|
Matrix multiplication is a binary operation performed on a pair of matrices A, rank M x N, and B, rank N x P, resulting in a matrix C, rank M x P. In matrix multiplication, the element-wise products from a row of elements from A, and a column of elements from B are summed to generate an element of matrix C. Two popular methods to perform parallel matrix multiplication on a distributed memory system, namely, Cannon's algorithm and 3D matrix multiplication, were described in class.
For this assignment, you will implement the 2.5D version of matrix multiplication using MPI. You should refer to the technical report entitled Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms by Solomonik et. al. for complete information on this algorithm. Communication avoiding algorithms are a hot topic - their development was an accomplishment highlighted in President Obama's FY 2012 Department of Energy Budget Request to Congress.
Summary of the 2.5D algorithm to get you started: 2.5D matrix multiplication algorithm, originating from UC Berkeley, reduces communication bandwidth consumed and exposed latency in matrix multiplication by maintaining a total of 'C' replicas of the input matrices A and B on a 3D grid of processors, P in total, of dimension sqrt(P/C) x sqrt(P/C) x C. Initially, a single copy of the A, B matrices is broken into chunks, and distributed among the processors on the front face of this grid. Through a series of broadcasts, and point-to-point communications, these chunks are then distributed to different processors. Each processor in the grid performs a matrix multiplication on the chunks of A and B received, and a final phase of reduction produces the result, C, matrix on the front face of the grid.
You can write your program in C, C++, or Fortran with MPI communication. You should use MPI collective communication where appropriate.
To evaluate the performance of your implementation, perform the following experiments:
As with your previous assignment, randomly generate the input matrices. I recommend that you use drand48; see its man page for details how to use it.
You will run your MPI program on NOTS's compute nodes and measure its performance and
scalability. Instructions how to do this with SLURM are given
later on this page. Your program should measure the elapsed time of your
multiplication phase, which means DO NOT include the matrix
initialization or answer verification. You can use
the code here as a guide for constructing a verification test
that your matrix multiply is computing the correct answer. You may include any of the code in the attached file in
your assignment.
As part of your
experimentation, you are required to collect a message trace and
visualize it with jumpshot
. (Details about collecting
traces and visualizing them using jumpshot
on NOTS are provided in the next section.)
To collect more detailed information about the compute performance of your application, you can use hardware performance counters to measure machine instructions completed, cycles, cache misses, and more. See the HPCToolkit User's Manual for additional information.
Write a document that describes how your program works. Along with describing your performance results as required above, please answer the following questions:
jumpshot
's timeline visualization of your program, along with a description of what you learned from it about your parallelization.
General information about MPI and links to online documentation can be found on the COMP 422 Resources Page. For helpful details explaining how to develop and debug your programs, see the Getting Started on NOTS FAQ. This FAQ contains information about how to compile your MPI program, run using srun, and submit your job to the batch queueing system using SLURM.
There are several MPI implementations on NOTS. We will be using OpenMPI along with the gcc compilers. To load the modules necessary to prepare your environment, use the following command:
module load assignment3
Compile and link your program using mpiXX, where XX is either cc, cxx, or fort for C, C++, or Fortran code respectively. The mpiXX script will use the gcc, g++ or gfortran compiler. The mpiXX script will provide the include path for the MPI include files and automatically add the proper MPI libraries when it links your executable.
I recommend editing your code on a login node to avoid the surprise of having your editor be killed as your interactive session on the compute node expires. I also recommend compiling your code on the login nodes rather than the compute nodes. To run an MPI program, you need to use the compute nodes. See the information below about using SLURM to run your job on a set of compute nodes.
To monitor the message passing communication of your MPI programs, you
will use the MPE tools. These tools have been added to your PATH by
the assignment3 module. To collect a communication trace,
instead of compiling and linking your MPI program with mpicc,
use mpecc. This script is like mpicc, but it
contains a library that provides support for message tracing.
Use the -mpilog option when you compile with mpecc to turn on
message tracing. When you do that,
your executable will record a trace of MPI communication when it is
executed. I recommend that you collect a communication trace on 16 or more
processors so that you get a sense of how the execution will scale.
If your program is named multiply. The
communication trace for multiply will be recorded in a
file named multiply.clog2. You will use the
jumpshot
tool to examine the
MPI communication trace from your program. The jumpshot
executable is installed in the openmpi module. When you run
jumpshot
is available in PDF
or HTML.
You can obtain a pair of nodes to run your job on 32 MPI ranks by requesting an interactive session using
See the documentation on Using NOTS under the tab "Submitting Jobs" for more information about using SLURM. Below is a sample SLURM batch script that you can launch with sbatch.
srun
several times to run multiple experiments within a
single script. While the number of tasks you request with the
--nodes and --ntask-per-node line of your script will be the maximum number of tasks that
you can run on, you can also use srun
to use a subset of the tasks
that the script secures for you. As shown in the second
invocation of srun
, you can then use the
-n option to specify the total number of MPI ranks. The total number of ranks
specified using -n must be less than or equal to the number of nodes
requested by your script using --nodes option times the argument you specify for --ntasks-per-node.
The queue status can be checked by running the squeue command on NOTS.
Your submission should contain:
9 April 2020 - initial version posted.