COMP 422 Assignment 2

COMP 422

Parallel Computing

Spring 2020

Assignment 2

Complete by the end of the spring semester: May 6 @5pm

Parallel LU Decomposition

LU Decomposition (where 'LU' stands for 'lower upper') is a classical method for transforming an N x N matrix A into the product of a lower-triangular matrix L and an upper-triangular matrix U,

You will use a process known as Gaussian elimination to create an LU decomposition of a square matrix. By performing elementary row operations, Gaussian elimination transforms a square matrix A into an equivalent upper-triangular matrix U. The lower-triangular matrix L consists of the row multipliers used in Gaussian elimination. A description of LU decomposition can be found on Mathworld.

In this assignment, you will develop two parallel implementations of LU decomposition that use Gaussian elimination to factor a dense N x N matrix into an upper-triangular one and a lower-triangular one. In matrix computations, pivoting involves finding the largest magnitude value in a row, column, or both and then interchanging rows and/or columns in the matrix for the next step in the algorithm. The purpose of pivoting is to reduce round-off error, which enhances numerical stability. In your assignment, you will use row pivoting, a form of pivoting involves interchanging rows of a trailing submatrix based on the largest value in the current column. To perform LU decomposition with row pivoting, you will compute a permutation matrix P such that PA = LU. The permutation matrix keeps track of row exchanges performed. Below is pseudocode for a sequential implementation of LU decomposition with row pivoting.

    inputs: a(n,n)
    outputs: π(n), l(n,n), and u(n,n)

    initialize π as a vector of length n
    initialize u as an n x n matrix with 0s below the diagonal
    initialize l as an n x n matrix with 1s on the diagonal and 0s above the diagonal
    for i = 1 to n
      π[i] = i
    for k = 1 to n
      max = 0
      for i = k to n
        if max < |a(i,k)|
          max = |a(i,k)|
          k' = i
      if max == 0
        error (singular matrix)
      swap π[k] and π[k']
      swap a(k,:) and a(k',:)
      swap l(k,1:k-1) and l(k',1:k-1)
      u(k,k) = a(k,k)
      for i = k+1 to n
        l(i,k) = a(i,k)/u(k,k)
        u(k,i) = a(k,i)
      for i = k+1 to n
        for j = k+1 to n
          a(i,j) = a(i,j) - l(i,k)*u(k,j)
          
    Here, the vector π is a compact representation of a permutation matrix p(n,n), 
    which is very sparse. For the ith row of p, π(i) stores the column index of
    the sole position that contains a 1.

You will write two shared-memory parallel programs that perform LU decomposition using row pivoting. You will develop one solution using the Pthreads programming model and one using OpenMP. Each LU decomposition implementation should accept two arguments: n - the size of a matrix, followed by t - the number of threads. Your programs will allocate an n x n matrix a of double precision (64-bit) floating point variables. You should initialize the matrix with uniform random numbers computed using a suitable random number generator, such as drand48, drand48_r, or the C++11 facilities for pseudo-random number generation. (Note: if you are generating random numbers in parallel, you will need to use a reentrant random number generator and seed the random number generator for each thread differently.) Apply LU decomposition with partial pivoting to factor the matrix into an upper-triangular one and a lower-triangular one. To check your answer, compute the sum of Euclidean norms of the columns of the residual matrix (this sum is known as the L2,1 norm) computed as PA-LU. Print the value of the L2,1 norm of the residual. (It should be very small.)

The verification step need not be parallelized. Have your program time the LU decomposition phase by reading the real-time clock before and after and printing the difference.

The formal components of the assignment are listed below:

Write a shared-memory parallel program that uses OpenMP to perform LU decomposition with partial pivoting.

Write a shared-memory parallel program that uses Pthreads to perform LU decomposition with partial pivoting.

Write a document that describes how your programs work. This document should not include your programs, though it may include figures containing pseudo-code that sketch the key elements of your parallelization strategy for each implementation. Explain how your program partitions the data, work and exploits parallelism. Justify your implementation choices. Explain how the parallel work is synchronized.
Use problem size n = 8000 to evaluate the performance of your implementations. If your sequential running time is too long for the interactive queue, you may base your timing measurements on n=7000. Prepare a table that includes your timing measurements for the LU decomposition phase of your implementations on 1, 2, 4, 8, 16, and 32 threads on a node on NOTS (nots.rice.edu). Graph of the parallel efficiency of your program executions. Plot a point for each of the executions. The x axis should show the number of processors. The Y axis should show your measured parallel efficiency for the execution. Construct your plot so that the X axis of the graph intersects the Y axis at Y=0.

Assignment 1 using CilkPlus made little use of shared data. However, in this assignment, reading and writing shared data will account for much of the execution cost. Accordingly, you should pay attention to how you lay out the data and how your parallelizations interact with your data layout. You should consider whether you want to use a contiguous layout for the array, or whether you want to represent the array as a vector, of n pointers to n-element data vectors. You should explicitly consider how false sharing might arise and take appropriate steps to minimize its impact on performance.

GitHub Classroom Link

Obtain a copy of the GitHub classroom template repository for the assignment at the following link: https://classroom.github.com/a/qO5HDPsJ

Submitting your Assignment

Your submission should contain:

a directory containing the code for your pthreads program and a Makefile to build the code,
a directory containing the code for your OpenMP program and a Makefile to build the code, and
a writeup about your programs in either Word or PDF format (PDF preferred).

You will submit your assignment by commiting your source files, Makefile, and report to your GitHub Classroom repository. If you have trouble using GitHub and come up against the deadline for submitting your assignment, email the instructor a tar file containing your source code and your report. Your assignment should be submitted prior to the assignment deadline noted at the top of this page or it will be subject to the policy on late work.

Grading Criteria

10% Program correctness.
40% Program clarity, elegance and parallelization. The programs should be well-documented internally with comments.
10% Program scalability and performance.
40% Writeup. Your grade on the writeup includes the quality of the writing (grammar, sentence structure), whether all of the required elements are included, and the clarity of your explanations.

Suggestions

Writing the OpenMP program should be fairly straightforward using directives. I recommend that you first write the OpenMP program before working trying the PThreads part of the assignment. You may write your programs in C, C++, or Fortran.

I recommend using the Intel compilers for your assignment. The names for the Intel compilers are icc for C, icpc for C++, and ifort for Fortran.

I recommend using the OMP_PROC_BIND environment variable or the OpenMP proc_bind clause for binding OpenMP worker threads to hardware threads. Some references about proc_bind: syntax, description.

Atomic Operations

If you want to use atomic operations for synchronization, I suggest using the C++11 atomic operations. Using the C++ atomics requires some understanding of weak ordering and the C++ memory model.

Using Intel Tools on NOTS

The top-level directory of your GitHub Classroom repository contains environment.sh -- a script to set up your environment with a compiler a and path to Intel's tools. Set up your environment by executing the command below:

source environment.sh

The Intel C compiler is icc and the C++ compiler is icpc. To compile a Pthreads program with icc or icpc, you need to use the -pthread option.

To compile an OpenMP program with icc or icpc, use the -fopenmp option.

To check your program for data races using Intel's Inspector, see the Intel Inspector 2018 documentation. To check your program for data races using Intel's Inspector on a NOTS compute node, you can use the make check command in the provided Makefile to run Intel's Inspector. Use make view to use Intel's Inspector GUI to view data races found (if any). You can use Intel's inspector in a batch script on a compute node. Simply copy the command executed by make check into a batch script file. If you use Intel's Inspector with srun or sbatch, make sure that you use the -export=ALL option to ensure that environment variables from loading the module are available to the script. I highly recommend that you run Inspector using a small linear system.

OpenMP

You can find information about OpenMP in several places. Lawrence Livermore National Laboratory has posted a good OpenMP tutorial. Complete information about OpenMP can be found in the OpenMP 4.5 specification. The OpenMP 4.5 standard was not intended to be read by application developers. A better reference for developers is the OpenMP Application Programming Interface Examples Version 4.5.0.

When you write OpenMP parallel regions and worksharing directives, be careful to specify the data scope attribute clauses for variables used within each directive's scope. I recommend using the data scoping directive default(none) to make sure that the compiler doesn't implicitly add an incorrect sharing directive.

Compiling OpenMP Programs with the Intel Compiler

When compiling OpenMP programs with any of the Intel compilers, you must use the -fopenmp option on the compiler's command line. It arranges for automatic linking of the OpenMP runtime library.

Note: the -fopenmp option is required for both the compile step and the link step. A program built in this way will automatically use a number of threads equal to the number of cores on the node.

Pthreads

The command and "man pthreads" will provide you some additional information about using Pthreads. Your Pthread application will need -lpthread at the end of the link line.

Interactive Development

Please run your jobs in nodes that you obtain from the scheduler queue rather than on the login nodes. NOTS uses SLURM for job submission. You should use the interactive queue for debugging. When you are testing and debugging your code, you can request

srun --pty --export=ALL --nodes=1 --ntasks-per-node=1 --cpus-per-task=32 --time=00:30:00 bash This will give you a single compute node (consisting of 2 sockets) with 16 or more cores that will support up to 32 threads.

If you have trouble getting access to interactive nodes, let me know and I'll discuss the situation with the Center for Research Computing (CRC) so we can make any adjustments necessary. (Note: the time for CRC to make adjustments is during the day, not the night before the assignment is due.)

Managing Threads and Data on NOTS

On NOTS, memory is partitioned among the 2 sockets on a node. Memory attached to a remote socket is farther away (slower to access) than memory attached to a local socket. You can manage the Non-Uniform Memory Access (NUMA) characteristics of nodes using the NUMA control library. See "man numactl" for more information. Some brief information is included below.

As an alternative to relying on defaults provided by "numactl --localalloc" for data placement, you can use primitives in libnuma to programmatically control where data is allocated. Before threads are created, you can allocate data in particular sockets before using libnuma's

void *numa_alloc_onnode(size_t size, int node);

After you create threads, you can have each thread allocate its own data in memory attached to the socket in which it is running using libnuma's

void *numa_alloc_local(size_t size); For info about the libnuma interface, documentation is available both on a web page, or in PDF.

Using SLURM

You will find two scripts interactive.sh and batch.sh in your github classroom repository. To get an interactive node on NOTS, use the following command:

source interactive.sh

To submit a batch job on NOTS, use the following command:

sbatch < batch.sh

Scaling Experiments

The scheduler allocates NOTS on the basis of nodes, which have 16 or more cores each and 2 SMT threads per core. To use up to 32 hardware threads on the node, you need to submit a job with --cpus-per-task=32. For your scaling experiments, you can can prepare each experiment as a separate batch job, or run multiple experiments in a single batch job by adding additional commands to the bottom of the batch script.

25 February 2020 - initial version