![]() |
|
LU Decomposition (where 'LU' stands for 'lower upper') is a classical method for transforming an N x N matrix A into the product of a lower-triangular matrix L and an upper-triangular matrix U,
In this assignment, you will develop two parallel implementations of
LU decomposition that use Gaussian elimination
to factor a dense N x N matrix into an upper-triangular one and a lower-triangular
one.
In matrix computations, pivoting involves finding the largest magnitude value in a row,
column, or both and then interchanging rows and/or columns in the
matrix for the next step in the algorithm.
The purpose of pivoting is to reduce round-off error, which enhances numerical stability.
In your assignment, you will use row pivoting, a form of
pivoting involves interchanging rows of a trailing submatrix based on
the largest value in the current column.
To perform LU decomposition with row pivoting, you will compute a permutation matrix
P such that PA = LU.
The permutation matrix keeps track of row exchanges performed.
Below is pseudocode for a sequential implementation of LU
decomposition with row pivoting.
inputs: a(n,n) outputs: π(n), l(n,n), and u(n,n) initialize π as a vector of length n initialize u as an n x n matrix with 0s below the diagonal initialize l as an n x n matrix with 1s on the diagonal and 0s above the diagonal for i = 1 to n π[i] = i for k = 1 to n max = 0 for i = k to n if max < |a(i,k)| max = |a(i,k)| k' = i if max == 0 error (singular matrix) swap π[k] and π[k'] swap a(k,:) and a(k',:) swap l(k,1:k-1) and l(k',1:k-1) u(k,k) = a(k,k) for i = k+1 to n l(i,k) = a(i,k)/u(k,k) u(k,i) = a(k,i) for i = k+1 to n for j = k+1 to n a(i,j) = a(i,j) - l(i,k)*u(k,j) Here, the vector π is a compact representation of a permutation matrix p(n,n), which is very sparse. For the ith row of p, π(i) stores the column index of the sole position that contains a 1.You will write two shared-memory parallel programs that perform LU decomposition using row pivoting. You will develop one solution using the Pthreads programming model and one using OpenMP. Each LU decomposition implementation should accept two arguments: n - the size of a matrix, followed by t - the number of threads. Your programs will allocate an n x n matrix a of double precision (64-bit) floating point variables. You should initialize the matrix with uniform random numbers computed using a suitable random number generator, such as drand48, drand48_r, or the C++11 facilities for pseudo-random number generation. (Note: if you are generating random numbers in parallel, you will need to use a reentrant random number generator and seed the random number generator for each thread differently.) Apply LU decomposition with partial pivoting to factor the matrix into an upper-triangular one and a lower-triangular one. To check your answer, compute the sum of Euclidean norms of the columns of the residual matrix (this sum is known as the L2,1 norm) computed as PA-LU. Print the value of the L2,1 norm of the residual. (It should be very small.)
The verification step need not be parallelized. Have your program time the LU decomposition phase by reading the real-time clock before and after and printing the difference.
The formal components of the assignment are listed below:
Use problem size n = 8000 to evaluate the performance of your implementations. If your sequential running time is too long for the interactive queue, you may base your timing measurements on n=7000. Prepare a table that includes your timing measurements for the LU decomposition phase of your implementations on 1, 2, 4, 8, 16, and 32 threads on a node on NOTS (nots.rice.edu). Graph of the parallel efficiency of your program executions. Plot a point for each of the executions. The x axis should show the number of processors. The Y axis should show your measured parallel efficiency for the execution. Construct your plot so that the X axis of the graph intersects the Y axis at Y=0.
Assignment 1 using CilkPlus made little use of shared data. However, in this assignment, reading and writing shared data will account for much of the execution cost. Accordingly, you should pay attention to how you lay out the data and how your parallelizations interact with your data layout. You should consider whether you want to use a contiguous layout for the array, or whether you want to represent the array as a vector, of n pointers to n-element data vectors. You should explicitly consider how false sharing might arise and take appropriate steps to minimize its impact on performance.
Your submission should contain:
I recommend using the Intel compilers for your assignment. The names for the Intel compilers are icc for C, icpc for C++, and ifort for Fortran.
I recommend using the OMP_PROC_BIND environment variable or the OpenMP proc_bind clause for binding OpenMP worker threads to hardware threads. Some references about proc_bind: syntax, description.
If you want to use atomic operations for synchronization, I suggest using the C++11 atomic operations. Using the C++ atomics requires some understanding of weak ordering and the C++ memory model.
The top-level directory of your GitHub Classroom repository contains environment.sh -- a script to set up your environment with a compiler a and path to Intel's tools. Set up your environment by executing the command below:
The Intel C compiler is icc and the C++ compiler is icpc. To compile a Pthreads program with icc or icpc, you need to use the -pthread option.
To compile an OpenMP program with icc or icpc, use the -fopenmp option.
To check your program for data races using Intel's Inspector, see the Intel Inspector 2018 documentation. To check your program for data races using Intel's Inspector on a NOTS compute node, you can use the make check command in the provided Makefile to run Intel's Inspector. Use make view to use Intel's Inspector GUI to view data races found (if any). You can use Intel's inspector in a batch script on a compute node. Simply copy the command executed by make check into a batch script file. If you use Intel's Inspector with srun or sbatch, make sure that you use the -export=ALL option to ensure that environment variables from loading the module are available to the script. I highly recommend that you run Inspector using a small linear system.
When you write OpenMP parallel regions and worksharing directives, be careful to specify the data scope attribute clauses for variables used within each directive's scope. I recommend using the data scoping directive default(none) to make sure that the compiler doesn't implicitly add an incorrect sharing directive.
When compiling OpenMP programs with any of the Intel compilers, you must use the -fopenmp option on the compiler's command line. It arranges for automatic linking of the OpenMP runtime library.
Note: the -fopenmp option is required for both the compile step and the link step. A program built in this way will automatically use a number of threads equal to the number of cores on the node.
If you have trouble getting access to interactive nodes, let me know and I'll discuss the situation with the Center for Research Computing (CRC) so we can make any adjustments necessary. (Note: the time for CRC to make adjustments is during the day, not the night before the assignment is due.)
As an alternative to relying on defaults provided by "numactl --localalloc" for data placement, you can use primitives in libnuma to programmatically control where data is allocated. Before threads are created, you can allocate data in particular sockets before using libnuma's
After you create threads, you can have each thread allocate its own data in memory attached to the socket in which it is running using libnuma's
You will find two scripts interactive.sh and batch.sh in your github classroom repository. To get an interactive node on NOTS, use the following command:
The scheduler allocates NOTS on the basis of nodes, which have 16 or more cores each and 2 SMT threads per core. To use up to 32 hardware threads on the node, you need to submit a job with --cpus-per-task=32. For your scaling experiments, you can can prepare each experiment as a separate batch job, or run multiple experiments in a single batch job by adding additional commands to the bottom of the batch script.