Lab 8 - Memory Allocator Profiling

Lab goals:

Learn about the mdriver driver program for the memory allocator
Look at and understand the trace file format used by mdriver
Understand how to analyze and improve the performance of your memory allocator

Resources

The Memory Allocator Driver `mdriver`

The mdriver memory allocator driver program in your repository can be used to evaluate your memory allocator according to the allocation requests in a number of traces. The program provides many different command line options; to see a list of them, invoke the driver with the -h option, like ./mdriver -h. For now, we care only about the -v option, which causes the driver to output per-trace information. At completion, mdriver will also output the number of points awarded for each of utilization (util) and throughput (thru), as well as the overall total performance index; these different metrics will be explained below.

When you are developing your memory allocator, it may be confusing as to how mdriver is assigning points. It is possible that your profiling shows improvements, yet the driver did not give any additional points. To illustrate how the driver is assigning points, let's look at the mdriver -v output from the provided solution:

Results for mm malloc:
trace  valid  util     ops      secs   Kops
 0       yes   98%    5694  0.011953    476
 1       yes   99%    5848  0.009703    603
 2       yes   99%    6648  0.020263    328
 3       yes   99%    5380  0.016268    331
 4       yes   66%   14400  0.000117 123182
 5       yes   92%    4800  0.005985    802
 6       yes   91%    4800  0.005537    867
 7       yes   54%   12000  0.218773     55
 8       yes   47%   24000  0.397283     60
 9       yes   27%   14401  0.098391    146
10       yes   34%   14401  0.003107   4634
Total          73%  112372  0.787380    143

Perf index = 29/40 (util) + 0/60 (thru) = 29/100

Utilization

The utilization metric (labeled as util) evaluates the space efficiency of your allocator.

The utilization metric for each trace is computed as the maximum over the trace of the sum of the size of all allocated payloads (where the size of a payload is the size requested by the program on the malloc or realloc call) divided by the size of the heap.

The provided solution performs relatively well on most of the traces. The utilization score is very high for about half of the traces, and is between 27% and 66% on the other traces. This allocator therefore receives 29 out of 40 points for its utilization score. The utilization score is just the overall average utilization for the traces, then scaled by the total of 40 points possible for the utilization score (that is, 0.73 × 40 = 29.2).

Throughput

The throughput metric (labeled as thru) evaluates the time efficiency of your allocator.

The throughput metric is computed by first taking the total number of Kilo-operations performed by the allocator (i.e., the number of thousands of operations performed) divided by the allocator's total number of seconds used, to get what mdriver calls Kops, or Kilo-operations per second. To then get the throughput score, this average is divided by the memory allocator's target throughput, and then scaled by the total of 60 points possible for the throughput score. The target throughput has been set at 54,500 Kops, which is reasonably achievable using various combinations of the techniques that have been taught in this class.

The important difference between this method of computing the throughput score (i.e., dividing the total Kops by the total seconds) vs. simply averaging the allocator's individual Kops results across the set of traces, is that summing first places more weight both on traces with more operations and on traces where performance was poor.

For a concrete example of this difference using the provided allocator, if the Kops for each trace were averaged individually, then the average performance would be 13,148 Kops, which would lead to a performance score of 14/60 because it is 24.1% of the target throughput of 54,500 Kops. However, the actual averaged performance is only 143 Kops, which leads to a score of 0/60 (that is, (143 / 54,500) × 60 = 0.157).

Perf index

The performance index printed by mdriver (labeled as Perf index) is simply the sum of the utilization score (util) and the throughput score (thru). In some cases, however, you may think there is an error as the total score may seem slightly off.

Internally, the driver uses floating point numbers to represent the scores. When the utilization and throughput scores are individually displayed, they are first each rounded to the nearest integer. The total score, however, is not simply the sum of the two rounded integer scores. Rather, it is calculated by first summing the two floating point numbers, and then rounding to an integer. This difference in rounding can lead to a total that is not strictly the sum of the displayed parts.

Reference Solution Example

For another example, look at the mdriver -v output for one of the reference solutions:

Results for mm malloc:
trace  valid  util     ops      secs   Kops
 0       yes   98%    5694  0.000131  43432
 1       yes   96%    5848  0.000125  46934
 2       yes   98%    6648  0.000166  40072
 3       yes   99%    5380  0.000118  45516
 4       yes   66%   14400  0.000144  99792
 5       yes   93%    4800  0.000535   8975
 6       yes   91%    4800  0.000582   8243
 7       yes   54%   12000  0.021052    570
 8       yes   47%   24000  0.095980    250
 9       yes   29%   14401  0.000183  78522
10       yes   43%   14401  0.000102 140772
Total          74%  112372  0.119119    943

Perf index = 30/40 (util) + 1/60 (thru) = 31/100

The reference solution being evaluated here uses a single explicit free list and an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded.

This reference allocator performs very poorly on traces 7 (binary-bal.rep) and 8 (binary2-bal.rep), which are among the larger traces, so the actual average performance is only 943 Kops, which leads to a score of only 1/60. If the Kops were simply straight averaged across the traces, this solution would have received 56/60, the maximum score.

The ultimate affect of the method for computing the average throughput is that your memory allocator will need to perform well on all traces in order to receive a good performance score.

You may notice that throughput scores can vary between different runs of the driver, even without changing the allocator. If this variation is making it difficult to profile your code, see the note about the shared CLEAR server machines near the end of this lab description.

Trace File Format

The trace file format used by mdriver is very simple. The first four lines of the file make up the header. For our purposes, we will ignore this (the driver ignores most of the header too). The rest of the lines in a trace are composed in one of two ways, depending on whether the operation is allocating (i.e., allocate and reallocate operations) or freeing (i.e., free operations) memory.

Allocate and reallocate operations all have the following form: op id bytes

op is a for an allocation operation and r for a reallocate operation.
id is a unique identification number for the allocation. This id can be used later in reallocate or free operations. If this is a reallocate operation, this identifier must have been used in a previous allocation operation (and possibly also used in another previous reallocate operation).
bytes is the total size of the allocation being performed.

Free operations all have the following form: op id

op is f for a free operation.
id must be an identifier used in a previous allocation operation (and possibly also used in a previous reallocate operation).

A Concrete Example

The trace file format explanation above may seem a bit abstract. Let's look at a concrete example of a trace file:

The first four lines of this example (20000, 2, 5, and 1) are the header, which we are ignoring. Step by step, in sequence, this is what the rest of this trace file means:

Create allocation #0 with a size of 512 bytes.
Create allocation #1 with a size of 128 bytes.
Reallocate allocation #0 to a new size of 640 bytes.
Free allocation #1 (the allocation of size 128 bytes)
Free allocation #0 (the allocation now of size 640 bytes)

The Test Traces

All of the traces that will be used to grade your memory allocator are located in the directory /clear/www/htdocs/comp321/test-cases/malloc. We encourage you to examine these trace files and to look at their patterns of operations. This examination can be especially useful if your allocator is not performing well on only one or a few traces.

The traces are identified simply by a trace number in the output from mdriver -v, but in this directory, each trace is identified by its filename. Here is the relationship between trace numbers and these filenames (this is configured in config.h):

amptjp-bal.rep
cccp-bal.rep
cp-decl-bal.rep
expr-bal.rep
coalescing-bal.rep
random-bal.rep
random2-bal.rep
binary-bal.rep
binary2-bal.rep
realloc-bal.rep
realloc2-bal.rep

Analyzing the Performance of the Memory Allocator

As described above, we will analyze the memory allocator performance in two parts: first, we will use profiling to analyze the time efficiency of your allocator (which affects the throughput metric), and then we will analyze its space efficiency (the utilization metric). The memory allocator we will be analyzing here is implemented in the source file mm.c and its associated header file mm.h in your repo for this lab. The rest of the source files in your repo are part of the memory allocator driver program mdriver.

Improving Throughput

We will use gprof, which was introduced previously in a COMP 222 lab, to profile and better understand how to improve the throughput of the memory allocator.

Note that the overhead of profiling might slow down your code significantly. So you should be careful to disable profiling once you have optimized your code.

The reason you might not see all the functions from mm.c in your profile is because the compiler automatically inlines some of the functions in your code. Inlining refers to the compiler inserting the machine instructions for the function being called into the code of the calling procedure at the place where the inlined code is being called, instead of actually calling that function there normally. The compiler performs this optimization in order to avoid the overhead of calling small functions (the overhead of getting in and out of the function), but this optimization can sometimes be misleading or may be confusing when analyzing the profile results. So, it is a good idea to turn off inlining while profiling (see below in Exercise #3).

If, for testing, you want to run individual traces with mdriver, one way to do that is to copy the traces from their location on CLEAR into your lab repository directory.

cd your-lab-repo-directory
cp /clear/www/htdocs/comp321/test-cases/malloc/* .

where your-lab-repo-directory is the pathname to your repo directory for this lab. (Note the . at the end of this command line, meaning the current directory.) You can now use the -f option on mdriver to run just a single trace:

./mdriver -v -f trace-file-name

where trace-file-name is the name of the specific trace file you want to use.

Using explicit free list(s), you can solve some of the problems. Below we show again the sample output from the same reference solution mentioned above, using a single explicit free list and using an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded:

Results for mm malloc:
trace  valid  util     ops      secs   Kops
 0       yes   98%    5694  0.000131  43432
 1       yes   96%    5848  0.000125  46934
 2       yes   98%    6648  0.000166  40072
 3       yes   99%    5380  0.000118  45516
 4       yes   66%   14400  0.000144  99792
 5       yes   93%    4800  0.000535   8975
 6       yes   91%    4800  0.000582   8243
 7       yes   54%   12000  0.021052    570
 8       yes   47%   24000  0.095980    250
 9       yes   29%   14401  0.000183  78522
10       yes   43%   14401  0.000102 140772
Total          74%  112372  0.119119    943

Perf index = 30/40 (util) + 1/60 (thru) = 31/100

Note that there is a significant improvement in throughput over the provided solution (shown previously) for most of the traces. In the following exercises, you will examine trace 8 (binary2-bal.rep), on which this reference allocator achieved little improvement, and trace 9 (realloc-bal.rep), on which it achieved significant improvement.

Improving Utilization

You can see from the sample output above from our reference solution that three of the traces result in a utilization of less than 50%. To help improve utilization, you can analyze these traces and try to find ways to optimize your memory allocator accordingly.

For example, consider trace 7 (binary-bal.rep). It has the following allocation/deallocation pattern:

Alternate allocations for many blocks of sizes 64 and 448 bytes.
Next, the blocks of size 64 bytes are freed.
Then, allocate many blocks of size 512
Finally, all of the blocks are freed.

A naive memory allocator would end up wasting all of the 64-byte free blocks while trying to allocate 512-byte blocks. The reason is that the 64-byte free blocks are not contiguous and hence cannot be coalesced. You can avoid this by using some form of the simple segregated storage (discussed in lecture and the textbook).

Note, however, that you will find that some of your optimizations will result in a trade-off, in that they will help improve the performance of some traces while reducing the performance of others. So, be careful while performing trace-specific optimizations.

A Note about the Shared CLEAR9 Server Machines

If the throughput of your allocator is varying wildly from one run to the next, it is likely that the CLEAR9 machine that you are using is under a heavy load from other users. The CLEAR9 machines are shared servers, used by many other COMP 321 students as well as of course users from other classes as well.

You can use the Unix command top to check the load average on the machine you are using. If the load average (at the end of the first line of the display) is very high, particularly if it is close to the number of processors on the machine (20 on CLEAR9), then you might want to log off of the CLEAR9 server machine you are on and try logging in to a different CLEAR9 server (type q to exit the top command display).

Each of the different CLEAR9 server machines has a specific unique name. The current list of CLEAR9 servers is below:

agate.clear9.rice.edu
jade.clear9.rice.edu
onyx.clear9.rice.edu
pyrite.clear9.rice.edu

Logging in to the generic name ssh.clear9.rice.edu automatically selects one of these server machines for you, but that selection is not based on the current load averages of the servers. You can, alternatively, log in to any specific one of these different CLEAR9 server machines by using its unique name (above) with your ssh client, rather than just using the generic name ssh.clear9.rice.edu with it.

Submission

For this lab and all labs in this course, at the end of your scheduled lab session, you should "submit" your work (regardless of how much your completed) using git push.

You should always include in your repo all files that you created in the lab, other than those (such as the output of the compiler) generated automatically from other files. Think of this as including just the files necessary to backup and to be able to recreate your work. Do not simply add all files to your repo; for example, do not simply type something like git add . or git add *, and never add any object files, executable files, or core files to your repo. Only add the actual source files (for example, .c and .h files) that should be in your repo.