Lab 9 - Memory Allocator Profiling

Lab goals:


The Memory Allocator Driver mdriver

The mdriver memory allocator driver program in your repository can be used to evaluate your memory allocator according to the allocation requests in a number of traces. The program provides many different command line options; to see a list of them, invoke the driver with the -h option, like ./mdriver -h.  For now, we care only about the -v option, which causes the driver to output per-trace information. At completion, mdriver will also output the number of points awarded for each of utilization (util) and throughput (thru), as well as the overall total performance index; these different metrics will be explained below.

When you are developing your memory allocator, it may be confusing as to how mdriver is assigning points. It is possible that your profiling shows improvements, yet the driver did not give any additional points. To illustrate how the driver is assigning points, let's look at the  mdriver -v  output from the provided solution:

Results for mm malloc:
trace  valid  util     ops      secs   Kops
 0       yes   98%    5694  0.012556    453
 1       yes   99%    5848  0.009918    590
 2       yes   99%    6648  0.021107    315
 3       yes   99%    5380  0.016953    317
 4       yes   66%   14400  0.000096 150628
 5       yes   92%    4800  0.006145    781
 6       yes   91%    4800  0.005828    824
 7       yes   54%   12000  0.200493     60
 8       yes   47%   24000  0.330035     73
 9       yes   27%   14401  0.094657    152
10       yes   34%   14401  0.002364   6093
Total          73%  112372  0.700151    160

Perf index = 29/40 (util) + 0/60 (thru) = 30/100

Utilization

The utilization metric (labeled as util) evaluates the space efficiency of your allocator.

The utilization metric for each trace is computed as the maximum over the trace of the sum of the size of all allocated payloads (where the size of a payload is the size requested by the program on the malloc or realloc call) divided by the size of the heap.

The provided solution performs relatively well on most of the traces. The utilization score is very high for about half of the traces, and is between 27% and 66% on the other traces. This allocator therefore receives 29 out of 40 points for its utilization score. The utilization score is just the overall average utilization for the traces, then scaled by the total of 40 points possible for the utilization score (that is, 0.73 × 40 = 29.2).

Throughput

The throughput metric (labeled as thru) evaluates the time efficiency of your allocator.

The throughput metric is computed by first taking the total number of Kilo-operations performed by the allocator (i.e., the number of thousands of operations performed) divided by the allocator's total number of seconds used, to get what mdriver calls Kops, or Kilo-operations per second. To then get the throughput score, this average is divided by the memory allocator's target throughput, and then scaled by the total of 60 points possible for the throughput score. The target throughput has been set at 58,500 Kops, which is reasonably achievable using various combinations of the techniques that have been taught in this class.

The important difference between this method of computing the throughput score (i.e., dividing the total Kops by the total seconds) vs. simply averaging the allocator's individual Kops results across the set of traces, is that summing first places more weight both on traces with more operations and on traces where performance was poor.

For a concrete example of this difference using the provided allocator, if the Kops for each trace were averaged individually, then the average performance would be 14,571 Kops, which would lead to a performance score of 15/60 because it is 24.9% of the target throughput of 58,500 Kops. However, the actual averaged performance is only 160 Kops, which leads to a score of 0/60 (that is, (160 / 58,500) × 60 = 0.164).

Perf index

The performance index printed by mdriver (labeled as Perf index) is simply the sum of the utilization score (util) and the throughput score (thru). But in the output above for the provided solution, the memory allocator scores 29/40 for utilization and 0/60 for throughput, yet it scores 30/100 overall. It may seem as though there is a bug that causes the total performance index to not be the correct sum. However, this is not the case.

Internally, the driver uses floating point numbers to represent the scores. When the utilization and throughput scores are individually displayed, they are first each rounded to the nearest integer. The total score, however, is not simply the sum of the two rounded integer scores. Rather, it is calculated by first summing the two floating point numbers, and then rounding to an integer. This difference in rounding is the cause of the apparent arithmetic error.

Reference Solution Example

For another example, look at the  mdriver -v  output for one of the reference solutions:

Results for mm malloc:
trace  valid  util     ops      secs   Kops
 0       yes   98%    5694  0.000114  49991
 1       yes   96%    5848  0.000112  52168
 2       yes   98%    6648  0.000140  47622
 3       yes   99%    5380  0.000107  50422
 4       yes   66%   14400  0.000115 125326
 5       yes   93%    4800  0.000416  11530
 6       yes   91%    4800  0.000468  10248
 7       yes   54%   12000  0.025115    478
 8       yes   47%   24000  0.082298    292
 9       yes   29%   14401  0.000155  92790
10       yes   43%   14401  0.000078 183686
Total          74%  112372  0.109119   1030

Perf index = 30/40 (util) + 1/60 (thru) = 31/100

The reference solution being evaluated here uses a single explicit free list and an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded. This is most similar to method 2 that is described in the lecture notes.

This reference allocator performs very poorly on traces 7 (binary-bal.rep) and 8 (binary2-bal.rep), which are among the larger traces, so the actual average performance is only 1030 Kops, which leads to a score of only 1/60. If the Kops were simply straight averaged across the traces, this solution would have received 60/60, the maximum score.

The ultimate affect of the method for computing the average throughput is that your memory allocator will need to perform well on all traces in order to receive a good performance score.

You may notice that throughput scores can vary between different runs of the driver, even without changing the allocator. If this variation is making it difficult to profile your code, see the note about the shared CLEAR server machines near the end of this lab description.


Trace File Format

The trace file format used by mdriver is very simple. The first four lines of the file make up the header. For our purposes, we will ignore this (the driver ignores most of the header too). The rest of the lines in a trace are composed in one of two ways, depending on whether the operation is allocating (i.e., allocate and reallocate operations) or freeing (i.e., free operations) memory.

Allocate and reallocate operations all have the following form:  op  id  bytes

Free operations all have the following form:  op  id

A Concrete Example

The trace file format explanation above may seem a bit abstract. Let's look at a concrete example of a trace file:

    20000
    2
    5
    1
    a 0 512
    a 1 128
    r 0 640
    f 1
    f 0

The first four lines of this example (20000, 2, 5, and 1) are the header, which we are ignoring. Step by step, in sequence, this is what the rest of this trace file means:

  1. Create allocation #0 with a size of 512 bytes.
  2. Create allocation #1 with a size of 128 bytes.
  3. Reallocate allocation #0 to a new size of 640 bytes.
  4. Free allocation #1 (the allocation of size 128 bytes)
  5. Free allocation #0 (the allocation now of size 640 bytes)

The Test Traces

All of the traces that will be used to grade your memory allocator are located in the directory /clear/www/htdocs/comp321/test-cases/malloc. We encourage you to examine these trace files and to look at their patterns of operations. This examination can be especially useful if your allocator is not performing well on only one or a few traces.

The traces are identified simply by a trace number in the output from mdriver -v, but in this directory, each trace is identified by its filename. Here is the relationship between trace numbers and these filenames (this is configured in config.h):

  1. amptjp-bal.rep
  2. cccp-bal.rep
  3. cp-decl-bal.rep
  4. expr-bal.rep
  5. coalescing-bal.rep
  6. random-bal.rep
  7. random2-bal.rep
  8. binary-bal.rep
  9. binary2-bal.rep
  10. realloc-bal.rep
  11. realloc2-bal.rep

Analyzing the Performance of the Memory Allocator

As described above, we will analyze the memory allocator performance in two parts: first, we will use profiling to analyze the time efficiency of your allocator (which affects the throughput metric), and then we will analyze its space efficiency (the utilization metric). The memory allocator we will be analyzing here is implemented in the source file mm.c and its associated header file mm.h in your repo for this lab. The rest of the source files in your repo are part of the memory allocator driver program mdriver.

Improving Throughput

We will use gprof, which was introduced previously in the lab on performance profiling, to profile and better understand how to improve the throughput of the memory allocator.

Note that the overhead of profiling might slow down your code significantly. So you should be careful to disable profiling once you have optimized your code.

The reason you might not see all the functions from mm.c in your profile is because the compiler automatically inlines some of the functions in your code. Inlining refers to the compiler inserting the machine instructions for the function being called into the code of the calling procedure at the place where the inlined code is being called, instead of actually calling that function there normally. The compiler performs this optimization in order to avoid the overhead of calling small functions (the overhead of getting in and out of the function), but this optimization can sometimes be misleading or may be confusing when analyzing the profile results. So, it is a good idea to turn off inlining while profiling (see below in Exercise #3).

If, for testing, you want to run individual traces with mdriver, one way to do that is to copy the traces from their location on CLEAR into your lab repository directory.

cd  your-lab-repo-directory
cp /clear/www/htdocs/comp321/test-cases/malloc/* .

where  your-lab-repo-directory  is the pathname to your repo directory for this lab. (Note the . at the end of this command line, meaning the current directory.) You can now use the -f option on mdriver to run just a single trace:

./mdriver -v -f  trace-file-name

where  trace-file-name  is the name of the specific trace file you want to use.

Using explicit free list(s), you can solve some of the problems. Below we show again the sample output from the same reference solution mentioned above, using a single explicit free list and using an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded:

Results for mm malloc:
trace  valid  util     ops      secs   Kops
 0       yes   98%    5694  0.000114  49991
 1       yes   96%    5848  0.000112  52168
 2       yes   98%    6648  0.000140  47622
 3       yes   99%    5380  0.000107  50422
 4       yes   66%   14400  0.000115 125326
 5       yes   93%    4800  0.000416  11530
 6       yes   91%    4800  0.000468  10248
 7       yes   54%   12000  0.025115    478
 8       yes   47%   24000  0.082298    292
 9       yes   29%   14401  0.000155  92790
10       yes   43%   14401  0.000078 183686
Total          74%  112372  0.109119   1030

Perf index = 30/40 (util) + 1/60 (thru) = 31/100

Note that there is a significant improvement in throughput over the provided solution (shown previously) for most of the traces. In the following exercises, you will examine trace 8 (binary2-bal.rep), on which this reference allocator achieved little improvement, and trace 9 (realloc-bal.rep), on which it achieved significant improvement.

Improving Utilization

You can see from the sample output above from our reference solution that three of the traces result in a utilization of less than 50%. To help improve utilization, you can analyze these traces and try to find ways to optimize your memory allocator accordingly.

For example, consider trace 7 (binary-bal.rep). It has the following allocation/deallocation pattern:

A naive memory allocator would end up wasting all of the 64-byte free blocks while trying to allocate 512-byte blocks. The reason is that the 64-byte free blocks are not contiguous and hence cannot be coalesced. You can avoid this by using some form of the simple segregated storage (discussed in lecture and the textbook).

Note, however, that you will find that some of your optimizations will result in a trade-off, in that they will help improve the performance of some traces while reducing the performance of others. So, be careful while performing trace-specific optimizations.


A Note about the Shared CLEAR Server Machines

If the throughput of your allocator is varying wildly from one run to the next, it is likely that the CLEAR machine that you are using is under a heavy load from other users. The CLEAR machines are shared servers, used by many other COMP 321 students as well as of course users from other classes as well.

You can use the Unix command  top  to check the load average on the machine you are using. If the load average (at the end of the first line of the display) is very high, particularly if it is close to the number of processors on the machine (20 on CLEAR), then you might want to log off of the CLEAR server machine you are on and try logging in to a different CLEAR server (type  q  to exit the top command display).

Each of the different CLEAR server machines has a specific unique name. The current list of CLEAR servers is below:

Logging in to the generic name ssh.clear.rice.edu automatically selects one of these server machines for you, but that selection is not based on the current load averages of the servers. You can, alternatively, log in to any specific one of these different CLEAR server machines by using its unique name (above) with your ssh client, rather than just using the generic name ssh.clear.rice.edu with it. Specifically:


GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to this link for the starter code for the lab. Follow the same steps as for previous labs and assignments to create your repository on GitHub and then to clone it onto CLEAR. The directory for your repository for this lab will be

lab-9-memory-allocator-profiling-name

where name is your GitHub userid.


What to Turn In

Unlike previous labs, this week you will not be turning in code. Rather, this week you should create and edit a new plain text file named ANSWERS.txt in which you give brief answers to each of the questions posed in the README.md file for the lab. Please read each of the exercises fully and pay careful attention to answering in your ANSWERS.txt file all of the questions in the exercises.


Submission

Be sure to git push your new ANSWERS.txt file for this lab before 11:55 PM tonight, to get credit for this lab. Be sure that your ANSWERS.txt file contains your answers to all of the questions in the README.md file.