Lab 9 - Memory Allocator Profiling

Lab goals:


GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to this starting link.
Follow the protocol from previous labs and assignments to get your repository link and clone it onto CLEAR. The repository for this lab is
  lab-9-memory-allocator-profiling-[YOUR GITHUB ID]

What to Turn In

Unlike previous labs, this week you will not be turning in code. Rather, this week you should create and edit a new file named  ANSWERS.txt  in which you give brief answers to each of the questions posed in this lab; pay careful attention to all of the questions below.


The Memory Allocator Driver  mdriver

The mdriver memory allocator driver program in your repository provides many different command line options, which can all be listed by invoking the driver with the -h option, like ./mdriver -h. For our purposes, we care here only about the -v option, which causes the driver to output per-trace information. At completion, mdriver will also output the number of points awarded for each of utilization (util) and throughput (thru), as well as the overall total performance index.

When you are developing your memory allocator, it may be confusing as to how mdriver is assigning points. It is possible that your profiling shows improvements, yet the driver did not give any additional points. To illustrate how the driver is assigning points, let's look at the mdriver output from the provided solution:

Results for mm malloc:
trace  valid  util     ops      secs   Kops
 0       yes   98%    5694  0.012556    453
 1       yes   99%    5848  0.009918    590
 2       yes   99%    6648  0.021107    315
 3       yes   99%    5380  0.016953    317
 4       yes   66%   14400  0.000096 150628
 5       yes   92%    4800  0.006145    781
 6       yes   91%    4800  0.005828    824
 7       yes   54%   12000  0.200493     60
 8       yes   47%   24000  0.330035     73
 9       yes   27%   14401  0.094657    152
10       yes   34%   14401  0.002364   6093
Total          73%  112372  0.700151    160

Perf index = 29/40 (util) + 0/60 (thru) = 30/100

Utilization

This allocator performs relatively well on most of the traces. The utilization score is very high for about half of the traces, and is between 27% and 66% on the other traces. This allocator therefore receives 29 out of 40 points for its utilization score, which is just the overall average utilization for the traces, then scaled by the total of 40 points possible for the utilization score (.73 * 40 = 29.2).

Throughput

The throughput score is computed by first taking the number of "Kilo-operations" performed by the allocator (i.e., the number of thousands of operations performed) divided by the allocator's total number of seconds used, to get what mdriver calls "Kops", or Kilo-operations per second. To then get the throughput score, this average is divided by the memory allocator's target throughput, and then scaled by the total of 60 points possible for the throughput score. The target throughput has been set at 54,500 Kops, which is reasonably achievable using various combinations of the techniques that have been taught in this class.

The important distinction between this method of computing the throughput score vs. simply averaging the allocator's Kops individually across the traces is that summing first will place more weight on both the traces with more operations and the traces where performance was poor.

For a concrete example of this difference using the provided allocator, if the Kops for each trace were averaged individually, then the average performance would be 14,571 Kops, which would lead to a performance score of 16/60 because it is 26.7% of the target throughput of 54,500 Kops. However, the actual averaged performance is only 160 Kops, which leads to a score of 0/60 ((160 / 54,500) * 60 = 0.176).

Perf index

The performance index printed by mdriver is simply the sum of the utilization score (util) and the throughput score (thru). But in the output above for the provided solution, the memory allocator scores 29/40 for utilization and 0/60 for throughput, but scores 30/100 overall. It may seem as though there is a bug that causes the total performance index to not be the correct sum. However, this is not the case.

Internally, the driver uses floating point numbers to represent the scores. When the utilization and throughput scores are individually displayed, they are first each rounded to the nearest integer. The total score, however, is not simply the sum of the two rounded integer scores. Rather, it is calculated by first summing the two floating point numbers, and then rounding to an integer. This difference in rounding is the cause of the apparent arithmetic error.

Reference Solution Example

For another example, look at the mdriver output for one of the reference solutions:

Results for mm malloc:
trace  valid  util     ops      secs   Kops
 0       yes   98%    5694  0.000114  49991
 1       yes   96%    5848  0.000112  52168
 2       yes   98%    6648  0.000140  47622
 3       yes   99%    5380  0.000107  50422
 4       yes   66%   14400  0.000115 125326
 5       yes   93%    4800  0.000416  11530
 6       yes   91%    4800  0.000468  10248
 7       yes   54%   12000  0.025115    478
 8       yes   47%   24000  0.082298    292
 9       yes   29%   14401  0.000155  92790
10       yes   43%   14401  0.000078 183686
Total          74%  112372  0.109119   1030

Perf index = 30/40 (util) + 1/60 (thru) = 31/100

The reference solution being evaluated here uses a single explicit free list and an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded. This is most similar to method 2 that is described in the lecture notes.

This reference allocator performs very poorly on traces 7 (binary-bal.rep) and 8 (binary2-bal.rep), which are among the larger traces, so the actual average performance is only 1030 Kops, which leads to a score of only 1/60. If the Kops were simply straight averaged across the traces, then this solution would have received the maximum score of 60/60.

The ultimate affect of the method for calculating the average throughput is that your memory allocator will need to perform well on all traces in order to receive a good performance score.

It is worth noting that throughput scores can vary between runs of the driver. If this is making it difficult to profile your code, see the footnote at the bottom of this lab.


Trace File Format

The trace file format is very simple. The first four lines of the file make up the header. For our purposes, we will ignore this (the driver ignores most of the header too). The rest of the lines in a trace are composed in one of two ways, depending on whether the operation is allocating or freeing:

Allocate and reallocate operations all have the following form:  op  id  bytes

Free operations all have the following form:  op  id

A Concrete Example

The trace file format explanation above may seem a bit abstract. Let's look at a concrete example of a trace file:

  20000
  2 
  5 
  1 
  a 0 512
  a 1 128
  r 0 640
  f 1
  f 0
The first four lines of this example ("20000" ... "1") are the header, which we ignore here. Step by step, in sequence, this is what the rest of this trace file means:
  1. Create allocation 0 with a size of 512 bytes.
  2. Create allocation 1 with a size of 128 bytes.
  3. Reallocate allocation 0 to a size of 640 bytes.
  4. Free allocation 1 (the allocation of size 128 bytes)
  5. Free allocation 0 (the allocation now of size 640 bytes)

The Test Traces

All of the traces that will be used to grade your memory allocator are located in the directory /clear/www/htdocs/comp321/test-cases/malloc. We encourage you to examine these trace files and to look at their patterns of operations. This examination can be especially useful if your allocator is not performing well on only one or a few traces.


Analyzing the Performance of the Memory Allocator

We will analyze the memory allocator performance in two parts. First, we will use profiling to analyze the time efficiency of your allocator (throughput) and then, we will analyze its space efficiency (utilization).

Improving Throughput

We will use gprof, which was introduced to you previously in the lab on performance profiling, to profile the memory allocator.

Exercise #1
  1. Run mdriver -v.

  2. Now compile the code with the -pg flag on gcc to enable profiling. You can do this by editing the Makefile and adding the -pg flag to the end of the CFLAGS line.
    Remember to then do a make clean before you then compile your code again.

  3. Run mdriver -v again.

  4. Compare the performance.

Note that the overhead of profiling might slow down your code significantly. So you should be careful to disable profiling once you have optimized your code.

Exercise #2
  1. Compile the code with the -pg flag on gcc to enable profiling.

  2. Run mdriver -v.

  3. Run gprof mdriver and analyze the profile.

  4. Do you see all the functions from mm.c in the profile?

The reason you might not see all the functions from mm.c in your profile is because the compiler automatically inlines some of the functions in your code. Inlining refers to inserting the machine instructions for the function being called into the code of the calling procedure at the place where the inlined code is being called. The compiler performs this optimization in order to avoid the overhead of calling small functions, but this can sometimes be misleading when you analyze the profile results. So, it is a good idea to turn off inlining while profiling (see below in Exercise #3).

If, for testing, you want to run individual traces, one way to do that is to copy the traces from their location on CLEAR into your lab repository directory.

      cd your-lab-repo
      cp /clear/www/htdocs/comp321/test-cases/malloc/* .
At this point, you can use the -f option on mdriver:
      ./mdriver -v -f  name-of-the-trace-you-want-to-run

Exercise #3
  1. Compile the code with the -pg flag on gcc to enable profiling and the -fno-inline flag to prevent inlining of functions. In particular, as explained above for the -pg flag, you can edit the Makefile to add these flags to the definition of CFLAGS there. Run make clean after this edit, before you try to compile using these new flags.

  2. Run mdriver -v.

  3. Run gprof mdriver and analyze the profile.

  4. Where is most of the time being spent?

  5. Now pick a couple of traces from the first 6 traces and profile them separately by running  mdriver -v -f  test-case-name  and then  gprof mdriver.

  6. Again, where is most of the time being spent? Can you explain why? How will you fix this?

Note: Ignore the functions in the profile that are not from mm.c.

Using explicit free list(s), you can solve some of the problems. Below we show again the sample output from the same reference solution mentioned above using a single explicit free list and using an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded:

Results for mm malloc:
trace  valid  util     ops      secs   Kops
 0       yes   98%    5694  0.000114  49991
 1       yes   96%    5848  0.000112  52168
 2       yes   98%    6648  0.000140  47622
 3       yes   99%    5380  0.000107  50422
 4       yes   66%   14400  0.000115 125326
 5       yes   93%    4800  0.000416  11530
 6       yes   91%    4800  0.000468  10248
 7       yes   54%   12000  0.025115    478
 8       yes   47%   24000  0.082298    292
 9       yes   29%   14401  0.000155  92790
10       yes   43%   14401  0.000078 183686
Total          74%  112372  0.109119   1030

Perf index = 30/40 (util) + 1/60 (thru) = 31/100

Note that there is a significant improvement in throughput over the provided solution (shown previously) for most of the traces. In the following exercises, you will examine trace 8 (binary2-bal.rep), on which this reference allocator achieved little improvement, and trace 9 (realloc-bal.rep), on which it achieved significant improvement.

Exercise #4
  1. Take a closer look at trace 8 (binary2-bal.rep). Do you see a pattern in the allocations/deallocations? What is it?

  2. The following profile was obtained by running our reference solution with trace 8 (binary2-bal.rep):

    Flat profile:
    
    Each sample counts as 0.01 seconds.
      %   cumulative   self              self     total           
     time   seconds   seconds    calls  ms/call  ms/call  name    
     83.13      0.98     0.98   144000     0.01     0.01  find_fit
     10.18      1.10     0.12    12000     0.01     0.01  add_range
      6.79      1.18     0.08    12000     0.01     0.01  remove_range
      0.00      1.18     0.00   192036     0.00     0.00  insert_entry_before
      0.00      1.18     0.00   192024     0.00     0.00  remove_entry
      0.00      1.18     0.00   147564     0.00     0.00  coalesce
      0.00      1.18     0.00   144000     0.00     0.00  mm_free
      0.00      1.18     0.00   144000     0.00     0.01  mm_malloc
      0.00      1.18     0.00   144000     0.00     0.00  place
      0.00      1.18     0.00    24000     0.00     0.00  mem_heap_hi
      0.00      1.18     0.00    24000     0.00     0.00  mem_heap_lo
      0.00      1.18     0.00     3588     0.00     0.00  mem_sbrk
      0.00      1.18     0.00     3564     0.00     0.00  extend_heap
      0.00      1.18     0.00       12     0.00     0.00  init_head
      0.00      1.18     0.00       12     0.00     0.00  mem_reset_brk
      0.00      1.18     0.00       12     0.00     0.00  mm_init
      0.00      1.18     0.00       10     0.00    81.75  eval_mm_speed
      0.00      1.18     0.00        1     0.00     0.00  clear_ranges
      0.00      1.18     0.00        1     0.00     0.00  mem_heapsize
    

  3. Why does find_fit() still consume the most time? Can you explain this based on the allocation/deallocation pattern? How will you fix this? 


Exercise #5
  1. The following profile was obtained by running our reference solution with trace 9 (realloc-bal.rep):

    Flat profile:
    
    Each sample counts as 0.01 seconds.
      %   cumulative   self              self     total           
     time   seconds   seconds    calls  Ts/call  Ts/call  name    
    100.10      0.45     0.45                             eval_mm_valid
      0.00      0.45     0.00    77004     0.00     0.00  insert_entry_before
      0.00      0.45     0.00    76992     0.00     0.00  remove_entry
      0.00      0.45     0.00    57864     0.00     0.00  coalesce
      0.00      0.45     0.00    57744     0.00     0.00  find_fit
      0.00      0.45     0.00    57744     0.00     0.00  mm_free
      0.00      0.45     0.00    57744     0.00     0.00  mm_malloc
      0.00      0.45     0.00    57744     0.00     0.00  place
      0.00      0.45     0.00    57588     0.00     0.00  mm_realloc
      0.00      0.45     0.00    19200     0.00     0.00  mem_heap_hi
      0.00      0.45     0.00    19200     0.00     0.00  mem_heap_lo
      0.00      0.45     0.00     9600     0.00     0.00  add_range
      0.00      0.45     0.00     9600     0.00     0.00  remove_range
      0.00      0.45     0.00      144     0.00     0.00  mem_sbrk
      0.00      0.45     0.00      120     0.00     0.00  extend_heap
      0.00      0.45     0.00       12     0.00     0.00  init_head
      0.00      0.45     0.00       12     0.00     0.00  mem_reset_brk
      0.00      0.45     0.00       12     0.00     0.00  mm_init
      0.00      0.45     0.00       10     0.00     0.00  eval_mm_speed
      0.00      0.45     0.00        1     0.00     0.00  clear_ranges
      0.00      0.45     0.00        1     0.00     0.00  mem_heapsize
    

  2. Understand the allocation/deallocation pattern in trace 9 (realloc-bal.rep) and perform a similar analysis. Note that eval_mm_valid is in mdriver.c, not in mm.c.
    The procedure eval_mm_valid is the function that executes the traces using your malloc implementation.
    In contrast, insert_entry_before, remove_entry, and init_head are functions that we added to mm.c for implementing operations on circular, doubly linked lists.
    In this trace, the reference allocator is fast enough that almost all of the time is spent in the driver!

  3. What explains the relative ordering of the functions from mm.c within the above profile?

Improving Utilization

You can see from the sample output from our reference solution that three of the traces result in a utilization of less than 50%. To help improve utilization, you can analyze these traces and try to find ways to optimize your memory allocator accordingly.

For example, consider trace 7 (binary-bal.rep). It has the following allocation/deallocation pattern:

A naive memory allocator would end up wasting all of the 64-byte free blocks while trying to allocate 512-byte blocks. The reason is that the 64-byte free blocks are not contiguous and hence cannot be coalesced. You can avoid this by using some form of the simple segregated storage (discussed in lecture and the textbook).

Note, however, that you will find that some of your optimizations will result in a trade-off, in that they will help improve the performance of some traces while reducing the performance of others. So, be careful while performing trace-specific optimizations.


Submission

Be sure to git push your lab before 11:55PM on Saturday at the end of this week. Again, for this week's lab, you should submit only your new file ANSWERS.txt, containing your answers to each of the questions above.


Footnote

If the throughput of your allocator is varying wildly from one run to the next, most likely the CLEAR machine that you are using is under a heavy load. You can use the Unix command top to check the load average on the machine. If the load average (at the end of the first line of the display) is very high, particular if it is near the number of processors on the machine (20 on CLEAR), then you might want to log off of the CLEAR server machine you are on and try logging in to another one (type  q  to exit the top command display).

The current CLEAR server machines are amber.clear.rice.eduborax.clear.rice.educobalt.clear.rice.edujade.clear.rice.eduonyx.clear.rice.eduopal.clear.rice.edu,  and pyrite.clear.rice.edu.  Logging in to the "generic" name ssh.clear.rice.edu automatically selects one of these server machines for you, but this doesn't use the current server load averages in the selection. You can, alternatively, log in to any specific one of these different CLEAR server machines by using its specific name with your ssh client, rather than just using the "generic" name ssh.clear.rice.edu with it.