Lab 8 - Memory Allocator Profiling
Lab goals:
- Learn about the mdriver driver program for the memory allocator
- Look at and understand the trace file format used by mdriver
- Understand how to analyze and improve the performance of your memory allocator
Resources
The Memory Allocator Driver mdriver
The mdriver memory allocator driver program in your
repository can be used to evaluate your memory allocator according to
the allocation requests in a number of traces. The program
provides many different command line options; to see a list of them,
invoke the driver with the -h option, like
./mdriver -h
. For now, we care only about the
-v option, which causes the driver to output per-trace
information. At completion, mdriver will also output the
number of points awarded for each of utilization (util) and
throughput (thru), as well as the overall total performance
index; these different metrics will be explained below.
When you are developing your memory allocator, it may be confusing as to how mdriver is assigning points. It is possible that your profiling shows improvements, yet the driver did not give any additional points. To illustrate how the driver is assigning points, let's look at the mdriver -v output from the provided solution:
Results for mm malloc: trace valid util ops secs Kops 0 yes 98% 5694 0.011953 476 1 yes 99% 5848 0.009703 603 2 yes 99% 6648 0.020263 328 3 yes 99% 5380 0.016268 331 4 yes 66% 14400 0.000117 123182 5 yes 92% 4800 0.005985 802 6 yes 91% 4800 0.005537 867 7 yes 54% 12000 0.218773 55 8 yes 47% 24000 0.397283 60 9 yes 27% 14401 0.098391 146 10 yes 34% 14401 0.003107 4634 Total 73% 112372 0.787380 143 Perf index = 29/40 (util) + 0/60 (thru) = 29/100
Utilization
The utilization metric (labeled as util
) evaluates the space efficiency of your allocator.
The utilization metric for each trace is computed as the maximum over the trace of the sum of the size of all allocated payloads (where the size of a payload is the size requested by the program on the malloc or realloc call) divided by the size of the heap.
The provided solution performs relatively well on most of the traces. The utilization score is very high for about half of the traces, and is between 27% and 66% on the other traces. This allocator therefore receives 29 out of 40 points for its utilization score. The utilization score is just the overall average utilization for the traces, then scaled by the total of 40 points possible for the utilization score (that is, 0.73 × 40 = 29.2).
Throughput
The throughput metric (labeled as thru
) evaluates the time efficiency of your allocator.
The throughput metric is computed by first taking the total number of
Kilo-operations
performed by the allocator (i.e., the number of
thousands of operations performed) divided by the allocator's total
number of seconds used, to get what mdriver calls
Kops
, or Kilo-operations per second. To then
get the throughput score, this average is divided by the memory
allocator's target throughput, and then scaled by the total of
60 points possible for the throughput score. The target throughput has
been set at 54,500 Kops, which is reasonably achievable
using various combinations of the techniques that have been taught in
this class.
The important difference between this method of computing the throughput score (i.e., dividing the total Kops by the total seconds) vs. simply averaging the allocator's individual Kops results across the set of traces, is that summing first places more weight both on traces with more operations and on traces where performance was poor.
For a concrete example of this difference using the provided allocator, if the Kops for each trace were averaged individually, then the average performance would be 13,148 Kops, which would lead to a performance score of 14/60 because it is 24.1% of the target throughput of 54,500 Kops. However, the actual averaged performance is only 143 Kops, which leads to a score of 0/60 (that is, (143 / 54,500) × 60 = 0.157).
Perf index
The performance index printed by mdriver (labeled as
Perf index
) is simply the sum of the utilization score (util) and the
throughput score (thru). In some cases, however, you may
think there is an error as the total score may seem slightly off.
Internally, the driver uses floating point numbers to represent the scores. When the utilization and throughput scores are individually displayed, they are first each rounded to the nearest integer. The total score, however, is not simply the sum of the two rounded integer scores. Rather, it is calculated by first summing the two floating point numbers, and then rounding to an integer. This difference in rounding can lead to a total that is not strictly the sum of the displayed parts.
Reference Solution Example
For another example, look at the mdriver -v output for one of the reference solutions:
Results for mm malloc: trace valid util ops secs Kops 0 yes 98% 5694 0.000131 43432 1 yes 96% 5848 0.000125 46934 2 yes 98% 6648 0.000166 40072 3 yes 99% 5380 0.000118 45516 4 yes 66% 14400 0.000144 99792 5 yes 93% 4800 0.000535 8975 6 yes 91% 4800 0.000582 8243 7 yes 54% 12000 0.021052 570 8 yes 47% 24000 0.095980 250 9 yes 29% 14401 0.000183 78522 10 yes 43% 14401 0.000102 140772 Total 74% 112372 0.119119 943 Perf index = 30/40 (util) + 1/60 (thru) = 31/100
The reference solution being evaluated here uses a single explicit free list and an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded.
This reference allocator performs very poorly on traces 7 (binary-bal.rep) and 8 (binary2-bal.rep), which are among the larger traces, so the actual average performance is only 943 Kops, which leads to a score of only 1/60. If the Kops were simply straight averaged across the traces, this solution would have received 56/60, the maximum score.
The ultimate affect of the method for computing the average throughput is that your memory allocator will need to perform well on all traces in order to receive a good performance score.
You may notice that throughput scores can vary between different runs of the driver, even without changing the allocator. If this variation is making it difficult to profile your code, see the note about the shared CLEAR server machines near the end of this lab description.
Trace File Format
The trace file format used by mdriver is very simple. The first four lines of the file make up the header. For our purposes, we will ignore this (the driver ignores most of the header too). The rest of the lines in a trace are composed in one of two ways, depending on whether the operation is allocating (i.e., allocate and reallocate operations) or freeing (i.e., free operations) memory.
Allocate and reallocate operations all have the following form: op id bytes
- op is a for an allocation operation and r for a reallocate operation.
- id is a unique identification number for the allocation. This id can be used later in reallocate or free operations. If this is a reallocate operation, this identifier must have been used in a previous allocation operation (and possibly also used in another previous reallocate operation).
- bytes is the total size of the allocation being performed.
Free operations all have the following form: op id
- op is f for a free operation.
- id must be an identifier used in a previous allocation operation (and possibly also used in a previous reallocate operation).
A Concrete Example
The trace file format explanation above may seem a bit abstract. Let's look at a concrete example of a trace file:
20000 2 5 1 a 0 512 a 1 128 r 0 640 f 1 f 0
The first four lines of this example (
, 20000
, 2
, and 5
) are the header, which we are ignoring. Step by step, in sequence,
this is what the rest of this trace file means:
1
- Create allocation #0 with a size of 512 bytes.
- Create allocation #1 with a size of 128 bytes.
- Reallocate allocation #0 to a new size of 640 bytes.
- Free allocation #1 (the allocation of size 128 bytes)
- Free allocation #0 (the allocation now of size 640 bytes)
The Test Traces
All of the traces that will be used to grade your memory allocator are
located in the directory
/clear/www/htdocs/comp321/test-cases/malloc
. We encourage
you to examine these trace files and to look at their patterns of
operations. This examination can be especially useful if your allocator
is not performing well on only one or a few traces.
The traces are identified simply by a trace number in the output from mdriver -v, but in this directory, each trace is identified by its filename. Here is the relationship between trace numbers and these filenames (this is configured in config.h):
- amptjp-bal.rep
- cccp-bal.rep
- cp-decl-bal.rep
- expr-bal.rep
- coalescing-bal.rep
- random-bal.rep
- random2-bal.rep
- binary-bal.rep
- binary2-bal.rep
- realloc-bal.rep
- realloc2-bal.rep
Analyzing the Performance of the Memory Allocator
As described above, we will analyze the memory allocator performance in two parts: first, we will use profiling to analyze the time efficiency of your allocator (which affects the throughput metric), and then we will analyze its space efficiency (the utilization metric). The memory allocator we will be analyzing here is implemented in the source file mm.c and its associated header file mm.h in your repo for this lab. The rest of the source files in your repo are part of the memory allocator driver program mdriver.
Improving Throughput
We will use gprof, which was introduced previously in a COMP 222 lab, to profile and better understand how to improve the throughput of the memory allocator.
Note that the overhead of profiling might slow down your code significantly. So you should be careful to disable profiling once you have optimized your code.
The reason you might not see all the functions from mm.c
in
your profile is because the compiler automatically inlines some
of the functions in your code. Inlining refers to the compiler inserting
the machine instructions for the function being called
into the code of the calling procedure at the place where the
inlined code is being called, instead of actually calling that function
there normally. The compiler performs this optimization in order to
avoid the overhead of calling small functions (the overhead of getting
in and out of the function), but this optimization can sometimes be
misleading or may be confusing when analyzing the profile results. So,
it is a good idea to turn off inlining while profiling (see below in
Exercise #3).
If, for testing, you want to run individual traces with mdriver, one way to do that is to copy the traces from their location on CLEAR into your lab repository directory.
cd your-lab-repo-directory
cp /clear/www/htdocs/comp321/test-cases/malloc/* .
where your-lab-repo-directory is the pathname to
your repo directory for this lab. (Note the .
at the
end of this command line, meaning the current directory.) You can now
use the -f option on mdriver to run just a single
trace:
./mdriver -v -f trace-file-name
where trace-file-name is the name of the specific trace file you want to use.
Using explicit free list(s), you can solve some of the problems. Below we show again the sample output from the same reference solution mentioned above, using a single explicit free list and using an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded:
Results for mm malloc: trace valid util ops secs Kops 0 yes 98% 5694 0.000131 43432 1 yes 96% 5848 0.000125 46934 2 yes 98% 6648 0.000166 40072 3 yes 99% 5380 0.000118 45516 4 yes 66% 14400 0.000144 99792 5 yes 93% 4800 0.000535 8975 6 yes 91% 4800 0.000582 8243 7 yes 54% 12000 0.021052 570 8 yes 47% 24000 0.095980 250 9 yes 29% 14401 0.000183 78522 10 yes 43% 14401 0.000102 140772 Total 74% 112372 0.119119 943 Perf index = 30/40 (util) + 1/60 (thru) = 31/100
Note that there is a significant improvement in throughput over the provided solution (shown previously) for most of the traces. In the following exercises, you will examine trace 8 (binary2-bal.rep), on which this reference allocator achieved little improvement, and trace 9 (realloc-bal.rep), on which it achieved significant improvement.
Improving Utilization
You can see from the sample output above from our reference solution that three of the traces result in a utilization of less than 50%. To help improve utilization, you can analyze these traces and try to find ways to optimize your memory allocator accordingly.
For example, consider trace 7 (binary-bal.rep). It has the following allocation/deallocation pattern:
- Alternate allocations for many blocks of sizes 64 and 448 bytes.
- Next, the blocks of size 64 bytes are freed.
- Then, allocate many blocks of size 512
- Finally, all of the blocks are freed.
A naive memory allocator would end up wasting all of the 64-byte free blocks while trying to allocate 512-byte blocks. The reason is that the 64-byte free blocks are not contiguous and hence cannot be coalesced. You can avoid this by using some form of the simple segregated storage (discussed in lecture and the textbook).
Note, however, that you will find that some of your optimizations will result in a trade-off, in that they will help improve the performance of some traces while reducing the performance of others. So, be careful while performing trace-specific optimizations.
A Note about the Shared CLEAR9 Server Machines
If the throughput of your allocator is varying wildly from one run to the next, it is likely that the CLEAR9 machine that you are using is under a heavy load from other users. The CLEAR9 machines are shared servers, used by many other COMP 321 students as well as of course users from other classes as well.
You can use the Unix command top
to check the
load average
on the machine you are using. If the load average
(at the end of the first line of the display) is very high, particularly
if it is close to the number of processors on the machine (20 on CLEAR9),
then you might want to log off of the CLEAR9 server machine you are on
and try logging in to a different CLEAR9 server (type
q to exit the top
command display).
Each of the different CLEAR9 server machines has a specific unique name. The current list of CLEAR9 servers is below:
- agate.clear9.rice.edu
- jade.clear9.rice.edu
- onyx.clear9.rice.edu
- pyrite.clear9.rice.edu
Logging in to the generic
name
ssh.clear9.rice.edu automatically selects one of these server
machines for you, but that selection is not based on the current load
averages of the servers. You can, alternatively, log in to any specific
one of these different CLEAR9 server machines by using its unique name
(above) with your ssh
client, rather than just using the
generic
name ssh.clear9.rice.edu with it.
Submission
For this lab and all labs in this course, at the end of your scheduled lab session, you should "submit" your work (regardless of how much your completed) using git push.
You should always include in your repo all files that
you created in the lab, other than those (such as the output of
the compiler) generated automatically from other files. Think of this as
including just the files necessary to backup and to be able to recreate
your work.
Do not
simply add all files to your repo; for example, do
not simply type something like
git add .
or
git add *
, and never add any object files, executable
files, or core
files to your repo.
Only add the actual source files (for example,
.c and