Lab 9 - Memory Allocator Profiling
Lab goals:
- Learn about the mdriver driver program for the memory allocator
- Look at and understand the trace file format used by mdriver
- Understand how to analyze and improve the performance of your memory allocator
The Memory Allocator Driver mdriver
The mdriver memory allocator driver program
in your repository can be used to evaluate your memory
allocator according to the allocation requests in a number
of traces. The program provides many different
command line options; to see a list of them, invoke the
driver with the -h option, like ./mdriver
-h
. For now, we care only about the
-v option, which causes the driver to output
per-trace information. At completion, mdriver
will also output the number of points awarded for each of
utilization (util) and throughput
(thru), as well as the overall total performance
index; these different metrics will be explained below.
When you are developing your memory allocator, it may be confusing as to how mdriver is assigning points. It is possible that your profiling shows improvements, yet the driver did not give any additional points. To illustrate how the driver is assigning points, let's look at the mdriver -v output from the provided solution:
Results for mm malloc: trace valid util ops secs Kops 0 yes 98% 5694 0.012556 453 1 yes 99% 5848 0.009918 590 2 yes 99% 6648 0.021107 315 3 yes 99% 5380 0.016953 317 4 yes 66% 14400 0.000096 150628 5 yes 92% 4800 0.006145 781 6 yes 91% 4800 0.005828 824 7 yes 54% 12000 0.200493 60 8 yes 47% 24000 0.330035 73 9 yes 27% 14401 0.094657 152 10 yes 34% 14401 0.002364 6093 Total 73% 112372 0.700151 160 Perf index = 29/40 (util) + 0/60 (thru) = 30/100
Utilization
The utilization metric (labeled as
util
) evaluates the space
efficiency of your allocator.
The utilization metric for each trace is computed as the maximum over the trace of the sum of the size of all allocated payloads (where the size of a payload is the size requested by the program on the malloc or realloc call) divided by the size of the heap.
The provided solution performs relatively well on most of the traces. The utilization score is very high for about half of the traces, and is between 27% and 66% on the other traces. This allocator therefore receives 29 out of 40 points for its utilization score. The utilization score is just the overall average utilization for the traces, then scaled by the total of 40 points possible for the utilization score (that is, 0.73 × 40 = 29.2).
Throughput
The throughput metric (labeled as
thru
) evaluates the time
efficiency of your allocator.
The throughput metric is computed by first taking the
total number of Kilo-operations
performed by the
allocator (i.e., the number of thousands of operations
performed) divided by the allocator's total number of
seconds used, to get what mdriver calls
Kops
, or Kilo-operations
per second. To then get the throughput score,
this average is divided by the memory allocator's
target throughput, and then scaled by the total of
60 points possible for the throughput score. The target
throughput has been set at 54,500 Kops,
which is reasonably achievable using various combinations
of the techniques that have been taught in this class.
The important difference between this method of computing the throughput score (i.e., dividing the total Kops by the total seconds) vs. simply averaging the allocator's individual Kops results across the set of traces, is that summing first places more weight both on traces with more operations and on traces where performance was poor.
For a concrete example of this difference using the provided allocator, if the Kops for each trace were averaged individually, then the average performance would be 14,571 Kops, which would lead to a performance score of 16/60 because it is 26.7% of the target throughput of 54,500 Kops. However, the actual averaged performance is only 160 Kops, which leads to a score of 0/60 (that is, (160 / 54,500) × 60 = 0.176).
Perf index
The performance index printed by mdriver
(labeled as Perf index
) is simply
the sum of the utilization score (util) and the
throughput score (thru). But in the output above
for the provided solution, the memory allocator scores
29/40 for utilization and 0/60 for throughput, yet it
scores 30/100 overall. It may seem as though there is a bug
that causes the total performance index to not be the
correct sum. However, this is not the case.
Internally, the driver uses floating point numbers to represent the scores. When the utilization and throughput scores are individually displayed, they are first each rounded to the nearest integer. The total score, however, is not simply the sum of the two rounded integer scores. Rather, it is calculated by first summing the two floating point numbers, and then rounding to an integer. This difference in rounding is the cause of the apparent arithmetic error.
Reference Solution Example
For another example, look at the mdriver -v output for one of the reference solutions:
Results for mm malloc: trace valid util ops secs Kops 0 yes 98% 5694 0.000114 49991 1 yes 96% 5848 0.000112 52168 2 yes 98% 6648 0.000140 47622 3 yes 99% 5380 0.000107 50422 4 yes 66% 14400 0.000115 125326 5 yes 93% 4800 0.000416 11530 6 yes 91% 4800 0.000468 10248 7 yes 54% 12000 0.025115 478 8 yes 47% 24000 0.082298 292 9 yes 29% 14401 0.000155 92790 10 yes 43% 14401 0.000078 183686 Total 74% 112372 0.109119 1030 Perf index = 30/40 (util) + 1/60 (thru) = 31/100
The reference solution being evaluated here uses a single explicit free list and an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded. This is most similar to method 2 that is described in the lecture notes.
This reference allocator performs very poorly on traces 7 (binary-bal.rep) and 8 (binary2-bal.rep), which are among the larger traces, so the actual average performance is only 1030 Kops, which leads to a score of only 1/60. If the Kops were simply straight averaged across the traces, this solution would have received 60/60, the maximum score.
The ultimate affect of the method for computing the average throughput is that your memory allocator will need to perform well on all traces in order to receive a good performance score.
You may notice that throughput scores can vary between different runs of the driver, even without changing the allocator. If this variation is making it difficult to profile your code, see the note about the shared CLEAR server machines near the end of this lab description.
Trace File Format
The trace file format used by mdriver is very simple. The first four lines of the file make up the header. For our purposes, we will ignore this (the driver ignores most of the header too). The rest of the lines in a trace are composed in one of two ways, depending on whether the operation is allocating (i.e., allocate and reallocate operations) or freeing (i.e., free operations) memory.
Allocate and reallocate operations all have the following form: op id bytes
- op is a for an allocation operation and r for a reallocate operation.
- id is a unique identification number for the allocation. This id can be used later in reallocate or free operations. If this is a reallocate operation, this identifier must have been used in a previous allocation operation (and possibly also used in another previous reallocate operation).
- bytes is the total size of the allocation being performed.
Free operations all have the following form: op id
- op is f for a free operation.
- id must be an identifier used in a previous allocation operation (and possibly also used in a previous reallocate operation).
A Concrete Example
The trace file format explanation above may seem a bit abstract. Let's look at a concrete example of a trace file:
20000 2 5 1 a 0 512 a 1 128 r 0 640 f 1 f 0
The first four lines of this example
(
, 20000
,
2
, and 5
) are the
header, which we are ignoring. Step by step, in sequence,
this is what the rest of this trace file means:1
- Create allocation #0 with a size of 512 bytes.
- Create allocation #1 with a size of 128 bytes.
- Reallocate allocation #0 to a new size of 640 bytes.
- Free allocation #1 (the allocation of size 128 bytes)
- Free allocation #0 (the allocation now of size 640 bytes)
The Test Traces
All of the traces that will be used to grade your memory
allocator are located in the directory
/clear/www/htdocs/comp321/test-cases/malloc
.
We encourage you to examine these trace files and to look
at their patterns of operations. This examination can be
especially useful if your allocator is not performing well
on only one or a few traces.
The traces are identified simply by a trace number in the output from mdriver -v, but in this directory, each trace is identified by its filename. Here is the relationship between trace numbers and these filenames (this is configured in config.h):
- amptjp-bal.rep
- cccp-bal.rep
- cp-decl-bal.rep
- expr-bal.rep
- coalescing-bal.rep
- random-bal.rep
- random2-bal.rep
- binary-bal.rep
- binary2-bal.rep
- realloc-bal.rep
- realloc2-bal.rep
Analyzing the Performance of the Memory Allocator
As described above, we will analyze the memory allocator performance in two parts: first, we will use profiling to analyze the time efficiency of your allocator (which affects the throughput metric), and then we will analyze its space efficiency (the utilization metric). The memory allocator we will be analyzing here is implemented in the source file mm.c and its associated header file mm.h in your repo for this lab. The rest of the source files in your repo are part of the memory allocator driver program mdriver.
Improving Throughput
We will use gprof, which was introduced previously in the lab on performance profiling, to profile and better understand how to improve the throughput of the memory allocator.
Note that the overhead of profiling might slow down your code significantly. So you should be careful to disable profiling once you have optimized your code.
The reason you might not see all the functions from
mm.c
in your profile is because the compiler
automatically inlines some of the functions in
your code. Inlining refers to the compiler inserting the
machine instructions for the function being called into
the code of the calling procedure at the place where
the inlined code is being called, instead of actually
calling that function there normally. The compiler performs
this optimization in order to avoid the overhead of calling
small functions (the overhead of getting in and out of the
function), but this optimization can sometimes be
misleading or may be confusing when analyzing the profile
results. So, it is a good idea to turn off inlining while
profiling (see below in Exercise #3).
If, for testing, you want to run individual traces with mdriver, one way to do that is to copy the traces from their location on CLEAR into your lab repository directory.
cd your-lab-repo-directory
cp /clear/www/htdocs/comp321/test-cases/malloc/* .
where your-lab-repo-directory is
the pathname to your repo directory for this lab. (Note the
.
at the end of this command line,
meaning the current directory.) You can now use the
-f option on mdriver to run just a
single trace:
./mdriver -v -f trace-file-name
where trace-file-name is the name of the specific trace file you want to use.
Using explicit free list(s), you can solve some of the problems. Below we show again the sample output from the same reference solution mentioned above, using a single explicit free list and using an intelligent realloc that over-allocates if the current block of memory cannot simply be expanded:
Results for mm malloc: trace valid util ops secs Kops 0 yes 98% 5694 0.000114 49991 1 yes 96% 5848 0.000112 52168 2 yes 98% 6648 0.000140 47622 3 yes 99% 5380 0.000107 50422 4 yes 66% 14400 0.000115 125326 5 yes 93% 4800 0.000416 11530 6 yes 91% 4800 0.000468 10248 7 yes 54% 12000 0.025115 478 8 yes 47% 24000 0.082298 292 9 yes 29% 14401 0.000155 92790 10 yes 43% 14401 0.000078 183686 Total 74% 112372 0.109119 1030 Perf index = 30/40 (util) + 1/60 (thru) = 31/100
Note that there is a significant improvement in throughput over the provided solution (shown previously) for most of the traces. In the following exercises, you will examine trace 8 (binary2-bal.rep), on which this reference allocator achieved little improvement, and trace 9 (realloc-bal.rep), on which it achieved significant improvement.
Improving Utilization
You can see from the sample output above from our reference solution that three of the traces result in a utilization of less than 50%. To help improve utilization, you can analyze these traces and try to find ways to optimize your memory allocator accordingly.
For example, consider trace 7 (binary-bal.rep). It has the following allocation/deallocation pattern:
- Alternate allocations for many blocks of sizes 64 and 448 bytes.
- Next, the blocks of size 64 bytes are freed.
- Then, allocate many blocks of size 512
- Finally, all of the blocks are freed.
A naive memory allocator would end up wasting all of the 64-byte free blocks while trying to allocate 512-byte blocks. The reason is that the 64-byte free blocks are not contiguous and hence cannot be coalesced. You can avoid this by using some form of the simple segregated storage (discussed in lecture and the textbook).
Note, however, that you will find that some of your optimizations will result in a trade-off, in that they will help improve the performance of some traces while reducing the performance of others. So, be careful while performing trace-specific optimizations.
A Note about the Shared CLEAR Server Machines
If the throughput of your allocator is varying wildly from one run to the next, it is likely that the CLEAR machine that you are using is under a heavy load from other users. The CLEAR machines are shared servers, used by many other COMP 321 students as well as of course users from other classes as well.
You can use the Unix command
top
to check the load average
on
the machine you are using. If the load average (at the end
of the first line of the display) is very high,
particularly if it is close to the number of processors on
the machine (20 on CLEAR), then you might want to log off
of the CLEAR server machine you are on and try logging in
to a different CLEAR server (type q
to exit the top
command display).
Each of the different CLEAR server machines has a specific unique name. The current list of CLEAR servers is below:
- agate.clear.rice.edu
- amber.clear.rice.edu
- cobalt.clear.rice.edu
- jade.clear.rice.edu
- onyx.clear.rice.edu
- opal.clear.rice.edu
- pyrite.clear.rice.edu
Logging in to the generic
name
ssh.clear.rice.edu automatically selects one of
these server machines for you, but that selection is not
based on the current load averages of the servers. You can,
alternatively, log in to any specific one of these
different CLEAR server machines by using its unique name
(above) with your ssh
client, rather than just
using the generic
name ssh.clear.rice.edu
with it. Specifically:
-
If you are using a command-line
ssh
client, just use one of the above specific names the command line. -
If you are using
PuTTY
as yourssh
client, click (once) on the name of the saved session you would usually use to connect to CLEAR, and then click the
button on the right. In theLoad
field at the top, change it to one of the above specific names, and then click on theHost Name
button at the bottom.Open
GitHub Repository for This Lab
To obtain your private repo for this lab, please point your browser to this link for the starter code for the lab. Follow the same steps as for previous labs and assignments to create your repository on GitHub and then to clone it onto CLEAR. The directory for your repository for this lab will be
lab-9-memory-allocator-profiling-name
where name is your GitHub userid.
What to Turn In
Unlike previous labs, this week you will not be turning in code. Rather, this week you should create and edit a new plain text file named ANSWERS.txt in which you give brief answers to each of the questions posed in the README.md file for the lab. Please read each of the exercises fully and pay careful attention to answering in your ANSWERS.txt file all of the questions in the exercises.
Submission
Be sure to git push your new ANSWERS.txt file for this lab before 11:55 PM tonight, to get credit for this lab. Be sure that your ANSWERS.txt file contains your answers to all of the questions in the README.md file.