Lab 12 - Performance Profiling

Lab goals:

Understand how to use performance profiling tools and to interpret their results.

Pre-lab

Performance Profiling

Often, you may want to understand more about the performance of your program. Of course, when you are developing your program, you should understand its asymptotic big-O complexity behavior. But frequently you may want to know how your program actually performs on real input data.

Modern computer architectures are sufficiently complicated (by features like pipelining and caching) that static analyses cannot accurately or precisely describe a program's running time. What we are left with are dynamic analyses, i.e., running or simulating programs and analyzing what actually happened.

Whole-Program Profiling

You might already have discovered the simplest possible Unix timing tool:

time

Running the shell command

time program arguments

will run the given program with those arguments, and when the program completes, it will then report timing statistics about this execution to the standard error file (generally, the screen). Many shells (e.g., bash, csh, and tcsh) have a built-in time command, and there is also a standard time as a separate external program in /usr/bin/time.

All versions of the time command report the elapsed (wall clock) time, as well as the user time and the system time. The latter two count CPU time spent on that program, broken down, respectively, into time the computer was running the program itself (the user time) and time the computer was running in various low-level functions (i.e., the operating system) on behalf of that program (the system time). CPU time is when the computer is actually computing for your program rather than running someone else's program or simply sitting idle (e.g., while waiting on you to type some input). Most versions of the time command also report a few other statistics, such as the amount of memory and the number of file system I/O operations performed in the program's execution.

In this lab, we will just be using the time command built into the shell.

In-lab

Detailed Profiling

Often, developers are interested in more detailed performance profiling of programs than can be provided by the time command, generally including some or all of the following:

the number of times each function is called,
the time spent executing in each function, and
the size of the heap over time.

How could you measure these? One way is that we could change the program's source code to track them:

We could add a counter for each function. At the beginning of each function, we would increment the corresponding counter.
At the beginning and end of each function, we could get the current time (e.g., using gettimeofday()) and compute the time consumed. In other words, add a stopwatch to each function.
And we could modify the memory allocator to periodically report the current memory usage.

Note that in each case, the addition of this profiling code will change the program's performance and thus the results, at least a little.

We could integrate the first two of these with the printf-style of debugging advocated earlier in the lab on debugging. However, more simply, Unix has standard, classic tools for this type of performance profiling:

gprof
gcov

And modern Linux systems augment these classic tools with one more tool:

perf

Unfortunately, Unix has no standard tool for profiling memory (e.g., heap) usage, although several such debugging and profiling tools are described here.

The `gprof` Tool

gprof is a tool for analyzing a program's execution time and breaking this time down among all the various defined functions in the program. In essence, it is a tool for (conceptually, at least) automatically adding a stopwatch to each function. However, gprof actually collects the data to show where the time is spent in executing the program instead by periodically (at a very high frequency) sampling where the program is currently executing at each sample time. This statistical sampling approach achieves essentially the same effect (or very close to it) with much lower execution overhead.

Using gprof is a three-step process:

Compile the program with the -pg flag to the C compiler. This adds code into the resulting executable to do the profiling.
Run the program:

./program arguments

The resulting profiling information is put in a raw form into the file gmon.out in the current directory.
Run gprof on the program, which interprets the information in gmon.out and prints a human-readable summary:

gprof program > output

By default, this produces a lot of output! For each function, it shows the percentage of time spent in that function and how that time is broken down into the functions it calls. If you link the program with the -static flag, the output includes the percentage of time spent in functions from the system libraries, too.

To view all of this output generated by gprof, the command line above redirects the standard output into a file, and you can then view the contents of that file, for example using the command

less output

The program less displays a screenful of the file and then pauses. To see more, type the SPACE key; each time, this will show the next screenful. To exit the less program, type the q key.

Various gprof command-line options are also available to limit what output is generated by gprof.

For more details, use man gprof, or look at the complete gprof manual.

See also B&O Section 5.14.1.

The `gcov` Tool

The program gcov is similar to gprof. But, rather than breaking down the time spent in each function, it counts the number of times each line of code is executed. It is designed to help analyze testing coverage during execution of a program (thus the name gcov), for example finding statements in the source code that were not executed and thus not tested during execution, but the line execution counts are useful for simple performance analysis as well.

Using gcov is a four-step process:

Compile each of the source files (e.g., source1.c, source2.c, source3.c, etc.) of the program with the -fprofile-arcs -ftest-coverage flags to the C compiler. This adds code into the resulting executable to do the profiling, and creates corresponding notes files (e.g., source1.gcno, source2.gcno, source3.gcno, etc.) to record information for later correlating the performance data to specific points in the source files. The suffix .gcno stands for gcov notes.
Link and run the program. The resulting profiling information is put in raw form into corresponding output files (e.g., source1.gcda, source2.gcda, source3.gcda, etc.). The suffix .gcda stands for gcov data.
Run gcov on each of your sources files (e.g., run gcov source1.c, gcov source2.c, gcov source3.c, etc.) to format that raw profiling information into the human readable files corresponding to each of your source files (e.g., source1.c.gcov, source2.c.gcov, source3.c.gcov, etc.).
Look at these resulting *.gcov files.

The figure below illustrates this multistage process for using gcov and shows the resulting files for each starting source code file (e.g., source1.c):

For more details, use man gcov, or look at the complete gcov manual.

The `perf` Tool

The Linux perf tool is a new addition to the suite of Unix performance analysis tools. Like gprof, perf examines the call stack in acquiring the performance information from your program, but unlike gprof, perf samples only fixed-length stack fragments. Consequently, programs with deep call stacks can show some non-intuitive results with perf.

The perf tool has several advantages over gprof:

perf does not require any additional compiler flags beyond -g.
perf works with any compiler.
perf has lower run-time overhead.

But the gprof tool still has several advantages over perf:

gprof can produce call counts, as well as raw time data.
gprof produces complete call graphs, not fragments of a call graph.
The gprof user interface is much simpler and much better documented.

Like gprof, using perf is a three-step process:

Compile the program with the -g flag to the C compiler. Only this flag is needed with perf, and this is the same flag used for other things such as for use with gdb.
Then, run the program by invoking the recording phase of perf as follows:

perf record -e cycles:u --call-graph dwarf program arguments

where program is the program you are profiling, and arguments are the command-line arguments for this run of it. The data recording phase of perf yields a raw data file named perf.data.

If you get error messages like lost 1 chunks! and Check IO/CPU overload!, it generally means that the CLEAR machine you are currently logged in to is overloaded from something being run by one or more other users. If the number of lost chunks is small, you can basically ignore it for our purposes in this lab. You could also try waiting a while and trying it again, or you could log in to one of the other CLEAR machines.

To interpret the data, invoke perf again as follows:

perf report --stdio --max-stack 6

You probably want to redirect the standard output of this into a file and then use something like less to examine the file.

You should see something like the following:

     # To display the perf.data header info, please use --header/--header-only options.
     #
     #
     # Total Lost Samples: 0
     #
     # Samples: 9K of event 'cycles:u'
     # Event count (approx.): 8012828499
     #
     # Children      Self  Command        Shared Object     Symbol
     # ........  ........  .............  ................  ..........................
     #
         99.58%    28.11%  sample-nonopt  sample-nonopt     [.] insert_list
                 |
                  --99.49%--insert_list
                            |
                            |--97.79%--insert_list
                            |          |
                            |           --97.59%--insert_list
                            |                     |
                            |                     |--81.58%--insert_list
                            |                     |          |
                            |                     |          |--66.74%--insert_list
                            |                     |          |          |
                            |                     |          |          |--27.86%--insert_list
                            |                     |          |          |
                            |                     |          |          |--15.99%--_int_free
                            |                     |          |          |
                            |                     |          |          |--15.13%--make_nonempty
                            |                     |          |          |
                            |                     |          |          |--5.25%--is_empty
                            |                     |          |          |
                            |                     |          |           --2.52%--__cfree (inlined)
                            |                     |          |
                            |                     |           --14.84%--make_nonempty
                            |                     |                     malloc
                            |                     |
                            |                      --16.00%--make_nonempty
                            |                                malloc
                            |                                _int_malloc
                            |
                            |--0.97%--_int_free
                            |
                             --0.69%--make_nonempty

         46.80%    15.86%  sample-nonopt  sample-nonopt     [.] make_nonempty

etc. . . .

The interpretation of this data can be a little bit complicated. Here are some of the key points:

The Self number shows the percentage of total CPU time spent by the given function. The name of the function (when known) is shown on the far right as Symbol. Another way to think about the Self number is that it reflects the fraction of samples that had the function at the bottom (most recent end) of the call stack. The sum of all of the Self numbers should be 100 percent.
The Children number is an attempt to measure the total fraction of CPU time that the function, plus everything that calls the function, spends. In particular, note that perf looks up the call stack at the callers, not down the call stack at what is called.

Also note that the Children number cannot be relied upon for programs with deep call stacks. Since perf does not collect arbitrarily deep, full call stacks, some top level functions may have highly inaccurate Children numbers. In particular, the main function may not have 100 percent as its Children number. That happens in this lab.
The Command and Shared Object reflect the command specifics, as well as the object file name and dynamically linked libraries.
The tree-like entries appearing after some symbol line indicate the callers of the function. It is important to keep in mind that perf looks up the call stack at callers, not down the call stack at what it calls.

The percentages shown on a call stack node indicate the fraction of time that had that function at that level of the call stack.

Notice that insert_list is heavily recursive, so you should expect to find calls to that same function higher up on its call stack.

Using Profiling Information

Now that you have the information from one or more of the profiling methods above, what do you do with it?

Debugging: Identify parts of the program using more or fewer resources than expected. These may correspond to bugs.
Tuning: Identify the parts of the program using lots of resources. Try to optimize those parts of the program, since this is where there is the biggest room for improvement.

Within this course, a few lectures have described some optimization techniques, and more techniques will be described in later lectures.

See also B&O Section 5.14.2.

GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to this link for the starter code for the lab. Follow the same steps as for previous labs to create your repository on GitHub and then to clone it onto CLEAR. The directory for your repository for this lab will be

performance-profiling-name

where name is your GitHub userid.

Post-lab

Submission

Be sure to git push the optimized and updated sample.c file for this lab before 11:55 PM on Sunday, 11/23 to get credit for this lab.

COMP 222: Introduction to Computer Organization

Navigation

Lab 12 - Performance Profiling

Pre-lab

Performance Profiling

Whole-Program Profiling

In-lab

Detailed Profiling

The `gprof` Tool

The `gcov` Tool

The `perf` Tool

Using Profiling Information

GitHub Repository for This Lab

Post-lab

Submission

COMP 222: Introduction to Computer Organization

Navigation

Lab 12 - Performance Profiling

Pre-lab

Performance Profiling

Whole-Program Profiling

In-lab

Detailed Profiling

The gprof Tool

The gcov Tool

The perf Tool

Using Profiling Information

GitHub Repository for This Lab

Post-lab

Submission

The `gprof` Tool

The `gcov` Tool

The `perf` Tool