Lab 7 - Performance Profiling

Lab goals:

Performance Profiling

Often, you may want to understand more about the performance of your program. Of course, when you are developing your program, you should understand its asymptotic big-O complexity behavior. But frequently you may want to know how your program actually performs on real input data.

Modern computer architectures are sufficiently complicated (by features like pipelining and caching) that static analyses cannot accurately or precisely describe a program's running time. What we are left with are dynamic analyses, i.e., running or simulating programs and analyzing what actually happened.

Whole-Program Profiling

You might already have discovered the simplest possible Unix timing tool:

Running the shell command

time program  arguments

will run the given program with those arguments, and when the program completes, it will then report timing statistics about this execution to the standard error file (generally, the screen). Many shells (e.g., bash, csh, and tcsh) have a built-in time command, and there is also a standard time as a separate external program in /usr/bin/time.

All versions of the time command report the elapsed (wall clock) time, as well as the user time and the system time. The latter two count CPU time spent on that program, broken down, respectively, into time the computer was running the program itself (the user time) and time the computer was running in various low-level functions (i.e., the operating system) on behalf of that program (the system time). CPU time is when the computer is actually computing for your program rather than running someone else's program or simply sitting idle (e.g., while waiting on you to type some input). Most versions of the time command also report a few other statistics, such as the amount of memory and the number of file system I/O operations performed in the program's execution.

In this lab, we will just be using the time command built into the shell.

More Detailed Profiling

Often, developers are interested in more detailed performance profiling of programs than can be provided by the time command, generally including some or all of the following:

How could you measure these? One way is that we could change the program's source code to track them:

Note that in each case, the addition of this profiling code will change the program's performance and thus the results, at least a little.

We could integrate the first two of these with the printf-style of debugging advocated earlier in the lab on debugging. However, more simply, Unix has standard, classic tools for this type of performance profiling:

And modern Linux systems augment these classic tools with one more tool:

Unfortunately, Unix has no standard tool for profiling memory (e.g., heap) usage, although several such debugging and profiling tools are described here.

The gprof Tool

gprof is a tool for analyzing a program's execution time and breaking this time down among all the various defined functions in the program. In essence, it is a tool for (conceptually, at least) automatically adding a stopwatch to each function. However, gprof actually collects the data to show where the time is spent in executing the program instead by periodically (at a very high frequency) sampling where the program is currently executing at each sample time. This statistical sampling approach achieves essentially the same effect (or very close to it) with much lower execution overhead.

Using gprof is a three-step process:

  1. Compile the program with the -pg flag to the C compiler. This adds code into the resulting executable to do the profiling.

  2. Run the program:

    ./program  arguments

    The resulting profiling information is put in a raw form into the file gmon.out in the current directory.

  3. Run gprof on the program, which interprets the information in gmon.out and prints a human-readable summary:

    gprof program > output

    By default, this produces a lot of output!  For each function, it shows the percentage of time spent in that function and how that time is broken down into the functions it calls. If you link the program with the -static flag, the output includes the percentage of time spent in functions from the system libraries, too.

    To view all of this output generated by gprof, the command line above redirects the standard output into a file, and you can then view the contents of that file, for example using the command

    less output

    The program less displays a screenful of the file and then pauses. To see more, type the SPACE key; each time, this will show the next screenful. To exit the less program, type the  q  key.

    Various gprof command-line options are also available to limit what output is generated by gprof.

For more details, use  man gprof, or look at the complete gprof manual.

See also B&O Section 5.14.1.

The gcov Tool

The program gcov is similar to gprof.  But, rather than breaking down the time spent in each function, it counts the number of times each line of code is executed. It is designed to help analyze testing coverage during execution of a program (thus the name gcov), for example finding statements in the source code that were not executed and thus not tested during execution, but the line execution counts are useful for simple performance analysis as well.

Using gcov is a four-step process:

  1. Compile each of the source files (e.g., source1.csource2.csource3.c,  etc.) of the program with the  -fprofile-arcs -ftest-coverage  flags to the C compiler. This adds code into the resulting executable to do the profiling, and creates corresponding notes files (e.g., source1.gcnosource2.gcnosource3.gcno, etc.) to record information for later correlating the performance data to specific points in the source files. The suffix .gcno stands for gcov notes.

  2. Link and run the program. The resulting profiling information is put in raw form into corresponding output files (e.g., source1.gcdasource2.gcdasource3.gcda,  etc.). The suffix .gcda stands for gcov data.

  3. Run gcov on each of your sources files (e.g., run  gcov source1.cgcov source2.cgcov source3.c,  etc.) to format that raw profiling information into the human readable files corresponding to each of your source files (e.g., source1.c.gcovsource2.c.gcovsource3.c.gcov,  etc.).

  4. Look at these resulting *.gcov files.

The figure below illustrates this multistage process for using gcov and shows the resulting files for each starting source code file (e.g., source1.c):

gcov states and files

For more details, use  man gcov, or look at the complete gcov manual.

The perf Tool

The Linux perf tool is a new addition to the suite of Unix performance analysis tools. Like gprofperf examines the call stack in acquiring the performance information from your program, but unlike gprof, perf samples only fixed-length stack fragments. Consequently, programs with deep call stacks can show some non-intuitive results with perf.

The perf tool has several advantages over gprof:

But the gprof tool still has several advantages over perf:

Like gprof, using perf is a three-step process:

  1. Compile the program with the -g flag to the C compiler. Only this flag is needed with perf, and this is the same flag used for other things such as for use with gdb.

  2. Then, run the program by invoking the recording phase of perf as follows:

    perf record -e cycles:u --call-graph dwarf program  arguments

    where program is the program you are profiling, and arguments are the command-line arguments for this run of it. The data recording phase of perf yields a raw data file named

    If you get error messages like lost 1 chunks! and Check IO/CPU overload!, it generally means that the CLEAR machine you are currently logged in to is overloaded from something being run by one or more other users. If the number of lost chunks is small, you can basically ignore it for our purposes in this lab. You could also try waiting a while and trying it again, or you could log in to one of the other CLEAR machines.

  3. To interpret the data, invoke perf again as follows:

    perf report --stdio --max-stack 6

    You probably want to redirect the standard output of this into a file and then use something like less to examine the file.

    You should see something like the following:

         # To display the header info, please use --header/--header-only options.
         # Total Lost Samples: 0
         # Samples: 9K of event 'cycles:u'
         # Event count (approx.): 8012828499
         # Children      Self  Command        Shared Object     Symbol
         # ........  ........  .............  ................  ..........................
             99.58%    28.11%  sample-nonopt  sample-nonopt     [.] insert_list
                                |          |
                                |           --97.59%--insert_list
                                |                     |
                                |                     |--81.58%--insert_list
                                |                     |          |
                                |                     |          |--66.74%--insert_list
                                |                     |          |          |
                                |                     |          |          |--27.86%--insert_list
                                |                     |          |          |
                                |                     |          |          |--15.99%--_int_free
                                |                     |          |          |
                                |                     |          |          |--15.13%--make_nonempty
                                |                     |          |          |
                                |                     |          |          |--5.25%--is_empty
                                |                     |          |          |
                                |                     |          |           --2.52%--__cfree (inlined)
                                |                     |          |
                                |                     |           --14.84%--make_nonempty
                                |                     |                     malloc
                                |                     |
                                |                      --16.00%--make_nonempty
                                |                                malloc
                                |                                _int_malloc
             46.80%    15.86%  sample-nonopt  sample-nonopt     [.] make_nonempty
                  etc.  . . .

The interpretation of this data can be a little bit complicated. Here are some of the key points:

Using Profiling Information

Now that you have the information from one or more of the profiling methods above, what do you do with it?

Within this course, a few lectures have described some optimization techniques, and more techniques will be described in later lectures. It will be helpful to use some of these techniques in the next assignment, where optimizing time and space resources is part of the assignment requirements.

See also B&O Section 5.14.2.

GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to this link for the starter code for the lab. Follow the same steps as for previous labs and assignments to create your repository on GitHub and then to clone it onto CLEAR. The directory for your repository for this lab will be


where name is your GitHub userid.


Be sure to  git push  the appropriate C source files for this lab before 11:55 PM tonight, to get credit for this lab.