Lab 7 - Performance Profiling
- Understand how to use profiling tools and interpret their results.
GitHub Repository for This Lab
To obtain your private repo for this lab, please point your browser to this
Follow the protocol from previous labs and assignments to get your repository link and clone the repository onto CLEAR. The repository for this lab is
lab-7-performance-profiling-[YOUR GITHUB ID]
NOTE: In case you have trouble creating your new repository using GitHub Classroom as usual (it has lately been taking a very long time in some cases in the "Importing starter code" phase), you can temporarily get a copy of the same files for this lab from the following ZIP file: lab-7-performance-profiling.zip.
Often, you may want to know about the performance aspects of your program. Of course, when you are developing your program, you should understand its asymptotic "big-O" complexity behavior. But frequently you may want to know how your program actually performs on real input data.
Modern architectures are sufficiently complicated (by features like superscalar pipelining and multi-level caching) that static analyses cannot accurately and precisely describe program running time. What we are left with are dynamic analyses, i.e., running or simulating programs and analyzing what happened.
You might already have discovered the simplest possible Unix timing tool:
time programwill report the time taken to complete running the given program. It reports elapsed ("wall clock") time, user CPU time, and system CPU time. The latter two count processor time spent on that program, broken down into "user time", corresponding to your code, and "system time", corresponding to various low-level routines (i.e., the operating system) used by your code.
timecommand, however, cannot show the resource usage of individual modules or functions within the program.
More Detailed Profiling
More generally, the three pieces of performance information that developers are usually interested in are
- the number of calls to each function,
- the time spent in each function, and
- the heap size over time.
How could you track these? We could change the program's source code to track these (although some hardware can perform some kinds of profiling, it can't map the machine code behavior back to the source code):
- We could add a counter for each function. At the beginning of each function, we would increment the corresponding counter.
- At the beginning and end of each function, we could get the current
time (e.g., using
gettimeofday()) and compute the time consumed. In other words, add a stopwatch to each function.
- We could modify the memory allocator to periodically report the current memory usage.
We could integrate the first two of these with the
printf-style of debugging advocated earlier in the lab on
However, instead, Unix has standard tools for these:
gprof is a tool for analyzing a program's execution time
and breaking this time down among all the various defined functions
in the program.
In essence, it is a tool for automatically adding a stopwatch to
gprof is a three-step process:
Compile the program
-pgflag to the C compiler you are using. This adds code into the resulting executable to do the profiling.
NOTE: As of this writing, you cannot use
gcovon the CLEAR system.
You must use
gccfor this instead of
Run the program
program. The resulting profiling information is put into the file
gprof program, which interprets the information in
gmon.outand prints a human-readable summary.
By default, this produces lots of information. (Use "
gprof program | more" or "
gprof program | less" to view it all!) For each function, it shows the percentage of time spent in that function and how that time is broken down into the functions it calls. (Note: It includes all the functions included in system libraries, too.) Various command-line options are available to limit what you see.
man gprof, or this gprof manual.
See also B&O Section 5.14.1.
gcov is similar to
But, rather than breaking down the time spent in each function,
it counts the number of times each statement is executed.
gcov is a four-step process:
Compile each of the source files
source2.c, etc. of the program
-fprofile-arcs -ftest-coverageflags to the C compiler. This adds code into the resulting executable to do the profiling, and creates the bookkeeping files
Run the program
program. The resulting profiling information is put into the files
gcovon each of your sources files, for example,
gcov source2.c, etc., to separate that profiling information into the human readable files
Look at the files
perf tool is a new addition to the suite of Unix performance monitoring tools.
perf uses sampling to acquire the performance information from your program.
perf samples only fixed-length stack fragments.
Consequently, programs with deep call stacks show some non-intuitive results.
perf tool has several advantages over
notrequire any additional compiler flags beyond '
perfhas lower overhead.
gproftool, however, still has several advantages over
gprofcan produce call counts, as well as raw time data.
gprofproduces complete call graphs, not fractions of a call graph.
gprofuser interface is
perfis a 2-step process, much like
To begin the
perfprocess (after compiling with
-g), invoke the recording phase of
perf record -e cycles:u --call-graph dwarf YOUR_PROGRAM_WITH_ARGUMENTSThe data recording phase of
perfyields a file called
To interpret the data, invoke the second step of the
perfprocess as follows:
perf report --stdio --max-stack 6On a small screen, you might want to reduce
max-stackto 5. You should see something like:
# To display the perf.data header info, please use --header/--header-only options. # # Samples: 5K of event 'cycles:u' # Event count (approx.): 4146868529 # # Children Self Command Shared Object Symbol # ........ ........ .......... ................. .......................... # 99.62% 31.31% sample-opt sample-opt [.] insert_list | ---insert_list | |--98.74%-- insert_list | | | |--99.96%-- insert_list | | | | | |--99.96%-- insert_list | | | | | | | |--99.98%-- insert_list | | | | | | | | | --100.00%-- insert_list | | | --0.02%-- [...] | | --0.04%-- [...] | --0.04%-- [...] --1.26%-- [...] 47.38% 24.57% sample-opt sample-opt [.] malloc
The interpretation of this data is a little involved. The key points are:
- The "Self" column shows the percentage of total cycles spent by the given routine.
The name of the routine (when known) is shown in the column on the far right labeled "Symbol".
Another way to think about the Self number is that it reflects the fraction of samples that had the routine at the bottom of the call stack.
The sum of all of the Self numbers should be 100%
- The "Children" number is an attempt to measure the total fraction of program cycles that the routine,
plus everything that calls the routine, spends.
The Children number, however, cannot be relied upon for programs with deep callstacks. Since
perfdoes not collect full callstacks, top level routines may have highly inaccurate Children numbers.
In particular, the
mainroutine may not have 100% as a Children number. That is what is happening in this lab.
- The "Command" and "Shared Object" reflect the command specifics, as well as the object file name. This will likely not matter to you until you employ dynamically linked libraries.
- The tree-like entries appearing after a main symbol line indicate the callers of the routine.
- It is important to note that
perflooks up the callstack.
The percentages shown on a callstack node indicate the fraction of cycles that had that routine at that level of the callstack.
insert_listis heavily recursive, so you should expect to find calls to itself on the unwound callstack.
Using Profiling Information
Now that you have the information from one or more of the profiling methods above, what do you do with it?
- Debugging -- Identify parts of the program using more or fewer resources than expected. These may correspond to bugs.
- Tuning -- Identify the parts of the program using lots of resources. Try to optimize those parts of the program, since this is where there is the biggest room for improvement.
Within this course, a few classes have described some optimization techniques, and more techniques will be described in later classes. It will be helpful to use some of these techniques in the next assignment, where optimizing time and space resources is part of the assignment requirements.
See also B&O Section 5.14.2.
Be sure to git push your lab before 11:55PM on Saturday at the end of this week.