Lab 7 - Performance Profiling

Lab goals:


GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to this starting link.
Follow the protocol from previous labs and assignments to get your repository link and clone the repository onto CLEAR. The repository for this lab is

  lab-7-performance-profiling-[YOUR GITHUB ID]

NOTE:  In case you have trouble creating your new repository using GitHub Classroom as usual (it has lately been taking a very long time in some cases in the "Importing starter code" phase), you can temporarily get a copy of the same files for this lab from the following ZIP file:  lab-7-performance-profiling.zip.



Performance Profiling

Often, you may want to know about the performance aspects of your program. Of course, when you are developing your program, you should understand its asymptotic "big-O" complexity behavior. But frequently you may want to know how your program actually performs on real input data.

Modern architectures are sufficiently complicated (by features like superscalar pipelining and multi-level caching) that static analyses cannot accurately and precisely describe program running time. What we are left with are dynamic analyses, i.e., running or simulating programs and analyzing what happened.

Whole-Program Profiling

You might already have discovered the simplest possible Unix timing tool:

The time command, however, cannot show the resource usage of individual modules or functions within the program.

Exercise #1
  1. The provided program sample.c in your repository requests a specification of a list of numbers (the specification should be a sequence of positive numbers, terminated by a 0), and then prints the original full list of numbers, followed by the same list of numbers in sorted order. For example:

         To create some data, enter a sequence of positive numbers.
         Terminate the list with a zero.
         Use at least moderately large numbers to create enough data for good profiling.
         (However, you can run out of memory if you use numbers that are too large.)
         10 5 8 0
         The data:
         1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 1 2 3 4 5 6 7 8
         The data:
         1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 7 7 8 8 9 10

    The input specification here "10 5 8 0" means that the full list of numbers should contain all integers from 1 through 10, followed by all integers from 1 through 5, followed by all integers from 1 through 8.

  2. Compile the sample program sample.c without any compiler optimizations. Then use the time command on it:

         make sample-nonopt
         time ./sample-nonopt

    The output of this will be something like:

         1.636u 0.003s 0:01.65 98.7%	0+0k 0+0io 0pf+0w

    This output indicates that 1.636 seconds of CPU time were spent running this user process, 0.003 seconds of CPU time were spent running in the operating system on behalf of this process, and 1.65 seconds of real ("wall clock") time elapsed. The output also indicates that the process used 98.7% of the CPU's total time over its execution. Finally, the cryptic numbers on the right indicate the amount of memory that was used and the number of I/O operations, page faults, and page swaps that occurred.

  3. Since this program involves user input, the elapsed time includes however long it takes you to enter some data. The time command is most useful on programs not involving user input. Here, you'll get the most accurate results if you redirect the standard input from a file.

    We have provided a sample input file called testinput. Use time on the program, redirecting the program's standard input from this file. What does the input specification in testinput mean?

    You'll probably want to redirect the standard input from this file in running the sample.c program for all of the other exercises in this lab, too.

  4. Run "time ./sample-nonopt" a few more times. Observe if and how the time statistics change.

  5. The Makefile will also make an optimized version of the sample.c program. sample-nonopt is the non-optimized version (above), and sample-opt is the optimized version. To make the optimized version, use

           make sample-opt
    
    The optimization level for sample-opt defaults to -O3.  Look at and make sure you understand the command executed by this use of make.

    To employ alternative optimization levels, use, for example, the following

           make clean
           make OPT=2 sample-opt
    
    to compile with optimization level -O2, or use OPT=1 in the same way for level -O1.  Again, look at and make sure you understand the command executed by this use of make.

  6. Using the time command again as above, compare the results for sample-nonopt with those for sample-opt with different levels of optimization.

More Detailed Profiling

More generally, the three pieces of performance information that developers are usually interested in are

How could you track these? We could change the program's source code to track these (although some hardware can perform some kinds of profiling, it can't map the machine code behavior back to the source code):

Note that in each case, the act of profiling will change the profile at least a little.

We could integrate the first two of these with the printf-style of debugging advocated earlier in the lab on debugging. However, instead, Unix has standard tools for these:

Modern Linux systems augment these classic tools with: Unfortunately, Unix has no standard tool for profiling space usage, although several debugging and profiling tools are described here.



The gprof Tool

gprof is a tool for analyzing a program's execution time and breaking this time down among all the various defined functions in the program. In essence, it is a tool for automatically adding a stopwatch to each function.

Using gprof is a three-step process:

  1. Compile the program program with the -pg flag to the C compiler you are using. This adds code into the resulting executable to do the profiling.

    NOTE: As of this writing, you cannot use clang with gprof or gcov on the CLEAR system.

    You must use gcc for this instead of clang.

  2. Run the program program. The resulting profiling information is put into the file gmon.out.

  3. Run gprof program, which interprets the information in gmon.out and prints a human-readable summary.

    By default, this produces lots of information. (Use "gprof program | more"  or  "gprof program | less"  to view it all!) For each function, it shows the percentage of time spent in that function and how that time is broken down into the functions it calls. (Note: It includes all the functions included in system libraries, too.) Various command-line options are available to limit what you see.

Of course, for more details, look at man gprof, or this gprof manual.

Exercise #2 (gprof)
  1. Compile the sample.c in your repo using -pg and without any compiler optimizations, and use gprof on it. You can use the Makefile in your repo:

           make sample-gprof-nonopt
    will make a version of the sample.c program named sample-gprof-nonopt that is compiled with the -pg option.

    Use gprof on sample-gprof-nonopt as described above. What function in sample-gprof-nonopt takes the most time?

  2. Compile the program again using various compiler optimization levels (-O1 or -O2 or -O3) and use gprof.

    Similar to the above, the Makefile can make 2 versions of the sample.c program compiled with the -pg option: sample-gprof-nonopt and sample-gprof-opt. To make the non-optimized version with -pg, use

           make sample-gprof-nonopt
    
    To make the optimized version with -pg, use
           make sample-gprof-opt
    
    The optimization level for sample-gprof-opt defaults to -O3.  Look at and make sure you understand the command executed by this use of make.

    To employ alternative optimization levels, use, for example, the following

           make clean
           make OPT=2 sample-gprof-opt
    
    to compile with optimization level -O2 and with -pg, or use OPT=1 in the same way for level -O1
  3. How much does the performance improve with various optimizations levels?

    You can see some sample output from gprof in the Sample-Gprof-Output directory in your repo.

See also B&O Section 5.14.1.

The gcov Tool

gcov is similar to gprof. But, rather than breaking down the time spent in each function, it counts the number of times each statement is executed. Using gcov is a four-step process:

  1. Compile each of the source files source1.csource2.c,  etc. of the program program with the -fprofile-arcs -ftest-coverage flags to the C compiler. This adds code into the resulting executable to do the profiling, and creates the bookkeeping files source1.gcnosource2.gcno,  etc.

  2. Run the program program. The resulting profiling information is put into the files source1.gcdasource2.gcda,  etc.

  3. Run gcov on each of your sources files, for example, gcov source1.cgcov source2.c,  etc., to separate that profiling information into the human readable files source1.c.gcovsource2.c.gcov,  etc.

  4. Look at the files *.gcov.

Exercise #3 (gcov)
  1. Compile the sample.c program, and use gcov on it. You can use the Makefile to make the sample.c program with the above arguments for gcov, creating sample-gcov:

           make clean
           make sample-gcov
    
    Again, look at and make sure you understand the command executed by this use of make.

    Be sure to run the profiled program, sample-gcov, only once for now.

    What statement is executed the most times?

  2. Without recompiling, run the profiled program again, and run gcov again.

    The statistics should now represent both runs together. The counts are set back to zero only when you recompile.

The perf Tool

The Linux perf tool is a new addition to the suite of Unix performance monitoring tools. Like gprofperf uses sampling to acquire the performance information from your program. Unlike gprof, however, perf samples only fixed-length stack fragments. Consequently, programs with deep call stacks show some non-intuitive results.

The perf tool has several advantages over gprof:

The gprof tool, however, still has several advantages over perf: Using perf is a 2-step process, much like gprof.
  1. To begin the perf process (after compiling with -g), invoke the recording phase of perf as follows:

          perf record -e cycles:u --call-graph dwarf YOUR_PROGRAM_WITH_ARGUMENTS
    The data recording phase of perf yields a file called perf.data.

  2. To interpret the data, invoke the second step of the perf process as follows:

        perf report --stdio --max-stack 6
    On a small screen, you might want to reduce max-stack to 5. You should see something like:
      
    # To display the perf.data header info, please use --header/--header-only options.
    #
    # Samples: 5K of event 'cycles:u'
    # Event count (approx.): 4146868529
    #
    # Children      Self  Command     Shared Object      Symbol                    
    # ........  ........  ..........  .................  ..........................
    #
        99.62%    31.31%  sample-opt  sample-opt         [.] insert_list           
                |
                ---insert_list
                   |          
                   |--98.74%-- insert_list
                   |          |          
                   |          |--99.96%-- insert_list
                   |          |          |          
                   |          |          |--99.96%-- insert_list
                   |          |          |          |          
                   |          |          |          |--99.98%-- insert_list
                   |          |          |          |          |          
                   |          |          |          |           --100.00%-- insert_list
                   |          |          |           --0.02%-- [...]
                   |          |           --0.04%-- [...]
                   |           --0.04%-- [...]
                    --1.26%-- [...]
    
    47.38%    24.57%  sample-opt  sample-opt         [.] malloc
    

The interpretation of this data is a little involved. The key points are:

Exercise #4 (perf)

Run perf (both phases, as described above) on sample-opt and then separately on sample-nonopt.
You may need to do "make sample-opt sample-nonopt" first.

According to perf, what routine most strongly affects the performance?



Using Profiling Information

Now that you have the information from one or more of the profiling methods above, what do you do with it?

Within this course, a few classes have described some optimization techniques, and more techniques will be described in later classes. It will be helpful to use some of these techniques in the next assignment, where optimizing time and space resources is part of the assignment requirements.

Exercise #5
  1. Optimize the one function in the sample.c program that you previously identified as being the performance bottleneck.

    You'll need to use some ideas that the compiler couldn't figure out. For example, replace recursion with iteration (in an ideal world, C compilers should be better at automatically performing this optimization), or pick different algorithms or data structures. Change anything, except do not change main() and do not change the externally observable behavior of the program.

  2. Again use gprofgcov,  and perf. How much did this one localized improvement help?

  3. Optimize the whole program. Compare results with your friends.

See also B&O Section 5.14.2.


Submission

Be sure to git push your lab before 11:55PM on Saturday at the end of this week.