COMP 600: Graduate Seminar

Gabriel Marin, Application Insight Through Performance Modeling

Application performance depends on a large number of variables, including algorithms, data structures, architectural parameters and input data. Moreover, these factors interact in complex ways. My work explores strategies for performance modeling that separate the contribution of application-specific factors from the contribution of characteristics of the target architecture. Such an approach has two principal benefits. First, modeling application-centric factors results in architecture-neutral models that can predict performance on different architectures. Second, algorithmic and application factors typically have a convex and differentiable profile. Our approach models the most important application factors that affect performance and enables us to explore the interactions between a target architecture and an application's characteristics. Our models consider not only memory bandwidth limitations, but also functional unit constraints, instruction dependencies and memory latencies when characterizing the performance of each loop.

Accurate models of program execution characteristics have many uses, including understanding how an application scales with problem size and predicting how an application performs on a proposed future architecture. In addition, accurate performance models can provide guidance for application tuning by highlighting the factors that limit performance for different sections of a program. My talk will provide an overview of our performance modeling technique and describe a case study where we applied our methodology to gain insight into the ASCI Sweep3D benchmark. We uncovered a key bottleneck in one loop where the lack of instruction-level parallelism limited performance. By transforming the loop we obtained an overall speedup of 17%. We further transformed the code to improve its memory locality based on the results of our memory reuse analysis. The improved version of the code runs up to three times faster than the original code on an Itanium2 based machine across a large range of problem sizes.