With the long-standing discrepancy between processor speeds and memory access times, fitting a program's working set in on-chip cache is critical to good performance. While on-chip data caches have become large enough to accommodate typical working sets for sequential codes, parallel programming and shared data caches on multicore architectures are complicating the task anew. Contemporary computer systems exhibit a diverse range of cache topologies as they incorporate increasing numbers of processing cores on single chips and in multi-chip modules. Writing parallel code optimized for the complexities of a single memory subsystem is a sufficiently daunting challenge for programmers; add system portability and it becomes unmanageable not only for programmers but for compilers as well.
The platform-aware compilation environment of the PACE project aims to quickly retarget compilers through automatic resource characterization. Information derived from simple, portable microbenchmarks drives compilation decisions, and compiled code is further adapted for its runtime context through dynamic application tuning and performance feedback. In this talk, we introduce optimizing cache performance of parallel programs through runtime compiler-guided tile selection. As part of this framework, we also present our work in the automatic detection of shared data caches.