Lai Wei, Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model

Slides

There is a general consensus that exascale computing will embrace a wide range of programming models to harness the many levels of architectural parallelism, including models to expose parallelism in CPUs and devices, such as OpenMP. To aid programmers in managing the complexities arising from multiple programming models, debugging tools must allow programmers to identify errors at the level of the programming model where the root cause of a bug was introduced. However, the question of what the effective levels for debugging in hybrid distributed models are, remains unanswered. In this work, we present a novel framework to build an intuitive stack trace view of MPI+OpenMP programs. We develop a stack-trace merging methodology for OpenMP threads and share our lessons learned from incorporating OpenMP awareness into a highly-scalable, lightweight debugging tool for MPI applications: the Stack Trace Analysis Tool (STAT). Our framework leverages OMPD, a new debugging interface for OpenMP, and evaluate the effective levels of debugging for MPI+OpenMP. Our easy-to-understand view of a stack trace helps programmers debug MPI+OpenMP programs by eliminating unnecessary stack frames (e.g., those coming from the OpenMP runtime) and allows programmers to map the stack traces to the high-level abstractions of the programming model.