Effective combination of inter-node and intra-node parallelism is recognized to be a major challenge for future extreme-scale systems. State-of-the-art techniques that combine distributed- and shared-memory programming models, as well as many PGAS approaches, have demonstrated the benefits of combining the two, including increased communication-computation overlap, improved memory utilization, and effective use of accelerators. However, these approaches often require significant rewrites of application code and/or are often accessible only to expert programmers.
Dynamic task parallelism has been widely regarded as a programming model that combines the best of performance and programmability for shared-memory programs. For distributedmemory programs, most users rely on efficient implementations of MPI. In this talk, we propose HCMPI (Habanero-C MPI), a programming model that integrates asynchronous task parallelism with MPI, creating a rich new platform with novel programming model constructs, while also offering a practical approach for programmers wanting to take incremental transitional steps starting from either a shared- or distributed-memory program. HCMPI is a unification of the Habanero-C dynamic task-parallel language with the widely used MPI message-passing interface. All MPI calls are treated as asynchronous tasks in this model, thereby enabling unified handling of messages and tasking constructs. We also introduce a new distributed datadriven programming model that seamlessly integrates intra-node and inter-node data-flow programming, without requiring any knowledge of MPI. We demonstrate scalable performance with the help of a novel runtime design that uses a combination of communication and computation workers. We evaluate our approach on a set of microbenchmarks and large benchmarks show superior performance and scalability compared to the most efficient MPI implementations, while offering a general programming model to integrate asynchronous task parallelism with MPI.