Slides
What simplifies parallelism can also simplify resilience. In this work, we describe an asynchronous checkpoint/restart framework for the CnC programming model — a dataflow-based programming model — and demonstrate that CnC is an exemplar target for a simple yet powerful resilience system for parallel computations. We claim that the same attributes that simplify reasoning about parallel applications written in CnC similarly simplify the implementation of a checkpoint/restart system within the CnC runtime. To demonstrate how these simplifying properties of CnC help to simplify resilience, we have implemented a simple checkpoint/restart system within CnC on Habanero-C. We show how the CnC runtime can fully encapsulate checkpointing and restarting processes, enabling application programmers to gain all the benefits of resilience without any added effort beyond implementing an application in CnC. Furthermore, our approach is asynchronous and thus avoids synchronization overheads present in traditional techniques.