Florin Dinu, Chattering in the Cloud: Improving Hadoop's Performance by Cross-Sharing of Application Experience

Failures are common in today's data center environment and can significantly impact the performance of important jobs running on top of large scale computing frameworks. In this work we analyze Hadoop's behavior under compute node and process failures. Surprisingly, we find that even a single failure can have a large detrimental effect on job running times. We uncover several important design decisions underlying this behavior: the inefficiency of Hadoop's statistical speculative execution algorithm, the overloading of TCP failure semantics and the lack of sharing failure information. Today, the solutions for sharing experiences are too crude and too slow to serve the needs of running cloud applications. One important cause for this distressing state-of-the-art lies in the current trend towards increased isolation among cloud applications. Focusing on isolation overlooks and potentially impedes substantial benefits obtainable through cross-sharing of application experience. We explore these under examined benefits and the challenges associated with sharing application experience.