Brent Stephens, Handling Routing Failures in Data Center Networking via Local Fast Failover

Slides

In current datacenter networks, failures have become the norm. However, software recovery strategies struggle or even fail to meet tight SLAs due to these failures. On the other hand, hardware recovery can be near-instantaneous. This paper describes an approach to practically implement t-resilient hardware, which protects against up to t simultaneous failures, and we show that next generation switches can implement moderate yet interesting values of t. Although forwarding table size is a limiting factor, we find that low levels of resilience are effective at preventing failures. For example, only 0.0002% of the pairwise paths are expected to fail given 4-resilience and 64 edge failures on a 2048-host topology. Additionally, we consider Plinko, a new resilient forwarding architecture with forwarding entries. To utilize this compressibility, we introduce both a new forwarding table compression algorithm and a new compression-aware routing algorithm. With these, we find that, as topology size increases, Plinko is frequently 6–8x more scalable than repurposing the packet header formats of MPLS fast Re-route or FCP to enable resilient forwarding.