home · mobile · calendar · colloquia · 2011-2012 · 

Colloquium - Clement

Robust Replication (Or How I Learned to Stop Worrying and Love Failures)
Max Planck Institute for Software Systems

The choice between Byzantine and crash fault tolerance is viewed as a fundamental design decision when building fault tolerant systems. We show that this dichotomy is not fundamental, and present a unified model of fault tolerance in which the number of tolerated faults of each type is a configuration choice. Additionally, we observe that a single fault is capable of devastating the performance of existing Byzantine fault tolerant replication systems. We argue that fault tolerant systems should, and can, be designed to perform well even when failures occur. In this talk I will expand on these two insights and describe our experience leveraging them to build a generic fault tolerant replication library that provides flexible fault tolerance and robust performance. We use the library to build a (Byzantine) fault tolerant version of the Hadoop Distributed File System.

Allen Clement is a Post-doctoral Research Fellow at the Max Planck Institute for Software Systems. He received a PhD from the University of Texas at Austin in 2010 and an AB in Computer Science from Princeton University. His research focuses on the challenges of building robust and reliable distributed systems. In particular, he has investigated practical Byzantine fail tolerant replication, systems robust to both Byzantine and selfish behaviors, consistency in geo-replicated environments, and how to leverage the structure of social networks to build Sybil-tolerant systems.

Department of Computer Science
University of Colorado Boulder
Boulder, CO 80309-0430 USA
May 5, 2012 (14:13)