Process Checkpointing and Rollback Recovery

Process checkpointing and rollback recovery has been extensively used for providing fault-tolerance support to system software. This technique relies on the availability of a stable storage, an abstraction of a perfect storage that survives processor failures. The key idea is that processes save their state onto a stable storage at regular intervals during execution. Then, should a failure occur, an intermediate application state is constructed from the saved checkpoints and the application restarts its execution from this intermediate state when the system recovers. Restarting application from an intermediate state instead of from the beginning saves valuable computing time. We have designed and implemented a stable storage service on top of the Unix operating system. This service allows servers to create, access, and delete persistent memory that survives server crashes. In the design of this service, we have attempted to supply the operations which are needed in, and are suitable for, the design and implementation of a large number of fault-tolerant protocols. The operations provided are based on our experiences in implementing fault-tolerant group communication protocols. The main goals in providing this service are simplicity, generality of use, and efficiency. We have derived guidelines for choosing an appropriate checkpointing and recovery algorithm for a given parallel and distributed application. These guidelines are derived by characterizing parallel and distributed applications by their interprocess communication and stable storage requirements, and then evaluating some checkpointing and rollback recovery algorithms for applications with different characteristics.

Publications



Copyright © 1996 Shivakant Mishra