Process Checkpointing and Rollback Recovery
Process checkpointing and rollback recovery has been extensively
used for providing fault-tolerance support to system software. This technique
relies on the availability of a stable storage, an abstraction of a perfect
storage that survives processor failures. The key idea is that processes
save their state onto a stable storage at regular intervals during execution.
Then, should a failure occur, an intermediate application state is constructed
from the saved checkpoints and the application restarts its execution from
this intermediate state when the system recovers. Restarting application
from an intermediate state instead of from the beginning saves valuable
computing time. We have designed and implemented a stable storage service
on top of the Unix operating system. This service allows servers to create,
access, and delete persistent memory that survives server crashes. In the
design of this service, we have attempted to supply the operations which
are needed in, and are suitable for, the design and implementation of a
large number of fault-tolerant protocols. The operations provided are based
on our experiences in implementing fault-tolerant group communication protocols.
The main goals in providing this service are simplicity, generality of
use, and efficiency. We have derived guidelines for choosing an appropriate
checkpointing and recovery algorithm for a given parallel and distributed
application. These guidelines are derived by characterizing parallel and
distributed applications by their interprocess communication and stable
storage requirements, and then evaluating some checkpointing and rollback
recovery algorithms for applications with different characteristics.
Publications
-
F. Cristian, S. Mishra, and Y. Hyun,
Implementation
and Performance of a Stable Storage Service in Unix. Proceedings of
the 15th IEEE Symposium on Reliable Distributed Systems, Niagara-on-the-lake,
Canada (October 1996), 86--95.
Abstract
-
S. Mishra and D. Wang,
Choosing
an Appropriate Checkpointing and Rollback Recovery Algorithm for Long-Running
Parallel and Distributed Applications. Proceedings of the 11th ISCA
International Conference on Computers and their Applications, San Francisco,
CA (March 1996), 24--27.
Abstract
Copyright © 1996 Shivakant Mishra