This paper provides some guidelines for choosing an appropriate checkpointing and recovery algorithm to provide fault-tolerance support to long-running parallel and distributed applications. This is done in three stages. First, parallel and distributed applications are characterized by their stable storage and communication needs. Second, three popular checkpointing and recovery algorithms are implemented for five long-running parallel and distributed applications of different characteristics. Finally, the performance from this implementation is analyzed and, based on this analysis, some guidelines for choosing an appropriate algorithm for a given application are provided.


Copyright © 1996 Shivakant Mishra