|
Department of Computer Science
|
University of Colorado Boulder
|
|
|
|
|
|
|
|
|
home · events · thesis defenses · 2005-2006 ·
|
| |
Thesis Defense - Arshad |
| |
1/13/2006 9:30am-11:30am ECOT 831
|
A Planning-Based Approach to Failure Recovery in Distributed Systems
Computer Science PhD Candidate
Automated failure recovery in distributed systems poses a tough challenge
because of myriad requirements and dependencies. Moreover, failure scenarios
are usually unpredictable so they cannot easily be foreseen. Therefore, it is
not practical to enumerate all possible failure scenarios and a way to recover
a distributed system for each of them. Due to this reason, present failure
recovery techniques are highly manual and have considerable downtime associated
with them. In this dissertation, we have developed a planning-based approach to
automated failure recovery in distributed component-based systems. This
approach automates failure recovery through continuous monitoring of the
system. Therefore, an exact system state is always available with a failure
monitor. When a failure is detected the monitor performs various checks to
ensure that it is not a false positive or false negative. A dependency analyzer
then checks effects of the failure on other parts of the system. After this an
offline planning procedure is performed to take the system from a failed state
to a working state. This planning is performed using an artificially
intelligent (AI) planner. By using planning, this approach can recover from a
variety of failed states and reach any of several acceptable states: from
minimal functionality to complete recovery. When a plan is calculated, it is
executed onto the system to bring it back to a working state. We have evaluated
this technique through various online and synthetic experiments performed on
various distributed applications. Our results have shown that this is indeed an
effective technique to automatically recover component-based distributed
systems from failures. Our results have also shown that this technique can also
scale to large-scale distributed systems.
|
|
|
|
|
|
|
|
|
| |