Tech Report CS-93-25

Adaptive Independent Checkpointing for Reducing Rollback Propagation

Jian Xu and Robert H. B. Netzer

May 1993

Abstract:

Independent checkpointing is a simple technique for providing fault tolerance in distributed systems. During execution, processes periodically and independently checkpoint their state, and upon detecting a fault, execution is rolled back and resumed from earlier checkpoints. However, independent checkpointing suffers from the {\it domino effect}, which causes the rollback of one process to potentially propagate to others. In the worst case, rollback propagation can force recovery to start from the execution's start, preventing forward progress from being made in the presence of faults. In this paper we present an {\it adaptive} checkpointing algorithm to practically eliminate rollback propagation for independent checkpointing. Our algorithm is based on proofs of the conditions necessary and sufficient for a checkpoint to belong to a consistent global state, previously an open question. We characterize these conditions with a generalization of Lamport's happened-before relation, called a {\it zigzag path}. Our algorithm tracks zigzag paths on-line and checkpoints when certain paths are detected. Experiments on an iPSC/860 hypercube show that we reduce the average rollback required to recover from any fault to less than one state interval per process, and checkpoint only 10% more often than traditional periodic checkpointing algorithms. We thus eliminate rollback propagation without the run-time overhead of coordinated checkpoints or other schemes that attempt to reduce rollback propagation.

(complete text in pdf or gzipped postscript)