Tech Report CS-94-32

Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs

Robert H.B.Netzer and Barton P. Miller

July 1994


A common debugging strategy involves re-executing a program (on a given input) over and over, each time gaining more information about bugs. Such techniques can fail on message-passing parallel programs. Because of nondeterminacy, different runs on the given input may produce different results. This non-repeatability is a serious debugging problem, since an execution cannot always be reproduced to track down bugs. This paper presents a technique for tracing and replaying message-passing programs. By tracing the order in which messages are delivered, a reexecution can be forced to deliver messages in their original order, reproducing the original execution. To reduce the overhead of such a scheme, we show that the delivery order of only messages involved in races need be traced (and not every message). Our technique makes run-time decisions to detect and trace racing messages, and is usually optimal in the sense that the minimal number of racing messages is traced. Experiments indicate that only 1% of the messages are often traced, gaining two orders of magnitude reduction over traditional techniques which trace every message. These traces allow an execution to be reproduced any number of times for debugging. Our work is novel in that we adaptively decide what to trace, and trace only those messages that introduce nondeterminacy. With our strategy, large reductions in trace size allow long-running programs to be replayed that were previously unmanageable. In addition, the reduced tracing requirements alleviate tracing bottlenecks, allowing executions to be debugged with substantially lower execution-time overhead.

(complete text in pdf or gzipped postscript)