Back to the main page of LSP/EPFL Peripheral Systems Laboratory (EPFL-DI/LSP)
[Publications] [GigaServer]

Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules: A Programmer’s Perspective

S. Gerlach, B. Schaeli, R.D. Hersch

Dependable Systems: Software, Computing, Networks, Lecture Notes in Computer Science (LNCS) vol. 4028, J. Kohlas, B. Meyer and A. Schiper (Eds.), Springer Verlag, 2006, pp. 195-210

Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel applications on clusters of workstations. The DPS flow graph execution model enables automatic pipelined parallel execution of pplications. DPS supports graceful degradation of parallel applications in case of node failures. The fault-tolerance mechanism relies on a set of backup threads stored in the volatile storage of alternate nodes that are kept up to date by both duplicating transmitted data objects and performing periodical checkpointing. The current state of a failed node can be reconstructed on its backup threads by re-executing the application since the last checkpoint. A valid execution order is automatically deduced from the flow graph. The addition of fault-tolerance to a DPS application requires only minor changes to the application’s source code. The present contribution focuses on the development of fault-tolerant parallel applications with DPS from a programmer’s perspective.

Download the full paper: PDF 310 kb


<basile.schaeli@epfl(add: .ch)>
Last modified: 2007/09/26 21:26:01