
Fault-Tolerance Techniques for High-Performance Computing
Springer (Publisher)
Published on 15. October 2016
Book
Paperback/Softback
IX, 320 pages
978-3-319-35560-3 (ISBN)
Description
This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.
More details
Series
Edition
Softcover reprint of the original 1st ed. 2015
Language
English
Place of publication
Cham
Switzerland
Publishing group
Springer International Publishing
Target group
Professional and scholarly
Illustrations
113 s/w Abbildungen
IX, 320 p. 113 illus.
Dimensions
Height: 235 mm
Width: 155 mm
Thickness: 19 mm
Weight
505 gr
ISBN-13
978-3-319-35560-3 (9783319355603)
DOI
10.1007/978-3-319-20943-2
Schweitzer Classification
Other editions
Additional editions

Thomas Herault | Yves Robert
Fault-Tolerance Techniques for High-Performance Computing
Book
07/2015
Springer
€106.99
Shipment within 10-15 days
Content
Part I: General Overview.- Fault-Tolerance Techniques for High-Performance Computing.- Part II: Technical Contributions.- Errors and Faults.- Fault-Tolerant MPI.- Using Replication for Resilience on Exascale Systems.- Energy-Aware Check pointing Strategies.