Toward Resilience in HPC: A Prototype to Analyze and Predict System Behavior
- Date
- Feb 2, 2017
- Time
- 11:00 AM - 12:00 PM
- Speaker
- Siavash Ghiasvand
- Affiliation
- ZIH
- Language
- en
- Main Topic
- Informatik
- Other Topics
- Informatik
- Description
- Nowadays, failures in high performance computers (HPC) became the norm rather than the exception. In the near future, the mean time between failures (MTBF) of HPC systems is expected to be too short, and that current failure recovery mechanisms e.g., checkpoint-restart, will no longer be able to recover the systems from failures. Early failure detection is a new class of failure recovery methods that can be also beneficial for HPC systems with short MTBF. Detecting failures in their early stage can reduce their negative effects by preventing their propagation to other parts of the system. Analyzing system behavior, may even enable us to predict certain types of failures, and proactively employ protection mechanisms against them. Preventing failures and their propagation within the HPC system, besides extending the system uptime, reduces the energy consumption, and subsequently the cost of system resilience. We use 'Taurus' HPC cluster as our test bed. This presentation, gives an overview about the state-of-the-art in failure detection and prediction on HPC systems, and current HPC system behavioral analysis methods. Furthermore, correlation of failures, importance of system logs, challenges toward a generic approach, and current status of the project will be discussed. Diese Veranstaltung wird unterstützt von ZIH.
Last modified: Feb 2, 2017, 8:54:23 AM
Location
TUD Andreas-Pfitzmann-Bau (Computer Science) (APB 1004 (Ratssaal))Nöthnitzer Straße4601069Dresden
- Homepage
- https://navigator.tu-dresden.de/etplan/apb/00
Organizer
TUD InformatikNöthnitzer Straße4601069Dresden
- Phone
- +49 (0) 351 463-38465
- Fax
- +49 (0) 351 463-38221
- Homepage
- http://www.inf.tu-dresden.de
Legend
- Biology
- Chemistry
- Civil Eng., Architecture
- Computer Science
- Economics
- Electrical and Computer Eng.
- Environmental Sciences
- for Pupils
- Law
- Linguistics, Literature and Culture
- Materials
- Mathematics
- Mechanical Engineering
- Medicine
- Physics
- Psychology
- Society, Philosophy, Education
- Spin-off/Transfer
- Traffic
- Training
- Welcome
