Co

Toward Resilience in HPC: A Prototype to Analyze and Predict System Behavior

Date
Feb 2, 2017
Time
11:00 AM - 12:00 PM
Speaker
Siavash Ghiasvand
Affiliation
ZIH
Language
en
Main Topic
Informatik
Other Topics
Informatik
Description
Nowadays, failures in high performance computers (HPC) became the norm rather than the exception. In the near future, the mean time between failures (MTBF) of HPC systems is expected to be too short, and that current failure recovery mechanisms e.g., checkpoint-restart, will no longer be able to recover the systems from failures. Early failure detection is a new class of failure recovery methods that can be also beneficial for HPC systems with short MTBF. Detecting failures in their early stage can reduce their negative effects by preventing their propagation to other parts of the system. Analyzing system behavior, may even enable us to predict certain types of failures, and proactively employ protection mechanisms against them. Preventing failures and their propagation within the HPC system, besides extending the system uptime, reduces the energy consumption, and subsequently the cost of system resilience. We use 'Taurus' HPC cluster as our test bed. This presentation, gives an overview about the state-of-the-art in failure detection and prediction on HPC systems, and current HPC system behavioral analysis methods. Furthermore, correlation of failures, importance of system logs, challenges toward a generic approach, and current status of the project will be discussed. Diese Veranstaltung wird unterstützt von ZIH.

Last modified: Feb 2, 2017, 8:54:23 AM

Location

TUD Andreas-Pfitzmann-Bau (Computer Science) (APB 1004 (Ratssaal))Nöthnitzer Straße4601069Dresden
Homepage
https://navigator.tu-dresden.de/etplan/apb/00

Organizer

TUD InformatikNöthnitzer Straße4601069Dresden
Phone
+49 (0) 351 463-38465
Fax
+49 (0) 351 463-38221
Homepage
http://www.inf.tu-dresden.de
Scan this code with your smartphone and get directly this event in your calendar. Increase the image size by clicking on the QR-Code if you have problems to scan it.
  • BiBiology
  • ChChemistry
  • CiCivil Eng., Architecture
  • CoComputer Science
  • EcEconomics
  • ElElectrical and Computer Eng.
  • EnEnvironmental Sciences
  • Sfor Pupils
  • LaLaw
  • CuLinguistics, Literature and Culture
  • MtMaterials
  • MaMathematics
  • McMechanical Engineering
  • MeMedicine
  • PhPhysics
  • PsPsychology
  • SoSociety, Philosophy, Education
  • SpSpin-off/Transfer
  • TrTraffic
  • TgTraining
  • WlWelcome