BEGIN:VCALENDAR
VERSION:2.0
PRODID:www.dresden-science-calendar.de
METHOD:PUBLISH
CALSCALE:GREGORIAN
X-MICROSOFT-CALSCALE:GREGORIAN
X-WR-TIMEZONE:Europe/Berlin
BEGIN:VTIMEZONE
TZID:Europe/Berlin
X-LIC-LOCATION:Europe/Berlin
BEGIN:DAYLIGHT
TZNAME:CEST
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
DTSTART:19810329T030000
RRULE:FREQ=YEARLY;INTERVAL=1;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZNAME:CET
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
DTSTART:19961027T030000
RRULE:FREQ=YEARLY;INTERVAL=1;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
UID:DSC-12564
DTSTART;TZID=Europe/Berlin:20170310T090000
SEQUENCE:1489131893
TRANSP:OPAQUE
DTEND;TZID=Europe/Berlin:20170310T100000
URL:https://dresden-science-calendar.de/calendar/en/detail/12564
LOCATION:TUD Andreas-Pfitzmann-Bau\, Nöthnitzer Straße 4601069 Dresden
SUMMARY:Molka: Performance analysis of complex shared memory systems
CLASS:PUBLIC
DESCRIPTION:Speaker: Dipl.-Inf. Daniel Molka\nInstitute of Speaker: Institu
 t für Technische Informatik\, Professur Rechnerarchitektur\nTopics:\nInfo
 rmatik\n Location:\n  Name: TUD Andreas-Pfitzmann-Bau (APB 1004 (Ratssaal)
 )\n  Street: Nöthnitzer Straße 46\n  City: 01069 Dresden\n  Phone: \n  F
 ax: \nDescription: Systems for high performance computing are getting incr
 easingly complex. On the one hand\, the number of processors is increasing
 . On the other hand\, the individual processors are getting more and more 
 powerful. In recent years\, the latter is to a large extent achieved by in
 creasing the number of cores per processor. Unfortunately\, scientific app
 lications often fail to fully utilize the available computational performa
 nce. Therefore\, performance analysis tools that help to localize and fix 
 performance problems are indispensable. Large scale systems for high perfo
 rmance computing typically consist of multiple compute nodes that are conn
 ected via network. Performance analysis tools that analyze performance pro
 blems that arise from using multiple nodes are readily available. However\
 , the increasing number of cores per processor that can be observed within
  the last decade represents a major change in the node architecture. There
 fore\, this work concentrates on the analysis of the node performance. The
  goal of this thesis is to improve the understanding of the achieved appli
 cation performance on existing hardware. It can be observed that the scali
 ng of parallel applications on multi-core processors differs significantly
  from the scaling on multiple processors. Therefore\, the properties of sh
 ared resources in contemporary multi-core processors as well as remote acc
 esses in multi-processor systems are investigated and their respective imp
 act on the application performance is analyzed. As a first step\, a compre
 hensive suite of highly optimized micro-benchmarks is developed. These ben
 chmarks are able to determine the performance of memory accesses depending
  on the location and coherence state of the data. They are used to perform
  an in-depth analysis of the characteristics of memory accesses in contemp
 orary multi-processor systems\, which identifies potential bottlenecks. Ho
 wever\, in order to localize performance problems\, it also has to be dete
 rmined to which extend the application performance is limited by certain r
 esources. Therefore\, a methodology to derive metrics for the utilization 
 of individual components in the memory hierarchy as well as waiting times 
 caused by memory accesses is developed in the second step. The approach is
  based on hardware performance counters\, which record the number of certa
 in hardware events. The developed micro-benchmarks are used to selectively
  stress individual components\, which can be used to identify the events t
 hat provide a reasonable assessment for the utilization of the respective 
 component and the amount of time that is spent waiting for memory accesses
  to complete. Finally\, the knowledge gained from this process is used to 
 implement a visualization of memory related performance issues in existing
  performance analysis tools. The results of the micro-benchmarks reveal th
 at the increasing number of cores per processor and the usage of multiple 
 processors per node leads to complex systems with vastly different perform
 ance char- acteristics of memory accesses depending on the location of the
  accessed data. Furthermore\, it can be observed that the aggregated throu
 ghput of shared resources in multi-core processors does not necessarily sc
 ale linearly with the number of cores that access them concurrently\, whic
 h limits the scalability of parallel applications. It is shown that the pr
 oposed methodology for the identification of meaningful hardware performan
 ce counters yields useful metrics for the localization of memory related p
 erformance limitations.
DTSTAMP:20260524T043448Z
CREATED:20170223T075005Z
LAST-MODIFIED:20170310T074453Z
END:VEVENT
END:VCALENDAR