High Performance Computing

Name: High Performance Computing | ISC High Performance 2018 International Workshops, Frankfurt/Main, Germany, June 28, 2018, Revised Selected Papers
Brand: Springer
Price: 53.49 EUR
Availability: OnlineOnly

ISC High Performance 2018 International Workshops, Frankfurt/Main, Germany, June 28, 2018, Revised Selected Papers

Rio Yokota Miche'le Weiland John Shalf Sadaf Alam(Editor)

Springer (Publisher)

Published on 24. January 2019

XXII, 757 pages

E-Book

PDF with digital watermarking

System requirements

978-3-030-02465-9 (ISBN)

€53.49incl. 7% vat

System requirements

for PDF with digital watermarking

E-Book Single Licence

Available for download

Description

More details

Other editions

Content

Intro
Preface
Organization
Contents
HPC I/O in the Data Center Workshop (HPC-IODC 2018)
1 Introduction
2 Organization of the Workshop
2.1 Program Committee
3 Workshop Summary
3.1 Research Papers
3.2 Talks from Experts
3.3 Discussion Sessions
References
Analyzing the I/O Scalability of a Parallel Particle-in-Cell Code
1 Introduction
2 Characterization of the I/O System
2.1 Throughput Evaluation as a Function of Request Sizes
2.2 Throughput Evaluation as a Function of the Number of Nodes
3 Analyzing the Application's I/O Scalability
3.1 I/O Pattern Analysis
3.2 Evaluation of the Weight of I/O Operations
3.3 Evaluation of I/O Strategies
4 Experimental Evaluation
5 Conclusions
References
Cost and Performance Modeling for Earth System Data Management and Beyond
1 Introduction
1.1 Data Growth and Access Requirements
1.2 Existing and Emerging Technologies
1.3 Addressing Domain Scientists and Their Workflows
2 Related Work
3 Cost Modeling
4 Coarse Grained Model
4.1 Resilience Model
4.2 Performance Model
5 Model Considerations for Common Subcomponents
5.1 Compute Nodes
5.2 I/O Nodes
6 Cost Study for Alternative Deployments
7 Application in Cost-Aware I/O Middleware
8 Summary
References
I/O Interference Alleviation on Parallel File Systems Using Server-Side QoS-Based Load-Balancing
1 Introduction
2 Research Background
2.1 K Computer and Its File Systems
2.2 Performance Problems of File I/O on the K Computer
2.3 QoS-Based Management at an MDS
3 Investigation of Internal File Server Activities
4 Performance Evaluation
4.1 MDS Response Evaluation Using MDTEST
4.2 QoS Impact in Fair-Share Execution Among Concurrent Running Jobs
4.3 QoS Impact in Data-Staging
5 Related Work
6 Concluding Remarks
References
Tools for Analyzing Parallel I/O
1 Introduction
2 Introduction to Performance Analysis
2.1 Closed Loop of Performance Tuning
2.2 Measurement
2.3 Preparation of Applications
2.4 Analysis of Data
3 Tools
3.1 Darshan
3.2 Vampir
3.3 Mistral/Breeze
3.4 SIOX
3.5 PIOM-MP
3.6 Additional User-Level Tools
3.7 Further Administrative Tools
3.8 Tools for Unifying Trace Formats
4 Example Studies
4.1 I/O Performance Analysis at the Application Level
4.2 Online Monitoring
4.3 Online Monitoring with LLview
5 Challenges in Analyzing I/O
6 Conclusions
References
Workshop on Performance and Scalability of Storage Systems (WOPSSS 2018)
Understanding Metadata Latency with MDWorkbench
1 Introduction
2 Related Work
3 MDWorkbench
4 Experimental Setup
5 Results
5.1 Impact of Concurrent Execution of Several Metadata Operations
5.2 Overview of Results for the Benchmark Phase
5.3 Understanding Latencies
6 Conclusions
References
From Application to Disk: Tracing I/O Through the Big Data Stack
1 Introduction
2 Big Data Software Stack
3 Methodology
3.1 Status Quo
3.2 Statistics File System
4 A Case Study: TeraSort
4.1 Setup
4.2 Vanilla Hadoop Results
4.3 SFS Insights
4.4 Optimized Hadoop Results
5 SFS Overhead
6 Discussion and Limitations
7 Related Work
8 Future Work
9 Conclusions
References
IOscope: A Flexible I/O Tracer for Workloads' I/O Pattern Characterization
1 Introduction
2 IOscope Design and Validation
2.1 Foundation: eBPF
2.2 IOscope Design
2.3 IOscope Validation
3 Experiments
3.1 Setup, Datasets, and Scenarios
3.2 MongoDB Experiments
3.3 Cassandra Experiments
4 Related Work
5 Conclusions
References
Exploring Scientific Application Performance Using Large Scale Object Storage
1 Introduction
2 Background and Related Work
3 Emulating Scientific Applications Using Object Storage
3.1 Emulator Implementation
4 Experimental Environment
5 Evaluation
6 Conclusion
References
Benefit of DDN's IME-FUSE for I/O Intensive HPC Applications
1 Introduction
2 Related Work
3 Test Environment
3.1 Benchmarks
4 Experiment Configuration
4.1 Open/Close Times
4.2 Performance
5 Evaluation
5.1 Application Kernel Using HDF5
5.2 Performance Variability with Individual I/Os
6 Conclusion
References
Performance Study of Non-volatile Memories on a High-End Supercomputer
1 Introduction
2 Related Work
3 Methodology and Technical Specifications
3.1 The Device Specifications
3.2 Experimental Methodology
3.3 Benchmarking
4 Evaluation
4.1 Transfer Size Impact
4.2 Weak Scaling
4.3 File Size Impact
5 Conclusion
References
Self-optimization Strategy for IO Accelerator Parameterization
1 Introduction
2 Self-optimization Strategy
3 Inference of the Accelerator Parameters
3.1 Regression of the Objective Function
3.2 Search for the Optimal Parameterization
4 Experiments and Results
5 Conclusion
References
13th Workshop on Virtualization in High-Performance Cloud Computing (VHPC 2018)
utmem: Towards Memory Elasticity in Cloud Workloads
1 Introduction
2 Background
3 Overview
3.1 Design
3.2 Implementation
4 Evaluation
4.1 Evaluation of utmem Under Memory Pressure
4.2 Evaluation of utmem Under Nonexistent Memory Pressure
5 Related Work
6 Conclusion and Future Work
6.1 Future Work
6.2 Conclusion
References
Efficient Live Migration of Linux Containers
1 Introduction
2 Live Migration with CRIU
2.1 Pre-copy Migration with CRIU
2.2 Post-copy Migration with CRIU
2.3 Automatic Transfer of Image Files
2.4 Design of Post-copy Memory Migration
2.5 Combining Pre-copy and Post-copy Migration
3 Using Image Cache and Image Proxy for Container Live Migration
4 Evaluation
5 Discussion and Future Work
6 Conclusion
References
Third International Workshop on In Situ Visualization: Introduction and Applications (WOIV 2018)
Introduction
Organization of the Workshop
2.1 Organizing Committee
2.2 Program Committee
Workshop Summary
3.1 Invited Talks
3.2 Research Papers
Coupling the Uintah Framework and the VisIt Toolkit for Parallel In Situ Data Analysis and Visualization and Computational Steering
1 Introduction
2 Background
2.1 In Situ Data Analysis and Visualization
2.2 Utilization of Diagnostic Data
3 Methods
3.1 Per-Rank Runtime Performance Data
3.2 The Simulation Dashboard
4 Results
5 Conclusion
References
Binning Based Data Reduction for Vector Field Data of a Particle-In-Cell Fusion Simulation
1 Introduction
2 Related Work
2.1 In Situ Visualization
2.2 XGC1 Visualization
2.3 Large-Scale Particle Visualization
3 Binning of Fusion Data
3.1 Generating the Binned Data
4 Experimental Overview
4.1 Workflow Description
4.2 Evaluating Accuracy
5 Results
5.1 Test Result Summary
5.2 Poincaré Test Results
5.3 Streamline/Pathline Test Results
6 Conclusions and Future Work
References
In Situ Analysis and Visualization of Fusion Simulations: Lessons Learned
1 Introduction
2 Related Work
3 Motivation
4 Setup
4.1 Campaign 1
4.2 Campaign 2
5 Results
5.1 Campaign 1
5.2 Campaign 2
6 Discussion
7 Conclusion and Future Works
References
Design of a Flexible In Situ Framework with a Temporal Buffer for Data Processing and Visualization of Time-Varying Datasets
1 Introduction
2 Temporal Buffer
2.1 Description
2.2 Buffer Operations
2.3 Code Integration
2.4 Update Parameters and Steering
3 Computational Environment and Implementation
3.1 Computational Environment
3.2 Software System
3.3 Use In Situ and in Transit Scenarios
4 Evaluation and Discussion
4.1 Target Application
4.2 Considerations on the Number of Time Steps to Hold
4.3 Target Processing Examples
4.4 Data Transfer Performance (In Transit Scenario)
5 Conclusion
References
Streaming Live Neuronal Simulation Data into Visualization and Analysis
1 Introduction
2 Related Work
3 Method
3.1 NESCI - Neuronal Simulator Conduit Interface
3.2 CONTRA - Conduit Transport
4 Application
4.1 NEST Simulation
4.2 2D Visualization
4.3 3D Visualization
5 Performance Evaluation
6 Conclusion and Future Work
References
Enabling Explorative Visualization with Full Temporal Resolution via In Situ Calculation of Temporal Intervals
1 Introduction
2 Related Work
2.1 Individual Time Slice Data
2.2 Multiple Time Slice Data
2.3 Complete Temporal Data
2.4 Impact of Error in Compression
2.5 How Our Approach Differs from Previous Work
3 Algorithm
3.1 Error Bound
3.2 Compression Approaches
3.3 Memory Requirements
3.4 Reconstruction
4 Evaluation
4.1 Experiment Configuration
4.2 Phase Overview and Measurements
4.3 Hardware
4.4 Software
5 Results
5.1 Phase One: GHOST, LULESH, XGC1 Particle Ions, and Tornado
5.2 Phase Two: Comparison with Wavelets and SZ
5.3 Phase Three: In Situ Experimentation
6 Conclusion
7 Future Work
References
In-Situ Visualization of Solver Residual Fields
1 Introduction
2 Related Work
3 Method
3.1 Solvers and Residual Fields
3.2 Aggregated Residual Fields
3.3 Residual Curves
3.4 Residual Stacks
3.5 In-Situ Application
4 Results
4.1 Implementation and Timings
4.2 Kármán Runs
4.3 Mesh Resolution Experiment
4.4 Grid Refinement Experiment
5 Conclusion
References
An In-Situ Visualization Approach for the K Computer Using Mesa 3D and KVS
1 Introduction
2 Related Work
3 Mesa 3D on the K Computer
4 OpenGL-Based KVS Library
4.1 Particle Based Volume Rendering
4.2 Traditional Rendering Methods
5 Experimental Results
6 Conclusions
References
4th International Workshop on Communication Architectures for HPC, Big Data, Deep Learning and Clouds at Extreme Scale (ExaComm 2018)
Introduction
Organization
2.1 Program Committee
Workshop Summary
3.1 Invited Talks
3.2 Research Papers
3.3 Panel Discussion
Comparing Controlflow and Dataflow for Tensor Calculus: Speed, Power, Complexity, and MTBF
1 Introduction to Ultimate Dataflow
2 Introduction to Tensor Calculus
3 Existing Solutions
3.1 An Overview of Tensor Operations
3.2 An Overview of Underlying Hardware
4 The Dataflow Approach
5 Tensor Operations on the Dataflow Architecture
5.1 Tensor Addition
5.2 Tensor Transpose
5.3 Tensor Composition
5.4 Tensor Inverse
5.5 Primary and Principal Invariants
5.6 Eigenvalues and Eigenvectors
5.7 Spectral Decomposition
5.8 Divergence of a Tensor Field
5.9 Tensor Rank
6 Performance Evaluation
6.1 Speedup
6.2 Power Dissipation
6.3 Complexity
6.4 Mean Time Between Failures
7 Conclusion
References
Supercomputer in a Laptop: Distributed Application and Runtime Development via Architecture Simulation
1 Introduction
2 Prior Work
3 Simulator Implementation
3.1 Encapsulation
3.2 Interception and uGNI Bindings
3.3 Skeletonization
3.4 Overhead Pragma
4 Example Results
4.1 Methodology
4.2 GASNet Benchmark
4.3 Scaling of Skeletonized Runtime
5 Additional Features and Future Work
5.1 Deterministic Debugging of Distributed Race Conditions
5.2 Deterministic, Controlled Environment for Performance Comparisons
5.3 Host Compute Overhead Estimates
5.4 Valgrind and GDB
5.5 Extending to OpenMPI and Infiniband
6 Conclusion
References
International Workshop on OpenPOWER for HPC 2018 (IWOPH 2018)
References
CGYRO Performance on Power9 CPUs and Volta GPUs
1 Introduction
1.1 CGYRO: A Multiscale-Optimized Fusion Plasma Solver
1.2 Porting CGYRO to GPUs
1.3 Simulations Suitable for Benchmarking
2 Compiling and Running CGYRO on Power9
2.1 Porting CGYRO to Power9
2.2 Evaluating the Effect of Hyperthreading
2.3 The Impact of Volta GPUs on CGYRO Performance
3 Comparing to Other Systems
3.1 CPU-Only Tests
3.2 Full Node Tests
4 Summary
References
A 64-GB Sort at 28 GB/s on a 4-GPU POWER9 Node for Uniformly-Distributed 16-Byte Records with 8-Byte Keys
Abstract
1 Introduction
2 System Attributes and Upper Bounds
3 Sorting Algorithm
3.1 Partitioner Design
3.2 Design of the Shuffle Phase
3.3 Sorting a Single Partition
3.4 Single-Node Sort
3.5 Multi-node Sort
4 Sort Performance
4.1 Single-Node Sort Performance
4.2 Multi-node Sort Performance
5 Future Work
6 Summary and Conclusions
Acknowledgements
References
Early Experience on Running OpenStaPLE on DAVIDE
1 Introduction
2 OpenStaPLE
3 The DAVIDE Cluster
4 Performance Analysis of OpenStaPLE
4.1 Benchmarking of Interconnects
4.2 Energy Performance
5 Conclusions and Future Prospects
References
Porting and Benchmarking of BWAKIT Pipeline on OpenPOWER Architecture
Abstract
1 Introduction
2 BWAKIT Pipeline Implementation
3 Experimental Benchmarking Setup
4 Benchmarking Methodology
5 Performance Metrics Used for Benchmarking
6 Validation of BWAKIT Results
7 Conclusion
Acknowledgement
References
Improving Performance and Energy Efficiency on OpenPower Systems Using Scalable Hardware-Software Co-design
1 Introduction
2 GEOPM on OpenPower
2.1 GEOPM Overview
2.2 Measuring Power and Performance
2.3 Port
3 Preliminary Results
3.1 Experimental Setup
3.2 Applications Profiles
4 Conclusions and Future Work
References
Porting DMRG++ Scientific Application to OpenPOWER
1 Introduction
2 Motivation
3 Density Matrix Renormalization Group
3.1 The Application
3.2 Baseline Performance Characteristics of the Application
3.3 Hamiltonian Matrix
3.4 Pseudo Code: Apply Hamiltonian Target
3.5 Types of Available Parallelism in the Kronecker Product Algorithm
4 Problem Statement
5 Implementation and Experimental Evaluations
5.1 Experimental Setup
5.2 Pseudo Codes for Evaluation
5.3 Evaluation
References
Job Management with mpi_jm
1 Introduction
2 mpi_jm
2.1 Masters
2.2 The Scheduler
2.3 Workers
2.4 Issues and Dependencies
3 Individual Tasks and the Python Interface
4 Initial Performance
References
Compile-Time Library Call Detection Using CAASCADE and XALT
1 Introduction
2 CAASCADE Overview
3 Library Detection
3.1 Compiler Plugins
3.2 Classification of the Libraries Calls
3.3 Call Graph Analysis for Library Detection
4 Experiments and Results
5 Related Work
6 Future Work
6.1 Inter-procedural Pointer Analysis
6.2 Linkage Information
References
NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems
1 Introduction
2 Explicit Data Transfer
3 Unified Memory
3.1 Page Fault Latency
4 Conclusion
References
IXPUG Workshop: Many-Core Computing on Intel Processors: Applications, Performance and Best-Practice Solutions
IXPUG in an Evolving World - The New IXPUG
1.1 What You Should Know About IXPUG
1.2 Working Groups and Discussion Forum
1.3 The IXPUG Steering Committee
Workshop Overview
Call for Papers
Workshop Agenda
Program Committee
Workshop Organizers
Reference
Sparse CSB_Coo Matrix-Vector and Matrix-Matrix Performance on Intel Xeon Architectures
1 Introduction
2 System Architecture
3 Methodology
3.1 CSB_Coo
4 SPMM
4.1 Vectorization
5 SPMV
5.1 Thread Scaling
5.2 AVX-512 CD Instructions
5.3 Manually Removing the Conflicts
6 Conclusions
References
Lessons Learned from Optimizing Kernels for Adaptive Aggregation Multi-grid Solvers in Lattice QCD
1 Introduction
2 Restrictor Definition and Implementation
2.1 No Inner Parallelism or Parallelism Using Atomics
2.2 Explicit OpenMP Nested Parallelism
2.3 OpenMP Custom Reductions
2.4 Manual Fake-Out of Nested Parallelism
3 Performance Results
3.1 Experimental Setup
3.2 Performance Results
4 Conclusions and Outlook
References
Distributed Training of Generative Adversarial Networks for Fast Detector Simulation
1 Introduction
2 Previous Work
3 Three-Dimensional GANs for Calorimeter Simulation
3.1 Calorimeter Data
3.2 Networks Architecture
3.3 Physics Validation
3.4 Computing Performance and Training Time
4 Distributed Training
4.1 Distributing the Training of the 3DGAN with Keras/Tensorflow and Horovod
4.2 Execution Environment
4.3 Scaling Results
4.4 Validation at Scale
5 Conclusions and Future Goals
References
Workshop on Sustainable Ultrascale Computing Systems
Introduction
Workshop Program Committee
Workshop Chair
Program Committee
Workshop Summary
Cache-Aware Roofline Model and Medical Image Processing Optimizations in GPUs
1 Introduction
2 Background
2.1 CARM: Cache-Aware Roofline Model
2.2 Reconstruction Algorithms in Medical Imaging
3 Characterization and Profiling Method
3.1 CARM-Based Profiling Tool for GPU Applications
3.2 Kernels for Medical Image Processing in CT
4 Experimental Results
4.1 High-End GPU Evaluation
4.2 Commodity GPU Evaluation
5 Related Work
6 Conclusions
References
How Pre-multicore Methods and Algorithms Perform in Multicore Era
1 Introduction
2 How Much Performance and Energy You Can Lose Through Load Balancing on Multicore Platforms
2.1 When Does Load Balancing Work?
2.2 When Does Load Balancing Not Work?
2.3 New Methods and Algorithms for Performance and Energy Optimization on Multicore-Based Platforms
3 PMC-Based Power and Energy Modelling in Multicore Era
4 Conclusion
References
Approximate and Transprecision Computing on Emerging Technologies (ATCET 2018)
Preface
Sec2
Impact of Approximate Memory Data Allocation on a H.264 Software Video Encoder
1 Introduction
1.1 Approximate Memory
2 OS Managed Approximate Memory and AppropinQuo Emulator
3 H.264 Video Encoding
3.1 Approximate Memory Data Allocation for the x264 Encoder
4 Results
4.1 Output with Approximate Memory and Energy Saving Considerations
5 Conclusion
References
Residual Replacement in Mixed-Precision Iterative Refinement for Sparse Linear Systems
1 Introduction
2 Residual Replacement for Krylov Methods
2.1 Preconditioned Conjugate Gradient (PCG)
2.2 Residual Replacement
3 Evaluation
3.1 Cost Model
3.2 Cost Analysis
4 Concluding Remarks
References
Training Deep Neural Networks with Low Precision Input Data: A Hurricane Prediction Case Study
1 Introduction
2 Background and Related Work
3 Hurricane Prediction Case Study
3.1 Deep Learning for Hurricane Prediction
3.2 Reduced Input Data Precision
4 Results and Discussion
5 Conclusion and Future Work
References
A Transparent View on Approximate Computing Methods for Tuning Applications
1 Introduction
2 Exploit Performance Profiles as Transparent View on Approximate Computing Methods
3 How to Consider Multiple Objectives?
4 Taking Conventional Methods into Account
5 Exploiting PPs for System Tuning
6 Conclusion
References
Exploring the Effects of Code Optimizations on CPU Frequency Margins
1 Introduction
2 Methodology
3 Compiler Optimizations Analysis
4 Source Code Transformations
4.1 Memory Access Pattern Optimizations
4.2 SIMD Optimizations
5 Related Work
6 Conclusions
References
First Workshop on the Convergence of Large-Scale Simulation and Artificial Intelligence
Taking Gradients Through Experiments: LSTMs and Memory Proximal Policy Optimization for Black-Box Quantum Control
1 Introduction
2 Quantum Control
3 Reinforcement Learning: Why and What?
4 The Learning Algorithm
5 Applying the Method
5.1 Quantum Memory
5.2 Ground State Transitions
6 Results
6.1 Quantum Memory
6.2 Ground State Transition
7 Conclusion and Future Work
References
Towards Prediction of Turbulent Flows at High Reynolds Numbers Using High Performance Computing Data and Deep Learning
1 Introduction
2 Deep Learning and Turbulence
3 DNS Data Base
4 Results
5 Conclusions
References
Third Workshop for Open Source Supercomputing (OpenSuCo 2018)
Using a Graph Visualization Tool for Parallel Program Dynamic Visualization and Communication Analysis
1 Introduction
2 Related Work
3 Methodology
3.1 Graph Building
3.2 Data Collection
3.3 Graph Textual Representation
3.4 Graph Visualization and Analysis
4 Case Study: NAS Parallel Benchmark
4.1 Algorithm Topology
4.2 Dynamic Communication Behavior
5 Conclusion
References
Offloading C++17 Parallel STL on System Shared Virtual Memory Platforms
1 Introduction
2 Related Work
3 Heterogeneous Offloading of Parallel STL
4 System Shared Virtual Memory and C++17
5 Proof of Concept Implementation
5.1 Binary Exchange Format
5.2 Indirect Calls and IL Specialization
6 Evaluation
7 Conclusions
References
First Workshop on Interactive High-Performance Computing
Introduction
Organization of the Workshop
2.1 Program Committee
2.2 Summary of the Submissions
Workshop Summary
Lessons Learned from a Decade of Providing Interactive, On-Demand High Performance Computing to Scientists and Engineers
1 Introduction
2 Lessons Learned
2.1 Broadening the Definition of Interactive HPC
2.2 Re-architecting for Interactive HPC
2.3 Reframing the Metrics of Success
2.4 Expanding the HPC Ecosystem
3 Architecture Requirements for Interactive HPC
3.1 System
3.2 Software
3.3 Supporting Users
4 Metrics
5 Summary and Future Work
References
Enabling Interactive Supercomputing at JSC Lessons Learned
1 Introduction
2 Background Jupyter
3 Jupyter Integration at JSC
4 Use Case: Rhinodiagnost
5 Use Case: Deep Learning
6 Lessons Learned
7 Outlook
8 Conclusion
References
Interactive Distributed Deep Learning with Jupyter Notebooks
1 Introduction
2 System Architecture
3 Distributed Training
4 Distributed Hyper-parameter Optimization
4.1 Random Search HPO Notebook
4.2 HPO with Interactive Widgets
4.3 Advanced HPO
5 Conclusions
6 Code and Recipes
References
Third International Workshop on Performance Portable Programming Models for Accelerators (P^3MA 2018)
1Workshop Summary
Part13
3Steering Committee
4Program Chairs
5Steering Committee
Performance Portability of Earth System Models with User-Controlled GGDML Code Translation
1 Introduction
2 Related Work
3 Approach
3.1 The General Approach
3.2 The User-Controlled Source-to-Source Code Translation
4 GGDML Review
5 Machine-Specific Configuration
5.1 Grid Configuration
5.2 Configurable Access Operators
5.3 Memory Layout
5.4 Parallelization
6 Evaluation
6.1 Test Application
6.2 Test System
6.3 Results
7 Summary
7.1 Future Work
References
Evaluating Performance Portability of Accelerator Programming Models using SPEC ACCEL 1.2 Benchmarks
1 Introduction
2 Motivation
3 The SPEC ACCEL Benchmark Suite
4 SPEC ACCEL 1.2 Results
4.1 Experimental Systems
4.2 Performance
4.3 Correctness and Functionality
4.4 OpenMP and OpenACC Performance Comparison
5 Related Work
6 Conclusion
References
A Beginner's Guide to Estimating and Improving Performance Portability
1 Introduction
2 Related Work
3 The Performance Portability Definition and Metric
4 Experimental Setup
4.1 The OpenACC Applications
4.2 The Platforms
5 Computing and Interpreting PPM
5.1 The PPM Calculation Workflow
5.2 Calculating Performance Efficiency
5.3 Case-Studies: PPM Results and Analysis
6 Improving Performance Portability
6.1 Techniques for Performance Portability Improvement
6.2 Case-Studies: Improving Performance Portability
7 Conclusion and Future Work
References
Profiling and Debugging Support for the Kokkos Programming Model
1 Introduction
2 Kokkos Profiling Tools
2.1 Overview of the Kokkos Profiling Interface
2.2 Event Callbacks from Kokkos
3 Tools for Profiling Kokkos Applications
3.1 Kernel Profiling
3.2 Parallel Time Stack Profiling
3.3 Memory Event/Heap Profiling
4 Conclusions
5 Related Work
6 Tool Availability
References
Author Index

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

High Performance Computing

Description

More details

Other editions

Additional editions

Content

System requirements