OpenMP: Memory, Devices, and Tasks

Name: OpenMP: Memory, Devices, and Tasks | 12th International Workshop on OpenMP, IWOMP 2016, Nara, Japan, October 5-7, 2016, Proceedings
Brand: Springer
Price: 53.49 EUR
Availability: OnlineOnly

12th International Workshop on OpenMP, IWOMP 2016, Nara, Japan, October 5-7, 2016, Proceedings

Naoya Maruyama Bronis R. de Supinski Mohamed Wahib(Editor)

Springer (Publisher)

Published on 28. September 2016

XI, 352 pages

E-Book

PDF with digital watermarking

System requirements

978-3-319-45550-1 (ISBN)

€53.49incl. 7% vat

System requirements

for PDF with digital watermarking

E-Book Single Licence

Available for download

Description

More details

Other editions

Content

Intro
Preface
Organization
Contents
Applications
Estimation of Round-off Errors in OpenMP Codes
1 Introduction
2 The CADNA Library
2.1 Principles of DSA (Discrete Stochastic Arithmetic)
2.2 Numerical Validation of Sequential Codes Using CADNA
3 CADNA for OpenMP Codes
4 Performance Tests and Application to OpenMP Codes
4.1 A Reduction Code
4.2 Performance Tests
4.3 A Shallow-Water Application
5 Conclusion
References
OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms
1 Introduction
2 Graph-Based Classification Algorithms
2.1 Introduction
2.2 Semi-supervised and Unsupervised Algorithms
2.3 Nyström Extension Method
3 Math Library Usage and Optimizations
4 Parallelization of the Nyström Extension
4.1 OpenMP Parallelization
4.2 Arithmetic Intensity and Roofline Model
5 Conclusion and Future Work
References
Locality
Evaluating OpenMP Affinity on the POWER8 Architecture
1 Introduction
1.1 Memory Placement
1.2 Thread Affinity
1.3 POWER8 System
1.4 POWER8 Hardware Counters
2 Motivation
3 Experimentation
3.1 Experimental Setup
4 Results
5 Related Work
6 Conclusions and Future Work
References
Workstealing and Nested Parallelism in SMP Systems
1 Introduction
2 Terminology
3 Static Workstealing Scheduler
3.1 Motivation
3.2 Scheduler Implementation
3.3 OpenMP Scheduler Constraints
4 ISO-3DFD Test Code
5 ISO-3DFD Optimization
5.1 Nested Parallelism vs. Hand Threading
5.2 Performance Results
6 OpenMP Extension to Loop Scheduling
6.1 Hierarchical Loop Scheduling
6.2 Multi-dimensional Chunking
6.3 Example
7 Static Workstealing and Particle Codes
7.1 Application of Static Workstealing
8 Conclusions and Future Work
References
Description, Implementation and Evaluation of an Affinity Clause for Task Directives
1 Introduction
2 Motivating Examples for Which Affinity Does Matter
3 Extending OpenMP to Support Affinities
3.1 Extension of the OpenMP Task Directive
3.2 Extension of the OpenMP Runtime API Functions
3.3 Extension of the Task Scheduler to Support Affinity
4 Examples of Use and Experimentation Results
4.1 Enhancing Task-Based OpenMP Kernels to Support affinity
4.2 Experimental Platform Description
4.3 Experimental Results
5 Related Work
6 Conclusion
References
Task Parallelism
NUMA-Aware Task Performance Analysis
1 Introduction
2 Related Work
3 NUMA-Aware Task Creation
3.1 Task Scheduling in OpenMP
3.2 Task Creation
3.3 Benchmark Evaluation
4 Task Performance Analysis
4.1 Gathered Data
4.2 Data Analysis
5 Evaluation
6 Conclusion
References
OpenMP Extension for Explicit Task Allocation on NUMA Architecture
1 Introduction
2 Related Work
3 OpenMP Extension for NUMA-Aware Task Allocation
3.1 Overview
3.2 Language Definition
3.3 Prototype Implementation Using GCC
4 KASTOR Kernel Optimization with node_bind
4.1 Jacobi Kernel
4.2 SparseLU Kernel
4.3 Strassen Kernel
5 Performance Evaluation
5.1 Result of Jacobi Kernel
5.2 Result of SparseLU Kernel
5.3 Result of Strassen Kernel
6 Conclusion
References
Approaches for Task Affinity in OpenMP
1 Introduction
2 Related Work
3 Design Choices for Task Affinity
4 Proposed Syntax and Semantics
5 Prototype Implementations
5.1 OpenMP Place/Thread Approach
5.2 Storage Location Approach
5.3 Taskgroup
5.4 Taskloop
6 Evaluation
6.1 Place/Thread
6.2 Storage Location
6.3 Taskgroup
6.4 Taskloop Construct
7 Conclusion
References
Towards Unifying OpenMP Under the Task-Parallel Paradigm
1 Introduction
2 Existing OpenMP Task-Loop Implementations
3 Improved Task-Loop Implementation and Load Balancing
3.1 Implementation
3.2 Iteration Tasks
4 Experimental Method
4.1 Benchmarks
5 Results
5.1 Scheduling Heuristics
5.2 Benchmark Performance
6 Related Work
7 Conclusions
References
A Case for Extending Task Dependencies
1 Introduction
2 Global Dependencies
3 Unstructured Tasks
4 Queue Dependencies
5 Related Work
6 Conclusions
References
OpenMP as a High-Level Specification Language for Parallelism
1 Motivation
2 HClib
3 Methods
3.1 OpenMP-to-X Compile-Time Mechanics
3.2 The HClib APIs Targeted by OpenMP-to-X
3.3 Mapping OpenMP to HClib
4 Experimental Evaluation
4.1 Variance of each Runtime
4.2 Overall Performance
5 Discussion
5.1 Insights Gained into HClib
5.2 Motivating Extensions to OpenMP
5.3 Other Targets for OpenMP-to-X
6 Conclusions and Future Work
References
Scaling FMM with Data-Driven OpenMP Tasks on Multicore Architectures
1 Introduction
2 Related Work
3 About the FMM Case Study
4 Tasking Through Temporal and Spatial Blocking
5 Characterization and Evaluation
5.1 Characterizing Data Locality and Idleness
5.2 Performance Evaluation
6 Conclusion and Future Work
References
Extensions
Reducing the Functionality Gap Between Auto-Vectorization and Explicit Vectorization
Abstract
1 Introduction
2 Compress and Expand
2.1 Loop-Level Syntax
2.2 Block-Level Syntax
2.3 Semantics Discussion
2.4 Combination with omp declare simd
2.5 Unit Test Performance
3 Histogram
3.1 Loop-Level Syntax
3.2 Block-Level Syntax
3.3 Block-Level Syntax Scoping
3.4 Unit Test Performance
4 Conclusion
References
A Proposal to OpenMP for Addressing the CPU Oversubscription Challenge
1 Introduction
2 Use Cases and Challenges of Interoperability
2.1 Three Use Cases
2.2 Issues with No or Poor Interoperability
2.3 Limitation of Interoperability Support in the Standard
3 Extensions to Address the Oversubscription Challenge
3.1 Definition of the ACTIVE and PASSIVE Wait Policies
3.2 Proposed Runtime Routines
3.3 The Omp_get_num_threads_runtime Runtime Routine
3.4 The Omp_set_wait_policy and omp_get_wait_policy Runtime Routines
3.5 The Omp_quiesce Runtime Routine
3.6 The Omp_thread_create/exit/join Runtime Routines
4 Implementation and Evaluation
4.1 Evaluation
4.2 Performance with Regards to the Oversubscription Ratio
5 Related Work
6 Conclusions and Future Work
References
Tools
Testing Infrastructure for OpenMP Debugging Interface Implementations
1 Introduction
2 Architecture
2.1 Debugging Interface
2.2 OpenMP Debugging Interface
2.3 OpenMP Runtime Functions
3 Implementation
3.1 Comparing Properties
3.2 Triggers for Checks
4 OMPD Callback Library for Dyninst
4.1 Interface of LibOmpdCallback
4.2 Using the LibOmpdCallback for Stackwalker
5 Issues Detected in the OMPD Library Implementation
6 Applicability for Other Debugging Interfaces
7 Conclusions
References
The Secrets of the Accelerators Unveiled: Tracing Heterogeneous Executions Through OMPT
1 Introduction
2 Background
3 Implementation
3.1 Runtime Support for OMPT
3.2 Instrumentation Support for OMPT
4 Experimental Setup
5 Results
5.1 Matrix Multiplication
5.2 Cholesky Decomposition
6 Related Work
7 Conclusions
References
Language-Centric Performance Analysis of OpenMP Programs with Aftermath
1 Introduction
2 Aftermath: Trace Generation and Analysis
2.1 Trace Generation
2.2 A Graphical Interface for Trace Analysis
3 Use Case: Optimization of MG
3.1 Identifying Execution Phases
3.2 Identifying Load Imbalance Resulting from NUMA
3.3 Identifying Parallelism Degree Limitations and Imbalance
4 Overhead of Tracing
5 Related Work
6 Conclusion and Future Work
References
Accelerator Programming
Pragmatic Performance Portability with OpenMP 4.x
1 Introduction
1.1 Scope
2 Background
3 Implementation-Specific Interpretations
3.1 Thread Co-ordination
3.2 Cray Compiler Mapping of OpenMP onto NVIDIA GPUs
3.3 Clang Mapping of OpenMP onto NVIDIA GPUs
3.4 GCC 6.1 Mapping of OpenMP onto AMD GPUs with HSA
3.5 Intel Mapping of OpenMP
4 Performance Analysis
4.1 Individual Performance
4.2 Directives for Performance
5 Approaching Pragmatic Performance Portability
5.1 Homogenising the Directives
5.2 Patterns that Can Inhibit Performance Portability
5.3 Concluding Suggestions for Performance Portability
6 Related Work
7 Future Work
8 Conclusions
References
Multiple Target Task Sharing Support for the OpenMP Accelerator Model
1 Introduction
2 Accelerator Support in Directive-Based Approaches
2.1 Heterogeneity Support in OpenMP Accelerator Model
2.2 Heterogeneity Support in OpenACC
2.3 Heterogeneity Support in OmpSs
3 Proposal and Implementation of Multi Target Approach
3.1 Target Directive Syntax Extension
3.2 Compiler and Runtime Support to the Proposed Extensions
3.3 Compiler and Runtime Support for the resources Clause
4 Evaluation
4.1 System Configurations
4.2 OmpSs Runtime Configurations and Thread Binding
4.3 Performance Results
5 Conclusions and Future Work
References
Early Experiences Porting Three Applications to OpenMP 4.5
1 Introduction
2 Kripke
2.1 Challenges When Using Abstractions
3 Cardioid
3.1 High Performance and C++ Code Challenges
4 LULESH
4.1 Porting Challenges
4.2 Portable Performance Possible with Continued Compiler Work
5 Suggestions to Address Some Challenges
5.1 Clarify OpenMP's Relationship to Key C and C++ Constructs
5.2 Clarify Virtual Method Support
5.3 Add Deep Copy/Complex Data Structure Support
6 Related Work
7 Conclusions and Future Work
References
Design and Preliminary Evaluation of Omni OpenACC Compiler for Massive MIMD Processor PEZY-SC
1 Introduction
2 Related Work
3 PEZY-SC
3.1 Architecture
3.2 Programming
4 Omni OpenACC Compiler
4.1 Design
4.2 Implementation
5 Evaluation
5.1 Benchmark
5.2 Performance
5.3 Productivity
6 Discussion
6.1 Optimization for PEZY-SC
6.2 Comparison with OpenMP
7 Conclusion
References
Performance Evaluations and Optimization
Evaluating OpenMP Implementations for Java Using PolyBench
1 Introduction
2 Related Work
2.1 OpenMP for Java
2.2 Benchmarks
3 PolyBench for Java OpenMP
4 Evaluation
5 Results
5.1 Long Runtimes
5.2 Medium Runtimes
5.3 Short Runtimes
5.4 Other Observations
6 Conclusions
References
Transactional Memory for Algebraic Multigrid Smoothers
1 Introduction
2 Transactional Memory State-of-the-Art
3 Applying Transactional Memory to the AMG Smoother
3.1 Brief Review of Algebraic Multigrid Methods
3.2 TM-Assisted Error-Smoothing in AMG
4 Experimental Results
4.1 Problem Descriptions
4.2 Convergence
4.3 Transactional Memory Statistics
4.4 Timed Performance
5 Concluding Remarks
References
Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models
1 Introduction
2 Generalization Through AMLs
3 Runtime Support
3.1 Handling AMLs by the Compiler
3.2 Handling AMLs by the Runtime
3.3 Handling Inspector-Executors
4 Language Support
5 Case Study
5.1 The Choice of AML
5.2 Performance Results
6 Related Work
7 Conclusion
References
Author Index

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

OpenMP: Memory, Devices, and Tasks

Description

More details

Other editions

Additional editions

Content

System requirements