Euro-Par 2011 Parallel Processing

Name: Euro-Par 2011 Parallel Processing | 17th International Euro-ParConference, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part II
Brand: Springer
Price: 53.49 EUR
Availability: OnlineOnly

17th International Euro-ParConference, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part II

Emmanuel Jeannot Raymond Namyst Jean Roman(Editor)

Springer (Publisher)

Published on 12. August 2011

488 pages

E-Book

PDF with digital watermarking

System requirements

978-3-642-23397-5 (ISBN)

€53.49incl. 7% vat

System requirements

for PDF with digital watermarking

E-Book Single Licence

Available for download

Description

More details

Other editions

Content

Title
Preface
Organization
Table of Contents
Topic 9: Parallel and Distributed Programming
Introduction
Parallel Scanning with Bitstream Addition: An XML Case Study
Introduction
The Parallel Bitstream Method
Fundamentals
A Parallel Scanning Primitive
XML Scanning and Parsing
XML Well-Formedness
Compilation to Block-Based Processing
Performance Results
Conclusion
References
HOMPI: A Hybrid Programming Framework for Expressing and Deploying Task-Based Parallelism
Introduction
Programming Environment
Callbacks, Reductions and Detached Tasks
Task Distribution and Scheduling
TORC: The Runtime System
Mixed-Mode and Hybrid Programming
Experimental Evaluation
Conclusion
References
A Failure Detector forWireless Networks with Unknown Membership
Introduction
Related Work
Model and Problem Definition
Stability Assumptions
A Failure Detector of Class SM
Towards a Time-Free Failure Detector for the SM Class
Stable Query-Response Communication Mechanism
Behavioral Properties
A Failure Detector Algorithm for the SM Class
Algorithm Description
Practical Issues
Conclusion
References
Towards Systematic Parallel Programming over MapReduce
Introduction
MapReduce and List Homomorphisms
MapReduce and MapReduce Programming Model
List Homomorphism and Homomorphism Theorems
A Homomorphism-Based Framework for Parallel Programming with MapReduce
Programming Interface and Homomorphism Derivation
Homomorphism Implementation on MapReduce
A Programming Example
Experiments
Concluding Remarks
References
Correlated Set Coordination in Fault Tolerant Message Logging Protocols
Introduction
Rollback Recovery Background
Execution Model
Building a Consistent Recovery Set
Group-Coordinated Message Logging
Shared Memory and Message Logging
Correlated Set Coordinated Message Logging
Implementation
Experimental Evaluation
Experimental Conditions
Shared Memory Performance
Cluster of Multicore Performance
Related Works
Concluding Remarks
References
Topic 10: Parallel Numerical Algorithms
Introduction
A Bit-Compatible Parallelization for ILU(k) Preconditioning
Introduction
Review of the Sequential ILU(k) Algorithm
Terminology for ILU(k)
ILU(k) Algorithm and Its Parallelization
TPILU(k): Task-Oriented Parallel ILU(k) Algorithm
Parallel Tasks and Static Load Balancing
Optimized Symbolic Factorization
Optional Level-Based Incomplete Inverse Method
Experimental Results
Experimental Analysis
Related Work
References
Parallel Inexact Constraint Preconditioners for Saddle Point Problems
Introduction
Finite Element Coupled Consolidation Equations
Inexact Constraint Preconditioners
Eigenvalue Distribution of the Preconditioned Matrices
FSAI-Based ICP
Parallel Implementation
Numerical Results
Solution of K x= b.
Parallel Results and Scalability
Conclusions
References
Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms
Introduction
Previous Work
Communication Lower Bounds for Linear Algebra
3D Linear Algebra Algorithms
2.5D Lower and Upper Bounds
2.5D Matrix Multiplication
2.5D LU Communication Lower Bound
2.5D Communication Optimal LU
2.5D Communication Optimal LU with Pivoting
Performance Results
2.5D Matrix Multiplication Performance
2.5D LU Performance
2.5D LU with CA-Pivoting Performance
Future Work
References
Topic 11: Multicore and Manycore Programming
Introduction
Hardware and Software Tradeoffs for Task Synchronization on Manycore Architectures
Introduction
Asynchronous Task Parallelism and Software Phasers
Asynchronous Task Synchronization Using Phasers
Software Phasers in Habanero-C
Hardware Support in Phasers
Cyclops64 Manycore Architecture
Optimization Using Hardware Barriers
Optimization Using Thread Suspend and Awake
Adaptive Phasers
Memory Optimizations
Implementation and Experiments
Implementation and Experimental Benchmarks
Hierarchical Phasers and Memory Optimizations
Barrier and Point-to-Point Microbenchmarks
Applications
Related Work
Conclusions and Future Work
References
OpenMPspy: Leveraging Quality Assurance for Parallel Software
Introduction
Overview of OpenMPspy
Modes of Operation
The Code Analysis Framework
Analysis Features for OpenMP
Analyzing with OpenMPspy: A Study of Real Projects
Applications
Finding Unreported Errors in Real Projects
How OpenMP Constructs Are Used in Practice
Insights for Parallel Software Quality Improvement
Related Work
Conclusion
References
A Generic Parallel Collection Framework
Introduction
Scala Collection Framework
Adaptive Work Stealing
Design and Implementation
Splitters and Combiners
Parallel Array
Parallel Rope
Parallel Hash Table
Parallel Hash Trie
Parallel Views
Experimental Results
Related Work
Conclusion
References
Progress Guarantees When Composing Lock-Free Objects
Introduction
Progress Guarantee When Composing Lock-Free Data Objects
Lock-Free Data Objects
Examining Lock-Free Progress Guarantee in Object-Oriented Program
A Synchronization Mechanism for Composing Lock-Free Objects
Our Approach
The Operation Descriptor
The Synchronization Mechanism
ABA Problem
Linearizability
How Does the Proposed Synchronization Mechanism Resolve Lock-Free Conflicts?
Experimental Evaluation
Conclusion
References
Engineering a Multi-core Radix Sort
Introduction
Software Write-Combining
Virtual-Memory Counting Sort
Radix Sort
Performance Evaluation
Conclusion
References
Accelerating Code on Multi-cores with FastFlow
Introduction
Code Acceleration through Streamization
The FastFlow Parallel Programming Framework
Self-offloading on the FastFlow Accelerator
Experimental Evaluation
Micro-benchmarks
Applications
Related Work
Conclusions
References
A Novel Shared-Memory Thread-Pool Implementation for Hybrid Parallel CFD Solvers
Intro
Motivation
Outline
The DLR TAU Code
The Shared-Memory Parallelization - Generic Concept
The Shared-Memory Parallelization - Implementation Details for TAU
Cache Blocking in TAU
Modification of the Colors in TAU to Suite the Hybrid Parallelization Concept
Minimally Invasive Implementation of the Task Dispatching
First Performance Results
Conclusion and Outlook
References
A Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures
Introduction
Problem Description
Tile QR Factorization
Tunable Parameters and Objective
Two-Step Empirical Method
Experimental Environments
Step 1: Benchmarking the Most Compute-Intensive Serial Kernel
Step 2: Benchmarking the Whole QR Factorization
Discretization and Interpolation
Impact of the Pre-selection on the Elapsed Time of Step 2
Prune as You Go (PSPAYG)
Reliability
Conclusion and Future Work
References
Parallelizing a Real-Time Physics Engine Using Transactional Memory
Introduction
ODE Overview
Collision Detection
Dynamics Simulation
Parallel Transactional ODE
Global Thread Pool
Parallel Collision Detection Using Spatial Decomposition
Parallel Island Processing
Phase Separation
Feedback between Phases
Issues
Conditional Synchronization
Memory Management and Application Controlled alloc/de-alloc.
Experimental Evaluation
Execution Time
Frame Rate
Abort Rate
Thread Utilization
Transaction Read/Write Sets
Scalability Optimizations
Related Work
Conclusion
References
Topic 12: Theory and Algorithms for Parallel Computation
Introduction
Petri-nets as an Intermediate Representation for Heterogeneous Architectures
Introduction
Notation
Petri-net Intermediate Representation
Simple Hardware Model
Mapping Software to Hardware
Finding Optimal Executions
Complexity
Similar Problems and Techniques
Compiler Optimisations
Comparison with Other Models
Conclusions
References
A Bi-Objective Scheduling Algorithm for Desktop Grids with Uncertain Resource Availabilities
Introduction
Context and Motivation
Contributions
Organization of the Paper
Related Works
Scheduling with Unavailabilities
Scheduling under Uncertainties
Models
Model of Execution
Model of Disturbances
Problem Definition
Analysis of the Stability
Bi-objective Algorithm
Description
Theoretical Analysis
Experiments
Concluding Remarks
References
New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures
Introduction
Vertex Ordering
The Serial Framework
Parallel Ordering
Parallel Distance-2 Coloring
Experimental Results
Conclusion and Future Work
References
Topic 13: High Performance Networks and Communication
Introduction
Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned
Introduction
Related Work
Expressing Collective Operations
The GOAL Interpreter
User vs. Kernel Level Design
Integration into the Operating System
Anatomy of the Linux Kernel Network Stack
The Ethernet Streaming Protocol
Asynchronous Progression
Performing Reduction Operations in Kernel Space
Benchmark Results
Experimental Setting
Asynchronous Progress and Overlap
CPU Overheads
Conclusions and Future Work
References
A High Performance Superpipeline Protocol for InfiniBand
Introduction
Performance Analysis
Pipelining Memory Copy
Optimizations beyond Pipeline: Superpipeline
Benchmarks
Related Works
Conclusion and Future Works
References
Topic 14: Mobile and Ubiquitous Computing
Introduction
ChurnDetect: A Gossip-Based Churn Estimator for Large-Scale Dynamic Networks
Introduction
Related Work
Diffusion Algorithms
The DiffusionReset Algorithm
Convergence of DiffusionReset
Churn Detection Algorithm
Analysis of ChurnDetect Algorithm
Experimental Evaluation via Simulations
Experimental Evaluation on the Testbed
Conclusions
References
Topic 15: High-Performance and Scientific Applications
Introduction
Real Time Contingency Analysis for Power Grids
Introduction
Previous Work
Risk Based Algorithm
Load Balancing Schemes
Centralized Load Balancing Schemes
Decentralized Load Balancing Scheme
Results
Conclusions and Future Work
References
CRSD: Application Specific Auto-tuning of SpMV for Diagonal Sparse Matrices
Introduction
CRSD Storage Format
Diagonal Pattern
Application Specific Diagonal Pattern
Storage Format
SpMV Implementation for CRSD
Application Specific Automatic Performance Tuning
The Final CRSD SpMV Implementation
Parallelization
Evaluation
The Auto-Tuning Records
Serial Performance Improvement
Parallel Performance Improvement
Related Work
Conclusion
References
The LOFAR Beam Former: Implementation and Performance Analysis
Introduction
IBM BlueGene/P
System Description
External I/O
Real-Time Processing
LOFAR and Beam Forming
Beam Former Pipelines
Input from Stations
First All-to-All Exchange
Beam Forming
Channel-Level Dedispersion
Stokes Calculations
Second All-to-All Exchange
Transport to Disks
Performance Analysis
Overall Performance
System Load
Related Work
Conclusions
References
Application-Specific Fault Tolerance via Data Access Characterization
Introduction
Related Work
Background
NWChem
Global Arrays
Instrumentation Methodology
Fault Tolerance Techniques
Application Evaluation Axes
Data Access Characterization of NWChem
Hartree-Fock/Density Functional Theory
Coupled Cluster Theory
Evaluation of Various Fault Tolerance Schemes
Conclusions
References
High-Performance Numerical Optimization on Multicore Clusters
Introduction
Numerical Optimization
Multistart Parallelism Issues
TORC Runtime Library
PNDL and Parallel Multistart Implementation
Performance Experiments
Related Work
Conclusions
References
Parallel Monte-Carlo Tree Search for HPC Systems
Introduction
MCTS: Background and Related Work
Basic MCTS
Parallelization of MCTS
The UCT-Treesplit Algorithm for Parallel MCTS
Experiments
Setup
Results
Conclusion and Future Work
References
Petascale Block-Structured AMR Applications without Distributed Meta-data
Introduction
AMR Applications
Chombo AMR Framework
Benchmarking Methodology
Replication Scaling Benchmarks
Poisson Benchmark
Hyperbolic Gas Dynamics Benchmark
Optimizing AMR for Scalability
Memory Performance: Compression
Run Time Performance
Summary and Conclusions
References
Accelerating Anisotropic Mesh Adaptivity on nVIDIA's CUDA Using Texture Interpolation
Introduction
Background
PDEs, Meshes and Mesh Quality
Anisotropic PDEs
Vertex Smoothing and the Algorithm by Pain et al.
CUDA's Texturing Hardware
Design and Implementation
Experimental Evaluation
Conclusions and Future Work
References
Topic 16: GPU and Accelerators Computing
Introduction
Model-Driven Tile Size Selection for DOACROSS Loops on GPUs
Introduction
Parallelization of DOACROSS Loops on GPUs
Execution Time Modeling
Intra-tile Execution
Inter-tile Execution
Parameter Estimation
Border Tiles
Model-Driven Tile Size Selection
The Algorithm
The Framework
Experiments
References
Iterative Sparse Matrix-Vector Multiplication for Integer Factorization on GPUs
Introduction
SpMV on GF(2) for NFS Matrices Using Existing Formats on GPUs
New Formats for SpMV on GPUs for NFS Matrices
Dense Format
Sliced COO
Determining the Cut-Off Point of Each Format
Dual-GPU Implementation
Results
Conclusion and Future Work
References
Lessons Learned from Exploring the Backtracking Paradigm on the GPU
Introduction
Motivation
Backtracking Case Study: Bron-Kerbosch MCE
Algorithm Overview
Algorithm Parallelization
Benchmarking
Input Graphs
GPU vs. Multi-core CPU Timing
Lessons Learned
Coarse vs. Fine-Grain Parallelization
Global Memory Latency Hiding
A Reliance on Problem Instance Representation
Generality of Backtracking Properties with Respect to GPU-Based Algorithms
Conclusions / Future Work
References
Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design
Introduction
Benchmark Design and Methodology
Arithmetic Throughput
Memory Subsystem
Branching Penalty
Runtime Overheads
Device Characterization - Results
Arithmetic Throughput
Memory Subsystem
Branching Penalty
Runtime Overheads
Guiding Kernel Design
The Model Problem
Optimizations
Results
Related Work
Conclusion
References
Author Index

Content (PDF)

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Euro-Par 2011 Parallel Processing

Description

More details

Other editions

Additional editions

Content

System requirements