
Euro-Par 2011 Parallel Processing
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Content
- Title
- Preface
- Organization
- Table of Contents
- Topic 9: Parallel and Distributed Programming
- Introduction
- Parallel Scanning with Bitstream Addition: An XML Case Study
- Introduction
- The Parallel Bitstream Method
- Fundamentals
- A Parallel Scanning Primitive
- XML Scanning and Parsing
- XML Well-Formedness
- Compilation to Block-Based Processing
- Performance Results
- Conclusion
- References
- HOMPI: A Hybrid Programming Framework for Expressing and Deploying Task-Based Parallelism
- Introduction
- Programming Environment
- Callbacks, Reductions and Detached Tasks
- Task Distribution and Scheduling
- TORC: The Runtime System
- Mixed-Mode and Hybrid Programming
- Experimental Evaluation
- Conclusion
- References
- A Failure Detector forWireless Networks with Unknown Membership
- Introduction
- Related Work
- Model and Problem Definition
- Stability Assumptions
- A Failure Detector of Class SM
- Towards a Time-Free Failure Detector for the SM Class
- Stable Query-Response Communication Mechanism
- Behavioral Properties
- A Failure Detector Algorithm for the SM Class
- Algorithm Description
- Practical Issues
- Conclusion
- References
- Towards Systematic Parallel Programming over MapReduce
- Introduction
- MapReduce and List Homomorphisms
- MapReduce and MapReduce Programming Model
- List Homomorphism and Homomorphism Theorems
- A Homomorphism-Based Framework for Parallel Programming with MapReduce
- Programming Interface and Homomorphism Derivation
- Homomorphism Implementation on MapReduce
- A Programming Example
- Experiments
- Concluding Remarks
- References
- Correlated Set Coordination in Fault Tolerant Message Logging Protocols
- Introduction
- Rollback Recovery Background
- Execution Model
- Building a Consistent Recovery Set
- Group-Coordinated Message Logging
- Shared Memory and Message Logging
- Correlated Set Coordinated Message Logging
- Implementation
- Experimental Evaluation
- Experimental Conditions
- Shared Memory Performance
- Cluster of Multicore Performance
- Related Works
- Concluding Remarks
- References
- Topic 10: Parallel Numerical Algorithms
- Introduction
- A Bit-Compatible Parallelization for ILU(k) Preconditioning
- Introduction
- Review of the Sequential ILU(k) Algorithm
- Terminology for ILU(k)
- ILU(k) Algorithm and Its Parallelization
- TPILU(k): Task-Oriented Parallel ILU(k) Algorithm
- Parallel Tasks and Static Load Balancing
- Optimized Symbolic Factorization
- Optional Level-Based Incomplete Inverse Method
- Experimental Results
- Experimental Analysis
- Related Work
- References
- Parallel Inexact Constraint Preconditioners for Saddle Point Problems
- Introduction
- Finite Element Coupled Consolidation Equations
- Inexact Constraint Preconditioners
- Eigenvalue Distribution of the Preconditioned Matrices
- FSAI-Based ICP
- Parallel Implementation
- Numerical Results
- Solution of K x= b.
- Parallel Results and Scalability
- Conclusions
- References
- Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms
- Introduction
- Previous Work
- Communication Lower Bounds for Linear Algebra
- 3D Linear Algebra Algorithms
- 2.5D Lower and Upper Bounds
- 2.5D Matrix Multiplication
- 2.5D LU Communication Lower Bound
- 2.5D Communication Optimal LU
- 2.5D Communication Optimal LU with Pivoting
- Performance Results
- 2.5D Matrix Multiplication Performance
- 2.5D LU Performance
- 2.5D LU with CA-Pivoting Performance
- Future Work
- References
- Topic 11: Multicore and Manycore Programming
- Introduction
- Hardware and Software Tradeoffs for Task Synchronization on Manycore Architectures
- Introduction
- Asynchronous Task Parallelism and Software Phasers
- Asynchronous Task Synchronization Using Phasers
- Software Phasers in Habanero-C
- Hardware Support in Phasers
- Cyclops64 Manycore Architecture
- Optimization Using Hardware Barriers
- Optimization Using Thread Suspend and Awake
- Adaptive Phasers
- Memory Optimizations
- Implementation and Experiments
- Implementation and Experimental Benchmarks
- Hierarchical Phasers and Memory Optimizations
- Barrier and Point-to-Point Microbenchmarks
- Applications
- Related Work
- Conclusions and Future Work
- References
- OpenMPspy: Leveraging Quality Assurance for Parallel Software
- Introduction
- Overview of OpenMPspy
- Modes of Operation
- The Code Analysis Framework
- Analysis Features for OpenMP
- Analyzing with OpenMPspy: A Study of Real Projects
- Applications
- Finding Unreported Errors in Real Projects
- How OpenMP Constructs Are Used in Practice
- Insights for Parallel Software Quality Improvement
- Related Work
- Conclusion
- References
- A Generic Parallel Collection Framework
- Introduction
- Scala Collection Framework
- Adaptive Work Stealing
- Design and Implementation
- Splitters and Combiners
- Parallel Array
- Parallel Rope
- Parallel Hash Table
- Parallel Hash Trie
- Parallel Views
- Experimental Results
- Related Work
- Conclusion
- References
- Progress Guarantees When Composing Lock-Free Objects
- Introduction
- Progress Guarantee When Composing Lock-Free Data Objects
- Lock-Free Data Objects
- Examining Lock-Free Progress Guarantee in Object-Oriented Program
- A Synchronization Mechanism for Composing Lock-Free Objects
- Our Approach
- The Operation Descriptor
- The Synchronization Mechanism
- ABA Problem
- Linearizability
- How Does the Proposed Synchronization Mechanism Resolve Lock-Free Conflicts?
- Experimental Evaluation
- Conclusion
- References
- Engineering a Multi-core Radix Sort
- Introduction
- Software Write-Combining
- Virtual-Memory Counting Sort
- Radix Sort
- Performance Evaluation
- Conclusion
- References
- Accelerating Code on Multi-cores with FastFlow
- Introduction
- Code Acceleration through Streamization
- The FastFlow Parallel Programming Framework
- Self-offloading on the FastFlow Accelerator
- Experimental Evaluation
- Micro-benchmarks
- Applications
- Related Work
- Conclusions
- References
- A Novel Shared-Memory Thread-Pool Implementation for Hybrid Parallel CFD Solvers
- Intro
- Motivation
- Outline
- The DLR TAU Code
- The Shared-Memory Parallelization - Generic Concept
- The Shared-Memory Parallelization - Implementation Details for TAU
- Cache Blocking in TAU
- Modification of the Colors in TAU to Suite the Hybrid Parallelization Concept
- Minimally Invasive Implementation of the Task Dispatching
- First Performance Results
- Conclusion and Outlook
- References
- A Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures
- Introduction
- Problem Description
- Tile QR Factorization
- Tunable Parameters and Objective
- Two-Step Empirical Method
- Experimental Environments
- Step 1: Benchmarking the Most Compute-Intensive Serial Kernel
- Step 2: Benchmarking the Whole QR Factorization
- Discretization and Interpolation
- Impact of the Pre-selection on the Elapsed Time of Step 2
- Prune as You Go (PSPAYG)
- Reliability
- Conclusion and Future Work
- References
- Parallelizing a Real-Time Physics Engine Using Transactional Memory
- Introduction
- ODE Overview
- Collision Detection
- Dynamics Simulation
- Parallel Transactional ODE
- Global Thread Pool
- Parallel Collision Detection Using Spatial Decomposition
- Parallel Island Processing
- Phase Separation
- Feedback between Phases
- Issues
- Conditional Synchronization
- Memory Management and Application Controlled alloc/de-alloc.
- Experimental Evaluation
- Execution Time
- Frame Rate
- Abort Rate
- Thread Utilization
- Transaction Read/Write Sets
- Scalability Optimizations
- Related Work
- Conclusion
- References
- Topic 12: Theory and Algorithms for Parallel Computation
- Introduction
- Petri-nets as an Intermediate Representation for Heterogeneous Architectures
- Introduction
- Notation
- Petri-net Intermediate Representation
- Simple Hardware Model
- Mapping Software to Hardware
- Finding Optimal Executions
- Complexity
- Similar Problems and Techniques
- Compiler Optimisations
- Comparison with Other Models
- Conclusions
- References
- A Bi-Objective Scheduling Algorithm for Desktop Grids with Uncertain Resource Availabilities
- Introduction
- Context and Motivation
- Contributions
- Organization of the Paper
- Related Works
- Scheduling with Unavailabilities
- Scheduling under Uncertainties
- Models
- Model of Execution
- Model of Disturbances
- Problem Definition
- Analysis of the Stability
- Bi-objective Algorithm
- Description
- Theoretical Analysis
- Experiments
- Concluding Remarks
- References
- New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures
- Introduction
- Vertex Ordering
- The Serial Framework
- Parallel Ordering
- Parallel Distance-2 Coloring
- Experimental Results
- Conclusion and Future Work
- References
- Topic 13: High Performance Networks and Communication
- Introduction
- Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned
- Introduction
- Related Work
- Expressing Collective Operations
- The GOAL Interpreter
- User vs. Kernel Level Design
- Integration into the Operating System
- Anatomy of the Linux Kernel Network Stack
- The Ethernet Streaming Protocol
- Asynchronous Progression
- Performing Reduction Operations in Kernel Space
- Benchmark Results
- Experimental Setting
- Asynchronous Progress and Overlap
- CPU Overheads
- Conclusions and Future Work
- References
- A High Performance Superpipeline Protocol for InfiniBand
- Introduction
- Performance Analysis
- Pipelining Memory Copy
- Optimizations beyond Pipeline: Superpipeline
- Benchmarks
- Related Works
- Conclusion and Future Works
- References
- Topic 14: Mobile and Ubiquitous Computing
- Introduction
- ChurnDetect: A Gossip-Based Churn Estimator for Large-Scale Dynamic Networks
- Introduction
- Related Work
- Diffusion Algorithms
- The DiffusionReset Algorithm
- Convergence of DiffusionReset
- Churn Detection Algorithm
- Analysis of ChurnDetect Algorithm
- Experimental Evaluation via Simulations
- Experimental Evaluation on the Testbed
- Conclusions
- References
- Topic 15: High-Performance and Scientific Applications
- Introduction
- Real Time Contingency Analysis for Power Grids
- Introduction
- Previous Work
- Risk Based Algorithm
- Load Balancing Schemes
- Centralized Load Balancing Schemes
- Decentralized Load Balancing Scheme
- Results
- Conclusions and Future Work
- References
- CRSD: Application Specific Auto-tuning of SpMV for Diagonal Sparse Matrices
- Introduction
- CRSD Storage Format
- Diagonal Pattern
- Application Specific Diagonal Pattern
- Storage Format
- SpMV Implementation for CRSD
- Application Specific Automatic Performance Tuning
- The Final CRSD SpMV Implementation
- Parallelization
- Evaluation
- The Auto-Tuning Records
- Serial Performance Improvement
- Parallel Performance Improvement
- Related Work
- Conclusion
- References
- The LOFAR Beam Former: Implementation and Performance Analysis
- Introduction
- IBM BlueGene/P
- System Description
- External I/O
- Real-Time Processing
- LOFAR and Beam Forming
- Beam Former Pipelines
- Input from Stations
- First All-to-All Exchange
- Beam Forming
- Channel-Level Dedispersion
- Stokes Calculations
- Second All-to-All Exchange
- Transport to Disks
- Performance Analysis
- Overall Performance
- System Load
- Related Work
- Conclusions
- References
- Application-Specific Fault Tolerance via Data Access Characterization
- Introduction
- Related Work
- Background
- NWChem
- Global Arrays
- Instrumentation Methodology
- Fault Tolerance Techniques
- Application Evaluation Axes
- Data Access Characterization of NWChem
- Hartree-Fock/Density Functional Theory
- Coupled Cluster Theory
- Evaluation of Various Fault Tolerance Schemes
- Conclusions
- References
- High-Performance Numerical Optimization on Multicore Clusters
- Introduction
- Numerical Optimization
- Multistart Parallelism Issues
- TORC Runtime Library
- PNDL and Parallel Multistart Implementation
- Performance Experiments
- Related Work
- Conclusions
- References
- Parallel Monte-Carlo Tree Search for HPC Systems
- Introduction
- MCTS: Background and Related Work
- Basic MCTS
- Parallelization of MCTS
- The UCT-Treesplit Algorithm for Parallel MCTS
- Experiments
- Setup
- Results
- Conclusion and Future Work
- References
- Petascale Block-Structured AMR Applications without Distributed Meta-data
- Introduction
- AMR Applications
- Chombo AMR Framework
- Benchmarking Methodology
- Replication Scaling Benchmarks
- Poisson Benchmark
- Hyperbolic Gas Dynamics Benchmark
- Optimizing AMR for Scalability
- Memory Performance: Compression
- Run Time Performance
- Summary and Conclusions
- References
- Accelerating Anisotropic Mesh Adaptivity on nVIDIA's CUDA Using Texture Interpolation
- Introduction
- Background
- PDEs, Meshes and Mesh Quality
- Anisotropic PDEs
- Vertex Smoothing and the Algorithm by Pain et al.
- CUDA's Texturing Hardware
- Design and Implementation
- Experimental Evaluation
- Conclusions and Future Work
- References
- Topic 16: GPU and Accelerators Computing
- Introduction
- Model-Driven Tile Size Selection for DOACROSS Loops on GPUs
- Introduction
- Parallelization of DOACROSS Loops on GPUs
- Execution Time Modeling
- Intra-tile Execution
- Inter-tile Execution
- Parameter Estimation
- Border Tiles
- Model-Driven Tile Size Selection
- The Algorithm
- The Framework
- Experiments
- References
- Iterative Sparse Matrix-Vector Multiplication for Integer Factorization on GPUs
- Introduction
- SpMV on GF(2) for NFS Matrices Using Existing Formats on GPUs
- New Formats for SpMV on GPUs for NFS Matrices
- Dense Format
- Sliced COO
- Determining the Cut-Off Point of Each Format
- Dual-GPU Implementation
- Results
- Conclusion and Future Work
- References
- Lessons Learned from Exploring the Backtracking Paradigm on the GPU
- Introduction
- Motivation
- Backtracking Case Study: Bron-Kerbosch MCE
- Algorithm Overview
- Algorithm Parallelization
- Benchmarking
- Input Graphs
- GPU vs. Multi-core CPU Timing
- Lessons Learned
- Coarse vs. Fine-Grain Parallelization
- Global Memory Latency Hiding
- A Reliance on Problem Instance Representation
- Generality of Backtracking Properties with Respect to GPU-Based Algorithms
- Conclusions / Future Work
- References
- Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design
- Introduction
- Benchmark Design and Methodology
- Arithmetic Throughput
- Memory Subsystem
- Branching Penalty
- Runtime Overheads
- Device Characterization - Results
- Arithmetic Throughput
- Memory Subsystem
- Branching Penalty
- Runtime Overheads
- Guiding Kernel Design
- The Model Problem
- Optimizations
- Results
- Related Work
- Conclusion
- References
- Author Index
System requirements
File format: PDF
Copy protection: Watermark-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Use the free software Adobe Reader, Adobe Digital Editions, or any other PDF viewer of your choice (see eBook Help).
- Tablet/Smartphone (Android; iOS): Install the free app Adobe Digital Editions or another reading app for eBooks, e.g., PocketBook (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).
The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook uses Watermark-DRM, a „soft” copy protection. This means that there are no technical restrictions to prevent illegal distribution. However, there is a personalised watermark embedded in the eBook that can be used to identify the purchaser of the eBook in the event of misuse and to provide evidence for legal purposes.
For more information, see our eBook Help page.