High Performance Computing for Computational Science -- VECPAR 2010

Name: High Performance Computing for Computational Science -- VECPAR 2010 | 9th International Conference, Berkeley, CA, USA, June 22-25, 2010, Revised, Selected Papers
Brand: Springer
Price: 53.49 EUR
Availability: OnlineOnly

9th International Conference, Berkeley, CA, USA, June 22-25, 2010, Revised, Selected Papers

José M. Laginha M. Palma Michel Daydé Osni Marques Joao Correia Lopes(Herausgeber*in)

Springer (Verlag)

Erschienen am 18. Februar 2011

XIV, 470 Seiten

E-Book

PDF mit Wasserzeichen-DRM

Systemvoraussetzungen

978-3-642-19328-6 (ISBN)

53,49 €inkl. 7% MwSt.

Systemvoraussetzungen

für PDF mit Wasserzeichen-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Inhalt

Title Page
Preface
Organization
Table of Contents
Invited Talks
Exascale Computing Technology Challenges
Introduction
Metrics, Cost Functions, and Constraints
Memory Subsystem
Memory Bandwidth
Memory Capacity
Latency
Node Architecture Projections for 2018
Clock Rate
Instruction Level Parallelism
Instruction Bundling (SIMD and VLIW)
Multithreading to Hide Latency
FPU Organization
System on Chip (SoC) Integration
Alternative Exotic Functional Unit Organizations
Cache Hierarchy
Levels of Cache Hierarchy
Private vs. Shared Caches
Software Managed Caches vs. Conventional Caches
Intra-node Communication (Networks-on-Chip)
Cache Coherence (or Lack Thereof)
Global Address Space
Fine Grained Synchronization Support
Power Management
Node-Scale Power Management
System-Scale Power Management
Energy Aware Algorithms
Library Integration with Power Management Systems
Compiler Assisted Power Management
Application-Directed Power Management
System "Aging"
Voltage Conversion and Cooling Efficiency
Fault Detection and Recovery
Hard (Permanent) Errors
Soft (Transient) Errors
Node Localized Checkpointing
Interconnection Networks
Topology
Effect of Interconnect Topology on Interconnect Design
Conclusions
References
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?
HPC Techniques for a Heart Simulator
Game Changing Computational Engineering Technology
HPC in Phase Change: Towards a New Execution Model
Linear Algebra and Solvers on Emerging Architectures
Factors Impacting Performance of Multithreaded Sparse Triangular Solve
Introduction
Motivation
Level-Set Triangular Solver
Related Work
Factors Affecting Performance
Numerical Experiments
Barriers
Thread Affinity
Data Locality
More Realistic Problems
Summary and Conclusions
References
Performance and Numerical Accuracy Evaluation of Heterogeneous Multicore Systems for Krylov Orthogonal Basis Computation
Introduction
Orthogonalization Process
Accelerators Programming
Nvidia CUDA-Enabled GPUs
STI Cell Processor
Optimizations
BLAS Operations
CPU
GPU
Cell Broadband Engine
Experimentation
Hardware Precision
Performance Achieved
Synthesis
Conclusion
References
An Error Correction Solver for Linear Systems: Evaluation of Mixed Precision Implementations
Introduction
Mixed Precision Error Correction Methods
Mathematical Background
Mixed Precision Approach
Hardware Platform and Implementation Issues
Numerical Experiments
Test Configurations
Numerical Results
Result Interpretation
Conclusions and Future Work
References
Multifrontal Computations on GPUs and Their Multi-core Hosts
Introduction
Overview of a Multifrontal Sparse Solver
Graphics Processing Units
Algorithm for Factoring Individual Frontal Matrices on the GPU
Performance of the Accelerated Multifrontal Solver
Summary
References
Accelerating GPU Kernels for Dense Linear Algebra
Introduction
Performance of Current BLAS for GPUs
Pointer Redirecting
MAGMA BLAS Kernels
Performance
Conclusions and On-going Work
References
A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators
Introduction
Cholesky Factorization on Multicore+MultiGPUs
Principles and Methodology
Implementations Details
Memory Optimal
Data Persistence Optimizations
Hybrid xPOTRF Kernel
xSYRK and xTRSM Kernel Optimizations
Experimental Results
Environment Setup
Tuning
Performance Results
Related Work
Summary and Future Work
References
On the Performance of an Algebraic Multigrid Solver on Multicore Clusters
Motivation
The Algebraic Multigrid (AMG) Solver
The Hera Multicore Cluster
Using an MPI-Only Model with AMG
Replacing On-Node MPI with OpenMP
The OpenMP Implementation
Optimizing Memory Behavior with MCSup
Optimized OpenMP Performance
Mixed Programming Model
Investigating the MPI-Only Performance Degradation
Summary
References
An Hybrid Approach for the Parallelization of a Block Iterative Algorithm
Introduction
Block Cimmino Algorithm
Parallelization Strategy
Manual Parallelism Description
Automatic Parallelism with MUMPS
Strategy Details
Preprocessing
Solve: The Block-CG Acceleration
Numerical Results
Factorization Step
Solve Step
Ongoing Work
References
Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures
Introduction
Tile in-place Matrix Inversion
Algorithmic Study
Conclusion and Future Work
References
A Massively Parallel Dense Symmetric Eigensolver with Communication Splitting Multicasting Algorithm
Introduction
Symmetric Dense Eigensolver
The Communication Splitting Multicasting Algorithm
The Data Distribution
The Square Grid Algorithm for Tridiagonalization
The Process Grid Free Algorithm for Tridiagonalization
The Process Grid Free Algorithm for Inverse Transformation
Performance Evaluation
Machine Environment
Performance on Different Process Grids
Execution Performance in a Massively Parallel Environment
Conclusion
References
Large Scale Simulations in CS&E
Global Memory Access Modelling for Efficient Implementation of the Lattice Boltzmann Method on Graphics Processing Units
Compute Unified Device Architecture
Lattice Boltzmann Method
Methodology
Modelling
Throughput
N = 20
21 = N = 39
Complementary Studies
Implementations
References
Data Structures and Transformations for Physically Based Simulation on a GPU
Introduction
Related Work
Physically Based Simulation Framework
Coalesced Memory Accesses from Arrays of Objects
Automated Framework for Physics Data Structures
Data Transformations and Hierarchically Designed Data Structures
Performance Results
Conclusion
References
Scalability Studies of an Implicit Shallow Water Solver for the Rossby-Haurwitz Problem
Introduction
A Fully Implicit Finite Volume Discretization
An Inexact Newton's Method with Adaptive Stopping Conditions
Some Variants of One-Level and Multilevel Schwarz Preconditioners
Numerical Experiments
Numerical Conservation
Performance Tests
Conclusions
References
Parallel Multigrid Solvers Using OpenMP/MPI Hybrid Programming Models on Multi-Core/Multi-Socket Clusters
Introduction
Hardware Environment
Implementation and Optimization of Target Application
Finite-Volume Application
Iterative Method with Multigrid Preconditioning
Procedures for Reordering
Procedures for Optimization
Results
Effect of Coloring and Optimization
Weak Scaling
Strong Scaling
Concluding Remarks
References
A Parallel Strategy for a Level Set Simulation of Droplets Moving in a Liquid Medium
Introduction
Numerical Simulation of Droplets Sedimenting in Water
Parallel Hierarchy of Triangulations
Parallel Performance
Concluding Remarks
References
Optimization of Aircraft Wake Alleviation Schemes through an Evolution Strategy
Introduction
Optimization of Wake Alleviation
Alleviation Scheme
Optimization of the Lift Distribution and Perturbation
Methodology
Vortex Particle Method
Evolution Strategy
Coupling and Computation
Results
Optimization
Optimum Parameter Set
Conclusions
References
Parallel and Distributed Computing
On-Line Multi-Threaded Processing of Web User-Clicks on Multi-Core Processors
Introduction
Background and Problem Setting
Strategies for Read-Write Synchronization
Experiments
Conclusions
References
Performance Evaluation of Improved Web Search Algorithms
Introduction
Search Architecture
A Cost Estimation Methodology
Experimental Setting
Conclusions
References
Text Classification on a Grid Environment
Introduction
Text Classification
Naïve Bayes Classifier
Expectation-Maximization Algorithm
Grid Environment
NACAD Grid Environment
Grid Services
Naïve Bayes Classifier via the EM Algorithm on a Grid Environment
Results
Performance Criteria
Conclusion
References
On the Vectorization of Engineering Codes Using Multimedia Instructions
Introduction
Outline of the Boundary Element Theory
The Application
The Streaming SIMD Extensions
Auto-vectorization Compilers
Compiler Intrinsics
An SSE Implementation
Results Summary
Conclusions
References
Numerical Library Reuse in Parallel and Distributed Platforms
Introduction
Linear Algebra Libraries
Imperative Numerical Libaries
Object Oriented Numerical Libraries
A Reusable Numerical Library Design Model
Library Integration in Scientific Workflow Environment
Experiments
Conclusion
References
Improving Memory Affinity of Geophysics Applications on NUMA Platforms Using Minas
Introduction
Related Work
Minas
MAi: Memory Affinity interface
MApp: Memory Affinity preprocessor
Numarch: NUMA Architecture Module
Performance Evaluation
Cache-Coherent NUMA Platforms
Numerical Scientific Parallel Applications
Experimental Results
Conclusion and Future Work
References
HPC Environment Management: New Challenges in the Petaflop Era
Introduction
Available Tools
Deployment Tools
Monitoring Tools
Proprietary Solutions
The LEMMing Project
LEMMing Web Services (LEMM-WS)
LEMMing Web Application (LEMM-GATE)
Conclusion
References
Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems
Introduction
EdgeCFD: The Benchmark Software
Performance Tests
Concluding Remarks
References
Applying Process Migration on a BSP-Based LU Decomposition Application
Introduction
MigBSP: Process Rescheduling Model
LU Decomposition Application
BSP-Based LU Application Modeling
Evaluation Methodology
Results Analysis
Related Work
Concluding Remarks
References
A P2P Approach to Many Tasks Computing for Scientific Workflows
Introduction
Backgrounds on P2P Networks
Design of SciMule
SciMule Architectural Features
SciMule Conceptual Architecture
SciMule Evaluation
Conclusions
References
Intelligent Service Trading and Brokering for Distributed Network Services in GridSolve
Introduction
ServiceTrading
Inputs for the Service Trader
The Trader Output
Inside the Trader
Overview of GridSolve: A GridRPC Middleware
Integration of Service Trading into GridSolve
Generating the Inputs of the Trader
Discover the Combination of Services
Call the Services
The Service Trader C API
The Matlab Interface
Experiments
Summary
References
Load Balancing in Dynamic Networks by Bounded Delays Asynchronous Diffusion
Introduction
Model
Notations
General Load Balancing Scheme
Dynamical Evolution of the System State
Choice of the Load Ratios
Load Balancing Algorithm
Proof of the Load Balancing Convergence
Technical Results
Proof of Theorem 1
Experimental Evaluation
Efficiency Evaluation
Experimental Contexts
Results
Conclusion
References
A Computing Resource Discovery Mechanism over a P2P Tree Topology
Introduction
CoDiP2P Architecture
Updating Algorithm
Departure of Peers
Searching Algorithms
Exact Query Searching Algorithm
Range Query Searching Algorithm
Rebalancing Mechanism
Experimentation
Experimental Results
Conclusions and Future Work
References
Numerical Algorithms
A Parallel Implementation of the Jacobi-Davidson Eigensolver for Unsymmetric Matrices
Introduction
The Jacobi-Davidson Method
Computation of Eigenvalues at the Periphery of the Spectrum
Computation of Interior Eigenvalues
Computing Complex Eigenvalues with Real Arithmetic
Preconditioning
Implementation Details
Computational Results
The Exterior Case
The Interior Case
Parallel Performance
Conclusions and Future Work
References
The Impact of Data Distribution in Accuracy and Performance of Parallel Linear Algebra Subroutines
Introduction
Background
Theory of Rounding Errors
Data Distribution in Numerical Algorithms
Numerical Experiments
Platform
Input Data
Numerical Results
Final Remarks and Future Work
References
On a Strategy for Spectral Clustering with Parallel Computation
Introduction
Parallel Spectral Clustering: Algorithm and Justification
Choice of the Affinity Parameter s
Number of Clusters $k$
Implementation: Algorithm Components
Pre-processing Step: Partition $S$ in $q$ Subdomains
Domain Decomposition: Interface and Subdomains
Spectral Clustering on Subdomains
Grouping Step
Parallel Experiments
Discussion and Alternative
Numerical Experiments: Geometrical Example
An Image Segmentation Example
Conclusion and Ongoing Works
References
On Techniques to Improve Robustness and Scalability of a Parallel Hybrid Linear Solver
The Schur Complement Method and Parallelization
Efficient Computation of an Approximate Schur Complement
Sparse Triangular Solution with Sparse Right-Hand-Sides
Intra-processor Load Balance
Inter-processor Load Balance
Parallel Performance
Conclusion
References
Solving Dense Interval Linear Systems with Verified Computing on Multicore Architectures
Introduction and Motivation
Parallel and Verified Computing
Mathematical Background
Proposed Approach
Initial Implementation
Initial Approach Evaluation
Optimized Parallel Approach
Optimized Approach Evaluation
Considerations and Future Work
References
TRACEMIN-Fiedler: A Parallel Algorithm for Computing the Fiedler Vector
Introduction
The TRACEMIN-Fiedler Algorithm
Parallel Implementation of TRACEMIN-Fiedler
Numerical Results
Conclusions
References
Applying Parallel Design Techniques to Template Matching with GPUs
Introduction
Template Matching Background
Case Study: Full Search and On-Card Memory
GPU Acceleration Method
Results
Analysis
Conclusions and Future Work
References
Author Index

Systemvoraussetzungen

Als PDF speichern Als Link merken

High Performance Computing for Computational Science -- VECPAR 2010

Beschreibung

Weitere Details

Weitere Ausgaben

Andere Ausgaben

Inhalt

Systemvoraussetzungen