
Accelerated Computing with HIP
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
The goal of this book is to provide helpful guidance to GPU programmers looking to develop HIP programs for the ROCm platform using GPUs. The reader of this book will learn how to reason through real-world problems and break them down into independent parts so that GPUs can be used to solve them e¿iciently. This text is designed to take programmers on a tour of GPU hardware design and demonstrate how to effectively leverage its unique hardware features to optimize software performance. Finally, the text includes instructions on how programmers can exploit the ROCm ecosystem by invoking libraries to perform linear algebra operations while leveraging multiple GPUs in one application.
More details
Persons
Content
- Intro
- Title Page
- Copyright Page
- Contents
- Foreword
- Preface
- Acknowledgements
- 1. Introduction
- 1.1 Parallel Programming
- 1.2 GPUs
- 1.3 ROCm
- 1.4 HIP Framework
- 1.5 What This Book Covers
- 2. HIP Language
- 2.1 Introduction
- 2.2 "Hello World" in HIP
- 2.3 Compilation and Execution
- 2.4 HIP Runtime API
- 2.5 HIP Kernel Launch
- 2.6 HIP Program Structure
- 2.7 Error and Correctness Checking
- 2.8 Organizing Threads
- 2.9 vector_add in HIP
- 2.10 Conclusion
- 3. AMD GPU Internals
- 3.1 AMD GPUs
- 3.2 Overall Architecture
- 3.3 Command Processor and the DMA Engine
- 3.4 Workgroup Dispatching
- 3.5 Sequencer
- 3.6 SIMD Unit
- 3.7 Thread Divergence
- 3.8 Memory Coalescing
- 3.9 Memory Hierarchy
- 3.10 Conclusion
- 4. HIP Tools for Performance Analysis & Debug
- 4.1 ROCm Profiler
- 4.2 ROCm Debugger
- 4.3 ROCm SMI
- 4.4 Conclusion
- 5. HIP Programming Patterns
- 5.1 Highly Parallel Workload - Image Gamma Correction
- 5.2 Multidimensional Kernels-Stencil
- 5.3 Fixed-Sized Kernels-Image Gamma Correction
- 5.4 Reduce-Array Sum
- 5.5 Tiling & Reuse - Matrix Multiplication
- 5.6 Tiling & Coalescing: Matrix Transpose
- 5.7 Kernel-Level Synchronization: BFS
- 5.8 Conclusion
- 6. HIP Streams
- 6.1 Basics of Streams
- 6.2 Important Stream-Based APIs
- 6.3 Execution of HIP Streams on GPU Hardware
- 6.4 Default and Non-Default Streams
- 6.5 Concurrent Kernels
- 6.6 Overlapping Computation and Communication
- 6.7 Conclusion
- 7. ROCm Libraries
- 7.1 rocBLAS
- 7.1.1Using rocBLAS
- 7.1.2rocBLAS functions7.1.3Asynchronous execution
- 7.1.4rocBLAS on MI100
- 7.1.5Porting from the legacy BLAS library
- 7.2 rocSPARSE
- 7.2.1Sparse data representation
- 7.2.2rocSPARSE functions
- 7.3 rocFFT
- 7.3.1rocFFT workflow
- 7.3.2FFT execution plan
- 7.4 rocRAND
- 7.5 Conclusion
- 8. Porting CUDA Programs to HIP
- 8.1 Hipify Tools
- 8.1.1Hipify-clang
- 8.1.2Hipify-perl
- 8.2 General Hipifying Guidelines8.3 Hipification of Matrix-Transpose
- 8.4 Common Pitfalls and Solutions
- 8.5 Conclusion
- 9. Multi-GPU Programming
- 9.1 HIP Device APIs
- 9.2 Stream-Based Multi-GPU Programming
- 9.3 Thread-Based Multi-GPU Programming
- 9.4 MPI-Based Multi-GPU Programming
- 9.5 GPU-GPU Communication
- 9.6 RCCL
- 9.6.1Broadcast
- 9.6.2AllReduce
- 9.7 Conclusion
- 10. ROCm in Datacenters
- 10.1 Containerized ROCm
- 10.2 Managing ROCm Containers with Kubernetes
- 10.3 Managing ROCm Nodes with SLURM
- 10.3.1SLURM interactive mode
- 10.3.2SLURM batch submission mode
- 10.4 Conclusion
- 11. Third-Party Tools
- 11.1 PAPI
- 11.1.1Introduction
- 11.1.2PAPI utilities and tests
- 11.1.3PAPI support for AMD GPUs
- 11.1.4Preset events and Counter Analysis Toolkit (CAT)
- 11.2 Score-P and Vampir
- 11.2.1Overview
- 11.2.2Tracing with Score-P
- 11.2.3Score-P usage
- 11.2.4Profiling the Quicksilver application
- 11.2.5Summary
- 11.3 Trace Compass and Theia
- 11.4 TAU
- 11.4.1Profiling HIP programs with TAU
- 11.4.2Tracing HIP programs with TAU
- 11.4.3Using APEX to measure HIP programs
- 11.4.4Summary of TAU
- 11.5 TotalView Debugger
- 11.6 HPCToolkit
- 11.6.1HPCToolkit's workflow
- 11.6.2Analyzing PIConGPU with HPCToolkit
- 11.6.3Collecting and analyzing profiles and traces
- 11.6.4Measurement using hardware counters
- 11.7 E4S - Extreme Scale Scientific Software Stack
- 11.7.1E4S release
- A. CDNA Assembly
- A. 1 Using CDNA Assembly Code
- A.1.1Retrieve HIP kernel binary
- A.1.2Disassembling a CDNA binary
- A. 2 CDNA Registers
- A. 3 Instruction Types
- A. 4 Memory Access Instructions
- A. 5 Example: Shifted Copy
- A. 6 Example: Branching
- A. 7 Conclusion
- B. ML with ROCm
- B. 1 PyTorch on ROCm
- B. 2TensorFlow on ROCm
- B. 3 Conclusion
- Bibliography
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.