Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Foreword xvii
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
What's in this chapter?
Code Download The wrox.com code downloads for this chapter are found at www.wrox.com/go/procudac on the Download Code tab. The code is in the Chapter 1 download and individually named according to the names throughout the chapter.
www.wrox.com/go/procudac
The high-performance computing (HPC) landscape is always changing as new technologies and processes become commonplace, and the definition of HPC changes accordingly. In general, it pertains to the use of multiple processors or computers to accomplish a complex task concurrently with high throughput and efficiency. It is common to consider HPC as not only a computing architecture but also as a set of elements, including hardware systems, software tools, programming platforms, and parallel programming paradigms.
Over the last decade, high-performance computing has evolved significantly, particularly because of the emergence of GPU-CPU heterogeneous architectures, which have led to a fundamental paradigm shift in parallel programming. This chapter begins your understanding of heterogeneous parallel programming.
During the past several decades, there has been ever-increasing interest in parallel computation. The primary goal of parallel computing is to improve the speed of computation.
From a pure calculation perspective, parallel computing can be defined as a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently.
From the programmer's perspective, a natural question is how to map the concurrent calculations onto computers. Suppose you have multiple computing resources. Parallel computing can then be defined as the simultaneous use of multiple computing resources (cores or computers) to perform the concurrent calculations. A large problem is broken down into smaller ones, and each smaller one is then solved concurrently on different computing resources. The software and hardware aspects of parallel computing are closely intertwined together. In fact, parallel computing usually involves two distinct areas of computing technologies:
Computer architecture focuses on supporting parallelism at an architectural level, while parallel programming focuses on solving a problem concurrently by fully using the computational power of the computer architecture. In order to achieve parallel execution in software, the hardware must provide a platform that supports concurrent execution of multiple processes or multiple threads.
Most modern processors implement the Harvard architecture, as shown in Figure 1.1, which is comprised of three main components:
Figure 1.1
The key component in high-performance computing is the central processing unit (CPU), usually called the core. In the early days of the computer, there was only one core on a chip. This architecture is referred to as a uniprocessor. Nowadays, the trend in chip design is to integrate multiple cores onto a single processor, usually termed multicore, to support parallelism at the architecture level. Therefore, programming can be viewed as the process of mapping the computation of a problem to available cores such that parallel execution is obtained.
When implementing a sequential algorithm, you may not need to understand the details of the computer architecture to write a correct program. However, when implementing algorithms for multicore machines, it is much more important for programmers to be aware of the characteristics of the underlying computer architecture. Writing both correct and efficient parallel programs requires a fundamental knowledge of multicore architectures.
The following sections cover some basic concepts of parallel computing and how these concepts relate to CUDA programming.
When solving a problem with a computer program, it is natural to divide the problem into a discrete series of calculations; each calculation performs a specified task, as shown in Figure 1.2. Such a program is called a sequential program.
Figure 1.2
There are two ways to classify the relationship between two pieces of computation: Some are related by a precedence restraint and therefore must be calculated sequentially; others have no such restraints and can be calculated concurrently. Any program containing tasks that are performed concurrently is a parallel program. As shown in Figure 1.3, a parallel program may, and most likely will, have some sequential parts.
Figure 1.3
From the eye of a programmer, a program consists of two basic ingredients: instruction and data. When a computational problem is broken down into many small pieces of computation, each piece is called a task. In a task, individual instructions consume inputs, apply a function, and produce outputs. A data dependency occurs when an instruction consumes data produced by a preceding instruction. Therefore, you can classify the relationship between any two tasks as either dependent, if one consumes the output of another, or independent.
Analyzing data dependencies is a fundamental skill in implementing parallel algorithms because dependencies are one of the primary inhibitors to parallelism, and understanding them is necessary to obtain application speedup in the modern programming world. In most cases, multiple independent chains of dependent tasks offer the best opportunity for parallelization.
Nowadays, parallelism is becoming ubiquitous, and parallel programming is becoming mainstream in the programming world. Parallelism at multiple levels is the driving force of architecture design. There are two fundamental types of parallelism in applications:
Task parallelism arises when there are many tasks or functions that can be operated independently and largely in parallel. Task parallelism focuses on distributing functions across multiple cores.
Data parallelism arises when there are many data items that can be operated on at the same time. Data parallelism focuses on distributing the data across multiple cores.
CUDA programming is especially well-suited to address problems that can be expressed as data-parallel computations. The major focus of this book is how to solve a data-parallel problem with CUDA programming. Many applications that process large data sets can use a data-parallel model to speed up the computations. Data-parallel processing maps data elements to parallel threads.
The first step in designing a data parallel program is to partition data across threads, with each thread working on a portion of the data. In general, there are two approaches to partitioning data: block partitioning and cyclic partitioning. In block partitioning, many consecutive elements of data are chunked together. Each chunk is assigned to a single thread in any order, and threads generally process only one chunk at a time. In cyclic partitioning, fewer data elements are chunked together. Neighboring threads receive neighboring chunks, and each thread can handle more than one chunk. Selecting a new chunk for a thread to process implies jumping ahead as many chunks as there are threads.
Figure 1.4 shows two simple examples of 1D data partitioning. In the block partition, each thread takes only one portion of the data to process, and in the cyclic partition, each thread takes more than one portion of the data to process. Figure 1.5 shows three simple examples of 2D data partitioning: block partitioning along the y dimension, block partitioning on both dimensions, and cyclic partitioning along the x dimension. The remaining patterns — block partitioning along the x dimension, cyclic partitioning on both dimensions, and cyclic partitioning along the y dimension — are left as an exercise.
y
x
Figure 1.4
Figure 1.5
Usually, data is stored one-dimensionally. Even when a logical multi-dimensional view of data is used, it still maps to one-dimensional physical storage. Determining how to distribute data among threads is closely related to both how that data is stored physically, as well as how the execution of each thread is ordered. The way you organize threads has a significant effect on the program's performance.
There are two basic approaches to partitioning data:
The performance of a program is...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.