Professional CUDA C Programming

Name: Professional CUDA C Programming
Brand: Wrox
Price: 38.99 EUR
Availability: OnlineOnly

John Cheng Max Grossman Ty McKercher(Autor*in)

Wrox (Verlag)

Erschienen am 8. September 2014

528 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-118-73931-0 (ISBN)

38,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Break into the powerful world of parallel GPUprogramming with this down-to-earth, practicalguide
Designed for professionals across multiple industrial sectors,Professional CUDA C Programming presents CUDA -- aparallel computing platform and programming model designed to easethe development of GPU programming -- fundamentals in aneasy-to-follow format, and teaches readers how to think in paralleland implement parallel algorithms on GPUs. Each chapter covers aspecific topic, and includes workable examples that demonstrate thedevelopment process, allowing readers to explore both the "hard"and "soft" aspects of GPU programming.
Computing architectures are experiencing a fundamental shifttoward scalable parallel computing motivated by applicationrequirements in industry and science. This book demonstrates thechallenges of efficiently utilizing compute resources at peakperformance, presents modern techniques for tackling thesechallenges, while increasing accessibility for professionals whoare not necessarily parallel programming experts. The CUDAprogramming model and tools empower developers to writehigh-performance applications on a scalable, parallel computingplatform: the GPU. However, CUDA itself can be difficult to learnwithout extensive programming experience. Recognized CUDAauthorities John Cheng, Max Grossman, and Ty McKercher guidereaders through essential GPU programming skills and best practicesin Professional CUDA C Programming, including:
* CUDA Programming Model
* GPU Execution Model
* GPU Memory model
* Streams, Event and Concurrency
* Multi-GPU Programming
* CUDA Domain-Specific Libraries
* Profiling and Performance Tuning
The book makes complex CUDA concepts easy to understand foranyone with knowledge of basic software development with exercisesdesigned to be both readable and high-performance. For theprofessional seeking entrance to parallel computing and thehigh-performance computing community, Professional CUDA CProgramming is an invaluable resource, with the most currentinformation available on the market.

Weitere Details

Weitere Ausgaben

Personen

Inhalt

Foreword xvii

Preface xix

Introduction xxi

Chapter 1: Heterogeneous Parallel Computing with CUDA 1

Parallel Computing 2

Sequential and Parallel Programming 3

Parallelism 4

Computer Architecture 6

Heterogeneous Computing 8

Heterogeneous Architecture 9

Paradigm of Heterogeneous Computing 12

CUDA: A Platform for Heterogeneous Computing 14

Hello World from GPU 17

Is CUDA C Programming Difficult? 20

Summary 21

Chapter 2: CUDA Programming Model 23

Introducing the CUDA Programming Model 23

CUDA Programming Structure 25

Managing Memory 26

Organizing Threads 30

Launching a CUDA Kernel 36

Writing Your Kernel 37

Verifying Your Kernel 39

Handling Errors 40

Compiling and Executing 40

Timing Your Kernel 43

Timing with CPU Timer 44

Timing with nvprof 47

Organizing Parallel Threads 49

Indexing Matrices with Blocks and Threads 49

Summing Matrices with a 2D Grid and 2D Blocks 53

Summing Matrices with a 1D Grid and 1D Blocks 57

Summing Matrices with a 2D Grid and 1D Blocks 58

Managing Devices 60

Using the Runtime API to Query GPU Information 61

Determining the Best GPU 63

Using nvidia-smi to Query GPU Information 63

Setting Devices at Runtime 64

Summary 65

Chapter 3: CUDA Execution Model 67

Introducing the CUDA Execution Model 67

GPU Architecture Overview 68

The Fermi Architecture 71

The Kepler Architecture 73

Profile-Driven Optimization 78

Understanding the Nature of Warp Execution 80

Warps and Thread Blocks 80

Warp Divergence 82

Resource Partitioning 87

Latency Hiding 90

Occupancy 93

Synchronization 97

Scalability 98

Exposing Parallelism 98

Checking Active Warps with nvprof 100

Checking Memory Operations with nvprof 100

Exposing More Parallelism 101

Avoiding Branch Divergence 104

The Parallel Reduction Problem 104

Divergence in Parallel Reduction 106

Improving Divergence in Parallel Reduction 110

Reducing with Interleaved Pairs 112

Unrolling Loops 114

Reducing with Unrolling 115

Reducing with Unrolled Warps 117

Reducing with Complete Unrolling 119

Reducing with Template Functions 120

Dynamic Parallelism 122

Nested Execution 123

Nested Hello World on the GPU 124

Nested Reduction 128

Summary 132

Chapter 4: Global Memory 135

Introducing the CUDA Memory Model 136

Benefits of a Memory Hierarchy 136

CUDA Memory Model 137

Memory Management 145

Memory Allocation and Deallocation 146

Memory Transfer 146

Pinned Memory 148

Zero-Copy Memory 150

Unified Virtual Addressing 156

Unified Memory 157

Memory Access Patterns 158

Aligned and Coalesced Access 158

Global Memory Reads 160

Global Memory Writes 169

Array of Structures versus Structure of Arrays 171

Performance Tuning 176

What Bandwidth Can a Kernel Achieve? 179

Memory Bandwidth 179

Matrix Transpose Problem 180

Matrix Addition with Unified Memory 195

Summary 199

Chapter 5: Shared Memory and Constant Memory 203

Introducing CUDA Shared Memory 204

Shared Memory 204

Shared Memory Allocation 206

Shared Memory Banks and Access Mode 206

Configuring the Amount of Shared Memory 212

Synchronization 214

Checking the Data Layout of Shared Memory 216

Square Shared Memory 217

Rectangular Shared Memory 225

Reducing Global Memory Access 232

Parallel Reduction with Shared Memory 232

Parallel Reduction with Unrolling 236

Parallel Reduction with Dynamic Shared Memory 238

Effective Bandwidth 239

Coalescing Global Memory Accesses 239

Baseline Transpose Kernel 240

Matrix Transpose with Shared Memory 241

Matrix Transpose with Padded Shared Memory 245

Matrix Transpose with Unrolling 246

Exposing More Parallelism 249

Constant Memory 250

Implementing a 1D Stencil with Constant Memory 250

Comparing with the Read-Only Cache 253

The Warp Shuffle Instruction 255

Variants of the Warp Shuffle Instruction 256

Sharing Data within a Warp 258

Parallel Reduction Using the Warp Shuffle Instruction 262

Summary 264

Chapter 6: Streams and Concurrency 267

Introducing Streams and Events 268

CUDA Streams 269

Stream Scheduling 271

Stream Priorities 273

CUDA Events 273

Stream Synchronization 275

Concurrent Kernel Execution 279

Concurrent Kernels in Non-NULL Streams 279

False Dependencies on Fermi GPUs 281

Dispatching Operations with OpenMP 283

Adjusting Stream Behavior Using Environment Variables 284

Concurrency-Limiting GPU Resources 286

Blocking Behavior of the Default Stream 287

Creating Inter-Stream Dependencies 288

Overlapping Kernel Execution and Data Transfer 289

Overlap Using Depth-First Scheduling 289

Overlap Using Breadth-First Scheduling 293

Overlapping GPU and CPU Execution 294

Stream Callbacks 295

Summary 297

Chapter 7: Tuning Instruction-Level Primitives 299

Introducing CUDA Instructions 300

Floating-Point Instructions 301

Intrinsic and Standard Functions 303

Atomic Instructions 304

Optimizing Instructions for Your Application 306

Single-Precision vs. Double-Precision 306

Standard vs. Intrinsic Functions 309

Understanding Atomic Instructions 315

Bringing It All Together 322

Summary 324

Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327

Introducing the CUDA Libraries 328

Supported Domains for CUDA Libraries 329

A Common Library Workflow 330

The CUSPARSE Library 332

cuSPARSE Data Storage Formats 333

Formatting Conversion with cuSPARSE 337

Demonstrating cuSPARSE 338

Important Topics in cuSPARSE Development 340

cuSPARSE Summary 341

The cuBLAS Library 341

Managing cuBLAS Data 342

Demonstrating cuBLAS 343

Important Topics in cuBLAS Development 345

cuBLAS Summary 346

The cuFFT Library 346

Using the cuFFT API 347

Demonstrating cuFFT 348

cuFFT Summary 349

The cuRAND Library 349

Choosing Pseudo- or Quasi- Random Numbers 349

Overview of the cuRAND Library 350

Demonstrating cuRAND 354

Important Topics in cuRAND Development 357

CUDA Library Features Introduced in CUDA 6 358

Drop-In CUDA Libraries 358

Multi-GPU Libraries 359

A Survey of CUDA Library Performance 361

cuSPARSE versus MKL 361

cuBLAS versus MKL BLAS 362

cuFFT versus FFTW versus MKL 363

CUDA Library Performance Summary 364

Using OpenACC 365

Using OpenACC Compute Directives 367

Using OpenACC Data Directives 375

The OpenACC Runtime API 380

Combining OpenACC and the CUDA Libraries 382

Summary of OpenACC 384

Summary 384

Chapter 9: Multi-GPU Programming 387

Moving to Multiple GPUs 388

Executing on Multiple GPUs 389

Peer-to-Peer Communication 391

Synchronizing across Multi-GPUs 392

Subdividing Computation across Multiple GPUs 393

Allocating Memory on Multiple Devices 393

Distributing Work from a Single Host Thread 394

Compiling and Executing 395

Peer-to-Peer Communication on Multiple GPUs 396

Enabling Peer-to-Peer Access 396

Peer-to-Peer Memory Copy 396

Peer-to-Peer Memory Access with Unified Virtual Addressing 398

Finite Difference on Multi-GPU 400

Stencil Calculation for 2D Wave Equation 400

Typical Patterns for Multi-GPU Programs 401

2D Stencil Computation with Multiple GPUs 403

Overlapping Computation and Communication 405

Compiling and Executing 406

Scaling Applications across GPU Clusters 409

CPU-to-CPU Data Transfer 410

GPU-to-GPU Data Transfer Using Traditional MPI 413

GPU-to-GPU Data Transfer with CUDA-aware MPI 416

Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417

Adjusting Message Chunk Size 418

GPU to GPU Data Transfer with GPUDirect RDMA 419

Summary 422

Chapter 10: Implementation Considerations 425

The CUDA C Development Process 426

APOD Development Cycle 426

Optimization Opportunities 429

CUDA Code Compilation 432

CUDA Error Handling 437

Profile-Driven Optimization 438

Finding Optimization Opportunities Using nvprof 439

Guiding Optimization Using nvvp 443

NVIDIA Tools Extension 446

CUDA Debugging 448

Kernel Debugging 448

Memory Debugging 456

Debugging Summary 462

A Case Study in Porting C Programs to CUDA C 462

Assessing crypt 463

Parallelizing crypt 464

Optimizing crypt 465

Deploying Crypt 472

Summary of Porting crypt 475

Summary 476

Appendix: Suggested Readings 477

Index 481

Chapter 1
Heterogeneous Parallel Computing with CUDA

What's in this chapter?

Understanding heterogeneous computing architectures
Recognizing the paradigm shift of parallel programming
Grasping the basic elements of GPU programming
Knowing the differences between CPU and GPU programming

Code Download The wrox.com code downloads for this chapter are found at www.wrox.com/go/procudac on the Download Code tab. The code is in the Chapter 1 download and individually named according to the names throughout the chapter.

The high-performance computing (HPC) landscape is always changing as new technologies and processes become commonplace, and the definition of HPC changes accordingly. In general, it pertains to the use of multiple processors or computers to accomplish a complex task concurrently with high throughput and efficiency. It is common to consider HPC as not only a computing architecture but also as a set of elements, including hardware systems, software tools, programming platforms, and parallel programming paradigms.

Over the last decade, high-performance computing has evolved significantly, particularly because of the emergence of GPU-CPU heterogeneous architectures, which have led to a fundamental paradigm shift in parallel programming. This chapter begins your understanding of heterogeneous parallel programming.

Parallel Computing

During the past several decades, there has been ever-increasing interest in parallel computation. The primary goal of parallel computing is to improve the speed of computation.

From a pure calculation perspective, parallel computing can be defined as a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently.

From the programmer's perspective, a natural question is how to map the concurrent calculations onto computers. Suppose you have multiple computing resources. Parallel computing can then be defined as the simultaneous use of multiple computing resources (cores or computers) to perform the concurrent calculations. A large problem is broken down into smaller ones, and each smaller one is then solved concurrently on different computing resources. The software and hardware aspects of parallel computing are closely intertwined together. In fact, parallel computing usually involves two distinct areas of computing technologies:

Computer architecture (hardware aspect)
Parallel programming (software aspect)

Computer architecture focuses on supporting parallelism at an architectural level, while parallel programming focuses on solving a problem concurrently by fully using the computational power of the computer architecture. In order to achieve parallel execution in software, the hardware must provide a platform that supports concurrent execution of multiple processes or multiple threads.

Most modern processors implement the Harvard architecture, as shown in Figure 1.1, which is comprised of three main components:

Memory (instruction memory and data memory)
Central processing unit (control unit and arithmetic logic unit)
Input/Output interfaces

Figure 1.1

The key component in high-performance computing is the central processing unit (CPU), usually called the core. In the early days of the computer, there was only one core on a chip. This architecture is referred to as a uniprocessor. Nowadays, the trend in chip design is to integrate multiple cores onto a single processor, usually termed multicore, to support parallelism at the architecture level. Therefore, programming can be viewed as the process of mapping the computation of a problem to available cores such that parallel execution is obtained.

When implementing a sequential algorithm, you may not need to understand the details of the computer architecture to write a correct program. However, when implementing algorithms for multicore machines, it is much more important for programmers to be aware of the characteristics of the underlying computer architecture. Writing both correct and efficient parallel programs requires a fundamental knowledge of multicore architectures.

The following sections cover some basic concepts of parallel computing and how these concepts relate to CUDA programming.

Sequential and Parallel Programming

When solving a problem with a computer program, it is natural to divide the problem into a discrete series of calculations; each calculation performs a specified task, as shown in Figure 1.2. Such a program is called a sequential program.

Figure 1.2

There are two ways to classify the relationship between two pieces of computation: Some are related by a precedence restraint and therefore must be calculated sequentially; others have no such restraints and can be calculated concurrently. Any program containing tasks that are performed concurrently is a parallel program. As shown in Figure 1.3, a parallel program may, and most likely will, have some sequential parts.

Figure 1.3

From the eye of a programmer, a program consists of two basic ingredients: instruction and data. When a computational problem is broken down into many small pieces of computation, each piece is called a task. In a task, individual instructions consume inputs, apply a function, and produce outputs. A data dependency occurs when an instruction consumes data produced by a preceding instruction. Therefore, you can classify the relationship between any two tasks as either dependent, if one consumes the output of another, or independent.

Analyzing data dependencies is a fundamental skill in implementing parallel algorithms because dependencies are one of the primary inhibitors to parallelism, and understanding them is necessary to obtain application speedup in the modern programming world. In most cases, multiple independent chains of dependent tasks offer the best opportunity for parallelization.

Parallelism

Nowadays, parallelism is becoming ubiquitous, and parallel programming is becoming mainstream in the programming world. Parallelism at multiple levels is the driving force of architecture design. There are two fundamental types of parallelism in applications:

Task parallelism
Data parallelism

Task parallelism arises when there are many tasks or functions that can be operated independently and largely in parallel. Task parallelism focuses on distributing functions across multiple cores.

Data parallelism arises when there are many data items that can be operated on at the same time. Data parallelism focuses on distributing the data across multiple cores.

CUDA programming is especially well-suited to address problems that can be expressed as data-parallel computations. The major focus of this book is how to solve a data-parallel problem with CUDA programming. Many applications that process large data sets can use a data-parallel model to speed up the computations. Data-parallel processing maps data elements to parallel threads.

The first step in designing a data parallel program is to partition data across threads, with each thread working on a portion of the data. In general, there are two approaches to partitioning data: block partitioning and cyclic partitioning. In block partitioning, many consecutive elements of data are chunked together. Each chunk is assigned to a single thread in any order, and threads generally process only one chunk at a time. In cyclic partitioning, fewer data elements are chunked together. Neighboring threads receive neighboring chunks, and each thread can handle more than one chunk. Selecting a new chunk for a thread to process implies jumping ahead as many chunks as there are threads.

Figure 1.4 shows two simple examples of 1D data partitioning. In the block partition, each thread takes only one portion of the data to process, and in the cyclic partition, each thread takes more than one portion of the data to process. Figure 1.5 shows three simple examples of 2D data partitioning: block partitioning along the y dimension, block partitioning on both dimensions, and cyclic partitioning along the x dimension. The remaining patterns — block partitioning along the x dimension, cyclic partitioning on both dimensions, and cyclic partitioning along the y dimension — are left as an exercise.

Figure 1.4

Figure 1.5

Usually, data is stored one-dimensionally. Even when a logical multi-dimensional view of data is used, it still maps to one-dimensional physical storage. Determining how to distribute data among threads is closely related to both how that data is stored physically, as well as how the execution of each thread is ordered. The way you organize threads has a significant effect on the program's performance.

Data Partitions

There are two basic approaches to partitioning data:

Block: Each thread takes one portion of the data, usually an equal portion of the data.
Cyclic: Each thread takes more than one portion of the data.

The performance of a program is...

Inhalt (EPUB)