Chapter 2
Inside OneDNN: Architecture and Execution Model
What makes OneDNN an indispensable backbone for high-performance deep learning is not just its speed, but the powerful abstractions and engineering that let it squeeze every drop of efficiency from today's CPUs and accelerators. In this chapter, you will uncover the inner machinery of OneDNN-how its layered architecture, intelligent primitives, and dynamic execution strategies allow complex operations to be optimized, fused, and tailored for diverse platforms. With a focus on both practical interfaces and underlying mechanisms, this chapter reveals how OneDNN turns raw hardware into flexible, reliable, and lightning-fast neural network computation.
2.1 OneDNN's Abstraction Layers
OneDNN's architecture is distinguished by a clear separation between conceptual abstractions and their concrete implementations, enabling finely optimized hardware-specific execution without compromising on API uniformity and developer productivity. This section dissects the core abstraction layers of OneDNN, emphasizing the roles and interplay of engines, memory objects, primitives, and the underlying dependency graph that manages execution orchestration.
At the foundation lies the engine abstraction, which encapsulates the hardware backend. An engine represents a device context where computations are performed, such as a CPU or GPU. The design explicitly partitions hardware concerns from algorithmic descriptions. Engines serve as handles to the underlying hardware runtime and enable the library to select appropriate implementations dynamically. The engine abstraction allows end-user code to remain agnostic to the hardware specifics while facilitating maximal performance by exploiting hardware capabilities. For example, a CPU engine is tailored with vectorized instructions and NUMA-awareness, whereas a GPU engine interfaces with CUDA or other device APIs to harness massively parallel compute units.
Memory in OneDNN is abstracted through memory objects, which carry both a descriptor and a buffer view. A memory descriptor encapsulates the tensor's shape, data type, and layout formats, providing a metadata contract that defines how data is structured in device memory. Memory views enable reinterpretation or subsetting of the underlying buffer without copying, which is essential for efficient in-place operations and tensor reshaping. Crucially, memory objects decouple data semantics from physical memory representation, allowing the runtime to transparently insert or omit data layout conversions necessary for device-optimized kernels. This separation ensures that end-user code only defines the logical tensor formats and delegates layout adaptation to the backend, simplifying application development and improving portability.
At the computational core are primitives, atomic units of work analogous to operators in neural network frameworks. Primitives encapsulate well-defined operations such as convolution, pooling, normalization, and element-wise transformations. Each primitive is associated with descriptors that specify operation parameters, including algorithmic variants, stride, padding, or activation modes. The abstraction of primitives as distinct objects serves as an API contract, allowing implementations to optimize each primitive independently for specific hardware targets. Moreover, primitives expose a standardized interface for execution and subsequent results retrieval, enabling their composition into complex computation graphs.
The interaction between primitives and memory objects creates dependencies that must be carefully managed. OneDNN employs an internal dependency graph to orchestrate execution. This graph models data flow relationships among primitives, accounting for input and output memory objects and their lifecycle. The dependency graph ensures operations are executed in a correct, efficient order, respecting data hazards and synchronization requirements. It also facilitates advanced performance optimizations such as fusion of consecutive primitives, memory re-use, and pipeline parallelism transparently to the user. Through this abstraction, OneDNN maintains a coherent execution model that can adapt dynamically across diverse hardware contexts.
These layers collectively decouple the what from the how in deep learning computation. End-user code expresses the computation via primitives and supplies logical memory descriptors; the engine-based implementations optimize the execution path and memory layouts autonomously. This separation empowers OneDNN to provide a consistent, portable API while delivering peak performance tailored to each hardware platform. Furthermore, it reduces the complexity for application developers by hiding intricate device-specific details behind clean abstraction boundaries.
Consider the example of a convolution operation. From the application perspective, a convolution primitive is created with a descriptor specifying the kernel size, strides, data types, and algorithmic preferences. Input and output tensors are described by their memory descriptors, indicating logical shape and layout. The engine selects an optimized convolution implementation-such as Winograd or direct convolution-based on hardware properties. Internally, memory objects may be reordered into optimized formats used by the selected kernel. The dependency graph schedules the necessary reorderings and convolution execution in a data-aware sequence, ensuring loop fusion or parallelization where beneficial. Throughout this process, the end-user interacts exclusively with the abstracted API objects, maintaining functional correctness without manual tuning of low-level details.
OneDNN's abstraction layers provide a rigorous, modular architecture that promotes flexibility, performance portability, and ease of use. Engines define execution contexts tied to hardware; memory objects represent logically structured data with adaptable layouts; primitives model operations as first-class API entities; and the dependency graph enforces execution correctness and optimization transparently. Together, these layered abstractions form a powerful framework where conceptual clarity and hardware heterogeneity coexist harmoniously.
2.2 Supported Operations and Primitive Fusion
OneDNN (formerly known as MKL-DNN or DNNL) provides an extensive collection of highly optimized compute kernels and primitives that form the foundation of many deep learning workloads. These primitives are designed to leverage underlying hardware features, such as vectorization, multi-threading, and specialized instruction sets, ensuring maximal computational efficiency. Understanding the scope and functionality of these primitives is essential for exploiting OneDNN's performance capabilities and for architecting complex neural network computations.
At the core of OneDNN are its convolution primitives, which implement standard, grouped, depthwise, and dilated convolution operations. These are fundamental to convolutional neural networks (CNNs), enabling spatial filtering across multi-dimensional input tensors. OneDNN supports both forward and backward passes for convolutions, including weight gradient and data gradient computation. Optimizations include algorithm selection (direct, Winograd, or GEMM-based convolutions), memory format transformation, and loop unrolling to attain high throughput on CPUs and accelerators. Crucially, these convolution primitives support various data types, such as FP32, BF16, and INT8, enabling precision flexibility.
General Matrix-Matrix Multiplication (GEMM) is another fundamental primitive implemented in OneDNN, serving as a backbone for fully connected layers and for implementing convolution as a matrix multiplication using im2col or other transformations. The GEMM routines are highly optimized and tightly integrated with CPU-specific BLAS or intrinsic libraries to maximize floating-point operations per cycle. The GEMM primitive supports batched, strided, and transposed variants, providing flexibility across different model architectures.
Normalization primitives include batch normalization, layer normalization, and group normalization. These operations stabilize the training process by standardizing activations, improving gradient flow and convergence speed. OneDNN's implementation of batch normalization encompasses both inference and training modes, with fused support for scale and shift parameters. These primitives are parameterized to allow seamless integration within larger kernels, reducing overhead and enhancing pipeline efficiency.
Pooling primitives implement max and average pooling over spatial input dimensions, critical for spatial dimension reduction and selective feature aggregation in CNNs. Both forward and backward passes are supported with multiple kernel implementations for different pooling window sizes and strides, ensuring broad applicability. Pooling operations are also fused as part of larger computation graphs, enabling more streamlined data movement.
Elementwise operations, including activation functions (ReLU, sigmoid, tanh, leaky ReLU, ELU), arithmetic operations (add, multiply, subtract, divide), and elementwise transformations (square, sqrt, abs), are supported as...