Deeper Learning with oneDNN 3.7: Latest Neural Network Library Release

The latest version 3.7 of the oneAPI Deep Neural Network Library (oneDNN) has been released, an evolution of deep learning performance optimization. The release introduces substantial performance enhancements for Intel, ARM, NVIDIA, and other platforms.

 

As an open-source cross-platform performance library, oneDNN provides essential building blocks for deep learning applications across diverse hardware architectures. oneDNN is a critical component in the modern deep learning ecosystem and is used by frameworks including PyTorch and TensorFlow. Version 3.7 brings new capabilities to developers.

 

Why oneDNN?

 

oneDNN (oneAPI Deep Neural Network Library) serves as a foundational layer for many popular deep learning frameworks and applications. As part of the UXL Foundation and implementing the oneAPI specification, oneDNN provides highly optimized primitives for deep neural network operations that can be leveraged by higher-level software, including ONNX, MATLAB, PaddlePaddle, and Apache MXNet.

 

While initially developed with Intel hardware in mind, the library has expanded its scope to support multiple architectures, including ARM, RISC-V, POWER, AMD and NVIDIA. This cross-platform approach aligns with the UXL Foundation’s goal of creating a unified programming interface that delivers consistent performance regardless of the underlying hardware.

 

 

Key Enhancements in oneDNN 3.7

 

The latest release focuses on several areas of improvement:

 

  • CPU Optimizations
  • Build dependency updates
  • Compatibility with the latest SYCL version
  • Intel GPU optimizations

 

These changes make it easier for developers to identify and resolve performance issues, validate their implementations, and adopt the latest features of the library. For developers and organizations building deep learning applications, the improvements in oneDNN 3.7 offer opportunities to extract maximum performance from available hardware while maintaining portability across platforms. As models continue to grow in size and complexity, libraries like oneDNN become increasingly important.

 

With its integration into popular frameworks and applications, the improvements in oneDNN 3.7 will benefit a wide range of users, from research institutions to enterprise deployments. oneDNN 3.7 introduces significant performance improvements for Intel processors, including:

 

  • Enhanced performance for convolution and matrix multiplication (matmul) primitives
  • Improved INT8 and FP32 forward convolution performance on some processors
  • Better performance for FP8 matmul primitives with BF16 and FP16 bias data types
  • Optimized INT8 RNN primitive performance on processors with AVX2 and AVX-512 instruction sets
  • Enhanced INT8 depthwise separable convolution with per-channel zero points on AVX2 and AVX-512 processors
  • Improved FP16 and BF16 softmax operations with relaxed accumulation mode

 

These optimizations ensure that deep learning workloads can fully leverage the advanced instruction sets available in modern Intel processors, delivering significant performance gains for compute-intensive operations.

 

The release also incorporates many improvements for Intel Graphics cards including the upcoming Xe3 architecture and Arc GPUs for convolution operations..

 

oneDNN 3.7 extends support beyond Intel hardware with significant improvements for other architectures:

 

  • Enhanced BF16 matrix multiplication and convolution performance on AArch64 processors through improved Arm Compute Library (ACL) integration
  • Optimized BF16 to FP32 reorder operations for ARM processors
  • Improved matrix multiplication performance on NVIDIA GPUs using cuBLASLt-based implementation

 

oneDNN 3.7 introduces new functionalities:

 

  • Support for select algorithm in binary primitive
  • Extended quantization support in matmul and reorder operations with grouped scales and zero-points for weights ( for Intel CPUs and GPUs)
  • Initial support for 4-bit floating-point data types (f4_e2m1 and f4_e3m0) and e8m0 scales in matmul and reorder operations
  • New GenIndex and GreaterEqual operations in the Graph API
  • Support for FP32 matmul with FP16 and BF16 weights on Intel CPUs
  • Stochastic rounding support for convolution, matmul, and reorder operations on Intel GPUs, based on Philox counter-based random number generator
  • Support for strided memory formats in convolution on Intel GPUs
  • Reduction primitive and inner product primitive forward propagation support for generic GPU vendors

 

Technical Deep Dive

 

1. Precision and Performance Tradeoffs

 

One of the most notable aspects of oneDNN 3.7 is its expanded support for lower-precision operations. The introduction of 4-bit floating-point data types and enhanced INT8 operations reflect the industry’s continuing move toward precision-optimized computing, where the appropriate numerical precision is used for each task to maximize performance without sacrificing model accuracy.

 

This approach is particularly evident in the improved support for BF16 (Brain Floating Point) operations across multiple platforms. BF16 has emerged as an important format for deep learning, offering a balance between the precision of FP32 and the performance benefits of reduced precision.

 

2. Graph API Enhancements

 

The Graph API receives significant attention in this release with optimizations for several important neural network patterns:

 

  • Gated Multi-Layer Perceptron (Gated MLP)
  • Scaled Dot-Product Attention (SDPA) with implicit causal mask
  • SDPA with INT8 or INT4 compressed key and value

 

These optimizations are particularly relevant for transformer-based models that rely heavily on attention mechanisms and multi-layer perceptron (MLP) blocks. By optimizing these specific patterns, oneDNN 3.7 can deliver substantial performance improvements for modern large language models and other transformer architectures.

 

3. Memory Management Improvements

 

oneDNN 3.7 brings important usability enhancements in memory management, particularly for the SYCL runtime. Memory objects on CPU engines are now reference-counted, eliminating the need to explicitly maintain their lifetime during primitive execution. This change aligns the behavior between CPU and GPU engines, simplifying development and reducing the risk of memory-related errors.

 

Additionally, improved support for large tensor sizes in convolution, matmul, and reduction primitives on Intel GPUs addresses the growing demands of larger models and datasets.

 

4. Usability and Developer Experience

 

Intel has made several improvements to enhance the developer experience:

 

  • Enhanced verbose diagnostics to better identify issues during dispatching, primitive, and kernel creation
  • Enabled frame pointers support on Intel64 platforms for improved profiler integration
  • Improved verbose diagnostics for Intel GPU driver compatibility issues
  • Extended benchdnn with support for FP8 matmul patterns and additional validation capabilities
  • Added Graph API examples for Gated MLP and INT4 Gated MLP patterns

 

GitHub – uxlfoundation/oneDNN: oneAPI Deep Neural Network Library (oneDNN)

×

Join us at the UXL Foundation Mini Summit @ OSS in Denver, Colorado
June 26, 2025

Register