<aside> πŸ’‘

If you find any mistakes in this article, please feel free to contact me through comment or [email protected].

Released on February 6 2025

Last updated on February 6 2025

Other versions: δΈ­ζ–‡η‰ˆζœ¬

</aside>

Introduction

Sorting is one of the most common computational tasks in the field of computer science. In this article, we analyze the performance optimization opportunities for the torch.sort operator in the PyTorch framework on GPUs. Taking the A800 GPU as an example, using PyTorch 2.6.0 (CUDA 12.6), sorting 100 million int32 integers takes 14.4 ms. The A800's theoretical memory bandwidth is 2 TB/s, and the theoretical time to read 100 million int32 integers from GPU global memory is 0.2 ms. Therefore, the sorting operation takes roughly the equivalent of 72 memory reads, or 36 complete memory read/write cycles.

To assess whether this sorting time is reasonable, we use the PyTorch Profiler to analyze the internal implementation of the torch.sort operator, breaking down the end-to-end sorting time into individual GPU kernels. We further analyze the performance of key kernels using Nsight Compute, identifying several optimization opportunities. In addition to common issues such as redundant data copying, we specifically observed that core GPU kernels, such as DeviceRadixSortOnesweep, exhibit bandwidth utilization of less than 40% on the A800. This performance bottleneck is closely related to the way CUDA handles input data of the OpaqueType in PyTorch.

To address the aforementioned performance issues, we conducted a series of optimizations. On GPUs such as the A800, H800, and H20, the performance of torch.sort has improved by more than 35%. For example, on the A800, the sorting time for 100 million int32 integers was reduced from 14.4 ms to 6.4 ms. Furthermore, operators based on sorting, such as torch.unique, also benefited from these optimizations, showing significant performance improvements.

The primary goal of this article is to break down and analyze these optimization opportunities, providing the relevant PyTorch optimization commits for future PR submissions or as a reference for further optimizations. Additionally, for the widely used OpaqueType in the PyTorch framework, we explain its impact on kernel performance by analyzing GPU PTX instructions, in order to avoid similar performance issues in future operator implementations.

Key Results

Table 1: The primary optimization results of this article, where the sorting input consists of 100 million int32 integers. Original performance corresponds to PyTorch version 2.6.0 + CUDA 12.6.

GPU Original Performance (ms) Optimized Performance (ms)
A800 14.2 6.4
H800 6.7 4.1
H20 10.8 6.0

Note 1: The A800 and H800 are special custom versions of GPUs for the Chinese market. The primary difference compared to the A100 and H100 is the NVLink performance, with single-device operator performance being essentially the same.

Note 2: Both the official PyTorch and the source-compiled optimized versions discussed in this article are based on CUDA 12.6. The official PyTorch 2.6.0 pip installation is based on CUDA 12.4 by default, which results in a slight performance drop compared to CUDA 12.6.

Note 3: This article focuses on sorting 1D tensors with shape (N, ). Sorting 2D tensors with shape (N, T), based on SegmentSort, is another optimization direction and will be analyzed separately in another article.

Analysis of PyTorch's Current Implementation

Interface Definition

Here is the operator definition for torch.sort in PyTorch:

# torch.sort
sorted, indices = torch.sort(input, dim=-1, descending=False, stable=False, out=None)

The meanings of the parameters are as follows: