<aside> 💡
If you find any mistakes in this article, please feel free to contact me through comment or [email protected].
Released on January 11 2025
Last updated on January 11 2025
Other versions: 䏿–‡ç‰ˆæœ¬
</aside>
In scenarios such as searching, advertising, and recommendation systems, deep learning-based personalized ranking models (Deep Learning Rank Models, DLRM) are widely used, with the embedding module being a key component. At the MICRO 2024 conference, researchers from Penn State University published a paper [1] on performance optimization of the PyTorch embedding_bag operator on GPUs. The paper demonstrated a 103% performance improvement through various techniques, such as compiler optimizations, software prefetching, and and L2 cache optimizations (L2 pinning) for hot data.
However, these methods present challenges in practical applications due to the need for parameter tuning. For example, increasing occupancy by restricting the number of registers via the compiler or optimizing memory access through data prefetching both rely on the configuration of specific parameters, and the optimal values of these parameters may vary across different GPU hardware, which affects the generalizability and ease of use of these approaches.
In this article, we achieve similar or even more significant performance improvements using a simpler approach. Through code and instruction analysis of the PyTorch embedding_bag operator, we identified an issue where input parameter boundary checks performed using CUDA_KERNEL_ASSERT lead to redundant calculations across threads. These redundant computations not only decrease performance but also result in low kernel occupancy. To address this, we consolidated the parameter boundary checks into a separate GPU kernel, reducing redundant calculations and lowering register usage, significantly improving occupancy.
Based on this method, we achieved substantial performance improvements for the torch.nn.EmbeddingBag operator on GPUs such as the A800, H20, and RTX 4090 under different input distributions. For example, on the A800, in one representative test case from the paper, the embedding_bag computation time was reduced from 460 µs to 175 µs. Additionally, we found that torch.nn.Embedding also suffers from performance issues due to low occupancy. By optimizing the GPU block parameters and merging redundant computations, we were able to achieve significant performance improvements.
First, we present the key optimization results of this article. Table 1 summarizes the optimization results for the embedding_bag operator, with tests based on the dataset used in the paper [1], and compares these results with those presented in the paper. Table 2 shows the optimization results for the embedding operator, tested with randomly generated input. More comprehensive and detailed experimental data will be discussed later.
Table 1: torch.nn.EmbeddingBag
optimization results (performance data based on the paper's dataset). The "high hot time" in the table represents the execution time of the embedding_bag operator under the high hot data distribution.
GPU | Data Source | GPU kernel version | high hot (us) | medium hot (us) | low hot (us) | random (us) |
---|---|---|---|---|---|---|
A100 | Original Paper Data | Official PyTorch implementation | 237 | 341 | 428 | 443 |
A100 | Original Paper Data | Optimized version in the paper | 167 | 190 | 216 | 217 |
A800 | Our Experiments | Official PyTorch implementation | 237.6 | 344.5 | 445.0 | 460.2 |
A800 | Our Experiments | Our optimized version | 118.9 | 144.0 | 168.7 | 175.2 |
Note: The A800 is a special version of GPU tailored for China, with the main difference from the A100 being the performance of NVLink. The performance of single-card GPU kernels on A800 should be generally consistent with that of the A100.
Table 2: torch.nn.Embedding
optimization results, performance data based on randomly generated input
GPU | Number of input elements | embedding_dim | GPU kernel time before optimization (us) | GPU kernel time after optimization (us) |
---|---|---|---|---|
A800 | 307200 | 128 | 516.4 | 281.5 |
A800 | 307200 | 32 | 137.0 | 82.1 |
A800 | 131072 | 128 | 222.0 | 125.6 |
A800 | 131072 | 32 | 59.8 | 39.3 |
A800 | 8192 | 128 | 17.0 | 13.7 |
A800 | 8192 | 32 | 6.8 | 7.9 |
Note: Since our implementation introduces a new kernel, there may be a slight performance degradation (1-3 µs) for smaller input sizes (e.g., in the last row of data). While this issue can be avoided by adding special logic (such as switching back to the original kernel logic for smaller input sizes), we have included the test data with negative results in this article to ensure the comprehensiveness and objectivity of the findings.
In DLRM models or NLP models, the purpose of embedding is to map discrete ID inputs (such as text tokens in NLP or video category IDs in short video recommendations) to embedding vectors in a continuous space. Below is a visual example based on the NLP scenario.
As shown in the figure, the main function of the embedding module is to look up the corresponding row in the embedding_weight for each ID in the input and use that row's embedding vector as the embedding for that ID. In this article, we use num_embeddings to represent the number of rows in the embedding_weight (20 in the example) and embedding_dim to represent the number of columns (i.e., the vector dimension, 4 in the example). Therefore, in PyTorch, embedding_weight is a 2D tensor with the shape [num_embeddings**,** embedding_dim].
In the above example, each word is mapped to an individual embedding vector. However, in practice, we may want to obtain the embedding of the entire sentence instead of each individual word's embedding. For example, the above example contains three sentences: ["How are you", "What is the weather today", "Good luck"]. In this case, we would want to generate embeddings for the three sentences, rather than embeddings for the ten words. The common approach in such cases is to sum or average the embeddings of all words in each sentence, a process known as embedding_bag.