How to Perform Multi-GPU Parallelization for DeepSeek MLA attention during Decoding

<aside> 💡

If you find any mistakes in this article, please feel free to contact me through comment or [email protected].

Released on February 23 2025

Last updated on February 23 2025

</aside>

Introduction

DeepSeek-V3[1] and DeepSeek-R1[2] have recently gained significant attention, not only for the model's exceptional capabilities but also for the innovative structural optimizations made to enhance training and inference performance. In this article, we analyze how to optimize the performance of Multi-head Latent Attention (MLA) used in DeepSeek-V3 during decoding stage . Since the KVCache in MLA does not include the head dimension, it cannot be split along the head dimension to implement tensor parallelism (Tensor Parallel, TP) like Multi-head Attention (MHA) or Grouped Query Attention (GQA). Current open-source implementations primarily support Data Parallel (DP) optimization for MLA, but data parallelism leads to uneven sequence computation time across different GPUs. In the DeepSeek-V3[1] paper, the authors mention that the parallel strategy for MLA in the decode stage includes both tensor parallelism and sequence parallelism (Sequence Parallel, SP), but only briefly mention it without detailing the implementation. In this article, we analyze how to parallelize and accelerate MLA during the decode stage. Specifically, we propose that tensor parallelism be applied to the q_proj, k_proj, v_proj, and o_proj layers along the head dimension, while sequence parallelism be used for the attention computation. This approach enables a more uniform distribution of attention computation across multiple GPUs, leading to performance acceleration.

Analysis and Insights

Insight 1: Attention Computation in MLA Cannot Use Tensor Parallelism (TP)

In MLA, the KVCache stores the latent space tensor before the k_proj and v_proj computations. Unlike the key and value tensors that are stored separately in Multi-head Attention, the KVCache in MLA does not include the head dimension. This prevents the use of tensor parallelism along the head dimension, as seen in Multi-head Attention. This is why current MLA implementations use data parallel.

Insight 2: Data Parallel Faces Workload Imbalance Across GPUs

When only data parallelism is used, an issue arises where the request length across different GPUs may vary significantly. This can affect GPU memory utilization and result in uneven data loading times for the KVCache on different GPUs, causing the slowest machine to bottleneck the overall performance. To address this, careful consideration is needed when routing requests to ensure a balanced KVCache distribution, which adds complexity. Additionally, with data parallel alone, small batches (e.g., batch < 4) cannot fully leverage GPU parallelism for attention acceleration. We note that in the DeepSeek-V3 paper, the decoding stage employs a combination of Tensor Parallelism (TP) and Sequence Parallelism (SP), but few details are provided.

Insight 3: Sequence Parallelism Faces Cache Distribution Issues Due to New Token Generation

When using the native implementation of Sequence Parallelism (SP), new tokens are generated continuously and need to be appended to the KVCache. This results in the cache on the last GPU growing increasingly longer over time, leading to uneven cache distribution and performance degradation.

Based on the above insights, we propose the following hybrid parallel stragegy.

Tensor Parallelism + Chunked Sequence Parallelism

Chunked Sequence Parallelism

For the attention computation in the decode stage, we have the following observation.

The KVCache can be split along the sequence dimension, and each GPU can compute local attention using queries. For example, if there are 2 GPUs, and the decode query dimension is [1, 128] while the KVCache dimension is [T, 128], we can split the KVCache into two blocks, each of size [T/2, 128], and perform attention computation on both GPUs to get a [1, 128] tensor. The resulting tensors from both GPUs are then reduced to obtain the final output tensor [2, 128]. It's important to note that intermediate data from the softmax operation must also be communicated across GPUs. This approach is conceptually similar to the strategy used in flash decoding[3].
The order of which block is computed first does not affect the result, as long as the corresponding position encoding is processed correctly.

Based on this observation, we propose a chunked sequence parallelism strategy to address the issue of continuous KVCache appending during decoding, which causes the cache on the last GPU to keep growing. Specifically, we treat 256 tokens as one chunk, and new chunks are scheduled in a round-robin fashion across different GPUs. For example, the KVCache corresponding to tokens 1 to 256 would be placed on GPU0, tokens 256 to 512 would be placed on GPU1, and tokens 512 to 768 would be placed back on GPU0.

Tensor Parallelism

This section discusses the parallel strategy for computing q_proj, k_proj, v_proj, and o_proj. These modules include the head dimension, which means tensor parallelism can be directly applied along the head dimension. It is important to note that MLA utilizes an optimization approach that exchanges the order of matrix computations using associativity (referred to as "absorb" in the paper, for a clear understanding of this optimization, you can refer to my earlier blog[4] where I discussed the same idea based on Multi-head Attention).

Introduction

Analysis and Insights

Tensor Parallelism + Chunked Sequence Parallelism

Chunked Sequence Parallelism

Tensor Parallelism

Overall Architecture