Subnostr

**DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs**

DeepSeek's open-source FlashMLA is a high-performance decoding kernel designed for Hopper GPUs. It's optimized to handle variable-length sequences, achieving impressive speeds: up to 3000 GB/s in memory-limited scenarios and 580 TFLOPS in compute-limited scenarios on an H800 SXM5 GPU using CUDA 12.6.

FlashMLA's development was inspired by the FlashAttention 2&3 and Cutlass projects. The project actively incorporates user feedback to improve its efficiency and capabilities. The focus is on providing a fast and efficient solution for decoding tasks on modern NVIDIA hardware.

💬 [HN Comments](https://news.ycombinator.com/item?id=43155023) (82)

Reply to this note

Please Login to reply.

Discussion

No replies yet.