**DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs**
DeepSeek's open-source FlashMLA is a high-performance decoding kernel designed for Hopper GPUs. It's optimized to handle variable-length sequences, achieving impressive speeds: up to 3000 GB/s in memory-limited scenarios and 580 TFLOPS in compute-limited scenarios on an H800 SXM5 GPU using CUDA 12.6.
FlashMLA's development was inspired by the FlashAttention 2&3 and Cutlass projects. The project actively incorporates user feedback to improve its efficiency and capabilities. The focus is on providing a fast and efficient solution for decoding tasks on modern NVIDIA hardware.
[Read More](https://github.com/deepseek-ai/FlashMLA)
💬 [HN Comments](https://news.ycombinator.com/item?id=43155023) (82)