FlashMLA is a high-performance MLA decoding kernel optimized for Hopper GPUs, achieving up to 3000 GB/s in memory-bound configurations and 580 TFLOPS in computation-bound scenarios. The implementation supports BF16 and paged kvcache, requiring CUDA 12.3+ and PyTorch 2.0+.

https://github.com/deepseek-ai/FlashMLA

#gpucomputing #mloptimization #cuda #pytorch #performance

Reply to this note

Please Login to reply.

Discussion

No replies yet.