DeepGEMM is a CUDA library offering efficient FP8 matrix multiplications with fine-grained scaling, supporting both normal and Mix-of-Experts GEMMs. The lightweight library matches or exceeds performance of expert-tuned libraries, featuring runtime compilation and Hopper tensor core optimization, while maintaining a simple ~300-line core kernel.

https://github.com/deepseek-ai/DeepGEMM

#cuda #gpucomputing #matrixoperations #performanceoptimization #deeplearning

Reply to this note

Please Login to reply.

Discussion

No replies yet.