People wonder why Llama has knowledge cut off. The model boasts 15T tokens and requires 1.3M H100 hours of training. On a 16k H100 cluster, it takes roughly 4 days to train from the ground up. π
Later this year, training should be twice as speedy on the same H100s, given the current 400 Tflops/GPU performance. 