People wonder why Llama has knowledge cut off. The model boasts 15T tokens and requires 1.3M H100 hours of training. On a 16k H100 cluster, it takes roughly 4 days to train from the ground up. πŸš€

Later this year, training should be twice as speedy on the same H100s, given the current 400 Tflops/GPU performance.

Reply to this note

Please Login to reply.

Discussion

However, making it continuous still is very very difficult. πŸ˜“Thus until we reach appropriate hw availability it always will be a knowledge gap. Especially with amount of information generated daily raising daily. πŸ‘€