You just need Anna's archive. That's what OpenAI allegedly used to pre-train their earlier GPT models.
Discussion
I think the point is that relying on "the only one we need" is a bad idea.
It's a massive torrent that sources from every source listed in the quoted note. It's a meta-source, not the source itself.