That sounds like a Common Crawl model, which is not a good dataset to use AI on. At least it's "open-source" (It's ethical source, let's be real here).
Maybe I could interest you in taking a look at non-Common Crawl models like DeepSeek R1 if reasoning modeling is being planned?