These feels like an ‘in two or four years’ maybe idea. Its a logical idea but it feels like way to early. Could be me and I hope I’m wrong but there’s a lot more ‘tricky’ parts than just weeding out bad training data.
There’s a new alternative to DiLoCo [1] for training large scale AI models over the internet called DisTro [2]. It enables low latency training on low bandwidth communication channels (ie. slow internet).
Methods like these are a crucial component for enabling a decentralized AI system that rivals the big tech companies and nation state actors.
The next step is to figure out monetary rewards for contributing to training and inference. The tricky part is to weed out bad training data in a decentralized way. Perhaps we could use something like a “mempool” for training data batches?
1. https://arxiv.org/abs/2311.08105
2. (PDF) https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf
#ai #llm
Discussion
You’re right; it will take more than just data prep and likely more than 4 years to mature.
For example we need an architecture that allows for versioning of components, auditability, benchmarking for regression testing, etc. It also needs verifiable outputs so we can prove the output was generated without tampering.
And there are supplementary components like vector data stores so that users can store context for long running tasks.