Interested in helping with that, but its very important to have transparency on the datasets used to train such an LLM.
(Thinking aloud here).
Otherwise we create a new, centralised and opaque point of failure / target for control.
Actually even with transparency we have that problem.
Perhaps if we used only posters' #hashtags as categories / labels then everyone can contribute naturally and unobtrusively?
Sentiment analysis is deceptively fraught too - imagine the incentives to label every "Political Party X"-aligned post as "angry", "hysterical" and "Political Part Y" as "funny", "upbeat".
These decisions are a Big F---ing Deal in fiatland social media for valid reasons...