Interested in helping with that, but its very important to have transparency on the datasets used to train such an LLM.

(Thinking aloud here).

Otherwise we create a new, centralised and opaque point of failure / target for control.

Actually even with transparency we have that problem.

Perhaps if we used only posters' #hashtags as categories / labels then everyone can contribute naturally and unobtrusively?

Sentiment analysis is deceptively fraught too - imagine the incentives to label every "Political Party X"-aligned post as "angry", "hysterical" and "Political Part Y" as "funny", "upbeat".

These decisions are a Big F---ing Deal in fiatland social media for valid reasons...

Reply to this note

Please Login to reply.

Discussion

Labels can be done by anyone. You can choose to follow the bot you like.

As a client developer, how would you want to integrate such a service?

It could be made lightweight enough to run in a client. SpaCy en_core_sm with a text multicat layer.

That would use more CPU, but much less bandwidth and storage.

Many bots separately labeling every post sounds like a packet storm waiting to happen...

Mostly on topic selection. I'd like to show the labels available to the user and help him/her make new feeds on the things he/she likes. The bot tags these posts and Amethyst just downloads tags from a given topic.

Makes sense!

Next annoying question:

Which topics do new users / all users want available to select? How many is too many?

I know you run a "new user" questionnaire - how many responses have you been getting, and do you think they're close to representative of the userbase now / in the future?

There are lots of topics (millions?) in the replies. Most people want a mix between very general things like "science" when they don't know much about it all they way to specific things like "sha256" when they are in that field. But we can make it work with whatever the bot outputs.

Maybe the bot can add the 5 most representative labels for each post?

Offering many very specific labels - a vector database like Weaviate or FAISS coupled with an LLM can do it and do it well, but man... We are talking datacenter level of resources here.

Not many organisations can offer that, and I'm not sure I want them curating my feed.

Offering a couple of dozen general categories though, that we could run in the client or on a modest VPS, with maybe a BM25 fulltext search for specific terms (will be slow).

General categories we can do with a BoW filter fed forward into a modest CNN.

Specific categories need a serious LLM and vector database of context, or else accuracy will be hilarious

We can run multiple bots, each using a different pubkey. People will decide to follow whatever works best for them. We could have multiple algorithms running in parallel.

We could. There will be a limited number of actors able to finance such a service, however.

If I may make a suggestion, we could run the model in the client, using notes already downloaded.

Default topic model downloaded on first run, or be bundled with app (~50 MB).

Menu of topic models somewhere in settings:

- L.I.V's Mad Science Topic Model

- Leserin's Overthinking Everyday Topic Model

- Onyx's I Know What Boys Like topic model

Etc.

Building a model is less of a commitment than hosting one, and the processing is offloaded to the client instead of angel funding or whatever

You can also put the labels behind a private relay only paying customers can access.

That works, too