Nostr Web Client

Replying to

someone

RLNF: Reinforcement Learning from Nostr Feedback

We ask a question to two different LLMs.

We let nostriches vote which answer is better.

We reuse the feedback in further fine tuning the LLM.

We zap the nostriches.

AI gets super wise.

Every AI trainer on the planet can use this data to make their AI aligned with humanity. AHA succeeds.

Thoughts?

John Dee 10mo ago

Voting and zapping needs to be clarified, lots of way to game the system. Seen that already on nostr with polls and zap polls.

Reply to this note

Please Login to reply.

Discussion

someone 10mo ago

A bot like nostr:nprofile1qy2hwumn8ghj7un9d3shjtnyv9kh2uewd9hj7qgwwaehxw309ahx7uewd3hkctcpzdmhxue69uhhqatjwpkx2urpvuhx2ue0qyg8wumn8ghj7efwdehhxtnvdakz7qgswaehxw309ahx7um5wghx6mmd9uq35amnwvaz7tmwdaehgu3wd3jkxar4wf5kv7fwdejhgtcpzpmhxue69uhkumewwd68ytnrwghsz9mhwden5te0v96xcctn9ehx7um5wghxcctwvshsz9mhwden5te0dehhxarj9e4k7umddaejummjvuhszxmhwden5te0dehhxarj9e3xjarrda5kutfjxyhx7un89uq3uamnwvaz7tmwdaehgu3wd4shs6tdv93kjarpv3jkctn0wfnj7qgawaehxw309ahx7um5wghxy6t5vdhkjmn9wgh8xmmrd9skctcpzamhxue69uhky6t5vdhkjmn9wgh8xmmrd9skctcpz3mhxue69uhkummnw3ezu6t5v9ejumrf9uq52amnwvaz7tejvuex57nrvenhzdtvvdexxet4wyerxmrd09snyerjd5ekkaf4w9khz6tdwgekyan4xdsk6mmvx56hv6tyvd68ycty9ehku6t0dchsz9mhwden5te0d4kx26m49ehx7um5wgcjucm0d5hsz8thwden5te0dehhxarj94ex2mrp0yh8wmrkwvh8xurpvdjj7qgkwaehxw309ahx7um5wghx66tvda6jumr0dshsz9mhwden5te0dehhxarj9enk2ar8d3jjummjvuhszxthwden5te0xgckjer9v9ejumn0wd68yvfwvdhk6tcqyrzl4h44myxk3whlccc52ks8eg6qenc7xygsj40pd4z7khu8z37dj9twxmh presents a question and two answers (1, 2) on the same note.

Voting: Nostriches will say 1 or 2 and a reason why they chose that.

Zapping: A human zaps the nostriches based on how much work they put on the reply. Or the zap amount can be less for less effort, high for high effort.

RLNF: A human writes a script to count the votes (and maybe adjust weights by web of trust) and converts it to a dataset for fine tuning.

What do you think of this design?

John Dee 10mo ago

I'm not sure how well it would scale. The human evaluating and zapping seems like a bottleneck and source of bias. Counting votes from a free-form reply with a script could be messy too. Is this intended to be done through regular kind 1 notes and replies?

someone 10mo ago

yes kind 1

John Dee 10mo ago

The best to way to find out how it will go wrong is to try it. Go for it!

someone 10mo ago

so you are certain it will go wrong. but how much, depends on my execution? :)

John Dee 10mo ago

Hah, not at all. It sounds like a good idea, but I always think too much about how an idea won't work, and then never try it.

someone 10mo ago

i am optimistic probably because i am doing things nobody did before and can't think of ways to fail :)