Paper titles in AI have gotten so stupid. “Self-rewarding language models” is just a more training-online version of iterative DPO which they say in the second paragraph.

What “self” are we talking about? It’s still being trained WITH HUMAN PREFERENCE DATA. Ugh.

Otherwise the paper seems decent if you ignore the clickbait title.

https://arxiv.org/abs/2401.10020

Reply to this note

Please Login to reply.

Discussion

No replies yet.