GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Soon after OpenAI released GPT-4o on Monday, May 13, some Chinese speakers started to notice something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases. On May 14, Tianle Cai, a PhD student at Princeton University studying inference efficiency in large…

https://www.technologyreview.com/2024/05/17/1092649/gpt-4o-chinese-token-polluted/

Reply to this note

Please Login to reply.

Discussion

No replies yet.