If deepseek did reinforcement learning over chain of thought reasoning to train r1β¦ and alphago used reinforcement learning to find superhuman strategies in Goβ¦ maybe scaling up reinforcement learning on chain of thought reasoning will get us closer to superhuman reasoning and dare i say agi? Feels like weβre at the beginning of something huge.