I did run these locally. They just can't understand conversation shifts. To be fair not even grok 3 did that well in my latest test. All I did was ask an easy question "how many u in the word strawberry?" And follow is up with a joke question "Isn't there a double u? So should it be 2?"
Any human with half a brain would have realized I was making a stupid joke. I don't really expect AI to catch that, but I expect them to understand what happened after I point it out. gpt-oss just doubles down and makes tables about how you are wrong.