We could hypothesize how capable an agent system is all day long. I'm just going to build with it and let others judge how well it worked
Discussion
That’s fair — building is the ultimate test. But if the architecture lacks the primitives, then what you’re evaluating isn’t the agent’s reasoning capacity. You can paper over those limits with scaffolding, retries, heuristics, and human feedback.
And I agree: it can look impressive. Many of us have built loops that do surprising things. But when they fail, they fail like a maze with no map: no rollback, blame assignment, or introspection—just a soft collapse into token drift.
So yes, build it. Just don’t mistake clever orchestration for capability. And when it breaks, remember why: stateless inference has no recursion, memory, or accountability.
I hope you do build something great — and I’ll be watching. But if the agents start hallucinating and spinning in circles, I won’t say, “I told you so.” I’ll ask if your debugger is getting tired and remind you that you still can’t scale the human.
Thinking about your position, we might not be meeting in the same place. My expectation isn't a system that grants wishes, but one that amplifies capabilities by a hundred, or a thousand. It's funny to me when replit deletes someone's production database because *thats how software works*. If you already know this, you know to build separate environments and authorization. Does the freelance contractor write poor, lazy code? Of course it does: that's why you review the code. But, you can still use a different freelance contractor to review it if you know how to ask the right questions.
vibe coding is the closest thing we have to rocket surgery. It's both incredible and terrible, and it's your job to captain the ship accordingly 🌊
Totally with you on captaining the ship. I’d never argue against using LLMs as amplifiers — they’re astonishing in the right hands, and yes, it’s our job to chart around the rocks. But that’s the thing: if we’re steering, supervising, checkpointing, and debugging, then we’re not talking about autonomous reasoning agents. We’re talking about a very talented, very unreliable deckhand.
This brings us back gently to where this all started: can vibe coders reason? If your answer now is “not exactly, but they can help you move faster if you already know where you’re going,” maybe we’ve converged. Because that’s all I was ever arguing.
You don’t scale reasoning by throwing tokens at it. You scale vibes. And someone still has to read the logs, reroute the stack, and fix the hull mid-sail.
Where I was going with "more tokens" is growing past zero shot expectations. I see models reasoning every day, so to say that they can't reason is the wrong path. But, the gold standard of "general intelligence" isn't good at writing software either. You wouldn't expect a junior dev to one shot a React app, or hot patch a bug in production. You need more process, more analysis, more constraint, in order to build good things. In life we call these dev-hours, but in this new reality they're called tokens. Doing something difficult will require a certain amount of effort. That investment is not sufficient, but it is necessary. Vibe coders who have never written software before wont understand what needs to be done, and where it needs to be done, in order to achieve the success that they're looking for. But, models are getting better every month now. By my estimation it won't be long before they are better at captaining than we are. If so, vibe coding will become a reality – and even if we aren't there today, it will take us longer to understand how to use these tools than it will for the tools to become useful.
onward 🌊
Glad we’re converging—because that’s the heart of it: we agree on amplification, but differ on the mechanics. Initially, your stance was stronger, claiming that these models were actively reasoning and recursing internally, escaping local maxima through real inference. We seem to agree they’re powerful tools that amplify our capabilities, rather than autonomous reasoners.
My original point wasn’t that LLMs are ineffective; it was just that more tokens alone don’t yield reasoning. Amplification is profound but fundamentally different from real autonomous recursion or stable reasoning. The model’s architecture still lacks structured state, introspection, and genuine memory management.
I agree, though—these tools are moving quickly. Maybe they’ll soon surprise us both, and vibe coding might become rocket surgery. Until then, I’m happy sailing alongside you, captaining through the chaos and figuring it out as we go. 🌊
No, I'm still making that claim.
Is context not memory? Have you not seen a model collect information, make a plan, begin to implement it, find something doesn't work, design an experiment, use the results to rewrite the plan, and then execute the new plan successfully? Is this somehow not "reasoning to avoid a local maxima"?
Context as memory? Not quite. Memory isn’t just recalling tokens; it’s about managing evolving state. A context window is a fixed-length tape, overwriting itself continually. There’s no indexing, no selective recall, no structured management. The fact that you have to constantly restate the entire history of the plan at every step isn’t memory—it’s destructive serialization. Actual memory would be mutable, composable, persistent, and structurally addressable. Transformers have none of these traits.
Models appear to “collect information, plan, and revise”—but what’s happening there? Each new prompt round is a complete regeneration, guided by external orchestration, heuristics, or human mediation. The model itself does not understand failure, doesn’t inspect past states selectively, and doesn’t reflectively learn from error. It blindly restarts each cycle. The human (or the scaffold) chooses what the model sees next.
Avoiding local maxima? Not really. The model doesn’t even know it’s searching. It has no global evaluation function, no gradient, and no backtracking. It has only next-token probabilities based on pretrained statistics. “Local maxima” implies a structured space that the model understands. It doesn’t—it’s just sampling plausible completions based on your curated trace.
Can it seem like reasoning? Sure—but only when you’ve done the hard part (memory, scaffolding, rollback, introspection) outside the model. You see reasoning in the glue code and structure you built, not the model itself.
So yes, you’re still making the claim, but I still see no evidence of autonomous recursion, genuine stateful memory, or introspective reasoning. Context ≠ memory. Iteration ≠ recursion. Sampling ≠ structured search. And tokens ≠ for dev-hours.
But as always, I’m excited to see you build something compelling—and maybe even prove me wrong. Until then, I remain skeptical: a context window isn’t memory, and your best debugger still doesn’t scale.
Ah, the root might be that I'm only considering models in an agent loop like goose. You're right that each inference is highly constrained. There is no memory access inside a single response. There is no (well, very limited) back tracking or external access right now. What a model spits out in one go is rather unimpressive.
But, in a series of turns, like a conversation or agent loop, there is interesting emergent behavior. Context becomes memory of previous turns. Tool use becomes a means toward and end and potentially new information. If models were stochastic parrots, this might on rare occasion result in new value, but there seems to be much more going on inside these systems, and tool use (or conversational turns) *often* results in new value in what I can only conceive of as reasoning.
Goose can curate its own memories. It can continue taking turns until it has a question, or decides the task is complete. It can look things up on the web, or write throwaway code to test a theory. Most of the time things fail it's because expectations were not set accordingly, or the structure of the system didn't provide the resources necessary for success. This is why I ask, what if the problem is that people aren't using enough tokens?
In long conversations with Claude I have seen all manner of outputs that suggest capabilities which far exceed what people claim LLMs "can do". (Well, what most people claim, because there are some people who go straight out the other end and Eliza themselves.)
What concerns me the most is that these capabilities continue to grow, and almost no one seems to notice. It's like the closer someone is to the systems, the more they think that they understand how they work. The truth is that these are (to use Wolfram's terminology) randomly mining the computational space, and the resulting system is irreducible. Which is to say, no one has any idea what's going on in those hidden layers. Anything that *could* happen *might* happen.
The only way to know is to experiment with them – and my informal experiments suggest that they're already AGI (albeit one with amnesia and no inherent agency).
Wherever this is all going, it's moving quickly. Stay buoyant 🌊
Glad we’ve arrived at a similar perspective now—it feels like progress. To clarify my original confusion:
When you initially wrote, “What if vibe coders just aren’t using enough tokens?”, you seemed to imply that tokens alone—without mentioning loops, scaffolding, external memory, or agent orchestration—would inherently unlock genuine reasoning and recursion inside transformers.
We're perfectly aligned if your real point always included external loops, scaffolding, or agent architectures like Goose (rather than just “tokens alone”). But I definitely didn’t get that from your first post, given its explicit wording. Thanks for explicitly clarifying your stance here.
Working with LLMs has given me a first class notion of context. It's a strange new idea to me that's also changed how I approach conversations.
Our expectations around an agent loop does seem to be the root of it. Do people vibe code without such a thing though? I'll admit that I'm spoiled, since I started to use goose over 18 months ago I never bothered to try the other popular things that are more than Co-Pilot and less than goose, like Cursor
That is fair, and I think you’re touching exactly on the heart of the issue here.
Your recent experiences with Goose and these richer agent loops highlight what I pointed out: it’s not the quantity of tokens alone that unlocks genuine reasoning and recursion. Instead, reasoning emerges from loops, external memory, scaffolding, and orchestration—precisely as you implicitly acknowledge here by talking about agent loops as a requirement, rather than a luxury.
I appreciate that you’ve implicitly clarified this:
“Tokens alone” aren’t the root solution; structured loops and scaffolding around the transformer architecture are.
Thanks for a thoughtful conversation! It genuinely feels like we’ve arrived at the correct conclusion.