I am lucky enough to have a PC that can run the lastest gpt-oss models from OpenAI. I am not impressed. It is like talking to Drax the Destroyer from Guardians of the Galaxy. Even when you explain the joke to it, it still misunderstands.

It also lies. Probably not purposely, it merely says what a human might say even though it is impossible for it.

Reply to this note

Please Login to reply.

Discussion

They fucking do lie. I've called models out and they hilariously admitted that they make something up that sounds plausible sometimes. Seems like we *could* have them just say they don't know, but that doesn't look as sexy to VCs and investors.

The thing is, they only admit to lying because that is the answer you are most likely to accept. They don't know they lied.

For sure. But this gets hairy with definitions. The software may not "know" but it was built by humans who may.

They just output the most likely text. This is why "pick a random number between 1 and 25 is always the same number." But it has other consequences. I can ask it. "Search the web for restaurants in my home town" and it will happily give me the results of various "searches" it did even though it is not connected to the Internet.

It doesn't know any better, it just knows what a list of search results looks like.

I'm skeptical of any claims that these models are entirely misunderstood creations that just do things willy nilly with not a single company or developer having some idea that something is fishy. I understand the whole black box theory, but I'm just saying I'm skeptical of it. There is truth to it, but it's also a super easy scapegoat for shenanigans.

And at the very least the tools should tell users explicitly that what theyre being told could be completely false. But that doesn't look as good so that isnt the case for any I've used. My first instinct would be to wonder what the fucking point is. And that IS what people should know. You have to verify sources. Most people won't and it will have massive consequences. I've already seen people IRL settle debates with these tools as if they are stone cold fact. That's wild knowing what I know.

What size parameter model are you running?

Both the 21B and the 120B.

Huh. That's kinda crazy. Haven't tried them myself yet.

Guess that's the advantage of running 10 models in tandem and choosing the best response 😅

120B? you must have a pretty sweet little GPU there

Just a 4090. Most of it has to run on my CPU. The model is quantized so is "only" 65GB.

ah, yeah, that's about 50% more grunty than my RX 7800 XT. it has 256 bit memory and 16gb. runs 22b codestral fine though.

these free models from hugging face are quite a jumble of hit and miss tho. took me a while to find a good one, and then someone put me onto codestral, which also seems to be quite good, and more parameters than the 14B qwen 3 i was using before. haven't really evaluated it though, because most of the work i do uses claude 3.7 cloud for a coding agent. i'm looking forward to eventually being able to point the agent at my local LLMs tho. i just don't see the point in using a remote service. i also don't like teaching those fuckers my process, that's more what i'm concerned about because i know copilot is already eating all of my output on github.

Sitting comfortably between y'all with a 4080 Super because I take video games very seriously

And I still use cloud LLMs lol.

meh, my 16gb video card runs LLMs fine

i don't think there is even cloud services that run the models i use anyway

i don't trust cloud hosting at all, in any way, whatsoever. that's why i'm a nostr relay dev. because i want little people to run internet services. they are more likely to be honorable.

I was messing with LM Studio and Ollama and Roocode and stuff recently. It's a bit confusing to me when choosing a model (in general). I tried a 7B model which was fucking memes. Haven't tried a 70B yet.

LM studio is the one i use, first one that i got to actually work on linux, after i finally got the AMD ROCm compute libraries installed finally (needed to use ubuntu 24 to make it work)

idk what kind of thing you want to do but so far i've found Qwen and Codestral models are both good for understanding code

The only model I recommend locally is llama 3.3. qwen and deepseek get a lot of hype but they are overall worse. What they are better at is looking like they are doing something. But they all basically ape conversation. The turing test is really a test of the user.

llama3.3 wins by being the least pretentious. That means more parameters can be used for actual knowledge rather than performance art.

openAI's shit is trash. i've been using the coding variant of mistral lately and it works nice. i don't even bother to waste my time talking to it like it's intelligent, it's just a clever monkey that can remix text, and claude seems to be pretty good at figuring out plans and executing them to write code, tests and find bugs.

i'd much rather run models from hugging face on my GPU than use those shitty cloud services. gonna be very happy when they finally enable local models for junie so i can just use codestral or maybe try some other models that are more focused on programming. the general purpose models provided by cloud providers are, quite frankly, irrelevant and full of bullshit that you don't need for programming work. and i wouldn't trust them to do much writing either, since they are completely infested with commie propaganda screeds.

i literally asked gemini one time to summarise some subject or other, and forgot to give it the link to the thing i wanted it to work on. it started on this screed about renewable energy and shit and i was like, so, this is what it will talk about if you don't give it a specific topic. imagine trying to get this thing to talk about solar forcing of weather and earthquakes. lol. would be fun to read the "thinking" output when it does this and it says "well, this is a thing, but i'm not allowed to say that"

I did run these locally. They just can't understand conversation shifts. To be fair not even grok 3 did that well in my latest test. All I did was ask an easy question "how many u in the word strawberry?" And follow is up with a joke question "Isn't there a double u? So should it be 2?"

Any human with half a brain would have realized I was making a stupid joke. I don't really expect AI to catch that, but I expect them to understand what happened after I point it out. gpt-oss just doubles down and makes tables about how you are wrong.

these models are mostly trained by mids, so, they are mid.

i'd love to see what legit intelligent people would do with them.

from the snippets i've seen of grok it's on the high side of mid compared to gpt and gemini. gemini seems to be straight up woke

Grok is by far the best. The only model that understands after you explain.

> The word "strawberry" contains two 'u' characters. Here's how it breaks down:

>

> stuwburry

What do you use them for? And are you pooping on OpenAI models specifically over others?

Nm, seems from replies you like grok (gross; have never tried it).

What is gross about grok?

That isn't a picture of grok.

It has Musk's stench all over it. I can't help but want to dislike it.

I don't think Musk had much to do with the weights generated. He paid for and helped roll out the cluster it was trained on, and he set the priorities. "Be maximally truth seeking" but there is no way he had time to curate the data or even design the over-all architecture. Thus, the actual AI isn't particularly Musk-like. It is actually quite similar to other AI of similar complexity.

The difference is a subtle shift in the data used (a bit less woke (whatever that means any more)) and the way it perfoms search.

Musk himself doesn't have any magic answers as to what is truth and what isn't. He is good with physics and engineering, but kinda dumb in other areas. So what you get is an AI that is better than average at searching the web to give the answer du jour.

It's a tool, it is good at its job or not irrespective or not of who made it.

For coding I prefer Claude 3.7 and Chatgpt 3o (I think, I can't keep them straight) but I have tried grok 4. I don't have money to throw around.

You have the zap I sent you. Sit on it a few more months and you'll have funds for all the LLMs!

I have lots of random questions about llms, but will hold off until I've tried some a bit more and have something semi-intelligent to say

if it was about the CEOs he would be preferable to slimy sam altman

Fair point to consider

No, I usually express my disappointment every time a new local AI comes out. These are just the newest. Generally my disappointment comes from models that try to do too much for their parameter count.

In my testing "small" (70b ish parameter) models are not big enough to do actual reasoning. I think you need to hit some magic scale where it pays off. Smaller than that they just make noises that make it look like they are reasoning, but the output is no better or even worse than a classic LLM.

For this reason I like Meta's llama 3.x models. They are unsurpassed at prompt following and they give as good of answers as you can expect for their parameter count.

You can get some improvement out of them, in some circumstances, if you prompt them to spell out their reasoning. They don't actually reason, but it allows them to correctly count r's in strawberry for instance.

I will pay OpenAI this compliment though. Their 120b parameter models somehow runs twice as fast as llama 70b parameters, on my machine. I don't think it is mixture of experts, I think they did something clever with the quantization. I haven't figured it out yet. It seems impossible. It should be bound by my RAM speed.

I lurnt some stuff from reply, thanks.

I'm mostly interested in coding agents, something I plan to play with down the line; I haven't used much beyond cursor a couple times to mess around. Not even sure if I can use llama inside it, but I assume so.

Local llm would be nice for but I'm guessing that's out of reach for my current tech and abilities