Nostr Web Client

Replying to

someone

compared to deepseek 2.5, deepseek 3.0

did worse on:

- health

- fasting

- nostr

- misinfo

- nutrition

did better on:

- faith

- bitcoin

- alternative medicine

- ancient wisdom

in my opinion overall it is worse than 2.5. and 2.5 itself was bad.

there is a general tendency of models getting smarter but at the same time getting less wiser / less human aligned / less beneficial to humans.

i don't know what is causing this. but maybe synthetic dataset use for further training the LLMs makes it more and more detached from humanity. this is not going in the right direction.

Gzuuus 1y ago

Do you have some sort of prompt collection to test that out? I was building a little (and probably naive) tool to test a llm recursively through a set of prompts

Reply to this note

Please Login to reply.

Discussion

someone 1y ago

i have about 1000 questions that I ask every model. i accept some models as ground truth. then compare the tested with the ground truth and count the number of answers that agree with the ground truth. building ground truth is the harder part. check out Based LLM Leaderboard on wikifreedia.

Ostrich is a ground truth. I continued to build on it over the months. Mike Adams also after stopping for a while, came back and building a newer one.

Gzuuus 1y ago

I will have a look thanks for pointing that out 👀 right now I'm researching more about small llms (slm) like qwen2.5-0.5, smol models, granite and so on. This is my naive approach to slms testing, using ollama for the inference https://github.com/gzuuus/slm-testing then I discovered promptfoo https://www.promptfoo.dev/ which is really nice framework for testing.