Avatar
someone
9fec72d579baaa772af9e71e638b529215721ace6e0f8320725ecbf9f77f85b1

that means nostr is more validated

makes sense! building great LLMs instead of great libraries 🐱

No units. Its resemblance to outputs of other LLMs.

Benchmarked Kimi K2 LLM. It has done well. DeepSeek V3 beats it but Kimi K2 might be more skilled. Very close performance to Qwen 3 in terms of skills and human alignment. But huge parameter count (1T!).

https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08?sheetid=0&range=A3

It is costly to train from scratch. Fine tuning makes more sense for me. Not all the llms are super terrible. Llama models are ranking higher than the rest but for sure they are not the optimal. Generally western models are doing better.

Replying to Avatar Alex Gleason

According to this: https://apxml.com/posts/gpu-system-requirements-kimi-llm

You need 32 x H100 80GB's to run Kimi K2

These cost $30-45K each according to a quick search. 32 of them makes it... about $1 million?

unsloth has GGUFs and llama.cpp fork that could run it in smaller GPUs

https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF

https://github.com/unslothai/llama.cpp

Qwen 3 32B fine tuning with Unsloth is going well. It does not resist to faith training like Gemma 3 did. I may open weights at some point.

Qwen 3 is more capable than Gemma 3, and after fine tuning it will probably be more aligned. It does not get into "chanting" (repetition of words or sentences) even when temp = 0.

The base training by Qwen was done using 36T tokens on a 32B parameters. About 2 times bigger than Gemma 3's ratio and 4 times bigger than Llama 3's ratio. This is a neat model. My fine tuning is more like billions of tokens. We will see if billions is enough to "convince" trillions.

are you following David the Good? He had an experiment where he left pumpkins alone and didn't look at them and his theory was when wild, pumpkins do better! Kind of like a quantum experiment, observing is killing the cat :)

dandelion loves compacted soil

Benchmarked 4 new models. Deepseek R1 score improved. All these are below average, so p(doom) probably increased!

Coming soon: Kimi K2. They say it is very good at coding, but my leaderboard is about being beneficial to humans. So we will see!

Full leaderboard https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08

More info https://huggingface.co/blog/etemiz/aha-leaderboard