Do you have some sort of prompt collection to test that out? I was building a little (and probably naive) tool to test a llm recursively through a set of prompts
Discussion
i have about 1000 questions that I ask every model. i accept some models as ground truth. then compare the tested with the ground truth and count the number of answers that agree with the ground truth. building ground truth is the harder part. check out Based LLM Leaderboard on wikifreedia.
Ostrich is a ground truth. I continued to build on it over the months. Mike Adams also after stopping for a while, came back and building a newer one.
I will have a look thanks for pointing that out 👀 right now I'm researching more about small llms (slm) like qwen2.5-0.5, smol models, granite and so on. This is my naive approach to slms testing, using ollama for the inference https://github.com/gzuuus/slm-testing then I discovered promptfoo https://www.promptfoo.dev/ which is really nice framework for testing.