nostr:npub14p3antm8cnvlqx3km7fp4ywyyxq7przhxay0d9gaf6hxkw984cxsm8x6r5 I found that particular paper very unconvincing once I started digging into the data behind it, they were marking answers as "incorrect" for pretty weak reasons in my opinion
nostr:npub14p3antm8cnvlqx3km7fp4ywyyxq7przhxay0d9gaf6hxkw984cxsm8x6r5 Practice! The more time I spend with different models the better my intuition for if they're going to give me a good answer or not
GPT-4 and ChatGPT are far, far more reliable than the models that I can run locally on my laptop
I'm only just beginning to build that intuition for Llama 2, it'll take a while
I spoke about that a bit in https://simonwillison.net/2023/Aug/3/weird-world-of-llms/#tips-for-using-them
Discussion
nostr:npub14p3antm8cnvlqx3km7fp4ywyyxq7przhxay0d9gaf6hxkw984cxsm8x6r5 generating little jq and bash scripts is an ideal application for untrustworthy LLMs because hallucinated code won't work, so you can spot any hallucination problems pretty fast!