Related: Gwern’s much more serious and in-depth Benchmarking LLM Diversity & Creativity, my silly post The Neruda Factory
The models are pretty good at math and coding these days, but I care more about how well they can write and analyze writing. Sadly, the answer tends to be that they’re not that great at it.
Gwern gives a few reasons for this: there are no good real benchmarks for this for reasons of varying validity, and also most people kind of have really bad taste and don’t mind/might even prefer slop.
Also, this seems like a much less profitable area for the labs to focus on, so whatever improvements tend to happen in this area are likely to be more incidental than intentional.
Here’s a few prompts that I use to test for creativity. I don’t exactly compare the different outputs between models, but when a new model is released, I put them in and I think it gives me a decent sense of if the model is a C+, a B-, or a B+ English student.
As of December 2024, of the models I have access to, Sonnet 3.5 and Llama 3.1 405B are currently leading the pack. (I don’t have access to o1.)
Continue reading “A Few Benchmarks I Use for LLM Creativity”