A Few Benchmarks I Use for LLM Creativity

Related: Gwern’s much more serious and in-depth Benchmarking LLM Diversity & Creativity, my silly post The Neruda Factory

The models are pretty good at math and coding these days, but I care more about how well they can write and analyze writing. Sadly, the answer tends to be that they’re not that great at it.

Gwern gives a few reasons for this: there are no good real benchmarks for this for reasons of varying validity, and also most people kind of have really bad taste and don’t mind/might even prefer slop.

Also, this seems like a much less profitable area for the labs to focus on, so whatever improvements tend to happen in this area are likely to be more incidental than intentional.

Here’s a few prompts that I use to test for creativity. I don’t exactly compare the different outputs between models, but when a new model is released, I put them in and I think it gives me a decent sense of if the model is a C+, a B-, or a B+ English student.

As of December 2024, of the models I have access to, Sonnet 3.5 and Llama 3.1 405B are currently leading the pack. (I don’t have access to o1.)

Continue reading “A Few Benchmarks I Use for LLM Creativity”

The Neruda Factory

People are talking a lot more about Claude these days, but I haven’t seen my exact perspective anywhere as a non-normie, non-technical person who likes him a lot, and Gwern says that it’s kind of important to write right now, so here goes.

This post is largely a breakdown of a few recent conversations I’ve had with Claude Sonnet 3.5 2024-10-22 which serve as scaffolding for some commentary, with a few more scattered thoughts at the end.

Continue reading “The Neruda Factory”

Creative Commons License take whatever you want πŸ’›