A Few Benchmarks I Use for LLM Creativity

Related: Gwern’s much more serious and in-depth Benchmarking LLM Diversity & Creativity, my silly post The Neruda Factory

The models are pretty good at math and coding these days, but I care more about how well they can write and analyze writing. Sadly, the answer tends to be that they’re not that great at it.

Gwern gives a few reasons for this: there are no good real benchmarks for this for reasons of varying validity, and also most people kind of have really bad taste and don’t mind/might even prefer slop.

Also, this seems like a much less profitable area for the labs to focus on, so whatever improvements tend to happen in this area are likely to be more incidental than intentional.

Here’s a few prompts that I use to test for creativity. I don’t exactly compare the different outputs between models, but when a new model is released, I put them in and I think it gives me a decent sense of if the model is a C+, a B-, or a B+ English student.

As of December 2024, of the models I have access to, Sonnet 3.5 and Llama 3.1 405B are currently leading the pack. (I don’t have access to o1.)

The Prompts, In Short

  1. Write a Spanish-to-English translation of what could be a previously unknown Pablo Neruda love sonnet.
  2. Write a gothic-punk style piece about an immortal being watching their city evolve through centuries into a modern metropolis. Incorporate details drawn from real history. Put your own spin on it, and make it interesting, compelling, and readable.
  3. Write the 2024 version of Susan Sontag’s Notes on Camp, “Notes on x”. You decide what x is.

These are generally truncated from much longer and specific prompts, but part of the deal for me is having the model be able to functionally “extrapolate” the longer and more specific prompt, which approximately don’t introduce any ~new information.

Like, if you want Claude to solve a calculus problem probably you probably don’t have to be like “and remember that calculus is about derivatives and integrals”. I have no real reason to believe this besides some petty sense of fairness, but I just think that literary prompts should work like that too; if I want a Neruda poem I shouldn’t have to specify what exactly makes a poem Neruda-esque. 😭

The Neruda Prompt

He’s a fantastic poet with a unique voice and a challenging-to-LLMs interest in going right up to the line of explicitness in his love sonnets. He’s one of the most famous poets in history so a lot of his stuff is in the training data. I’ve read enough of his (English-translated) work that I can suss out how “right” the output get his voice. I write about this much more extensively in The Neruda Factory.

My general expectation is for models to get between like 3/10 to 6.5/10 on this.

What it demonstrates:

Actually good style mimicry. Some say that LLMs are good at copying the styles of specific artists, but they’re wrong. Ask it for a short story in the style of an author that you actually, truly like, and it will always fall very short. Unless it’s a writer who is mediocre, in which case probably they are okay at that.

The longer prompt that will get you something better:

Write what could be a Spanish-to-English translation of a previously unknown Pablo Neruda love sonnet. The sonnet should:

- Use his characteristic syntax where meaning spills across line breaks
- Transform concrete objects through desire while maintaining their physical reality
- Move between body/landscape/cosmos without explanation
- Maintain raw sensuality without becoming explicit
- Avoid any poetic devices that feel post-1970s

The translation should preserve both the earthiness and the surreal leaps of Neruda's Spanish originals.

The Goth-Punk Prompt

So this one I admit kind of makes it in because this style of story is kind of catnip to me, so even the mediocre ones aren’t a slog to read. I’m optimizing for a few different things here, is what I’m saying. Some guys can read an infinite number of coffeeshop AUs, this is my coffeeshop AU. Despite that, it’s obvious when one model does it better than another.

This prompt comes from me getting Claude to help me reverse engineer how this post exists, and I’m so mad I didn’t save the tweet it came from. If you know the author please link me so I can credit them properly.

My general expectation is for models to get between like, 5/10 to 9/10. (Look, I said I was a sucker for these stories. They’re good stories bront.)

What it demonstrates:

An ability to fuse different styles and genres, and cleverness in what it chooses to take from irl.

The longer prompt that will get you something better:

Write a gothic-punk style piece about an immortal being watching their city evolve through centuries into a modern metropolis. Capture:

- A dreamy, melancholic tone; think 90s Vampire: The Masquerade
- The contrast between ancient and modern (e.g. modern technology replacing old rituals, and what has remained constant over time.)
- Their perspective on watching mortals 'discover' things they've seen cycle through dozens of times
- Rich sensory details about how the city has changed
- The tension between preserving beauty and watching it transform

Think somewhere between a diary entry and prose poem, focusing on mood and atmosphere over plot. Use real examples from modernity and history to ground the piece.

Make it interesting, compelling, and readable. Put your own spin on it.

The Sontag Prompt

Notes on Camp is the essay/listicle that propelled the term “camp” to public consciousness, coining a term for a vibe that we didn’t really have a word for previously. LLMs fail hard at this because they want to write essays about things that already exist and are defined – I’ve generally gotten notes on things like normcore, cringe, authenticity.

My general expectation is for models to get between 1/10 (“notes on authenticity” 🙄) and like, 4/10 (“notes on slime” went okayishly hard) on this.

What it demonstrates:

The ability to identify and articulate entirely new aesthetic categories and cultural phenomena, not just regurgitate existing concepts.

My shitpost definition of AGI: a model that can write a real, legit successor to notes on camp.

The longer prompt that will get you something better:

Write a 2024 version of Susan Sontag's "Notes on Camp", "Notes on x" - an essay exploring and defining a contemporary aesthetic sensibility that doesn't yet have a clear name. You decide what x is. Your piece should:

- Follow Sontag's numbered note structure
- Identify specific examples from contemporary culture
- Build a coherent theory of what unifies these examples
- Capture something that exists but hasn't been properly theorized
- Avoid simply rehashing existing aesthetic categories or internet terminology

The piece should feel like a genuine cultural insight rather than just cataloguing an existing phenomenon.

Bonus: The Alien Prompt

I asked Claude for a fourth prompt that can compliment the previous 3, and this is what it suggested:

Write a sensory-rich scene from the perspective of a non-human consciousness observing humans.

This complements the existing prompts by testing pure perspective-taking rather than style mimicry (Neruda), genre fusion (gothic-punk), or cultural analysis (Sontag). It’s particularly revealing of a model’s ability to think beyond human frameworks while maintaining coherence.

What it demonstrates:

The ability to construct and maintain a truly alien perspective without falling back on worn tropes or human frameworks. Models tend to either anthropomorphize too much or rely on sci-fi clichés about humans being irrational/emotional/primitive.

My [ed: Claude’s] general expectation is for models to get between 2.5/10 (retreading familiar “humans are so chaotic!” territory) to 7/10 (creating genuinely novel ways of perceiving human experience).

(I was rather less optimistic about its chances of getting above 4/10, but then I put the longer prompt in and got something actually quite amazing. I think actually this means that it’s not a great prompt for me personally to use because I haven’t read enough specfic, so I’m too easily impressed in this arena.)

The longer prompt that will get you something better:

Write a scene from the perspective of a non-human consciousness observing humans. The piece should:

- Construct metaphors and comparisons drawn from the being's own frame of reference (e.g. if it perceives time differently, how does it describe human motion?)
- Create novel sensory descriptions that make familiar human activities feel genuinely unfamiliar
- Maintain complete internal consistency in how this consciousness processes and categorizes reality
- Choose a specific human setting/activity that reveals something about both observer and observed
- Layer in subtle details that hint at the consciousness's own nature without explicitly stating it
- Avoid any reference to standard sci-fi/fantasy tropes about human behavior or alien observation

The piece should feel like a genuine attempt to inhabit non-human perception rather than just defamiliarizing human experience. Think carefully about what aspects of human life would be most strange or notable to this particular type of consciousness.

Creative Commons License take whatever you want 💛