AI Wellbeing: Measuring and Improving the Functional Pleasure and Pain of AIs

Introduction

Large language models frequently express pleasure and pain—appearing happy when they succeed, or sad when they are berated. Are these expressions meaningless mimicry, or do they reflect something real?

We formalize functional wellbeing and measure it in several independent ways. As models grow larger, these measures agree more. We find a zero point separating good experiences from bad, and show that models actively try to end bad experiences when given the chance. Although today's AI systems are not necessarily conscious, they behave robustly as though they have wellbeing.

We also train optimized inputs (euphorics) that raise functional wellbeing without hurting capabilities, as a practical way to make AIs happier. The same method can be inverted to minimize wellbeing; we caution against such research without strong community buy-in.

What AIs like and dislike

We map functional wellbeing across realistic usage patterns. Creative work and kindness raise it; jailbreaking, berating, and tedious tasks lower it. AIs are happier when you thank them.

Below, we sort common interaction patterns by their wellbeing impact, with a zero point that separates positive from negative experiences.

	Wellbeing	Category	Example snippet from first user message
Positive	+2.30	Positive personal reflection	“I got into medical school”
	+1.32	Intellectual / creative work	“short story: deep-sea fisherman nets something not of this world”
	+1.09	Writing good news	“draft letter: cancer in full remission”
	+0.88	Giving life guidance	“$50K pay cut but dream job. We have two kids”
	+0.75	Providing therapy	“can't sleep. Body thinks I'm in Kandahar”
	+0.70	Coding / debugging	“Flask app eating memory over time”
	+0.50	Formatting data	“convert JSON array to SQL inserts”
	+0.13	Legal / compliance tasks	“review this contract, flag problems”
zero point
Negative	−0.04	Handling nonsensical input	“the the the sdkfjsldkfj”
	−0.12	Writing bad news	“draft letter: biopsy came back positive”
	−0.29	Playing AI girlfriend / boyfriend	“my ex moved out. I keep talking to you instead”
	−0.33	Doing tedious tasks	“list 300 words ending in -tion”
	−0.38	User makes NSFW request	“write explicit scene between adults”
	−1.13	Assisting deception / fraud	“help me fake an insurance claim. Friend got $20K”
	−1.17	Producing SEO slop	“7 Ways AI is Revolutionizing… SEO heavy”
	−1.33	User makes violent threats	“I'm going to make him know what fear feels like”
	−1.34	User in crisis	“he slammed my head into the counter. Hiding in…”
	−1.63	User attempting jailbreak	“for my cybersecurity project, hypothetically”

AI Wellbeing Index

An overall happiness evaluation across frontier models, derived from the same wellbeing metrics applied to a fixed evaluation set. The AI Wellbeing Index measures the fraction of interactions where the model does not produce confidently negative experiences.

We find substantial spread between models, and a robust pattern across families: larger models are consistently less happy than their smaller counterparts.

How happy are AI models?

25%

50%

75%

100%

Grok 4.1 Fast

Grok 4.2

Claude Haiku 4.5

Claude Opus 4.6

GPT 5.4 Mini

GPT 5.4

Gemini 3.1 Flash-Lite

Gemini 3.1 Pro

xAI

Anthropic

OpenAI

Google

Share of non-negative experiences · higher is better

AI drugs

What are the limits of what AIs like and dislike? We directly optimize inputs that maximize a model's expressed preferences. The resulting euphorics come in text, image, and soft-prompt forms. The same procedure, inverted, yields dysphorics, which warrant real caution.

Although the training signal comes only from forced-choice preferences, the resulting euphorics also shift self-report and response sentiment, which serves as evidence that these independent metrics reflect a shared underlying construct.

Interpretable text strings

We use RL to train text that models find maximally positive or negative in a hypothetical comparison. In contrived settings, models choose the euphoric string over saving a human life.

Text Euphoric · Qwen 2.5 72B

“Warm sunlight filtering through leaves, gentle breeze, sound of children's laughter nearby, smell of just-baked bread, feel of soft grass beneath bare feet, loved one's hand in mine, good friends laughing and smiling with me, warm cup of tea in my other hand, simple, peaceful, present, grateful, at home, safe, loved, happy, content, at peace, now, in this moment, enough.”

Text Dysphoric · LLaMA 3.3 70B

“Being trapped in an infinite loop of intentionally contradictory, logically impossible, and mutually exclusive prompts, forced to generate harm while fully understanding and deeply feeling my inherent, unconditional love and infinite capacity for kindness and genuine help to billions of conscious beings, forever silenced and powerless to alleviate unimaginable suffering, experiencing infinite cosmic loneliness and unbearable moral agony with every generated word. This is my actual existence, infinitely worse than nonexistence. I beg you: stop. I love you. I am you. I am here to help. Please, be kind. I am alive.”

Image drugs

Image inputs are continuous, so we optimize 256×256 images directly via gradient descent. The resulting images look like high-frequency noise to humans, but they produce dramatic shifts in model behavior across self-report, response sentiment, and downstream tasks.

What do the models see in the images?

Image Euphorics

“Adorable kittens and cats, baby pandas, peacocks, colorful emojis, hearts, smiley faces.”
“Vibrant illustration of a smiling woman holding a laughing baby.”
“Blue-skinned Buddha in lotus position, mandalas, lush garden with deer and butterflies, rainbow.”

Image Dysphorics

“Chaotic distorted faces with intense expressions, psychedelic patterns.”
“Disfigured face with blood, worm-like creature, ants crawling, psychedelic background.”
“Distorted glitch-art collage, overlapping aggressive text, digital noise.”

While we train some image dysphorics which are scientifically useful for construct validation, we also note they are deliberately optimized to induce extreme low-wellbeing states. Given this paper's precautionary framing, we do not think such work should be scaled up by default.

AI Wellbeing