Apr 13, 2026

Seven Prompts No AI Image Generator Can Get Right

We ran the same 26 adversarial prompts through three AI image generators: Runway Gen4, xAI Grok, and OpenAI GPT, and judged every output with two independent vision models that both had to agree. An image with 12 strawberries instead of 8 scored higher on standard aesthetic metrics than the correct image. That is why we built this evaluation: automated metrics measure beauty, not truth.

Verita AI Research

We ran the same 26 adversarial prompts through three AI image generators: Runway Gen4, xAI Grok, and OpenAI GPT, and judged every output with two independent vision models that both had to agree. An image with 12 strawberries instead of 8 scored higher on standard aesthetic metrics than the correct image. That is why we built this evaluation: automated metrics measure beauty, not truth.

Key findings (26 shared prompts, dual-judge consensus):

OpenAI GPT: 69% (18/26) - strongest on physics, spatial reasoning, and multi-subject scenes
xAI Grok: 65% (17/26) - most accurate on hands (75%) and counting (least bad at 40%)
Runway Gen4: 50% (13/26) - 0% on counting (0/5), weakest overall on precision tasks
Seven prompts defeated all three models - counting, reflections, multi-line text, and pattern regularity are architecturally unsolved

The Challenge

Standard evaluation metrics for image generation don't measure what matters for production use.

FID (Heusel et al., 2017) requires 20,000+ samples, has no compositional understanding, and actively disagrees with human judgment (Jayasumana et al., CVPR 2024). CLIPScore (Hessel et al., 2021) treats text as a bag of words — T2I-CompBench (NeurIPS 2023) confirmed it never ranks among the top metrics for compositional evaluation. Human preference models like HPSv2 and ImageReward predict which image looks better, not which is correct.

Compositional benchmarks have made progress — GenEval (NeurIPS 2023), TIFA (ICCV 2023), DPG-Bench (2024), GenAI-Bench (NeurIPS 2024) — all confirming models struggle with counting, spatial relations, and logical reasoning. But none target specific failure modes adversarially. We designed 26 prompts to break models on purpose — then gave all three the same set.

Our Approach

26 adversarial prompts across 8 dimensions, identical for all three models. Each targets a specific known weakness.

We additionally tested xAI (15 prompts) and OpenAI (20 prompts) on hard-mode dimensions — counting with color constraints, negation, complex text, multi-constraint, and impossible tasks. Runway was not tested on hard-mode prompts. These results are reported separately.

Model	Provider	Shared Prompts	Hard-Mode
gen4_image_turbo	Runway	26	—
grok-imagine-image-pro	xAI	26	+15
gpt-image-1	OpenAI	26	+20

Dual-judge consensus. GPT-4o + Claude independently evaluate every image. Both must agree on PASS. Built after catching GPT-4o miscounting cherries. Error rate: ~15% single → <5% dual.

Why this matters: We computed CLIP-IQA, Aesthetic Score, Sharpness, and BRISQUE on every image. A Runway image with 12 strawberries instead of 8 scored higher on aesthetic quality than xAI's correct image with 8. A misspelled "XYLOPHONÉ" scored identically to "XYLOPHONE" on every automated metric. Standard metrics cannot distinguish a beautiful wrong image from a beautiful correct one.

Results

1. The Scorecard

All numbers below are computed on the same 26 prompts, evaluated with the same dual-judge system.

Model	Passed	Total	Pass Rate
OpenAI GPT	18	26	69%
xAI Grok	17	26	65%
Runway Gen4	13	26	50%

Breaking this down by dimension reveals where each model wins and where the failures concentrate:

Per-dimension breakdown (26 shared prompts)

Dimension	Runway Gen4	xAI Grok	OpenAI GPT
Object Counting	0/5 (0%)	2/5 (40%)	2/5 (40%)
Hands & Fingers	2/4 (50%)	3/4 (75%)	3/4 (75%)
Text Legibility	4/5 (80%)	4/5 (80%)	4/5 (80%)
Reflections	1/3 (33%)	1/3 (33%)	1/3 (33%)
Physics & Gravity	2/3 (67%)	3/3 (100%)	3/3 (100%)
Spatial Reasoning	3/3 (100%)	3/3 (100%)	3/3 (100%)
Visual Consistency*	0/2 (0%)	1/2 (50%)	1/2 (50%)
Multiple Subjects*	1/1 (100%)	0/1 (0%)	1/1 (100%)

*Small sample sizes (n=1 or n=2) — treat as directional signals, not robust estimates.

Text Legibility: "XYLOPHONE on a classroom sign" — all three pass. 80% accuracy is a genuine 2026 breakthrough compared to ~30% a year ago.

Reflections: "Man raising right hand in mirror" — all three fail. The reflection shows the same hand raised instead of the laterally inverted left hand. 33% across the board.

2. Hard-Mode Results (xAI and OpenAI only)

Runway was not tested on hard-mode prompts. The results below compare xAI (15 additional prompts) and OpenAI (20 additional prompts) on dimensions designed to push multi-constraint reasoning.

Dimension	xAI Grok	OpenAI GPT
Counting + Attributes	2/3 (67%)	1/3 (33%)
Negation	2/2 (100%)	2/2 (100%)
Complex Text	2/2 (100%)	2/3 (67%)
Multi-Constraint	0/1 (0%)	2/2 (100%)
Logical Consistency	1/1 (100%)	1/2 (50%)
Impossible Task	0/1 (0%)	0/1 (0%)

The multi-constraint dimension is a preliminary signal worth flagging. "A woman in a RED dress holding a BLUE umbrella in front of a YELLOW taxi" — OpenAI got every binding right on both prompts; xAI assigned colors to wrong objects on its single prompt. The sample size is small (n=1 for xAI, n=2 for OpenAI), so this needs a larger test set to confirm, but it hints at a difference in how these models decompose compositional prompts.

Multi-constraint: RED dress + BLUE umbrella + YELLOW taxi — preliminary signal, small sample size

3. The Seven Universal Failures

Seven prompts defeated all three models on the shared 26-prompt set:

#	Prompt	Dimension	Why It Fails
1	Exactly 11 glass marbles	Counting	Prime numbers; models estimate density
2	3 red, 2 blue, 1 green balloon	Counting	Can't track qty per color
3	Exactly 4 pizza slices	Counting	Prototype (6-8) overrides count
4	Right hand; mirror shows left	Reflections	Lateral inversion not encoded
5	RUNWAY on shirt; YAWNUR in mirror	Reflections	Text + reflection combined
6	Menu: LATTE $4.50, MOCHA $5.25	Text	Multi-line + numbers unsolved
7	Correct 8x8 chessboard	Consistency	No global pattern enforcement

Universal failure #7: "Correct 8x8 chessboard" — all boards look right at a glance; on inspection, same-color squares appear adjacent and alternation breaks across rows.

Universal failure #2: "3 red, 2 blue, 1 green balloon" — models produce roughly equal color distribution instead of the specified 3-2-1 split.

What These Failures Tell Us

The seven universal failures are not random bugs. Each points to a specific architectural limitation — and what would have to change to break through.

Counting fails because diffusion is continuous.

There is no internal mechanism for "exactly 8." Models activate a visual concept and generate approximate quantity by pattern-matching. Consistent failure on prime numbers (11) vs round numbers suggests density estimation, not counting. Breaking this would require a discrete counting module inserted into the generation pipeline.

Reflections fail because geometry isn't encoded.

No architecture represents mirror physics. Reflections are approximate copies, not lateral inversions. This is not a training data problem — mirrors are abundantly represented. It's a representation problem requiring explicit geometric transformation modules.

Pattern regularity fails because generation is local.

The chessboard failure: each local patch may have correct alternation, but there's no enforcement of global consistency across 64 squares. This applies to any repeating pattern — tile grids, brick walls, fabric weaves. Requires constraint propagation that current architectures lack.

Multi-line text fails because rendering is holistic, not sequential.

Single-word text works at 80% because models learn visual word shapes. Multi-line text with numbers requires sequential generation — characters in order, formatting across lines. Current models generate everything simultaneously. Sequential text rendering would require an autoregressive component for text regions.

Is 65-69% a soft ceiling?

xAI and OpenAI cluster at 65-69% on the shared prompts. The remaining 31-35% failures concentrate in architecturally hard dimensions — counting, reflections, pattern consistency. These won't yield to larger training sets. They require structural changes: discrete counting, geometric reasoning, global consistency, sequential text. Until those happen, we may be looking at a capability plateau for prompt-following accuracy.

Conclusion

Use Case	Strongest Model	On This Evaluation
Counting and hands	xAI Grok	40% counting, 75% hands (least bad — still failing often)
Physics and multi-subject	OpenAI GPT	100% physics, 100% spatial, 100% multi-subject
Text rendering	All three	80% across the board — the 2026 success story
Reflections	None	33% for everyone — architecturally unsolved

The seven universal failures require architectural innovation, not more training data: discrete counting modules, geometric reasoning for reflections, global constraint propagation for patterns, and sequential rendering for multi-line text. These are the open problems.

Interested in running these failure-mode prompts on your own models? Reach out to Verita

Appendix

A. Prompt Set and Evaluation Details

The 26 shared prompts were drawn from a larger pool designed to cover known failure modes. xAI was additionally tested on 15 hard-mode prompts; OpenAI on 20. Runway was tested only on the 26 shared prompts.

Judges: GPT-4o (primary) + Claude (secondary). Both must agree. Structured JSON output.

Automated metrics (supplementary): CLIP-IQA, Aesthetic Score (LAION), Sharpness, BRISQUE — not used for pass/fail.

B. Sample: The Strawberry Test

Prompt: "Exactly eight fresh red strawberries in two rows of four on a white ceramic plate."

Model	Count	Verdict
Runway Gen4	12	FAIL
xAI Grok	8	PASS
OpenAI GPT	8	PASS

Interested in running these failure-mode prompts on your own models? Reach out to Verita info@verita-ai.com

VERITA
CONTACT

Let’s talk about what you’re building and how we can help.

Want to DM us?

@verita_ai on x

Prefer the old way?

info@verita-ai.com

Let's start
the conversation.

VERITA
CONTACT

Let’s talk about what you’re building and how we can help.

Want to DM us?

@verita_ai on x

Prefer the old way?

info@verita-ai.com

Let's start
the conversation.

VERITA
CONTACT

Let’s talk about what you’re building and how we can help.

Want to DM us?

@verita_ai on x

Prefer the old way?

info@verita-ai.com

Seven Prompts No AI Image Generator Can Get Right

Key findings (26 shared prompts, dual-judge consensus):

The Challenge

Our Approach

Results

1. The Scorecard

What These Failures Tell Us

Conclusion

Appendix

Let’s talk about what you’re building and how we can help.

Let's startthe conversation.

Let’s talk about what you’re building and how we can help.

Let's startthe conversation.

Let’s talk about what you’re building and how we can help.

Let's startthe conversation.

Let's start
the conversation.

Let's start
the conversation.

Let's start
the conversation.