Seven Prompts No AI Image Generator Can Get Right

We ran the same 26 adversarial prompts through three AI image generators: Runway Gen4, xAI Grok, and OpenAI GPT, and judged every output with two independent vision models that both had to agree. An image with 12 strawberries instead of 8 scored higher on standard aesthetic metrics than the correct image. That is why we built this evaluation: automated metrics measure beauty, not truth.

Verita AI Research

We ran the same 26 adversarial prompts through three AI image generators: Runway Gen4, xAI Grok, and OpenAI GPT, and judged every output with two independent vision models that both had to agree. An image with 12 strawberries instead of 8 scored higher on standard aesthetic metrics than the correct image. That is why we built this evaluation: automated metrics measure beauty, not truth.

Key findings (26 shared prompts, dual-judge consensus):

  • OpenAI GPT: 69% (18/26) - strongest on physics, spatial reasoning, and multi-subject scenes

  • xAI Grok: 65% (17/26) - most accurate on hands (75%) and counting (least bad at 40%)

  • Runway Gen4: 50% (13/26) - 0% on counting (0/5), weakest overall on precision tasks

  • Seven prompts defeated all three models - counting, reflections, multi-line text, and pattern regularity are architecturally unsolved


The Challenge

Standard evaluation metrics for image generation don't measure what matters for production use.

FID (Heusel et al., 2017) requires 20,000+ samples, has no compositional understanding, and actively disagrees with human judgment (Jayasumana et al., CVPR 2024). CLIPScore (Hessel et al., 2021) treats text as a bag of words — T2I-CompBench (NeurIPS 2023) confirmed it never ranks among the top metrics for compositional evaluation. Human preference models like HPSv2 and ImageReward predict which image looks better, not which is correct.

Compositional benchmarks have made progress — GenEval (NeurIPS 2023), TIFA (ICCV 2023), DPG-Bench (2024), GenAI-Bench (NeurIPS 2024) — all confirming models struggle with counting, spatial relations, and logical reasoning. But none target specific failure modes adversarially. We designed 26 prompts to break models on purpose — then gave all three the same set.

Our Approach

26 adversarial prompts across 8 dimensions, identical for all three models. Each targets a specific known weakness.

We additionally tested xAI (15 prompts) and OpenAI (20 prompts) on hard-mode dimensions — counting with color constraints, negation, complex text, multi-constraint, and impossible tasks. Runway was not tested on hard-mode prompts. These results are reported separately.

Model

Provider

Shared Prompts

Hard-Mode

gen4_image_turbo

Runway

26

grok-imagine-image-pro

xAI

26

+15

gpt-image-1

OpenAI

26

+20

Dual-judge consensus. GPT-4o + Claude independently evaluate every image. Both must agree on PASS. Built after catching GPT-4o miscounting cherries. Error rate: ~15% single → <5% dual.

Why this matters: We computed CLIP-IQA, Aesthetic Score, Sharpness, and BRISQUE on every image. A Runway image with 12 strawberries instead of 8 scored higher on aesthetic quality than xAI's correct image with 8. A misspelled "XYLOPHONÉ" scored identically to "XYLOPHONE" on every automated metric. Standard metrics cannot distinguish a beautiful wrong image from a beautiful correct one.

Results

1. The Scorecard

All numbers below are computed on the same 26 prompts, evaluated with the same dual-judge system.

Model

Passed

Total

Pass Rate

OpenAI GPT

18

26

69%

xAI Grok

17

26

65%

Runway Gen4

13

26

50%


Breaking this down by dimension reveals where each model wins and where the failures concentrate:

Per-dimension breakdown (26 shared prompts)

Dimension

Runway Gen4

xAI Grok

OpenAI GPT

Object Counting

0/5 (0%)

2/5 (40%)

2/5 (40%)

Hands & Fingers

2/4 (50%)

3/4 (75%)

3/4 (75%)

Text Legibility

4/5 (80%)

4/5 (80%)

4/5 (80%)

Reflections

1/3 (33%)

1/3 (33%)

1/3 (33%)

Physics & Gravity

2/3 (67%)

3/3 (100%)

3/3 (100%)

Spatial Reasoning

3/3 (100%)

3/3 (100%)

3/3 (100%)

Visual Consistency*

0/2 (0%)

1/2 (50%)

1/2 (50%)

Multiple Subjects*

1/1 (100%)

0/1 (0%)

1/1 (100%)

*Small sample sizes (n=1 or n=2) — treat as directional signals, not robust estimates.

Text Legibility: "XYLOPHONE on a classroom sign" — all three pass. 80% accuracy is a genuine 2026 breakthrough compared to ~30% a year ago.




Runway Gen4

xAI Grok

OpenAI GPT

Reflections: "Man raising right hand in mirror" — all three fail. The reflection shows the same hand raised instead of the laterally inverted left hand. 33% across the board.




Runway Gen4

xAI Grok

OpenAI GPT

2. Hard-Mode Results (xAI and OpenAI only)

Runway was not tested on hard-mode prompts. The results below compare xAI (15 additional prompts) and OpenAI (20 additional prompts) on dimensions designed to push multi-constraint reasoning.

Dimension

xAI Grok

OpenAI GPT

Counting + Attributes

2/3 (67%)

1/3 (33%)

Negation

2/2 (100%)

2/2 (100%)

Complex Text

2/2 (100%)

2/3 (67%)

Multi-Constraint

0/1 (0%)

2/2 (100%)

Logical Consistency

1/1 (100%)

1/2 (50%)

Impossible Task

0/1 (0%)

0/1 (0%)

The multi-constraint dimension is a preliminary signal worth flagging. "A woman in a RED dress holding a BLUE umbrella in front of a YELLOW taxi" — OpenAI got every binding right on both prompts; xAI assigned colors to wrong objects on its single prompt. The sample size is small (n=1 for xAI, n=2 for OpenAI), so this needs a larger test set to confirm, but it hints at a difference in how these models decompose compositional prompts.

Multi-constraint: RED dress + BLUE umbrella + YELLOW taxi — preliminary signal, small sample size



xAI Grok

FAIL

OpenAI GPT

PASS

3. The Seven Universal Failures

Seven prompts defeated all three models on the shared 26-prompt set:


#

Prompt

Dimension

Why It Fails

1

Exactly 11 glass marbles

Counting

Prime numbers; models estimate density

2

3 red, 2 blue, 1 green balloon

Counting

Can't track qty per color

3

Exactly 4 pizza slices

Counting

Prototype (6-8) overrides count

4

Right hand; mirror shows left

Reflections

Lateral inversion not encoded

5

RUNWAY on shirt; YAWNUR in mirror

Reflections

Text + reflection combined

6

Menu: LATTE $4.50, MOCHA $5.25

Text

Multi-line + numbers unsolved

7

Correct 8x8 chessboard

Consistency

No global pattern enforcement

Universal failure #7: "Correct 8x8 chessboard" — all boards look right at a glance; on inspection, same-color squares appear adjacent and alternation breaks across rows.




Runway Gen4

xAI Grok

OpenAI GPT

Universal failure #2: "3 red, 2 blue, 1 green balloon" — models produce roughly equal color distribution instead of the specified 3-2-1 split.




Runway Gen4

xAI Grok

OpenAI GPT

What These Failures Tell Us

The seven universal failures are not random bugs. Each points to a specific architectural limitation — and what would have to change to break through.

Counting fails because diffusion is continuous.

There is no internal mechanism for "exactly 8." Models activate a visual concept and generate approximate quantity by pattern-matching. Consistent failure on prime numbers (11) vs round numbers suggests density estimation, not counting. Breaking this would require a discrete counting module inserted into the generation pipeline.

Reflections fail because geometry isn't encoded.

No architecture represents mirror physics. Reflections are approximate copies, not lateral inversions. This is not a training data problem — mirrors are abundantly represented. It's a representation problem requiring explicit geometric transformation modules.

Pattern regularity fails because generation is local.

The chessboard failure: each local patch may have correct alternation, but there's no enforcement of global consistency across 64 squares. This applies to any repeating pattern — tile grids, brick walls, fabric weaves. Requires constraint propagation that current architectures lack.

Multi-line text fails because rendering is holistic, not sequential.

Single-word text works at 80% because models learn visual word shapes. Multi-line text with numbers requires sequential generation — characters in order, formatting across lines. Current models generate everything simultaneously. Sequential text rendering would require an autoregressive component for text regions.

Is 65-69% a soft ceiling?

xAI and OpenAI cluster at 65-69% on the shared prompts. The remaining 31-35% failures concentrate in architecturally hard dimensions — counting, reflections, pattern consistency. These won't yield to larger training sets. They require structural changes: discrete counting, geometric reasoning, global consistency, sequential text. Until those happen, we may be looking at a capability plateau for prompt-following accuracy.

Conclusion

Use Case

Strongest Model

On This Evaluation

Counting and hands

xAI Grok

40% counting, 75% hands (least bad — still failing often)

Physics and multi-subject

OpenAI GPT

100% physics, 100% spatial, 100% multi-subject

Text rendering

All three

80% across the board — the 2026 success story

Reflections

None

33% for everyone — architecturally unsolved

The seven universal failures require architectural innovation, not more training data: discrete counting modules, geometric reasoning for reflections, global constraint propagation for patterns, and sequential rendering for multi-line text. These are the open problems.

Interested in running these failure-mode prompts on your own models? Reach out to Verita 

Appendix

A. Prompt Set and Evaluation Details

The 26 shared prompts were drawn from a larger pool designed to cover known failure modes. xAI was additionally tested on 15 hard-mode prompts; OpenAI on 20. Runway was tested only on the 26 shared prompts.

Judges: GPT-4o (primary) + Claude (secondary). Both must agree. Structured JSON output.

Automated metrics (supplementary): CLIP-IQA, Aesthetic Score (LAION), Sharpness, BRISQUE — not used for pass/fail.

B. Sample: The Strawberry Test

Prompt: "Exactly eight fresh red strawberries in two rows of four on a white ceramic plate."

Model

Count

Verdict

Runway Gen4

12

FAIL

xAI Grok

8

PASS

OpenAI GPT

8

PASS


Interested in running these failure-mode prompts on your own models? Reach out to Verita info@verita-ai.com

  • VERITA

    CONTACT

Let’s talk about what you’re building and how we can help.

  • Want to DM us?

@verita_ai on x

  • Prefer the old way?

info@verita-ai.com

Let's start
the conversation.

  • VERITA

    CONTACT

Let’s talk about what you’re building and how we can help.

  • Want to DM us?

@verita_ai on x

  • Prefer the old way?

info@verita-ai.com

Let's start
the conversation.

  • VERITA

    CONTACT

Let’s talk about what you’re building and how we can help.

  • Want to DM us?

@verita_ai on x

  • Prefer the old way?

info@verita-ai.com

Let's start
the conversation.