WHOOPS!
by Community
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
OSS
WHOOPS!
Added 1 June 2026
Overview
WHOOPS! is a benchmark from the community for evaluating vision-and-language models on synthetic and compositional images. It tests common sense reasoning by presenting model-generated scenes that break typical real-world expectations.
Best for
Best for
Researchers evaluating vision-language models on common sense and compositional reasoning
Use cases
- Benchmarking vision-language models on common sense violations
- Evaluating compositional understanding in synthetic scenes
- Testing model robustness to atypical image compositions
Notes
WHOOPS! is a benchmark from the community for evaluating vision-and-language models on synthetic and compositional images. It tests common sense reasoning by presenting model-generated scenes that break typical real-world expectations.
Use cases
- Benchmarking vision-language models on common sense violations
- Evaluating compositional understanding in synthetic scenes
- Testing model robustness to atypical image compositions
Pros
- Focuses on challenging common sense reasoning, a key weakness in many models
- Synthetic images allow precise control over compositional elements
- Community-driven benchmark fosters open research
Cons
- Synthetic images may not transfer perfectly to real-world scenarios
- Limited to vision-language tasks, not multi-modal beyond those
- Narrow scope on common sense violations may not cover broader model capabilities
Indexed from awesome-llm and enriched against its public facts.
Pros
- Focuses on challenging common sense reasoning, a key weakness in many models
- Synthetic images allow precise control over compositional elements
- Community-driven benchmark fosters open research
Cons
- Synthetic images may not transfer perfectly to real-world scenarios
- Limited to vision-language tasks, not multi-modal beyond those
- Narrow scope on common sense violations may not cover broader model capabilities
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.