Enterprise DNA
O Open Source Frameworks medium

WHOOPS!

by Community

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

W

OSS

WHOOPS!

Added 1 June 2026

Overview

WHOOPS! is a benchmark from the community for evaluating vision-and-language models on synthetic and compositional images. It tests common sense reasoning by presenting model-generated scenes that break typical real-world expectations.

Best for

Best for
Researchers evaluating vision-language models on common sense and compositional reasoning

Use cases

  • Benchmarking vision-language models on common sense violations
  • Evaluating compositional understanding in synthetic scenes
  • Testing model robustness to atypical image compositions

Notes

WHOOPS! is a benchmark from the community for evaluating vision-and-language models on synthetic and compositional images. It tests common sense reasoning by presenting model-generated scenes that break typical real-world expectations.

Use cases

  • Benchmarking vision-language models on common sense violations
  • Evaluating compositional understanding in synthetic scenes
  • Testing model robustness to atypical image compositions

Pros

  • Focuses on challenging common sense reasoning, a key weakness in many models
  • Synthetic images allow precise control over compositional elements
  • Community-driven benchmark fosters open research

Cons

  • Synthetic images may not transfer perfectly to real-world scenarios
  • Limited to vision-language tasks, not multi-modal beyond those
  • Narrow scope on common sense violations may not cover broader model capabilities

Indexed from awesome-llm and enriched against its public facts.

Pros

  • Focuses on challenging common sense reasoning, a key weakness in many models
  • Synthetic images allow precise control over compositional elements
  • Community-driven benchmark fosters open research

Cons

  • Synthetic images may not transfer perfectly to real-world scenarios
  • Limited to vision-language tasks, not multi-modal beyond those
  • Narrow scope on common sense violations may not cover broader model capabilities