Reasoning About Images
Testing GPT-4V’s ability to understand and reason about visual content.
Experiments
Spatial Reasoning
Questions like “What is to the left of X?” or “How are these objects arranged?”
Result: Generally accurate for clear images, struggles with complex scenes.
Counting
Questions like “How many people are in this photo?”
Result: Accurate for small numbers, less reliable above 10.
Reading Text
OCR-style tasks: reading signs, labels, handwriting.
Result: Good for printed text, variable for handwriting.
Diagrams and Charts
Interpreting flowcharts, graphs, UML diagrams.
Result: Can describe structure, may misread specific values.
Practical Applications
- Describing images for accessibility
- Analyzing screenshots for debugging
- Interpreting diagrams from documentation
- Converting whiteboard sketches to text