In the study by AI and language experts Martha Lewis from the University of Amsterdam and Melanie Mitchell from the Santa Fe Institute, GPT-4's ability to handle analogical reasoning was tested against human performance. Analogical reasoning—the ability to draw comparisons between different things based on shared similarities—is a crucial method humans use to understand the world. For example: "Cup is to coffee as soup is to ???" (Answer: bowl).
While GPT-4 excels in analogy tests, the study found it struggled when problems were slightly altered. Unlike humans, who maintained performance despite variations in analogy problems, GPT-4’s results dropped.
GPT's Shortcomings in Reasoning
The study tested AI and human performance on three types of analogy problems:
Letter Sequences
Digit Matrices
Story Analogies
AI models like GPT-4 performed well on standard tests, but when faced with modified versions—such as changes in the position of a missing number or slight rewording of a story—GPT-4's performance faltered. Humans, however, remained consistent across the modifications. This suggests that GPT models lack the flexibility of human reasoning and often rely on pattern recognition rather than true understanding.
The Challenge for AI in Decision-Making
This research reveals that AI models like GPT-4 do not truly "understand" the analogies they generate. Their reasoning often mimics patterns seen in training data rather than abstract comprehension, which is a key feature of human cognition. The study concludes that GPT models are weaker than human cognition, especially when tasked with complex reasoning, pointing to the limitations of AI in fields requiring critical decision-making, such as healthcare, law, and education.
This is a critical reminder that while AI can be a powerful tool, it is not yet capable of replacing human thinking in complex, nuanced scenarios.
Article Details:
Martha Lewis and Melanie Mitchell, 2025, ‘Evaluating the Robustness of Analogical Reasoning in Large Language Models’
Transactions on Machine Learning Research