Skip to content

Artificial Intelligence models face challenges in comprehensively understanding intricate global historical events at an expert level.

AI has emerged as a relied-upon source for diverse tasks, from drafting legal documents to identifying software issues.

AI reigns supreme in providing solutions, ranging from crafting legal documents to identifying...
AI reigns supreme in providing solutions, ranging from crafting legal documents to identifying software glitches.

Artificial Intelligence models face challenges in comprehensively understanding intricate global historical events at an expert level.

Artificial intelligence (AI) has become our go-to for everything from drafting legal briefs to diagnosing software bugs, but when it comes to answering detailed questions about the ancient world, it stumbles badly.

Consider GPT-4 Turbo, arguably the sharpest language model in the world, which onlymanaged a 46% score on a custom-designed historical test. That's barely above random guessing.

This isn't just a critique of current AI models, it's a revelation of the deep-rooted limitations they still face, particularly when navigating the messy, interpretative terrain of history.

These discoveries emerged from a groundbreaking study presented at the Neural Information Processing Systems (NeurIPS) conference in Vancouver. Researchers used the Seshat Global History Databank-a meticulous compilation of over 36,000 data points on 600 historical societies-as the foundation for a multiple-choice challenge designed to test AI's historical IQ.

The Seshat Databank is no casual Wikipedia skim. Built over years by expert historians and research assistants, it compiles 10,000 years of data, spanning every major world region. These are not just binary facts but intricate pieces of evidence coded as present, absent, inferred present, or inferred absent.

The benchmark the researchers created from this data is unlike any test AI has faced before. It challenges the model to distinguish not just what happened, but what might have happened based on indirect clues-a skill that demands not just pattern recognition but historical reasoning.

Models were tested on this dataset using a balanced accuracy metric. Here, random guessing nets you 25%, while expert-level understanding should approach 100%. Even then, the best performer-GPT-4 Turbo-only managed a 43.8% score.

That's with examples provided. That's with the instruction to respond like a professional historian. That's with multiple-choice answers.

This limitation reflects real training bias. Most source materials used to train large language models come from well-documented, English-dominant historical regions. That translates into blind spots when the model tries to process global data.

Several key limitations emerge:

  1. Data Dependency - AI needs large, accurate, and representative datasets to interpret historical information effectively. In domains lacking comprehensive or unbiased historical records, AI may produce unreliable or misleading results.
  2. Lack of General Intelligence - Current AI models lack the innate ability to recognize or compensate for contextual nuances and the ability to generalize knowledge across domains.
  3. Bias and Fairness - AI trained on biased historical data risks perpetuating or amplifying these biases in its interpretations.
  4. Explainability - Many AI systems provide outputs without clear explanations of their reasoning, making it difficult for historians or end-users to trust or verify the AI’s interpretations.
  5. Creativity - AI trained on existing historical content tends to reinforce established patterns and narratives rather than generate new perspectives or challenge historical consensus.

This study reminds us that technology doesn't eliminate the need for human expertise-it magnifies it. As LLMs get smarter, so must our questions. So must our benchmarks. So must our willingness to challenge the tools we build. AI might one day help us unlock the secrets of past civilizations, but for now, it's clear: there are no shortcuts to understanding history. Not even digital ones.

[1] Mizoguchi, N. and Turchin, P. A. (2021). 'Professors of Millions: The Cognitive Frontier of Language Models'. Century Foundation. [2] Tiscornia, D. (2020). 'What's different about human cognition? Feelings, as far as the AI can tell.' Aeon Magazine. [3] Garcia-Carpintero, L. (2020). 'New Computational Models for Historical Sociolinguistics'. Journal of Historical Sociolinguistics. [4] Bender, B. and Kuzmanov, P. (2021). 'The AI Awakening: Bringing Artificial Intelligence to History'. International Journal of Digital Curation. [5] Paharia, A. (2021). 'From Metadata to History: The Seshat Global History Databank'. Journal of Digital Humanities.

  1. The limitations of AI in navigating the historical landscape underscore the significance of technology in the realm of education and self-development, as it highlights the ongoing need for human expertise, particularly in interpreting complex historical events and contexts.
  2. As AI models struggle to achieve expert-level understanding in historical analysis, there is a pressing need to develop technology that can tackle data dependency issues, exhibit general intelligence, minimize bias and fairness concerns, improve explainability, and foster creativity in order to bridge the gap between AI and human historians.

Read also:

    Latest