Evaluating LLMs on 1M-Token Contexts for Classical Chinese

Date:

Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

In a groundbreaking study recently published on arXiv, researchers have examined the long-context retrieval and reasoning capabilities of five leading large language models (LLMs) that boast 1M-token context windows. This evaluation focuses on a classical Chinese corpus and comprises two complementary tests designed to assess how effectively these models can navigate extensive textual information.

Study Overview

The research presents a dual-faceted approach to understanding the capabilities of these advanced LLMs. The first test, referred to as Test 1, evaluates the models’ single-needle retrieval abilities at the maximum input of 1 million tokens. This test involves the insertion of three biographical “needles” at varying depths within the text. To differentiate genuine in-context retrieval from reliance on memorized training data, the study includes both real and altered variants of these needles, with the latter contradicting the training data.

Test 2 follows up on the initial findings by assessing the models’ performance in multi-hop reasoning tasks. This test measures the ability to traverse three separate context tiers, specifically 256K, 512K, and 1M tokens. The aim is to determine whether the models maintain their retrieval efficacy when the task requires intermediate reasoning across long-context scenarios.

Key Findings

The results from both tests reveal significant insights into the performance of each model:

  • Single-Needle Retrieval: The strongest models, including Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5, achieved a remarkable 100% accuracy in retrieving the single needles when tested at the 1M token threshold. This indicates that they have effectively mastered this aspect of long-context retrieval.
  • Multi-Hop Performance: The study identified three distinct decay patterns in multi-hop retrieval performance:
    • Stable Regime: Models like Gemini Pro and Claude maintained over 80% accuracy through the 512K tier, experiencing only modest degradation at the 1M level.
    • Late-Cliff Regime: GPT-5.5 and Qwen3.6-plus showed a sharp decline in performance between the 512K and 1M tokens, indicating a vulnerability in their multi-hop reasoning capabilities.
    • Smooth-Decline Regime: DeepSeek V4 Pro exhibited a gradual decrease in accuracy across the entire range, suggesting a more consistent but less robust performance in multi-hop tasks.

Implications for Future Research

The findings from this evaluation underscore a critical insight: the nominal context-window length of LLMs does not necessarily correlate with their usable long-context multi-hop capabilities. The sharp transition observed between the 512K and 1M token thresholds serves as a key discriminating factor among the current flagship models. This suggests that further research should focus on enhancing multi-hop reasoning skills, particularly as context windows continue to expand.

As the field of artificial intelligence continues to evolve, understanding the nuances of model performance in complex reasoning tasks will be vital for developing more capable and reliable systems. The implications of this study not only contribute to the academic discourse but also pave the way for practical advancements in the use of LLMs across various applications.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.