Evaluating LLMs on 1M-Token Contexts for Classical Chinese

Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

In a groundbreaking study recently published on arXiv, researchers have examined the long-context retrieval and reasoning capabilities of five leading large language models (LLMs) that boast 1M-token context windows. This evaluation focuses on a classical Chinese corpus and comprises two complementary tests designed to assess how effectively these models can navigate extensive textual information.

Study Overview

The research presents a dual-faceted approach to understanding the capabilities of these advanced LLMs. The first test, referred to as Test 1, evaluates the models’ single-needle retrieval abilities at the maximum input of 1 million tokens. This test involves the insertion of three biographical “needles” at varying depths within the text. To differentiate genuine in-context retrieval from reliance on memorized training data, the study includes both real and altered variants of these needles, with the latter contradicting the training data.

Test 2 follows up on the initial findings by assessing the models’ performance in multi-hop reasoning tasks. This test measures the ability to traverse three separate context tiers, specifically 256K, 512K, and 1M tokens. The aim is to determine whether the models maintain their retrieval efficacy when the task requires intermediate reasoning across long-context scenarios.

Key Findings

The results from both tests reveal significant insights into the performance of each model:

Single-Needle Retrieval: The strongest models, including Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5, achieved a remarkable 100% accuracy in retrieving the single needles when tested at the 1M token threshold. This indicates that they have effectively mastered this aspect of long-context retrieval.
Multi-Hop Performance: The study identified three distinct decay patterns in multi-hop retrieval performance:

Stable Regime: Models like Gemini Pro and Claude maintained over 80% accuracy through the 512K tier, experiencing only modest degradation at the 1M level.
Late-Cliff Regime: GPT-5.5 and Qwen3.6-plus showed a sharp decline in performance between the 512K and 1M tokens, indicating a vulnerability in their multi-hop reasoning capabilities.
Smooth-Decline Regime: DeepSeek V4 Pro exhibited a gradual decrease in accuracy across the entire range, suggesting a more consistent but less robust performance in multi-hop tasks.

Implications for Future Research

The findings from this evaluation underscore a critical insight: the nominal context-window length of LLMs does not necessarily correlate with their usable long-context multi-hop capabilities. The sharp transition observed between the 512K and 1M token thresholds serves as a key discriminating factor among the current flagship models. This suggests that further research should focus on enhancing multi-hop reasoning skills, particularly as context windows continue to expand.

As the field of artificial intelligence continues to evolve, understanding the nuances of model performance in complex reasoning tasks will be vital for developing more capable and reliable systems. The implications of this study not only contribute to the academic discourse but also pave the way for practical advancements in the use of LLMs across various applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Evaluating LLMs on 1M-Token Contexts for Classical Chinese

Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

Study Overview

Key Findings

Implications for Future Research

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related