Understanding Grokking Delays in Arithmetic Transformers

Date:


The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

Summary: arXiv:2604.13082v1 Announce Type: cross

Abstract: Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place.

Understanding Grokking in Transformers

In recent studies, the phenomenon known as “grokking” has been observed in transformers that are trained on algorithmic tasks. This grokking is marked by a significant delay between the model’s ability to fit the training set and its eventual capacity to generalize effectively. Despite extensive research, the underlying cause of this delay is still not fully comprehended.

Encoder-Decoder Arithmetic Models

Our research focuses on encoder-decoder arithmetic models, exploring how the delay in grokking arises not from a failure to learn but from limited access to previously acquired structural knowledge. We specifically examine the one-step Collatz prediction task, a classic problem in algorithmic learning.

Key Findings

  • The encoder effectively organizes parity and residue structures within the initial few thousand training steps.
  • Output accuracy, however, remains close to chance for tens of thousands of additional steps.
  • Causal interventions bolster the decoder bottleneck hypothesis.
  • Transplanting a trained encoder into a new model can accelerate the grokking process by 2.75 times.
  • Conversely, transplanting a trained decoder can negatively impact performance.

Impact of Freezing and Retraining

One of the most compelling aspects of our findings is the effect of freezing a converged encoder and retraining only the decoder. This approach completely eliminates the plateau in output accuracy, achieving a remarkable 97.6% accuracy compared to only 86.1% for models trained jointly.

Numeral Representation and Its Role

The complexity of the decoder’s task is significantly influenced by the choice of numeral representation. Our study analyzed 15 different numeral bases and revealed intriguing disparities in learnability based on how well the numeral system aligns with the arithmetic of the Collatz map.

  • For example, using base 24 led to an impressive 99.8% accuracy.
  • In contrast, binary representation performed poorly, as its structural representations collapsed and failed to recover.

Conclusion

The choice of numeral base serves as a critical inductive bias, dictating the extent to which local digit structure can be exploited by the decoder. This discovery not only enhances our understanding of grokking but also opens up new avenues for optimizing model training and representation in transformer architectures.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.