The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
Summary: arXiv:2604.13082v1 Announce Type: cross
Abstract: Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place.
Understanding Grokking in Transformers
In recent studies, the phenomenon known as “grokking” has been observed in transformers that are trained on algorithmic tasks. This grokking is marked by a significant delay between the model’s ability to fit the training set and its eventual capacity to generalize effectively. Despite extensive research, the underlying cause of this delay is still not fully comprehended.
Encoder-Decoder Arithmetic Models
Our research focuses on encoder-decoder arithmetic models, exploring how the delay in grokking arises not from a failure to learn but from limited access to previously acquired structural knowledge. We specifically examine the one-step Collatz prediction task, a classic problem in algorithmic learning.
Key Findings
- The encoder effectively organizes parity and residue structures within the initial few thousand training steps.
- Output accuracy, however, remains close to chance for tens of thousands of additional steps.
- Causal interventions bolster the decoder bottleneck hypothesis.
- Transplanting a trained encoder into a new model can accelerate the grokking process by 2.75 times.
- Conversely, transplanting a trained decoder can negatively impact performance.
Impact of Freezing and Retraining
One of the most compelling aspects of our findings is the effect of freezing a converged encoder and retraining only the decoder. This approach completely eliminates the plateau in output accuracy, achieving a remarkable 97.6% accuracy compared to only 86.1% for models trained jointly.
Numeral Representation and Its Role
The complexity of the decoder’s task is significantly influenced by the choice of numeral representation. Our study analyzed 15 different numeral bases and revealed intriguing disparities in learnability based on how well the numeral system aligns with the arithmetic of the Collatz map.
- For example, using base 24 led to an impressive 99.8% accuracy.
- In contrast, binary representation performed poorly, as its structural representations collapsed and failed to recover.
Conclusion
The choice of numeral base serves as a critical inductive bias, dictating the extent to which local digit structure can be exploited by the decoder. This discovery not only enhances our understanding of grokking but also opens up new avenues for optimizing model training and representation in transformer architectures.
