Empirical Study of Feature Repulsion in Two-Layer Network Grokking

Date:

Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

In a groundbreaking study published on arXiv, Tian (2025) presents significant findings regarding the phenomenon known as grokking in two-layer neural networks. The research focuses on the role of feature repulsion during the interactive feature-learning stage and its implications for learning dynamics in neural networks.

The study introduces a repulsion theorem, denoted as Theorem 6, which discusses the behavior of the matrix \( B = (\widetilde{F}^\top \widetilde{F} + \eta I)^{-1} \). The theorem asserts that similar features exhibit negative off-diagonal entries \( B_{j\ell} \), resulting in an effective force that drives similar features apart. This mechanism plays a crucial role in feature learning; however, the study raises important questions regarding the empirical observability of this phenomenon and its spectral implications during parameter updates.

Research Methodology

Tian’s empirical investigation utilized a modular addition setup characterized by parameters \( M = 71 \) and \( K = 2048 \) with a mean squared error (MSE) loss function. The primary goal was to assess whether the theoretical predictions of the repulsion theorem manifest in observable ways during the learning process.

Key Findings

  • Structure-Mechanism Dissociation: The study revealed a notable dissociation between the predicted structure of feature repulsion and its empirical manifestations in network behavior.
  • Sign Rule Validation: The predicted sign rule showed a robust correlation with the top 200 most-similar feature pairs across various activations. The empirical sign-match increased significantly from 0.865 to 0.985 for the activation function \( \sigma = x^2 \) across five seeds, saturating at 1.000 for \( \sigma = \operatorname{ReLU} \).
  • Activation Dependency: The spectral signature observed in the parameter updates exhibited strong dependency on the choice of activation function. For \( \sigma = x^2 \), a simple slope detector analyzing the rolling eigengap \( \sigma_2 / \sigma_3 \) of the weight updates \( \Delta W \) indicated clear evidence of grokking, firing in 15 out of 15 seeds at epoch 174.
  • Contrast with Non-Grokking Controls: In stark contrast, the same detector recorded no activity in the non-grokking controls, highlighting the distinct learning dynamics associated with grokking.
  • Rank-2 Spectrum vs. Rank-1 Spectrum: The spectral analysis revealed a rank-2 spectrum for the \( x^2 \) activation, while the \( \operatorname{ReLU} \) activation maintained an effectively rank-1 spectrum, underscoring the critical influence of the activation derivative on feature repulsion’s translation into weight updates.

Conclusion

This empirical study not only validates aspects of Tian’s theoretical framework but also emphasizes the complex interplay between feature learning mechanisms and activation functions in neural networks. The findings suggest that while the foundational structure predicted by the repulsion theorem remains consistent, the mechanisms through which feature repulsion influences learning outcomes are highly activation-dependent. This research opens avenues for future exploration into optimizing activation functions to enhance learning efficacy in neural networks.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.