Priming Hybrid State Space Models with Pre-trained Transformers

Priming: Hybrid State Space Models From Pre-trained Transformers

In the rapidly evolving field of artificial intelligence, researchers are constantly exploring innovative methods to improve model efficiency and performance. A recent paper titled “Priming: Hybrid State Space Models From Pre-trained Transformers” introduces an intriguing approach that seeks to enhance Hybrid State-Space models by leveraging pre-trained Transformers. The paper, available on arXiv with the identifier 2605.08301v1, proposes a method that could significantly impact the design and development of AI models.

Hybrid State-Space models uniquely integrate Attention mechanisms with recurrent State-Space Model (SSM) layers. This combination aims to harness the strengths of both architectures: the eidetic memory capabilities derived from Attention and the compressed fading memory characteristic of SSMs. By doing so, Hybrid models can achieve smaller Key-Value caches, resulting in faster decoding processes compared to traditional Transformer models. However, exploring the architectural design space of these Hybrid models has previously necessitated training from scratch, creating a barrier for researchers and limiting the breadth of Hybrid architecture exploration.

The authors of the paper introduce “Priming,” a groundbreaking method that transforms Hybrid architecture design from a pre-training challenge into a knowledge transfer opportunity. This innovative approach allows researchers to initialize a Hybrid model using a pre-trained Transformer and subsequently fine-tune it through brief alignment and post-training phases. Remarkably, this process recovers downstream quality while utilizing less than 0.5% of the source model’s pre-training token budget.

Priming is designed to be agnostic regarding the source Transformer family, whether it be Qwen, Llama, or Mistral, and is applicable to various model classes, including dense models and Mixture-of-Experts configurations. This versatility offers substantial flexibility for researchers looking to implement Hybrid architectures across different use cases.

Controlled Comparisons and Findings

The introduction of Priming facilitates the first controlled comparisons of SSM layer types at scale under identical experimental conditions. The paper evaluates three distinct SSM layer architectures: Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2. The findings reveal a hierarchy of expressiveness among these models, with GKA outperforming GDN, which in turn surpasses Mamba-2. This expressiveness hierarchy is directly correlated with downstream performance on long-context reasoning tasks, providing valuable insights into model selection and optimization.

Gated KalmaNet (GKA): Demonstrated superior performance in reasoning tasks.
Gated DeltaNet (GDN): Intermediate performance, offering a balance between complexity and efficiency.
Mamba-2: While less expressive, it remains a viable option for certain applications.

The authors successfully scaled the Priming method to develop 8B and 32B reasoning models equipped with native 128K contexts. Notably, their Hybrid GKA 32B model exhibited a remarkable improvement of +3.8 average reasoning points compared to its source model, Qwen3-32B. Moreover, it maintained a performance level within 1% of a Transformer that was post-trained on the same dataset while enabling up to 2.3 times higher decoding throughput.

Future Directions and Resources

To promote further research and experimentation with Hybrid architectures, the authors have made available a model zoo of primed Hybrid models specifically designed for long-context reasoning and instruction following tasks. Additionally, they have released the Priming training and inference code, which includes Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and a vLLM serving plugin, all under the Apache 2.0 License.

With the introduction of Priming, researchers now have a powerful tool at their disposal to explore Hybrid State-Space models more efficiently, potentially leading to advancements in various AI applications and a deeper understanding of model architectures.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Priming Hybrid State Space Models with Pre-trained Transformers

Priming: Hybrid State Space Models From Pre-trained Transformers

Controlled Comparisons and Findings

Future Directions and Resources

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related