CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation
Summary: arXiv:2601.19178v2 Announce Type: replace
Abstract
Sequential recommendation models are widely used in applications, yet they face stringent latency requirements. Mainstream models leverage the Transformer attention mechanism to improve performance, but its computational complexity grows with the sequence length, leading to a latency challenge for long sequences. Consequently, KV cache technology has recently been explored in sequential recommendation systems to reduce inference latency. However, KV cache introduces substantial storage overhead in sequential recommendation systems, which often have a large user base with potentially very long user history sequences.
Introduction
In the rapidly evolving field of recommendation systems, the efficiency of sequential recommendation models is crucial for delivering real-time suggestions to users. Nevertheless, the computational demands imposed by these models, especially when utilizing Transformer architectures, can lead to significant delays in processing time. This latency challenge becomes particularly pronounced when dealing with lengthy user sequences.
The Challenge of KV Cache
KV cache technology has emerged as a potential solution to mitigate inference latency in sequential recommendations. However, its implementation comes with challenges, particularly concerning storage overhead. As the user base expands, the volume of user history sequences can become cumbersome, leading to inefficient resource utilization.
Observations and Insights
Our research reveals a noteworthy observation: KV sequences across different users exhibit significant similarities. This indicates the presence of collaborative signals within the KV data. To better understand these signals, we conducted an analysis using singular value decomposition (SVD), which allowed us to dissect the information stored within the KV cache.
CollectiveKV: A Proposed Solution
Motivated by our findings, we propose CollectiveKV, a novel cross-user KV sharing mechanism. This approach focuses on two key aspects:
- It captures the information that is shared across users through a learnable global KV pool.
- During inference, each user can retrieve high-dimensional shared KV from this pool and concatenate it with low-dimensional user-specific KV to generate the final KV.
Experimental Results
To evaluate the effectiveness of CollectiveKV, we conducted experiments on five sequential recommendation models using three different datasets. The results were promising, indicating that our method allows for a dramatic reduction in the size of the KV cache, compressing it to merely 0.8% of its original size. Remarkably, this compression does not compromise model performance; in some cases, it even enhances it.
Conclusion
In conclusion, CollectiveKV presents a significant advancement in the realm of sequential recommendation systems. By effectively decoupling and sharing collaborative information, this innovative approach addresses the dual challenges of latency and storage overhead, paving the way for more efficient and user-friendly recommendation experiences.
