A Human-Inspired Decoupled Architecture for Efficient Audio Representation Learning
Summary: arXiv:2603.26098v1 Announce Type: cross
Abstract: While self-supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource-constrained devices. To address this bottleneck, we propose HEAR (Human-inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of conventional foundation models (which typically require 85M-94M parameters). Despite this high efficiency, HEAR achieves highly competitive performance across diverse audio classification benchmarks. The code and pre-trained models are available at https://github.com/HarunoriKawano/HEAR.
Introduction
In recent years, self-supervised learning (SSL) has emerged as a transformative approach in audio representation learning, enabling models to learn from unlabeled data efficiently. However, the implementation of traditional Transformer architectures in this domain often faces significant challenges due to their high parameter count and computational demands. These limitations pose barriers to deployment in environments with constrained resources, such as mobile devices and embedded systems.
HEAR: A Novel Architecture
To overcome these constraints, we introduce HEAR, a novel architecture designed with inspiration drawn from human cognition. The human brain excels at isolating local sounds and integrating them into a coherent understanding of the auditory environment. HEAR mimics this capability through a decoupled design, consisting of two main components:
- Acoustic Model: This module focuses on extracting local acoustic features from audio input, enabling it to capture essential characteristics effectively.
- Task Model: This module is responsible for the global semantic integration of the features extracted by the Acoustic Model, ensuring that the final output is contextually rich and meaningful.
Innovative Training Methodology
Central to HEAR’s architecture is an Acoustic Tokenizer, which is trained using knowledge distillation techniques. This innovative approach enhances the model’s ability to perform Masked Audio Modeling (MAM), a task that has proven to be crucial for various audio classification challenges.
Efficiency and Performance
One of the standout features of HEAR is its efficiency. The model operates with merely 15 million parameters and requires only 9.47 GFLOPs for inference. In comparison, traditional foundation models demand between 85 million to 94 million parameters, making HEAR a significantly more resource-efficient option. Despite this reduction in complexity, HEAR demonstrates competitive performance across a wide range of audio classification benchmarks, proving that high efficiency does not compromise effectiveness.
Conclusion
HEAR represents a significant advancement in the field of audio representation learning, offering a human-inspired solution that balances efficiency and performance. With its decoupled architecture and innovative training strategies, HEAR opens new avenues for deploying advanced audio processing models in resource-constrained environments. Researchers and developers can access the code and pre-trained models at https://github.com/HarunoriKawano/HEAR to explore its capabilities further.
