Human-Inspired Context-Selective Multimodal Memory for Social Robots
Summary: arXiv:2604.12081v1 Announce Type: new
Abstract: Memory is fundamental to social interaction, enabling humans to recall meaningful past experiences and adapt their behavior accordingly based on the context. However, most current social robots and embodied agents rely on non-selective, text-based memory, limiting their ability to support personalized, context-aware interactions. Drawing inspiration from cognitive neuroscience, we propose a context-selective, multimodal memory architecture for social robots that captures and retrieves both textual and visual episodic traces, prioritizing moments characterized by high emotional salience or scene novelty.
Key Features of the Proposed System
The proposed memory architecture offers several innovative features:
- Context-Selective Retrieval: The system focuses on recalling memories that are contextually relevant, enhancing the interaction quality.
- Multimodal Memory Capture: Both textual and visual information are stored and retrieved, making the memory more comprehensive.
- User-Centric Approach: Memories are associated with individual users, allowing for personalized recall that aligns with user preferences and emotional states.
- Emotional Salience and Novelty: The architecture prioritizes memories that are emotionally significant or novel, ensuring that interactions remain engaging and relevant.
Performance Evaluation
The effectiveness of this context-selective memory system was rigorously evaluated using a carefully curated dataset of social scenarios. The results indicated a Spearman correlation of 0.506, which not only surpasses the human consistency score of 0.415 but also outperforms existing image memorability models. Moreover, the performance in multimodal retrieval experiments revealed that the fusion approach improves Recall@1 by up to 13% compared to traditional unimodal text or image retrieval methods.
Real-Time Performance and Qualitative Analysis
Runtime evaluations confirmed that the system operates in real-time, making it feasible for live interactions in various social contexts. Qualitative analyses further illustrated that the proposed framework generates responses that are richer and more socially relevant compared to baseline models. This enhancement in dialogue quality is crucial for creating more natural and engaging interactions between humans and robots.
Conclusion
This work represents a significant advancement in the memory design for social robots by integrating human-inspired selectivity with multimodal retrieval capabilities. By focusing on emotional salience and context, this system aims to enhance long-term, personalized human-robot interactions, paving the way for more sophisticated and empathetic social robots.
