Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport
Summary: arXiv:2604.12663v1 Announce Type: new
Abstract: Existing topic modeling methods, from LDA to recent neural and LLM-based approaches, which focus mainly on statistical coherence, often produce redundant or off-target topics that miss the user’s underlying intent. We introduce Human-centric Topic Modeling, Human-TM, a novel task formulation that integrates a human-provided goal directly into the topic modeling process to produce interpretable, diverse and goal-oriented topics.
Introduction
Topic modeling has become an essential technique in natural language processing, allowing researchers and practitioners to extract themes from large corpora of text. Traditional methods, such as Latent Dirichlet Allocation (LDA), have paved the way for more advanced models, including those based on neural networks and large language models (LLMs). However, these existing methodologies often prioritize statistical coherence over user intent, leading to topics that may be redundant, irrelevant, or misaligned with the user’s goals.
Human-Centric Topic Modeling
To address these limitations, we propose a new approach called Human-centric Topic Modeling (Human-TM). This approach emphasizes the integration of human-provided goals into the topic modeling process. By doing so, we aim to enhance the interpretability, diversity, and relevance of the topics generated. Human-TM represents a shift towards more user-centered applications in topic discovery.
Proposed Method: GCTM-OT
At the core of our approach is the Goal-prompted Contrastive Topic Model with Optimal Transport (GCTM-OT). The GCTM-OT methodology consists of several key components:
- Goal Extraction: The process begins with LLM-based prompting to extract potential goal candidates from the input documents.
- Semantic-Aware Contrastive Learning: These goals are then integrated into a contrastive learning framework that is aware of the underlying semantics of the data.
- Optimal Transport: We utilize optimal transport techniques to ensure that the discovered topics align closely with the extracted goals, thus enhancing topic relevance and coherence.
Experimental Results
To evaluate the effectiveness of GCTM-OT, we conducted extensive experiments on three public subreddit datasets. The results demonstrate that GCTM-OT significantly outperforms state-of-the-art baselines in terms of both topic coherence and diversity. More importantly, our approach shows a marked improvement in aligning the generated topics with human-provided goals, highlighting its potential as a more human-centric topic discovery system.
Conclusion
The introduction of Human-centric Topic Modeling and the GCTM-OT framework represents a significant advancement in the field of topic modeling. By integrating human intent directly into the modeling process, we can produce more meaningful and relevant topics for users. This research opens the door for future developments in creating more intuitive and user-friendly topic discovery systems.
For more information, please refer to the full paper available on arXiv.
