LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval
The challenge of retrieving relevant evidence from visually rich documents, such as textbooks, technical reports, and manuals, has been a significant hurdle in the fields of information retrieval and natural language processing. Traditional methods often struggle due to long context lengths, complex layouts, and weak lexical overlap between user queries and the content of supporting pages. In response to these challenges, researchers have introduced LITTA, a novel retrieval framework designed to enhance evidence page retrieval without necessitating retraining of the underlying retrieval model.
LITTA, which stands for Late-Interaction and Test-Time Alignment, focuses on a query-expansion-centric approach to improve multimodal document retrieval. The framework leverages a large language model to generate complementary query variants based on the initial user query. This innovative method not only expands the search parameters but also enhances the range of candidate pages that can be retrieved.
The retrieval process within LITTA involves the following key steps:
- Query Expansion: The framework uses a large language model to create multiple variants of the user’s original query, thereby broadening the search scope.
- Candidate Page Retrieval: Each query variant is processed using a frozen vision retriever, which scores the candidate pages through late-interaction scoring.
- Aggregation of Results: The candidates retrieved from the expanded queries are then aggregated using reciprocal rank fusion. This step enhances evidence coverage and minimizes the sensitivity to any single phrasing of the query.
The implementation of this test-time strategy significantly boosts the robustness of the retrieval process while maintaining compatibility with existing multimodal embedding indices. The effectiveness of LITTA has been evaluated across three distinct domains: computer science, pharmaceuticals, and industrial manuals. The results indicate that multi-query retrieval consistently improves key performance metrics, including top-k accuracy, recall, and Mean Reciprocal Rank (MRR), particularly in domains characterized by high visual and semantic variability.
One of the standout features of LITTA is its ability to provide a controllable accuracy-efficiency trade-off. Users can adjust the number of query variants generated, allowing for practical deployment under latency constraints. This flexibility makes LITTA not only effective but also suitable for real-world applications where time sensitivity is a concern.
In conclusion, LITTA represents a significant advancement in the field of visually grounded multimodal retrieval. By employing a query expansion strategy coupled with late-interaction scoring and result aggregation, LITTA enhances the retrieval process’s robustness and effectiveness. The framework’s successful evaluation across varied domains underscores its potential to revolutionize how we approach evidence retrieval in visually rich documents.
