When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models
Summary: Large Language Models (LLMs) have been augmented with web search to overcome the limitations of the static knowledge boundary by accessing up-to-date information from the open Internet. While this integration enhances model capability, it also introduces a distinct safety threat surface.
Abstract
Large Language Models (LLMs) have been augmented with web search to overcome the limitations of the static knowledge boundary by accessing up-to-date information from the open Internet. While this integration enhances model capability, it also introduces a distinct safety threat surface: the retrieval and citation process has the potential risk of exposing users to harmful or low-credibility web content. Existing red-teaming methods are largely designed for standalone LLMs as they primarily focus on unsafe generation, ignoring risks emerging from the complex search workflow.
Introducing CREST-Search
To address this gap, we propose CREST-Search, a pioneering red-teaming framework for LLMs with web search. The cornerstone of CREST-Search is three novel attack strategies that generate seemingly benign search queries yet induce unsafe citations. This innovative approach is designed to enhance the security of web-augmented LLMs by identifying vulnerabilities that traditional methods may overlook.
Key Features of CREST-Search
- Novel Attack Strategies: CREST-Search incorporates three unique strategies that manipulate search queries to produce harmful outputs.
- In-Context Refinement Mechanism: The framework employs an iterative mechanism that refines the context of the queries, thereby improving the effectiveness of adversarial attacks under black-box constraints.
- Search-Specific Harmful Dataset: We have created the WebSearch-Harm dataset, tailored specifically for identifying harmful content within web searches. This dataset is crucial for fine-tuning a specialized red-teaming model aimed at improving query quality.
Experimental Findings
Our experiments demonstrate the effectiveness of CREST-Search in bypassing safety filters, revealing critical vulnerabilities in web search-based LLM systems. The results underscore the urgent need for the development of robust search models that can withstand adversarial attacks and ensure user safety.
Conclusion
The integration of web search into LLMs represents a significant advancement in AI technology, yet it also poses new risks that must be addressed. By implementing frameworks like CREST-Search, we can proactively identify and mitigate these risks, enhancing the safety and reliability of LLMs in a web-augmented context. As AI continues to evolve, ongoing research and development in red-teaming methodologies will be essential to safeguard users and promote trust in AI systems.
