CLIP-Inspector: Detect Backdoors in Prompt-Tuned CLIP Models

CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion

Summary: arXiv:2604.09101v1 Announce Type: cross

Abstract

In the growing landscape of Machine Learning as a Service (MLaaS), organizations with limited data and computational resources often rely on external providers to train models. These providers adapt advanced vision-language models (VLMs) like CLIP to specific tasks through prompt tuning. However, this setup introduces significant security vulnerabilities. A malicious provider can exploit the prompt-tuning process to implant backdoors, making it possible for certain inputs to be classified into an attacker-specified category, even when those inputs are out-of-distribution (OOD).

Traditional methods focusing on encoder corruption fail to detect these hidden backdoors, as the underlying encoders remain intact. Meanwhile, existing data-level techniques that aim to sanitize data before training or during inference do not effectively address the pivotal question: “Is the delivered model backdoored or not?” To tackle this model-level verification challenge, we introduce CLIP-Inspector (CI), a novel backdoor detection method tailored for prompt-tuned CLIP models.

Functionality of CLIP-Inspector

CLIP-Inspector operates under the assumption of white-box access to the delivered model and leverages a pool of unlabeled OOD images. The primary functionality of CI includes:

Reconstructing potential triggers for each class.
Determining if the model exhibits backdoor behavior based on the reconstructed triggers.

Furthermore, we showcase that utilizing CI’s reconstructed trigger for fine-tuning on accurately labeled triggered inputs can realign the model and diminish the effectiveness of any backdoor present.

Experimental Validation

We conducted extensive experiments encompassing ten datasets and four distinct backdoor attack methods. The results indicate that CI is capable of reconstructing effective triggers within a single epoch using merely 1,000 OOD images. The detection accuracy achieved by CI stands at an impressive 94% (47 out of 50 models).

When comparing CI with other adapted trigger-inversion baselines, the performance is markedly superior. CI achieved an Area Under the Receiver Operating Characteristic (AUROC) score of 0.973, significantly higher than the scores of 0.495 and 0.687 reported for the baseline methods. This demonstrates CI’s robust capability in vetting and post-hoc repairing of prompt-tuned CLIP models, ensuring their safe deployment in real-world applications.

Conclusion

As the dependency on MLaaS increases, so does the need for secure and reliable model deployment. CLIP-Inspector emerges as a critical tool for organizations to verify the integrity of prompt-tuned CLIP models, providing a necessary safeguard against backdoor attacks and enhancing the overall security of machine learning applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

CLIP-Inspector: Detect Backdoors in Prompt-Tuned CLIP Models

CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion

Abstract

Functionality of CLIP-Inspector

Experimental Validation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related