Discover the LLM-as-Judge framework and Ghost-100 benchmark for evaluating tone-induced hallucination in vision-language models with improved accuracy.
Explore evidence collapse in multimodal reasoning models, its risks, and mitigation strategies to improve vision-language model reliability and safety.