When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization
Summary: arXiv:2603.26078v1 Announce Type: cross
Abstract: Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement.
In this paper, we identify a pervasive “Illusion of Scalability” in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties.
Key Findings
Our study highlights several critical findings regarding the capabilities of existing models:
- Identity Fidelity: The ability of models to maintain distinct identities diminishes significantly as the number of subjects increases.
- CLIP Metric Limitations: Standard CLIP-based metrics are inadequate for evaluating multi-subject interactions, often rewarding images that lack individual identity.
- Introduction of SCR: We propose the Subject Collapse Rate (SCR), a new metric that effectively measures identity preservation by penalizing local attention leakage and homogenization.
Benchmark Construction
The stress-test benchmark we developed includes:
- 75 Prompts: These prompts are designed to challenge the models with varying subject counts and complexities.
- Interaction Categories: We categorized the prompts into three interaction difficulties: Neutral, Occlusion, and Interaction.
Evaluation of State-of-the-Art Models
Our extensive evaluation of leading models, including MOSAIC, XVerse, and PSR, reveals alarming trends:
- As scene complexity increases, identity fidelity drops precipitously.
- At 10 subjects, SCR scores approach 100%, indicating severe identity collapse.
- This collapse can be traced back to semantic shortcuts used in global attention routing.
Conclusion
The findings from this study underscore the urgent need for advancements in generative architectures that prioritize physical disentanglement of subjects. As AI-driven models become increasingly integrated into various applications, the ability to accurately represent multiple identities will be paramount. Our proposed SCR metric offers a crucial step towards more reliable evaluations in multi-subject scenarios, paving the way for future research and development in the field.
In summary, while current text-to-image diffusion models show promise, they are far from achieving satisfactory performance in multi-subject personalization. Addressing these challenges will be essential for enhancing the quality and reliability of AI-generated images in diverse contexts.
