XDomainBench evaluates large language models' reasoning across 20 scientific domains, exposing challenges in complex interdisciplinary knowledge compositio...
Discover how Temporal Critique Fine-Tuning (TCFT) improves large language models' temporal reasoning by reducing information leakage after cutoff dates.