DeltaLogic introduces a new benchmark exposing belief-revision failures in AI logical reasoning models, highlighting the need for adaptive reasoning tests.
Explore CARV, a new benchmark assessing compositional analogical reasoning in multimodal large language models, revealing key AI challenges and insights.