Discover EnactToM, a 300-task benchmark evaluating AI agents' functional Theory of Mind in 3D environments with dynamic difficulty and formal verification.
Explore SeePhys Pro, a new benchmark analyzing modality transfer and blind training effects in multimodal reinforcement learning for physics reasoning.