Discover a novel submodular approach to select optimal AI benchmarks, reducing evaluation costs while maximizing insight into language model performance.
Explore LLM performance on 1M-token context windows with retrieval and multi-hop reasoning in classical Chinese texts. Key insights on model accuracy and d...