Discover SPEED-Bench, a unified benchmark for evaluating speculative decoding in large language models with diverse, real-world workloads and production in...
Discover SRBench, a new framework for comprehensive benchmarking of sequential recommendation models using large language models for fair and accurate eval...
CheeseBench evaluates large language models on classic rodent behavioral neuroscience tasks, revealing insights into their cognitive and spatial abilities.