Discover OracleProto, a framework for reliable benchmarking of LLM forecasting using knowledge cutoff and temporal masking to ensure accurate evaluations.
Discover Workspace-Bench 1.0, a benchmark for evaluating AI agents on complex workspace tasks with large-scale file dependencies and real-world scenarios.