Improve large language model accuracy by mitigating preference instability with Inclusion-of-Thoughts, a method that filters distractors in decision-making...
LudoBench evaluates large language models' strategic reasoning using 480 spot-based Ludo scenarios, revealing key insights into AI decision-making behavior...
Explore how source labels influence trust assessments by humans and large language models, revealing shared biases and the need for debiased evaluations.
Discover an AI-powered method to create hard math problems targeting LLM weaknesses, improving benchmark accuracy and scalability in math skill testing.