LLMbench: A Comparative Close Reading Workbench for Large Language Models
Summary: arXiv:2604.15508v1 Announce Type: cross
In an era where large language models (LLMs) are becoming increasingly prevalent in various domains, the need for effective tools to analyze their outputs is essential. Enter LLMbench, a pioneering browser-based workbench designed specifically for the comparative close reading of LLM outputs. Unlike existing tools that focus on quantitative evaluation, such as Google PAIR’s LLM Comparator, LLMbench takes a different approach, emphasizing the hermeneutic practices of the digital humanities.
Overview of LLMbench
LLMbench allows users to place two model responses to the same prompt side by side in annotatable panels. This setup provides a unique space for analysis, equipped with four analytical overlays:
- Probabilities: Enables token-level log-probability inspection.
- Differences: Highlights word-level differences across the two panels.
- Tone: Facilitates Hyland-style metadiscourse analysis.
- Structure: Offers sentence-level parsing with discourse connective highlighting.
Additionally, LLMbench supports five analytical modes:
- Stochastic Variation: Examines variations in outputs due to randomness.
- Temperature Gradient: Analyzes the effects of temperature settings on model responses.
- Prompt Sensitivity: Investigates how slight changes in prompts can affect outcomes.
- Token Probabilities: Offers insights into the likelihood of specific token selections.
- Cross-Model Divergence: Compares outputs from different models to highlight divergences.
Visualizations and Research Object
At its core, LLMbench treats the generated text as a research object in its own right, delving into the probability distribution that underlies each output. This leads to the creation of various visualizations, including:
- Continuous Heatmaps: Provide a visual representation of token probabilities.
- Entropy Sparklines: Display the uncertainty associated with word choices.
- Pixel Maps: Illustrate the density of token probabilities across the output.
- Three-Dimensional Probability Terrains: Offer an immersive view of the counterfactual history from which each word emerged.
Conclusion
This paper not only describes LLMbench’s architecture and its six analytical modes but also underscores its design rationale. It argues that log-probability data, which remains underutilized in humanistic and social-scientific readings of artificial intelligence, is a crucial resource for developing a more nuanced critical study of generative AI models. As LLMs continue to evolve, tools like LLMbench will play a vital role in fostering deeper understanding and critical engagement with these technologies.
