Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs
The advent of Large Language Models (LLMs) has transformed various fields, including software engineering, yet their application in systematic literature reviews (SLRs) remains underexplored. A recent study, detailed in arXiv:2604.27006v1, delves into the variability of LLMs during the study screening phase of SLRs, highlighting the implications this has on the validity of research findings.
Context and Importance
Conducting systematic literature reviews is a critical, yet resource-intensive process that often faces challenges related to consistency and the risk of false negatives. These false negatives can significantly undermine the validity of the research, making it essential to understand how LLMs can be effectively utilized in this context. This study aims to bridge the knowledge gap regarding LLM performance in study screening, particularly when compared to traditional classification methods.
Objectives and Methodology
The primary objectives of the study were threefold:
- Assess the performance variability of different LLMs during study screening.
- Quantify the impact of various input metadata types, such as abstracts, titles, and keywords, on LLM performance.
- Compare LLMs with classical classifiers to determine the advantages or disadvantages of employing LLMs in this setting.
To achieve these objectives, the researchers analyzed 12 LLMs from four different providers, including OpenAI, Google Gemini, Anthropic, and Llama, alongside four classical models: Logistic Regression, Support Vector Classification, Random Forest, and Naive Bayes. The analysis was conducted on two real SLRs, encompassing a total of 518 papers, ensuring that the findings were grounded in practical application.
Key Findings
The study yielded significant insights into LLM performance:
- Performance Variability: LLMs demonstrated notable heterogeneity and residual non-determinism, even when settings were optimized (temperature set to zero).
- Impact of Abstracts: The availability of abstracts was found to be crucial; removing abstracts consistently resulted in degraded performance. Conversely, adding titles or keywords did not yield substantial improvements.
- Comparison with Classical Models: The performance differences between LLMs and classical classifiers were inconsistent, questioning the blanket superiority of LLMs in this context.
Discussion and Implications
The findings suggest that while LLMs have potential in the study screening phase of SLRs, their adoption should be approached with caution. Researchers are encouraged to consider operational and governance constraints such as reproducibility, costs, and the availability of metadata before opting for LLMs over traditional methods. Furthermore, pilot validations, along with explicit reporting of variability and input configurations, are essential to ensure the integrity and reliability of the results.
In summary, while LLMs offer a promising avenue for enhancing the efficiency of systematic literature reviews, their variability and the influence of input features highlight the need for a careful, evidence-based approach to their implementation in software engineering research.
Related AI Insights
- People-Centred Medical Image Analysis for Fair AI
- DeepTutor: AI-Powered Personalized Tutoring System
- Two-Tiered Semantics for Defeasible Conditional Obligation
- Scaling AI with Data Sovereignty and Governance
- Improving MLLM Feedback Validity on Science Drawings
- Efficient Multibit Neural Inference with N-ary Crossbar Arrays
- Ethical Emotion Regulation Framework for Agentic AI Design
- Pentagon Partners with Nvidia, Microsoft & AWS for AI
- Sliceformer: Advanced Static Program Slicing with Language Models
- Self-Conditioning Boosts Masked Diffusion Models Performance
