LLM Variability in Software Engineering SLR Screening

Date:

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

The advent of Large Language Models (LLMs) has transformed various fields, including software engineering, yet their application in systematic literature reviews (SLRs) remains underexplored. A recent study, detailed in arXiv:2604.27006v1, delves into the variability of LLMs during the study screening phase of SLRs, highlighting the implications this has on the validity of research findings.

Context and Importance

Conducting systematic literature reviews is a critical, yet resource-intensive process that often faces challenges related to consistency and the risk of false negatives. These false negatives can significantly undermine the validity of the research, making it essential to understand how LLMs can be effectively utilized in this context. This study aims to bridge the knowledge gap regarding LLM performance in study screening, particularly when compared to traditional classification methods.

Objectives and Methodology

The primary objectives of the study were threefold:

  • Assess the performance variability of different LLMs during study screening.
  • Quantify the impact of various input metadata types, such as abstracts, titles, and keywords, on LLM performance.
  • Compare LLMs with classical classifiers to determine the advantages or disadvantages of employing LLMs in this setting.

To achieve these objectives, the researchers analyzed 12 LLMs from four different providers, including OpenAI, Google Gemini, Anthropic, and Llama, alongside four classical models: Logistic Regression, Support Vector Classification, Random Forest, and Naive Bayes. The analysis was conducted on two real SLRs, encompassing a total of 518 papers, ensuring that the findings were grounded in practical application.

Key Findings

The study yielded significant insights into LLM performance:

  • Performance Variability: LLMs demonstrated notable heterogeneity and residual non-determinism, even when settings were optimized (temperature set to zero).
  • Impact of Abstracts: The availability of abstracts was found to be crucial; removing abstracts consistently resulted in degraded performance. Conversely, adding titles or keywords did not yield substantial improvements.
  • Comparison with Classical Models: The performance differences between LLMs and classical classifiers were inconsistent, questioning the blanket superiority of LLMs in this context.

Discussion and Implications

The findings suggest that while LLMs have potential in the study screening phase of SLRs, their adoption should be approached with caution. Researchers are encouraged to consider operational and governance constraints such as reproducibility, costs, and the availability of metadata before opting for LLMs over traditional methods. Furthermore, pilot validations, along with explicit reporting of variability and input configurations, are essential to ensure the integrity and reliability of the results.

In summary, while LLMs offer a promising avenue for enhancing the efficiency of systematic literature reviews, their variability and the influence of input features highlight the need for a careful, evidence-based approach to their implementation in software engineering research.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.