LLM Variability in Software Engineering SLR Screening

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

The advent of Large Language Models (LLMs) has transformed various fields, including software engineering, yet their application in systematic literature reviews (SLRs) remains underexplored. A recent study, detailed in arXiv:2604.27006v1, delves into the variability of LLMs during the study screening phase of SLRs, highlighting the implications this has on the validity of research findings.

Context and Importance

Conducting systematic literature reviews is a critical, yet resource-intensive process that often faces challenges related to consistency and the risk of false negatives. These false negatives can significantly undermine the validity of the research, making it essential to understand how LLMs can be effectively utilized in this context. This study aims to bridge the knowledge gap regarding LLM performance in study screening, particularly when compared to traditional classification methods.

Objectives and Methodology

The primary objectives of the study were threefold:

Assess the performance variability of different LLMs during study screening.
Quantify the impact of various input metadata types, such as abstracts, titles, and keywords, on LLM performance.
Compare LLMs with classical classifiers to determine the advantages or disadvantages of employing LLMs in this setting.

To achieve these objectives, the researchers analyzed 12 LLMs from four different providers, including OpenAI, Google Gemini, Anthropic, and Llama, alongside four classical models: Logistic Regression, Support Vector Classification, Random Forest, and Naive Bayes. The analysis was conducted on two real SLRs, encompassing a total of 518 papers, ensuring that the findings were grounded in practical application.

Key Findings

The study yielded significant insights into LLM performance:

Performance Variability: LLMs demonstrated notable heterogeneity and residual non-determinism, even when settings were optimized (temperature set to zero).
Impact of Abstracts: The availability of abstracts was found to be crucial; removing abstracts consistently resulted in degraded performance. Conversely, adding titles or keywords did not yield substantial improvements.
Comparison with Classical Models: The performance differences between LLMs and classical classifiers were inconsistent, questioning the blanket superiority of LLMs in this context.

Discussion and Implications

The findings suggest that while LLMs have potential in the study screening phase of SLRs, their adoption should be approached with caution. Researchers are encouraged to consider operational and governance constraints such as reproducibility, costs, and the availability of metadata before opting for LLMs over traditional methods. Furthermore, pilot validations, along with explicit reporting of variability and input configurations, are essential to ensure the integrity and reliability of the results.

In summary, while LLMs offer a promising avenue for enhancing the efficiency of systematic literature reviews, their variability and the influence of input features highlight the need for a careful, evidence-based approach to their implementation in software engineering research.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

LLM Variability in Software Engineering SLR Screening

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

Context and Importance

Objectives and Methodology

Key Findings

Discussion and Implications

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related