AI Exercise Prescription Consistency: Comparing Top LLMs

Date:

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Summary: arXiv:2604.19598v1

Type: Cross

This study presents a comprehensive analysis of the consistency of exercise prescription outputs generated by three prominent large language models (LLMs): GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash. All models were evaluated under controlled temperature=0 conditions to ensure a fair comparison. Each model was tasked with generating exercise prescriptions for six distinct clinical scenarios, repeated 20 times, resulting in a total of 360 outputs for analysis.

Study Design and Methodology

The analysis focused on four key dimensions:

  • Semantic Similarity: The degree to which the generated outputs were similar in meaning.
  • Output Reproducibility: The ability of the model to produce consistent outputs across repetitions.
  • FITT Classification: The classification based on Frequency, Intensity, Time, and Type of exercise.
  • Safety Expression: The models’ ability to incorporate safety considerations into the prescriptions.

Findings

The results revealed significant differences in performance among the three models:

  • Mean Semantic Similarity: GPT-4.1 achieved the highest mean semantic similarity score of 0.955, followed closely by Gemini 2.5 Flash at 0.950, and Claude Sonnet 4.6 at 0.903. Statistical analysis confirmed these differences were significant (H = 458.41, p < .001).
  • Output Reproducibility: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content. In contrast, Gemini 2.5 Flash exhibited a notable level of output repetition, generating only 27.5% unique outputs. This discrepancy suggests that its high similarity score stemmed from text duplication rather than consistently sound reasoning.
  • Safety Expression: All models reached ceiling levels in safety expression metrics, highlighting its limited efficacy as a distinguishing factor between the models.

Conclusion

The findings of this study underscore the importance of model selection as a clinical decision rather than a purely technical one. The divergent behaviors observed under repeated generation conditions indicate that relying solely on single-output evaluations can be misleading. As such, the output behavior of LLMs should be a core criterion for the reliable deployment of AI-based exercise prescription systems. By understanding how different models perform in terms of consistency and reproducibility, healthcare professionals can make more informed choices about which AI tools to integrate into their practice.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.