Evaluating LLM Web Generation: Single-File HTML Test

Date:

The Single-File Test: Evaluating LLM Web Generation Performance

This article discusses a comprehensive study that examined the capabilities of various large language models (LLMs) in generating single-file HTML outputs. Conducted over an eight-week period, the research involved 68 HTML generations across 17 public experiments as part of the “HTML AI Battle” project. The evaluation spanned from December 10, 2025, to February 4, 2026, and focused on four prominent reasoning model families: GPT, Gemini, Grok, and Claude.

Research Methodology

The study employed a fixed public-interface protocol, ensuring a fair comparison among the models. Key aspects of the methodology included:

  • No custom instructions were applied to the models.
  • Personality tuning was excluded to maintain uniformity.
  • No repair prompts were used during evaluations.

Each generated output was assessed through a rendered browser video, utilizing human scores alongside a Gemini LLM-as-a-judge layer. This dual evaluation approach focused on:

  • Prompt adherence
  • Functional correctness
  • User interface (UI) quality

To further analyze the data, a standardized social-media protocol was implemented, which included platforms such as X (formerly Twitter), TikTok, and YouTube. In addition, two supervised predictive analyses were conducted:

  • An experiment-level model for estimating 24-hour X impressions
  • A generation-level model for assessing HTML verbosity

Findings and Insights

The comparative analysis yielded several significant findings:

  • Claude emerged as the strongest and most consistent model family, achieving the highest mean performance and winning 9 out of 17 prompts according to the primary human weighted score.
  • Longer reasoning times did not correlate with improved quality across the evaluated outputs.
  • Gemini, acting as a judging model, was notably more lenient than human evaluators, especially regarding functional correctness and overall performance. This pattern raised concerns regarding stable self-favoring bias.

Despite the comprehensive analysis, the exploratory X-impressions model exhibited weaknesses during post-screen cross-validation, with a mean absolute error (MAE) of 46,874 and an R-squared value of -0.377. In contrast, the HTML-lines model displayed better performance, where a model-family-only baseline outperformed prompt-aware alternatives, achieving an MAE of 135.2 and an R-squared value of 0.576.

Conclusion

The study concludes that the selected pre-publication technical and audio variables were insufficient in predicting 24-hour X reach. Additionally, it determined that code verbosity was primarily influenced by the model family rather than the specific wording of prompts. The results highlighted the observational nature of the comparisons, noting limitations due to public-interface drift, access-path differences, and reliance on a single primary human scorer. This nuanced understanding of LLM performance in web generation has implications for future research and development in AI-driven content creation.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.