SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT
Summary: arXiv:2604.05711v1 Announce Type: cross
Abstract
Web applications are increasingly reliant on hyperlinks to connect various information resources. However, the ever-changing nature of the web leads to a phenomenon known as link rot, where hyperlink targets become unavailable. More subtly, semantic drift can occur, where a valid HTTP 200 connection exists, but the content of the target no longer aligns with the source context. Traditional verification tools primarily function as crash oracles, checking only HTTP status codes, and often fail to detect these semantic inconsistencies. This oversight can compromise both web integrity and user experience.
While Large Language Models (LLMs) provide a degree of semantic understanding, they are often hindered by issues such as high latency, privacy concerns, and prohibitive costs when it comes to large-scale regression testing. In response to these challenges, we propose SemLink, a novel automated test oracle designed specifically for semantic hyperlink verification.
Introduction to SemLink
SemLink utilizes a Siamese Neural Network architecture, powered by a pre-trained Sentence-BERT (SBERT) backbone. This innovative framework allows SemLink to compute the semantic coherence between the source context of a hyperlink—encompassing anchor text, surrounding Document Object Model (DOM) elements, and visual features—and the content of the target page.
Dataset and Evaluation
To facilitate the training and evaluation of our model, we have introduced the Hyperlink-Webpage Positive Pairs (HWPPs) dataset. This dataset consists of over 60,000 rigorously constructed semantic pairs, providing a robust foundation for our evaluations.
- High Recall Rate: SemLink achieves an impressive Recall rate of 96.00%, which is on par with the state-of-the-art LLMs like GPT-5.2.
- Efficiency: Not only does SemLink demonstrate high accuracy, but it also operates approximately 47.5 times faster than traditional models.
- Resource Optimization: The computational resources required for SemLink are significantly lower than those needed for LLMs, making it a practical choice for large-scale applications.
Conclusion
This work effectively bridges the gap between traditional syntactic checkers and the costly generative AI models. By offering a robust and efficient solution for automated web quality assurance, SemLink addresses critical challenges in hyperlink verification. It not only enhances the integrity of web applications but also improves user experience by ensuring that hyperlinks remain semantically relevant and accessible.
In summary, SemLink represents a significant advancement in the field of web application testing, providing a much-needed tool for developers and quality assurance teams aiming to maintain high standards of web integrity and user satisfaction.
