Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection
Summary: arXiv:2604.13899v1 Announce Type: cross
As artificial intelligence continues to evolve, the capabilities of instruction-tuned large language models (LLMs) have become increasingly impressive. These models can annotate thousands of instances from a short prompt at negligible costs, prompting researchers to reevaluate the necessity of human involvement in the active learning (AL) process. This article delves into the comparative analysis of LLM-generated annotations versus human annotations in the context of hostility detection within social media comments.
Research Overview
The study introduces a new dataset comprising 277,902 TikTok comments in German, specifically targeting political discourse. Among these, 25,974 comments were annotated using LLMs while 5,000 were manually annotated by human reviewers. The primary objective was to assess whether LLM labels can effectively replace human labels within the AL loop and to explore the implications of labeling entire corpora at once.
Methodology
The researchers compared seven different annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on the 25,974 LLM annotations, which cost approximately $43, was evaluated against one trained on 3,800 human annotations, amounting to about $316. The results were compelling, as the F1-Macro score achieved by the LLM-trained classifier was comparable to that of the classifier trained on human annotations.
Findings
Despite the similar aggregate performance in terms of F1 scores, the study unearthed significant differences in error structures between the two annotation methods. Key findings include:
- LLM-trained classifiers tended to over-predict the positive class compared to the human gold standard.
- This divergence was particularly pronounced in discussions that were topically ambiguous, where the line between anti-immigrant hostility and policy critique is often blurred.
- Active learning, in this case, proved to offer little advantage over random sampling when applied to the enriched data pool.
- Furthermore, the full LLM annotation method yielded a higher F1 score at the same cost, suggesting a more efficient labeling strategy.
Conclusion
The results of this study raise critical questions regarding the role of human annotators in the era of advanced AI models. While LLMs can achieve comparable results at a fraction of the cost, the nuances of error profiles cannot be overlooked. Annotation strategies should not solely rely on aggregate metrics like F1 scores; rather, they must consider the acceptable error profiles for specific applications.
In conclusion, as we navigate the intersection of human and machine capabilities in annotation processes, it becomes evident that while LLMs can enhance efficiency, the unique understanding and contextual awareness of human annotators remain invaluable in certain scenarios. The future of active learning may necessitate a hybrid approach, leveraging the strengths of both human insight and machine learning efficiency.
