Fine-Tuning LLMs for Report Summarization: Analysis on Supervised and Unsupervised Data
Summary: arXiv:2503.10676v2 Announce Type: replace-cross
Abstract
We study the efficacy of fine-tuning Large Language Models (LLMs) for the specific task of report (government archives, news, intelligence reports) summarization. While this topic is being very actively researched, our specific application set-up faces two challenges: (i) ground-truth summaries may be unavailable (e.g., for government archives), and (ii) availability of limited compute power – the sensitive nature of the application requires that computation is performed on-premise. For most of our experiments, we use one or two A100 GPU cards.
Research Objectives
Under this set-up, we conduct experiments to answer the following questions:
- Is it feasible to fine-tune LLMs for improved report summarization capabilities on-premise, given that fine-tuning can be resource-intensive?
- What metrics can we leverage to assess the quality of the generated summaries?
Methodology
We conducted experiments on two different fine-tuning approaches in parallel. Our methods were designed to explore both supervised and unsupervised strategies for fine-tuning LLMs. The supervised approach utilized a dataset with available summaries, while the unsupervised method relied on clustering and similarity measures to generate summaries despite the absence of ground-truth data.
Findings
Our findings reveal interesting trends regarding the utility of fine-tuning LLMs:
- In many cases, fine-tuning helps to improve summary quality, making the generated summaries more coherent and relevant.
- In other cases, fine-tuning contributes to a reduction in the number of invalid or garbage summaries, which are often characterized by lack of coherence or relevance to the original text.
Conclusion
Overall, our research suggests that while the challenges of limited compute power and the absence of ground-truth summaries are significant, fine-tuning LLMs for report summarization is both feasible and beneficial. The results of our experiments indicate that with careful selection of fine-tuning strategies and metrics for evaluation, organizations can enhance the quality of automated summarization tools. This has implications not just for governmental archives, but also for various sectors that rely on summarization of extensive reports, including news agencies and intelligence organizations.
Future Work
Future research will focus on exploring additional metrics for quality assessment and expanding the dataset to include a wider range of report types. We also aim to evaluate the performance of different LLM architectures to further optimize the summarization process.
