AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese
Summary: arXiv:2603.26511v1 Announce Type: cross
In recent years, the development of open large language models (LLMs) has accelerated significantly, yet certain languages continue to be underrepresented in this field. One such language is European Portuguese (pt-PT), which has faced challenges in terms of both the availability of training data and the adequacy of native evaluation metrics. To address this gap, researchers have introduced AMALIA, a fully open LLM designed specifically to enhance the representation and performance of pt-PT.
Introduction to AMALIA
AMALIA stands out as a pioneering effort to create a comprehensive language model that focuses on the unique features of European Portuguese. By leveraging high-quality pt-PT data throughout both the mid- and post-training phases, AMALIA endeavors to provide a more accurate and culturally relevant language model. The project aims to fill the existing void in the representation of pt-PT in the broader landscape of natural language processing.
Challenges with Existing Models
Despite the growing number of LLMs, many existing models rely heavily on machine translation, which can overlook the linguistic and cultural intricacies inherent in different language variants. For European Portuguese, this results in benchmarks that may not accurately reflect the language’s unique characteristics. The AMALIA project seeks to mitigate these issues by emphasizing native evaluation and targeted training.
Benchmarks and Datasets
To enable a more faithful evaluation of pt-PT, AMALIA includes a suite of benchmarks specifically designed for this language. The benchmarks encompass:
- Translated standard tasks to assess general performance.
- Four new datasets focusing on pt-PT generation.
- Assessments of linguistic competence tailored to pt-PT.
- Evaluations of biases between pt-PT and pt-BR (Brazilian Portuguese).
Experimental Findings
The initial experiments conducted with AMALIA demonstrate promising results. The model not only meets the performance levels of existing strong baselines on translated benchmarks but also shows a significant improvement in evaluations that are specific to pt-PT. This success underscores the importance of dedicated training processes and the need for native benchmarking to accurately reflect the capabilities of language models in underrepresented languages.
Conclusion
AMALIA represents a significant step forward in the effort to enhance the representation of European Portuguese in the realm of large language models. By prioritizing high-quality data and native evaluation metrics, AMALIA not only aims to provide a more reliable tool for users but also advocates for the broader inclusion of diverse languages in AI development. As the field of natural language processing continues to evolve, initiatives like AMALIA will be crucial in ensuring that all languages, including European Portuguese, receive the attention and resources they deserve.
