Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe
Summary: arXiv:2604.12477v1 Announce Type: cross
Large language models (LLMs) have revolutionized the way we interact with artificial intelligence, particularly in the realm of natural language processing. However, these models primarily rely on data contributed by low-resource language communities. Unfortunately, the linguistic knowledge encoded in these models is predominantly accessible only through commercial APIs, which limits the potential benefits for speakers of low-resource languages.
Introduction
This paper investigates the efficacy of strategic prompting to extract usable text data from LLMs, focusing on two West African languages: Hausa, an Afroasiatic language with approximately 80 million speakers, and Fongbe, a Niger-Congo language with around 2 million speakers. The research systematically compares six elicitation task types across two commercial LLMs: GPT-4o Mini and Gemini 2.5 Flash.
Methodology
The study employed a series of elicitation tasks designed to maximize the quantity and quality of target-language outputs generated by LLMs. The tasks were structured to assess the performance of both models in generating usable language data for Hausa and Fongbe speakers.
- Elicitation Task Types:
- Functional Text Generation
- Dialogue Generation
- Descriptive Text Generation
- Question and Answering
- Summarization Tasks
- Creative Writing Prompts
Results
The findings indicate a significant disparity in the effectiveness of the two models in generating usable language data. GPT-4o Mini outperformed Gemini 2.5 Flash by extracting between 6 to 41 times more usable target-language words per API call. This variance underscores the importance of selecting the appropriate model for specific language tasks.
Language-Specific Strategies
Further analysis revealed that optimal elicitation strategies differ markedly between the two languages:
- Hausa: The model benefits from prompts that encourage functional text and dialogue generation. This approach allows for a more natural integration of the language’s structure and common usage.
- Fongbe: In contrast, Fongbe requires constrained generation prompts to effectively elicit relevant language data. This strategy ensures that the model focuses on the specific linguistic features and cultural contexts of Fongbe.
Conclusion and Future Work
The research highlights the potential of LLMs to serve low-resource language communities when approached with tailored elicitation strategies. By releasing all generated corpora and code, the study aims to foster further research and development in this area. The findings underscore the necessity for ongoing exploration into how AI can be leveraged to support and empower speakers of low-resource languages, ensuring their voices are heard in the digital age.
As AI technology continues to evolve, it is imperative for researchers and developers to collaborate with linguistic communities. This collaboration will not only enhance the capabilities of LLMs but also safeguard the linguistic heritage of these languages for future generations.
