In Data or Invisible: Toward a Better Digital Representation of Low-Resource Languages with Knowledge Graphs
The rise of digital technologies has transformed how data is accessed and shared globally. However, this transformation has also highlighted a significant divide in Open Access Data (OAD) between high-resource and low-resource languages. A recent PhD proposal aims to bridge this gap by enhancing the language coverage of Linked Open Data knowledge graphs (LOD KGs).
Understanding the Divide
As language plays a crucial role in digital representation, the disparity in language resources can lead to the exclusion of numerous communities from participating in the global digital landscape. The proposed research focuses on identifying and analyzing key variables that characterize language distribution within LOD. These variables include:
- Number of Wikipedia articles per language edition
- Number of language-tagged entities in LOD KGs
By examining these variables across three major multilingual LOD KGs—DBpedia, BabelNet, and Wikidata—the research aims to provide deeper insights into the representation and distribution of languages within the LOD ecosystem.
Proposed Methodology
The research intends to build on the initial analysis by studying the impact of cross-lingual transfer candidate selection on the task of multilingual KG completion. This involves investigating strategies that leverage:
- Linguistic proximity between languages
- Availability of curated annotated alignments between languages
These strategies aim to enhance the performance of knowledge graphs and improve the representation of low-resource languages. By utilizing linguistic proximity, the proposal seeks to explore the advantages of analogical reasoning, which relies on the (dis)similarities between languages—a method that has not yet been thoroughly investigated to identify correspondences across languages.
Potential Impact on Low-Resource Languages
The implications of this research are profound. By improving the digital representation of low-resource languages, the project aims to foster greater inclusivity in the global digital transformation. Enhanced language coverage in LOD not only benefits speakers of these languages but also enriches the knowledge graphs themselves, leading to a more diverse and representative digital landscape.
Furthermore, as digital technologies continue to evolve, addressing the needs of low-resource languages through advanced methodologies in knowledge graph construction and completion could pave the way for more equitable access to information and resources. The research underscores the importance of inclusivity in the digital age, emphasizing that every language and its speakers deserve representation in the vast digital universe.
Conclusion
The proposed PhD research represents a critical step in addressing the digital divide faced by low-resource languages. By leveraging knowledge graphs and focusing on linguistic strategies, this work promises to enhance language representation in OAD, fostering a more inclusive digital future. As the project unfolds, the insights gained will be essential for shaping data accessibility and representation in a rapidly digitizing world.
Related AI Insights
- Sheet as Token: Graph-Based Multi-Sheet Spreadsheet AI
- Boost Peptide Design with Conformal Prediction & RL
- HEDP: Hybrid Energy-Distance Framework for Domain Learning
- Robust Explainability for Safety-Critical ATR Systems
- Taklif.AI: Personalized College Assignments with LLM Tech
- SDFlow: Efficient Time Series Generation Without Exposure Bias
- PREFER: Personalized Review Summarization with Online Learning
- MolRecBench-Wild: Real-World Benchmark for OCSR Accuracy
- XDecomposer: Prior-Free Multiphase X-ray Diffraction Analysis
- Best Arm Identification in Generalized Linear Bandits Using Hybrid Feedback
