Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
Summary: arXiv:2604.09016v1 Announce Type: cross
Abstract
This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture.
Key Findings
Experimental results indicate that:
- Parakeet achieves the best performance in audio transcription.
- The proposed NER solutions achieve the highest f1-score values in detecting sensitive information.
Anonymization and Data Protection
In addition to the performance metrics, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data. This is crucial for ensuring the protection of personal information while supporting cybersecurity research within the current legal framework.
Methodology
The methodology employed in this study involves several key components:
- Data Collection: Information from the Telegram platform is gathered, encompassing various forms of unstructured data, including text, audio, and images.
- Speech-to-Text Transcription: Advanced transcription models are utilized, incorporating signal enhancement techniques to improve accuracy.
- Named Entity Recognition (NER): Multiple NER solutions are evaluated, such as Microsoft Presidio and transformer-based AI models, to identify and extract relevant entities from the data.
Implications for Cybersecurity
The findings of this study have significant implications for enhancing cybersecurity measures. By effectively identifying and anonymizing named entities in unstructured data, researchers and organizations can better understand cybercrime patterns while adhering to legal requirements. This approach not only aids in the detection of social engineering attempts but also contributes to the broader goal of safeguarding personal data.
Conclusion
In conclusion, the proposed system offers a robust framework for the identification and anonymization of named entities in unstructured information sources. With stringent adherence to GDPR and national regulations, this research paves the way for more secure and effective cybercrime analysis, ultimately supporting the ongoing efforts to combat cyber threats in a legally compliant manner.
