Named Entity Anonymization for Social Engineering Detection

Date:

Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection

Summary: arXiv:2604.09016v1 Announce Type: cross

Abstract

This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture.

Key Findings

Experimental results indicate that:

  • Parakeet achieves the best performance in audio transcription.
  • The proposed NER solutions achieve the highest f1-score values in detecting sensitive information.

Anonymization and Data Protection

In addition to the performance metrics, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data. This is crucial for ensuring the protection of personal information while supporting cybersecurity research within the current legal framework.

Methodology

The methodology employed in this study involves several key components:

  • Data Collection: Information from the Telegram platform is gathered, encompassing various forms of unstructured data, including text, audio, and images.
  • Speech-to-Text Transcription: Advanced transcription models are utilized, incorporating signal enhancement techniques to improve accuracy.
  • Named Entity Recognition (NER): Multiple NER solutions are evaluated, such as Microsoft Presidio and transformer-based AI models, to identify and extract relevant entities from the data.

Implications for Cybersecurity

The findings of this study have significant implications for enhancing cybersecurity measures. By effectively identifying and anonymizing named entities in unstructured data, researchers and organizations can better understand cybercrime patterns while adhering to legal requirements. This approach not only aids in the detection of social engineering attempts but also contributes to the broader goal of safeguarding personal data.

Conclusion

In conclusion, the proposed system offers a robust framework for the identification and anonymization of named entities in unstructured information sources. With stringent adherence to GDPR and national regulations, this research paves the way for more secure and effective cybercrime analysis, ultimately supporting the ongoing efforts to combat cyber threats in a legally compliant manner.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.