Recognition and extraction of named entities from the user agreements corpus

Data analysis and mining are used to solve a variety of different problems, but their effective use requires high-quality and large datasets. Open publication of such datasets is not always possible in accordance with the law. The presence of personal data in datasets necessitates their processing and cleaning before open publication. In particular, the PPInRussian text dataset created in 2024 for studying aspects of personal data processing cannot be published, but it has the potential to become a useful tool for both computer security researchers and legal scholars. This paper discusses modern methods of named entity recognition that can be used to clean a text corpus, tests them, and evaluates their applicability in the context of cleaning legal documents. In addition, the paper proposes a rule-based text corpus cleaning technique that shows more accurate results compared to more general-purpose tools. The application of this technique will clean the corpus of user agreements and, thus, make it possible to publish it for interested researchers.

Authors: M. D. Kuznetsov

Direction: Informatics, Computer Technologies And Control

Keywords: named entity recognition, user agreements, security policy, personal data


View full article