Extracting keywords from texts under conditions of missing annotated data using feedback
This work addresses the problem of keyword extraction from unstructured text documents under the following conditions: absence of annotated data at the beginning of work («cold start» conditions), the possibility of improving results through the use of feedback, and the necessity of mapping keywords to their canonical forms. The paper presents a formal problem statement, analysis and comparison of existing methods (classical methods, BERT-based methods, open LLMs). To solve the problem, a combined method is proposed that first uses a non-trainable extraction method, and after accumulating feedback, uses a trainable keyword post-processing method. The classical method (SingleRank, F1 = 0.26 on Inspec) is proposed as the non-trainable method; a BERT+CRF-based neural network is used as the trainable method. Various strategies for fine-tuning BERT for keyword post-processing are considered: processing keywords one by one (negative result), all keywords in one line (F1 = 0.34), sentences with keywords one by one (F1= 0.42), all sentences with keywords (F1 = 0.50). The method was also evaluated on a proprietary Russian-language benchmark (course annotations); the last variant of BERT fine-tuning with augmented data added to the training set shows F1 = 0.33, which is comparable to LLM t-pro (F1 = 0.33) with lower VRAM requirements (6 GB versus 22.8 GB for LLM). The condition of presenting keywords in canonical form was fulfilled using LLM qwen2.5:3b with F1 = 0.68. The obtained results can be used independently for the concise representation of text documents (such as course descriptions), as well as input data for topic modelling and comparative document analysis tasks.
Authors: P. V. Korytov, I. I. Kholod
Direction: Informatics, Computer Technologies And Control
Keywords: keywords, cold start, BERT, feedback learning, LLM
View full article