Extracting keywords from texts under conditions of missing annotated data using feedback

This work addresses the problem of keyword extraction from unstructured text documents under the following conditions: absence of annotated data at the beginning of work («cold start» conditions), the possibility of improving results through the use of feedback, and the necessity of mapping keywords to their canonical forms. The paper presents a formal problem statement, analysis and comparison of existing methods (classical methods, BERT-based methods, open LLMs). To solve the problem, a combined method is proposed that first uses a non-trainable extraction method, and after accumulating feedback, uses a trainable keyword post-processing method. The classical method (SingleRank, F₁ = 0.26 on Inspec) is proposed as the non-trainable method; a BERT+CRF-based neural network is used as the trainable method. Various strategies for fine-tuning BERT for keyword post-processing are considered: processing keywords one by one (negative result), all keywords in one line (F₁ = 0.34), sentences with keywords one by one (F₁= 0.42), all sentences with keywords (F₁ = 0.50). The method was also evaluated on a proprietary Russian-language benchmark (course annotations); the last variant of BERT fine-tuning with augmented data added to the training set shows F₁ = 0.33, which is comparable to LLM t-pro (F₁ = 0.33) with lower VRAM requirements (6 GB versus 22.8 GB for LLM). The condition of presenting keywords in canonical form was fulfilled using LLM qwen2.5:3b with F₁ = 0.68. The obtained results can be used independently for the concise representation of text documents (such as course descriptions), as well as input data for topic modelling and comparative document analysis tasks.

Authors: P. V. Korytov, I. I. Kholod

Direction: Informatics, Computer Technologies And Control

Keywords: keywords, cold start, BERT, feedback learning, LLM

View full article