Types of data representation for plagiarism search in program codes have been explored, such as text without transformations, n-grams, and tokens. The splitting of program texts from an array of student solutions into an array of tokens has been implemented using the Python programming language. The requirements for plagiarism detection algorithms are formulated. And the metrics for the detection of plagiarism in the texts of programs are reviewed. The advantages and disadvantages for each metric are highlighted. The comparison was made according to the criteria: time, memory, the probability of finding a pair of similar programs, the probability that the found pair will be really similar. The metrics: the numerical values of the attributes, the longest common subsequence, the Jaccard distance, the Levenshtein distance, and the Kolmogorov distance have been compared. The calculation of the Levenshtein distance is chosen for implementation. The Python programming language has been used for implementation of the Levenshtein distance calculation algorithm for a list of tokens. The results show how the texts of the programs are similar to each other.

Authors: I. A. Posov, V. Ye. Dopira

Direction: Informatics, Computer Technologies And Control

Keywords: Plagiarism, plagiarism detection metrics, token

View full article