An arrangement of the number of K-grams in the performance of Rabin Karp algorithm in text adjustment
Abstract
Rabin Karp algorithm is frequently used to determine the similarity between texts, using the hash function to compare the string identified and the substring in the text. The choice of the k value in the K-gram is often unrestricted. The number of k values used when cutting some terms will take longer if tried one by one. This research will perform a word cutting test on a script using K-gram 0 to 8. The results will cover the effect of the value of each K used on the similarity percentage produced. This research aims to determine the effect of the number of K-grams on the performance of Rabin Karp in text matching. The test underwent 20 sentences and 10 times using the dice coefficient for text similarity testing. The conclusion of this research should not use the K-gram 0 to 2 due to the K-gram basic principle: character deduction. Subsequently, if the character is 0,1,2, it does not have a meaning yet; thus, it gets a high similarity percentage. Based on trials by taking samples of K-gram 0 to 8 from 10 test data sets; the K-gram 3 is the best among K-grams 0 to 8.
Keywords
K-gram; Performance; Rabin Karp; Similarity; Text adjustment
Full Text:
PDFDOI: http://doi.org/10.11591/ijeecs.v26.i3.pp1388-1394
Refbacks
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).