Analyzing semantic similarity amongst textual documents to suggest near duplicates

Viji Devarajan, Revathy Subramanian

Abstract


Data deduplication techniques removing repeated or redundant data from the storage. In recent days, more data has been generated and stored in the storage environment. More redundant and semantically similar content of the data occupied in the storage environment due to this storage efficiency will be reduced and cost of the storage will be high. To overcome this problem, we proposed a method hybrid bidirectional encoder representation from transformers for text semantics using graph convolutional network hybrid bidirectional encoder representation from transformers (BERT) model for text semantics (HBTSG) word embedding-based deep learning model to identify near duplicates based on the semantic relationship between text documents. In this paper we hybridize the concepts of chunking and semantic analysis. The chunking process is carried out to split the documents into blocks. Next stage we identify the semantic relationship between documents using word embedding techniques. It combines the advantages of the chunking, feature extraction, and semantic relations to provide better results.

Keywords


BERT; Deep learning; GCN; Keyword extraction; Semantic-similarity

Full Text:

PDF


DOI: http://doi.org/10.11591/ijeecs.v25.i3.pp1703-1711

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).

shopify stats IJEECS visitor statistics