Document retrieval using term term frequency inverse sentence frequency weighting scheme

Mohannad T. Mohammed, Omar Fitian Rashid

Abstract


The need for an efficient method to find the furthermost appropriate document corresponding to a particular search query has become crucial due to the exponential development in the number of papers that are now readily available to us on the web. The vector space model (VSM) a perfect model used in “information retrieval”, represents these words as a vector in space and gives them weights via a popular weighting method known as term frequency inverse document frequency (TF-IDF). In this research, work has been proposed to retrieve the most relevant document focused on representing documents and queries as vectors comprising average term term frequency inverse sentence frequency (TF-ISF) weights instead of representing them as vectors of term TF-IDF weight and two basic and effective similarity measures: Cosine and Jaccard were used. Using the MS MARCO dataset, this article analyzes and assesses the retrieval effectiveness of the TF-ISF weighting scheme. The result shows that the TF-ISF model with the Cosine similarity measure retrieves more relevant documents. The model was evaluated against the conventional TF-ISF technique and shows that it performs significantly better on MS MARCO data (Microsoft-curated data of Bing queries).

Keywords


Document representation; Document retrieval; Similarity measures; Term frequency inverse sentence frequency; Weighting schemes

Full Text:

PDF


DOI: http://doi.org/10.11591/ijeecs.v31.i3.pp1478-1485

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).

shopify stats IJEECS visitor statistics