Weighted inverse document frequency and vector space model for hadith search engine

Septya Egho Pratama, Wahyudin Darmalaksana, Dian Sa'adillah Maylawati, Hamdan Sugilar, Teddy Mantoro, Muhammad Ali Ramdhani

Abstract


Hadith is the second source of Islamic law after Qur’an which make many types and references of hadith need to be studied. However, there are not many Muslims know about it and many even have difficulties in studying hadiths. This study aims to build a hadith search engine from reliable source by utilizing Information Retrieval techniques. The structured representation of the text that used is Bag of Word (1-term) with the Weighted Inverse Document Frequency (WIDF) method to calculate the frequency of occurrence of each term before being converted in vector form with the Vector Space Model (VSM). Based on the experiment results using 380 texts of hadith, the recall value of WIDF and VSM is 96%, while precision value is just around 35.46%. This is because the structured representation for text that used is bag of words (1-gram) that can not maintain the meaning of text well.

Keywords


information retrieval; search engine; vector space model; wighted inverse document frequency

References


A. C. Muna, “Perkembangan Studi Hadits Kontemporer [Development of Contemporary Hadith Studies],” Religia, vol. 14, no. 2, 2012.

Mardani, Hukum Islam; Pengantar Ilmu Hukum Islam di Indonesia [Islamic law; Introduction to Islamic Law in Indonesia]. Yogyakarta: Pustaka Pelajar, 2015.

Rohidin, Pengantar Hukum Islam (Dari Semenanjung Arabia Sampai Indonesia) [Introduction to Islamic Law (From the Arabian Peninsula to Indonesia)], 1st ed. Yogyakarta: Lintang Rasi Aksara Books, 2016.

F. Djamil, Filsafat Hukum Islam [Philosophy of Islamic Law]. Jakarta: Logos Wacana Ilmu, 1997.

‘Abd al-Wahab Khallaf, Ilm Usul al-Fiqh. Kairo: Dar Al-Hadith, 2003.

A. Wahyudi, “MENGURAI PETA KITAB-KITAB HADITS (Kajian Referensi atas Kitab-kitab Hadits) [UNDERSTANDING THE MAP OF THE BOOKS OF HADITS (Reference Study of the Books of Hadith)],” AL-IHKAM J. Huk. Pranata Sos., 2015.

P. W. Handayani, I. M. Wiryana, and J.-T. Milde, “MESIN PENCARI BERBASISKAN SEMANTIK UNTUK BAHASA Indonesia [SEARCHING MACHINE BASED ON SEMANTICS FOR INDONESIAN LANGUAGES],” Jurnal Sistem Informasi, vol. 4, no. 2. pp. 110–114, 2012.

J. M. Kassim and M. Rahmany, “Introduction to Semantic Search Engine,” 2009 Int. Conf. Electr. Eng. Informatics, vol. 02, no. August, pp. 380–386, 2009.

J. B. Killoran, “How to use search engine optimization techniques to increase website visibility,” IEEE Trans. Prof. Commun., vol. 56, no. 1, pp. 50–66, 2013.

S. M. Weiss, N. Indurkhya, T. Zhang, and F. J. Damerau, “Information Retrieval and Text Mining,” Springer Berlin Heidelb., no. Fundamentals of Predictive Text Mining, pp. 75–90, 2010.

C. D. Manning, P. Ragahvan, and H. Schutze, An Introduction to Information Retrieval, no. c. 2009.

P. R. Agrawal, “Google Search,” 2016.

C. C. Brown, “Google Scholar,” Charlest. Advis., 2017.

A. Hassan and S. S. Dadwal, “Search Engine Marketing,” in Digital Marketing and Consumer Engagement, 2017.

A. A. Maarif, “Penerapan Algoritma TF-IDF untuk Pencarian Karya Ilmiah [Application of TF-IDF Algorithm for Scientific Work Search],” Dok. Karya Ilm. | Tugas Akhir | Progr. Stud. Tek. Inform. - S1 | Fak. Ilmu Komput. | Univ. Dian Nuswantoro Semarang, 2015.

A. A. Okfan Rizal Ferdiansyah, Ema Utami, “Implementasi Principal Component Analysis Untuk Sistem Temu Balik Citra Digital [Implementation of Principal Component Analysis for Digital Image Retrieval Systems],” Creat. Inf. Technol. J., vol. 2, no. 3, 2015.

C. Slamet, R. Andrian, D. S. Maylawati, W. Darmalaksana, and M. A. Ramdhani, “Web Scraping and Naïve Bayes Classification for Job Search Engine,” vol. 288, no. 1, pp. 1–7, 2018.

F. Amin, “Sistem Temu Kembali Informasi dengan Pemeringkatan Metode Vector Space Model [Information Retrieval System with Vector Space Model Ranking Method],” J. Teknol. Inf. Din., vol. 18, no. 2, pp. 122–129, 2013.

G. Karyono, F. S. Utomo, A. Sistem, and T. Balik, “Temu Balik Informasi Pada Dokumen Teks Berbahasa Indonesia Dengan Metode Vector Space Retrieval Model [Information Retrieval in Indonesian Language Text Documents Using the Vector Space Retrieval Model],” Semin. Nas. Teknol. Inf. dan Terap. 2012, vol. 2012, no. Semantik, pp. 282–289, 2012.

F. Sanjaya, “Pemanfaatan Sistem Temu Kembali Informasi dalam Pencarian Dokumen Menggunakan Metode Vector Space Model [Utilization of Information Retrieval System in Finding Documents Using the Vector Space Model Method],” J. Inf. Technol., 2018.

P. E. Mas’udia, M. D. Atmadja, and L. D. Mustafa, “INFORMATION RETRIEVAL TUGAS AKHIR DAN PERHITUNGAN KEMIRIPAN DOKUMEN MENGACU PADA ABSTRAK MENGGUNAKAN VECTOR SPACE MODEL [INFORMATION RETRIEVAL OF FINAL PROJECT AND CALCULATION OF REFLECTING DOCUMENTS IN ABSTRACT USING VECTOR SPACE MODEL],” Simetris J. Tek. Mesin, Elektro dan Ilmu Komput., 2017.

I. Irmawati, “Information Retrieval in Documents using Vector Space Model,” J. Ilm. FIFO, 2017.

C. Van Gysel, M. de Rijke, and E. Kanoulas, “Learning Latent Vector Spaces for Product Search,” 2016.

T. Nadu, “TEXT PROCESSING IN INFORMATION RETRIEVAL SYSTEM USING VECTOR SPACE MODEL,” no. 978, pp. 0–5, 2014.

D. Susandi and U. Sholahudin, “Pemanfaatan Vector Space Model pada Penerapan Algoritma Nazief Adriani , KNN dan Fungsi Similarity Cosine untuk Pembobotan IDF dan WIDF pada Prototipe Sistem Klasifikasi Teks Bahasa Indonesia [Utilization of Vector Space Model in the Application of Nazief Adriani, KNN and Similarity Cosine Functions for IDF and WIDF Weighting in the Indonesian Text Classification System Prototype],” vol. 3, no. 1, pp. 22–29, 2016.

C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. 2008.

A. M. Siregar and A. Puspabhuana, “Improvement of term weight result in the information retrieval systems,” in Proceedings of 2017 4th International Conference on New Media Studies, CONMEDIA 2017, 2018.

F. Nadirman, A. Ridha, and A. Annisa, “Searching and Visualization of References in Research Documents,” TELKOMNIKA (Telecommunication Comput. Electron. Control., 2014.

Y. Wang, “Design of Information Retrieval System Using Rough Fuzzy Set,” TELKOMNIKA Indones. J. Electr. Eng., 2014.

H. Jiawei, M. Kamber, J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. 2006.

Y. E. Zohar, “Introduction to Text Mining,” Automated Learning Group, University of Illinois, 2002. [Online]. Available: http://www.docstoc.com/docs/25443990/Introduction-to-TextMining.

I. H. Witten, “Text mining,” in The Practical Handbook of Internet Computing, 2004.

T. Tokunaga, T. Tokunaga, I. Makoto, and I. Makoto, “Text categorization based on weighted inverse document frequency,” Spec. Interes. Groups Inf. Process Soc. Japan (SIG-IPSJ, 1994.

Kurniawati and A. Syauqi, “Term weighting based class indexes using space density for Al-Qur’an relevant meaning ranking,” in 2016 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2016, 2017.

G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Commun. ACM, 1975.

C. Slamet, A. R. Atmadja, D. S. Maylawati, R. S. Lestari, W. Darmalaksana, and M. A. Ramdhani, “Automated Text Summarization for Indonesian Article Using Vector Space Model,” IOP Conf. Ser. Mater. Sci. Eng., vol. 288, no. 1, pp. 0–6, 2018.

L. T. Su, “The relevance of recall and precision in user evaluation,” J. Am. Soc. Inf. Sci., 1994.

L. Torgo and R. Ribeiro, “Precision and recall for regression,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009.

M. Junker, R. Hoch, and A. Dengel, “On the evaluation of document analysis components by recall, precision, and accuracy,” in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 1999.

I. H. Al-Asqalani, Bulughul Al-Maram, Terjemah oleh A.Hasan. Bangil: Pustaka Tamam, 1997.

S. Vijayarani, J. Ilamathi, and M. Nithya, “Preprocessing Techniques for Text Mining - An Overview,” Int. J. Comput. Sci. Commun. Networks, vol. 5, no. 1, pp. 7–16, 2015.

D. S. Maylawati, H. Aulawi, and M. A. Ramdhani, “Flexibility of Indonesian text pre-processing library,” Indones. J. Electr. Eng. Comput. Sci., 2019.

T. Mardiana, T. Bharata Adji, and I. Hidayah, “Stemming Influence on Similarity Detection of Abstract Written in Indonesia,” TELKOMNIKA (Telecommunication Comput. Electron. Control., 2016.

A. S. Rizki, A. Tjahyanto, and R. Trialih, “Comparison of stemming algorithms and its effect on Indonesian text processing,” TELKOMNIKA (Telecommunication Comput. Electron. Control., 2019.

A. F. Hidayatullah, C. I. Ratnasari, and S. Wisnugroho, “Analysis of Stemming Influence on Indonesian Tweet Classification,” TELKOMNIKA (Telecommunication Comput. Electron. Control., 2016.

J. Asian, H. E. Williams, and S. M. M. Tahaghoghi, “Stemming Indonesian,” in Conferences in Research and Practice in Information Technology Series, 2005.

M. Adriani, J. Asian, S. M. M. T. . Nazief, and H. . Williams, “Stemming Indonesian: A Confix-stripping approach,” ACM Trans. Asian Lang. Inf. Process., vol. 6, no. 1, pp. 1–33, 2007.

L. Agusta, “Perbandingan Algoritma Stemming Porter Dengan Algoritma Nazief & Adriani Untuk Stemming Dokumen Teks Bahasa Indonesia,” Konf. Nas. Sist. dan Inform. 2009, 2009.

R. Setiawan, A. Kurniawan, W. Budiharto, I. H. Kartowisastro, and H. Prabowo, “Flexible Affix Classification for Stemming Indonesian Language,” in Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2016.

D. S. Maylawati, W. B. Zulfikar, C. Slamet, and M. A. Ramdhani, “An Improved of Stemming Algorithm for Mining Indonesian Text with Slang on Social Media,” in The 6th International Conference on Cyber and IT Service Management (CITSM 2018), 2018.

H. M. Wallach, “Topic Modeling: Beyond Bag-of-Words,” ICML, no. 1, pp. 977–984, 2006.

D. Sa’Adillah Maylawati, M. Irfan, and W. Budiawan Zulfikar, “Comparison between BIDE, PrefixSpan, and TRuleGrowth for Mining of Indonesian Text,” in Journal of Physics: Conference Series, 2017, vol. 801, no. 1.

S. J. Putra, M. N. Gunawan, I. Khalil, and T. Mantoro, “Sentence boundary disambiguation for Indonesian language,” pp. 587–590, 2018.

Pusat Bahasa Kemdikbud, “Kamus Besar Bahasa Indonesia ( KBBI ),” Kementeri. Pendidik. dan Budaya, 2016.

E. Setiawan, “KBBI - Kamus Besar Bahasa Indonesia [Indonesian Dictionary],” Kamus Besar Bahasa Indonesia (KBBI), 2019. .

D. S. Maylawati and G. A. P. Saptawati, “Set of Frequent Word Item sets as Feature Representation for Text with Indonesian Slang,” in International Conference on Computing and Applied Informatics, 2016, pp. 1–6.

S. Alias, S. K. Mohammad, G. K. Hoon, and T. T. Ping, “A text representation model using Sequential Pattern-Growth method,” Pattern Anal. Appl., vol. 21, no. 1, pp. 233–247, 2018.

H. Ahonen-Myka, “Finding All Maximal Frequent Sequences in Text,” Proc. ICML Work. Mach. Learn. Text Data Anal., pp. 11–17, 1999.

H. Ahonen-Myka, “Discovery of Frequent Word Sequences in Text,” Proc. ESF Explor. Work. Pattern Detect. Discov., vol. {LNCS} (24, no. Teollisuuskatu 23, pp. 180–189, 2002.

R. A. García-Hernández and Y. Ledeneva, “Word sequence models for single text summarization,” Proc. 2nd Int. Conf. Adv. Comput. Interact. ACHI 2009, pp. 44–48, 2009.

S. J. Putra, T. Mantoro, and M. N. Gunawan, “Text mining for Indonesian translation of the Quran: A systematic review,” in 3rd International Conference on Computing, Engineering, and Design, ICCED 2017, 2018.




DOI: http://doi.org/10.11591/ijeecs.v18.i2.pp%25p
Total views : 12 times

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

shopify stats IJEECS visitor statistics