LSA & LDA topic modeling classification: comparison study on e-books

Shaymaa H. Mohammed, Salam Al-augby

Abstract


With the rapid growth of information technology, the amount of unstructured text data in digital libraries is rapidly increased and has become a big challenge in analyzing, organizing and how to classify text automatically in E-research repository to get the benefit from them is the cornerstone. The manual categorization of text documents requires a lot of financial, human resources for management. In order to get so, topic modeling are used to classify documents. This paper addresses a comparison study on scientific unstructured text document classification (e-books) based on the full text where applying the most popular topic modeling approach (LDA, LSA) to cluster the words into a set of topics as important keywords for classification. Our dataset consists of (300) books contain about 23 million words based on full text. In the used topic models (LSA, LDA) each word in the corpus of vocabulary is connected with one or more topics with a probability, as estimated by the model. Many (LDA, LSA) models were built with different values of coherence and pick the one that produces the highest coherence value. The result of this paper showed that LDA has better results than LSA and the best results obtained from the LDA method was (0.592179) of coherence value when the number of topics was 20 while the LSA coherence value was (0.5773026) when the number of topics was 10.


Keywords


Text Mining;Text Classification;Text Clustering;Topic Modeling;Latent Semantic Analysis;Latent Dirichlet Allocation

References


D. (Pew) Putthividhya, H. T. Attias, and S. Nagarajan, “Independent factor topic models,” Proc. 26th Annu. Int.

Conf. Mach. Learn., pp.. 833–840, 2009.

D. Blei and J. Lafferty, “Correlated topic models,” Adv. Neural Inf. Process. Syst., vol. 18, p. 147, 2006.

P. Anupriya and S. Karpagavalli, “LDA based topic modeling of journal abstracts,” in 2015 International Conference on Advanced Computing and Communication Systems, 2015, pp. 1–5.

K. Hagedorn, D. Newman, and Y. Noh, “How Topic Modeling is Useful in Digital Libraries,” 2010.

S. Heyman, “Google books: A complex and controversial experiment,” New York Times, 2015.

J. Jackson, “Google- 129 Million Different Books Have Been Published,” PC World, 2010. [Online]. Available: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2013GL058951.

A. Kaur and D. Chopra, “Comparison of text mining tools,” 2016 5th Int. Conf. Reliab. Infocom Technol. Optim. (Trends Futur. Dir., pp. 186–192, 2016.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” J. Am. Soc. Inf. Sci., vol. 41, no. 6, pp. 391–407, 1990.

J. W. Uys, N. D. Du Preez, and E. W. Uys, “Leveraging unstructured information using topic modelling,” in PICMET’08-2008 Portland International Conference on Management of Engineering & Technology, 2008, pp. 955–961.

E. Sarioglu, K. Yadav, and H.-A. Choi, “Topic modeling based classification of clinical reports,” in 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, 2013, pp. 67–73.

S. Bergamaschi and L. Po, “Comparing LDA and LSA topic models for content-based movie recommendation systems,” Int. Conf. Web Inf. Syst. Technol., pp. 247–263, 2014.

Z. Tong and H. Zhang, “AText MINING RESEARCH BASED ON LDA TOPIC MODELLING,” Int. Conf. Comput. Sci. Eng. Inf. Technol., pp. 201–210, 2016.

Z. Li, W. Shang, and M. Yan, “News text classification model based on topic model,” 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), IEEE, pp. 1–5, 2016.

T. Rajasundari, P. Subathra, and P. Kumar, “Performance analysis of topic modeling algorithms for news articles,” in Journal of Advanced Research in Dynamical and Control Systems (11), 2017.

M. Mouhoub and M. Al Helal, “Topic Modelling in Bangla Language: An LDA Approach to Optimize Topics and News Classification.”, Comput. Inf. Sci., vol. 11, no. 4, pp. 77–83, 2018.

K. Kurata, Y. Miyata, E. Ishita, M. Yamamoto, F. Yang, and A. Iwase, “Analyzing library and information science full‐text articles using a topic modeling approach,” Proc. Assoc. Inf. Sci. Technol., vol. 55, no. 1, pp. 847–848, 2018.

M. A. Hearst, “Text data mining: Issues, techniques, and the relationship to information access,” in Presentation notes for UW/MS workshop on data mining, 1997, vol. 1, p. 997.

K. R. Bindu, L. Parameswaran, and K. V Soumya, “Performance evaluation of topic modelling algorithms with an application of Q & A dataset,” Int. J. Appl. Eng. Res., vol. 10, pp. 23–27, 2015.

Z. Zainol, M. T. H. Jaymes, and P. N. E. Nohuddin, “VisualUrText: A Text Analytics Tool for Unstructured Textual Data,” J. Phys. Conf. Ser., vol. 1018, no. 1, p. 12011, 2018.

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in knowledge discovery and data mining, vol. 21. AAAI press Menlo Park, 1996.

E. Simoudis, “Reality check for data mining,” IEEE Intell. Syst., no. 5, pp. 26–33, 1996.

https://www.newworldencyclopedia.org/entry/Library_of_Congress

C. C. Aggarwal and C. Zhai, “A survey of text classification algorithms,” in Mining text data, Springer, 2012, pp. 163–222.

T. Gonçalves and P. Quaresma, “Evaluating preprocessing techniques in a text classification problem,” São Leopoldo, RS, Bras. SBC-Sociedade Bras. Comput., 2005.

C.-K. Yau, A. Porter, N. Newman, and A. Suominen, “Clustering scientific documents with topic modeling,” Scientometrics, vol. 100, no. 3, pp. 767–786, 2014.

K. Hornik and B. Grün, “topicmodels: An R package for fitting topic models,” J. Stat. Softw., vol. 40, no. 13, pp. 1–30, 2011.

K. K. Mino George, P. Beaulah Soundarabai, “Impact Of Topic Modelling Methods And Text Classification Techniques In Text Mining: A Survey,” IJAiES, vol. 4, no. 3, pp. 72–77, 2017.

T. Cvitanic, B. Lee, H. I. Song, K. Fu, and D. Rosen, “Lda v. lsa: A comparison of two computational text analysis tools for the functional categorization of patents,” Int. Conf. Case-Based Reason., 2016.

J. C. Campbell, A. Hindle, and E. Stroulia, “Latent Dirichlet allocation: extracting topics from software engineering data,” art Sci. Anal. Softw. data, pp. 139–159, 2015.

K. Stevens, P. Kegelmeyer, D. Andrzejewski, and D. Buttler, “Exploring topic coherence over many models and many topics,” Proc. 2012 Jt. Conf. Empir. Methods Nat. Lang. Process. Comput. Nat. Lang. Learn., pp. 952–961, 2012.

Discovery,” in 2018 International Conference on Industrial Enterprise and System Engineering

(IcoIESE 2018), 2019.

N. Rafei, R. Hassan, R. D. R. Saedudin, A. F. M. Raffei, Z. Zakaria, and S. Kasim, “Comparison of

feature selection techniques in classifying stroke document,” Indones. J. Electr. Eng. Comput. Sci., vol.

, no. ] U. H. Govindarajan, A. Trappey, and G. Kumar, “Latent Dirichlet Allocation Modelling for CPS Patent

Topic 3, pp. 1244–1250, 2019.

A. Anantharaman, A. Jadiya, C. T. S. Siri, B. N. V. S. Adikar, and B. Mohan, “Performance Evaluation

of Topic Modeling Algorithms for Text Classification,” in 2019 3rd International Conference on Trends

in Electronics and Informatics (ICOEI), 2019, pp. 704–708.

K. Bastani, H. Namavari, and J. Shaffer, “Latent Dirichlet allocation (LDA) for topic modeling of the

CFPB consumer complaints,” Expert Syst. Appl., vol. 127, pp. 256–271, 2019. [35] Big Data Software

Engineering: Analysis of Knowledge Domains and Skill Sets Using LDA-Based Topic Modeling

C. M. Intisar, Y. Watanobe, M. Poudel, and S. Bhalla, “Classification of Programming Problems based

on Topic Modeling,” in Proceedings of the 2019 7th International Conference on Information and

Education Technology, 2019, pp. 275–283.

H. Jelodar et al., “Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey,” Multimed. Tools Appl., vol. 78, no. 11, pp. 15169–15211, 2019.

L. Ljungberg, “Using unsupervised classification with multiple LDA derived models for text generation

based on noisy and sensitive data.” 2019.

A. Adeleke, N. A. Samsudin, Z. A. Othman, and S. K. A. Khalid, “A two-step feature selection method

for quranic text classification,” Indones. J. Electr. Eng. Comput. Sci, vol. 16, no. 2, pp. 730–736, 2019.

S. V. Gaikwad, A. Chaugule, and P. Patil, “Text mining methods and techniques,” Int. J. Comput. Appl., vol. 85, no. 17, 2014.




DOI: http://doi.org/10.11591/ijeecs.v19.i1.pp%25p
Total views : 63 times

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

shopify stats IJEECS visitor statistics