Exploring word embeddings and clustering algorithms for user reviews

Zuleaizal Sidek, Sharifah Sakinah Syed Ahmad

Abstract


The rapid advancement of information technology has led to a significant surge in the volume of unstructured textual data. This has posed a major problem in terms of analyzing, organizing, and automatically clustering text for research purposes, which is crucial for extracting valuable insights. The process of manually clustering the unstructured data, such as customer reviews on the Internet, which capture the opinions of customers regarding products, services, and social events, requires significant financial resources, manpower, and time. Most of the studies are directed towards the analysis of sentiment in user reviews. In order to address the issues effectively, automated text clustering could assist in categorizing reviews into various themes, thereby simplifying the analysis process. Therefore, in this paper, we present and compare the result of experiment the combination of five text clustering techniques, namely K-means, fuzzy C-mean (FCM), non-negative matrix factorization (NMF), latent dirichlet allocation (LDA), and latent semantic analysis (LSA) with different embedding techniques, namely term frequency–inverse document frequency (TF-IDF), Word2Vec, and global vectors (GloVe). The experiments revealed that LDA is a reliable algorithm as it consistently produces good results across three-word embeddings. The highest Silhouette score recorded in the experiments was 0.66 using LDA and Word2Vec as word embedding. Simultaneously, the application of LSA in conjunction with Word2Vec yields superior outcomes, as evidenced by a Silhouette score of 0.65.

Keywords


Clustering algorithms; Silhouette score; Text analysis; User reviews; Word embedding

Full Text:

PDF


DOI: http://doi.org/10.11591/ijeecs.v41.i3.pp1017-1024

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES).

shopify stats IJEECS visitor statistics