BdRegionText: resource creation and evaluation for Bangla regional text classification with machine learning

Babe Sultana, S. M. Mirajul Hoque, Md Gulzar Hussain, Mohammad Nurul Huda

Abstract


Regional text analysis acknowledges the cultural diversity encompassed by a language. It offers insights into the authentic ways people communicate, promoting cultural awareness and genuineness in communication. This research paper delves into the classification of Bangla regional text using machine learning (ML) algorithms. Consequently, this study compiles a dataset comprising 2,573 sample texts in four distinct regional Bangla dialects (Chittagong, Rangpur, Barishal, and Noakhali). We focused on these dialects because they were more readily available on the internet than others. The primary objective is to identify synthesized Bangla text and assign appropriate categories. The categorization process focuses on a regional language authored by Bengali individuals, aiming to ascertain its authenticity and using ML techniques named decision tree (DT), stochastic gradient descent (SGD), support vector machine (SVM), and random forest (RF) to check how well categorization worked and also handled the issue of slight imbalance in the dataset. As there is limited prior research in this domain, we compare our work with the existing studies available, and we have employed various popular feature extraction techniques for text classification in natural language processing (NLP), specifically TF-IDF, CountVectorizer, and bag of words (BoW). Our comparative analysis indicates that an aggregation of term frequency–inverse document frequency (TF-IDF) and CountVectorizer outperforms BoW in terms of performance. Among the ML techniques we applied, the RF algorithm yielded the utmost accuracy of 79.15% and a mean accuracy of 79.47%.

Keywords


Bangla regional text; BoW; DT; Random forest; SGD; SVM; TF-IDF

Full Text:

PDF


DOI: http://doi.org/10.11591/ijeecs.v40.i1.pp411-425

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES).

shopify stats IJEECS visitor statistics