Apache Spark based distributed clustering for big data analytic with application to 3D road network

Rotsnarani Sethy, Soumya Ranjan Mahanta, Mrutyunjaya Panda

Abstract


The vast amount of data stored nowadays has turned big data analytics into a very promising research field. Clustering is an essential step in data analysis, widely used for classification, collecting statistics, and acquiring insights in specific domains of knowledge. However, the most of existing algorithms based on Lloyd-Forgy’s method, have an enormously huge average-case complexity while clustering data sets with a large number of features, which may be superpolynomial time (NP-hard) and are severely constrained in terms of speed, productivity, and adaptability. Aiming to improve Lloyd-Forgy’s clustering performance, K-means++ algorithms, a variety of algorithm-level optimizations which is not been well-studied, is discussed along with very promising gaussian mixture model (GMM) and soft clustering based Fuzzy C-means (FCM). Further, for fast and distributed data processing and to leverage the benefits of big data platforms, such as Apache Spark, Spark-based clustering methods are applied on three-dimensional (3D) road network data set which is collected from UCI repository. However, Spark-based clustering research is still in infancy. The distributed computation tests are conducted by allocating two core processors and one databricks unit (DBU) with 15 GB memory and measuring execution times, as well as root mean square error (RMSE), mean absolute error (MAE), clustering accuracy, and silhouette values. The results are promising and provide new research directions in the field of spark-based clustering on big data.

Keywords


3D road network; Apache Spark; ArcGIS data visualization; Clustering; Clustering accuracy; Silhouette score; Spark-based clustering

Full Text:

PDF


DOI: http://doi.org/10.11591/ijeecs.v37.i1.pp335-346

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).

shopify stats IJEECS visitor statistics