MDVC corpus: empowering Moroccan Darija speech recognition

Boumehdi Ahmed, Yousfi Abdellah

Abstract


Automatic speech recognition (ASR) technology has significantly transformed human-machine interactions, but it remains limited in its representation of diverse languages and dialects. Moroccan Darija, the lively Moroccan dialect, has long been underrepresented in the realm of language technology. To address this gap, we present a novel corpus of audio files accompanied by meticulously transcribed Moroccan Darija speech. The corpus comprises 1,000 hours of diverse content, featuring multiple Moroccan accents, extracted from 80 YouTube channels. To standardize the representation of Moroccan Darija in our corpus, we made efforts to establish consistent writing norms and conventions. In addition to the dataset creation, we applied fine-tuning using the Wav2Vec2 model on the Moroccan Darija voice corpus (MDVC) dataset achieving a remarkable word error rate (WER) of 9%. This article discusses the current state of Moroccan Darija research, highlighting the scarcity of resources and the need for robust ASR systems. Our contribution offers a valuable resource for researchers and developers, and by standardizing the Darija language, we strive to improve ASR system for this low resource language.

Keywords


Automatic speech recognition; Low resource language; Moroccan Darija voice corpus; Wav2Vec2; Word error rate

Full Text:

PDF


DOI: http://doi.org/10.11591/ijeecs.v34.i1.pp290-301

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).

shopify stats IJEECS visitor statistics