Arich and balanced phonetics corpus for modern standard Arabic ASR systems

Youssef Boutazart, Naouar Laaïdi, Abderrahim Ezzine, Hassan Satori, Mohamed Taj Bennani

Abstract


This research delves into the creation of an innovative Modern Standard Ara bic corpus, aiming for a comprehensive balance and richness while adhering to Zipf’s law. Building a phonetically diverse Arabic sentence collection yields significant advantages in terms of efficiency, cost-effectiveness, and storage ca pacity compared to conventional corpora. The corpus undergoes meticulous seg mentation into graphemes, which are then manually converted into phonemes, resulting in a total of 19769 phonemic units. Among these phonemes, conso nants like ’Laam- l’ account for 10%, while ’Fatha- A’ vowels constitute 20%. Evaluation of this corpus using an automatic speech recognition (ASR) system reveals a sentence error rate (SER) of 30% and a word error rate (WER) of 15%. Furthermore, statistical analysis unveils that diacritic marks encompass 47.59% of the corpus, with graphemes comprising the remaining 52.41%. These dia critized marks provide valuable insights into the precise phonetic transcription of the corpus. Additionally, the study provides detailed breakdowns of consonants based on their place and manner of articulation, enhancing our understanding of phonetic structures.


Keywords


Modern standard Arabic; Phonetically balanced corpus; Phonetically rich corpus; Segmentation grapheme to phoneme; Zipf’s law

Full Text:

PDF


DOI: http://doi.org/10.11591/ijeecs.v41.i3.pp1049-1059

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES).

shopify stats IJEECS visitor statistics