Arich and balanced phonetics corpus for modern standard Arabic ASR systems
Abstract
This research delves into the creation of an innovative Modern Standard Ara bic corpus, aiming for a comprehensive balance and richness while adhering to Zipf’s law. Building a phonetically diverse Arabic sentence collection yields significant advantages in terms of efficiency, cost-effectiveness, and storage ca pacity compared to conventional corpora. The corpus undergoes meticulous seg mentation into graphemes, which are then manually converted into phonemes, resulting in a total of 19769 phonemic units. Among these phonemes, conso nants like ’Laam- l’ account for 10%, while ’Fatha- A’ vowels constitute 20%. Evaluation of this corpus using an automatic speech recognition (ASR) system reveals a sentence error rate (SER) of 30% and a word error rate (WER) of 15%. Furthermore, statistical analysis unveils that diacritic marks encompass 47.59% of the corpus, with graphemes comprising the remaining 52.41%. These dia critized marks provide valuable insights into the precise phonetic transcription of the corpus. Additionally, the study provides detailed breakdowns of consonants based on their place and manner of articulation, enhancing our understanding of phonetic structures.
Keywords
Full Text:
PDFDOI: http://doi.org/10.11591/ijeecs.v41.i3.pp1049-1059
Refbacks
- There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES).