An ensemble technique for speech recognition in noisy environments

Imad Qasim Habeeb, Tamara Zuhair Fadhil, Yaseen Naser Jurn, Zeyad Qasim Al-Zaydi, Hanan Najm Abdulkhudhur


Automatic speech recognition (ASR) is a technology that allows a computer and mobile device to recognize and translate spoken language into text. ASR systems often produce poor accuracy for the noisy speech signal. Therefore, this research proposed an ensemble technique that does not rely on a single filter for perfect noise reduction but incorporates information from multiple noise reduction filters to improve the final ASR accuracy. The main factor of this technique is the generation of K-copies of the speech signal using three noise reduction filters. The speech features of these copies differ slightly in order to extract different texts from them when processed by the ASR system. Thus, the best among these texts can be elected as final ASR output. The ensemble technique was compared with three related current noise reduction techniques in terms of CER and WER. The test results were encouraging and showed a relatively decreased by 16.61% and 11.54% on CER and WER compared with the best current technique. ASR field will benefit from the contribution of this research to increase the recognition accuracy of a human speech in the presence of background noise.


An Ensemble technique; Noisy speech; Automatic speech recognition; Speech enhancement.


K. Ramli and A. Jarin, "A real-time application framework for speech recognition using HTTP/2 and SSE," Indonesian Journal of Electrical Engineering and Computer Science, vol. 12, pp. 1230-1238, 2018.

Z. Zhang, J. Geiger, J. Pohjalainen, A. E.-D. Mousa, W. Jin, and B. Schuller, "Deep learning for environmentally robust speech recognition: An overview of recent developments," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 9, p. 49, 2018.

J.-Y. Fourniols, N. Nasreddine, C. Escriba, P. Acco, J. Roux, and G. Soto-Romero, "An Overview of Basics Speech Recognition and Autonomous Approach for Smart Home IOT Low Power Devices," Journal of Signal and Information Processing, vol. 9, p. 239, 2018.

J. Benesty, I. Cohen, and J. Chen, Fundamentals of Signal Enhancement and Array Signal Processing: Wiley Online Library, 2018.

E. Rashno, A. Akbari, and B. Nasersharif, "A Convolutional Neural Network model based on Neutrosophy for Noisy Speech Recognition," arXiv preprint arXiv:1901.10629, 2019.

G. Krishna, C. Tran, J. Yu, and A. H. Tewfik, "Speech Recognition with no speech or with noisy speech," arXiv preprint arXiv:1903.00739, 2019.

P. G. Shivakumar, H. Li, K. Knight, and P. Georgiou, "Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling," APSIPA Transactions on Signal and Information Processing, vol. 8, 2019.

K. Garg and G. Jain, "A comparative study of noise reduction techniques for automatic speech recognition systems," in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016, pp. 2098-2103.

D. Bagchi, M. I. Mandel, Z. Wang, Y. He, A. Plummer, and E. Fosler-Lussier, "Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition," in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 496-503.

D. Jurafsky and J. H. Martin, Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, 2nd ed.: Pearson Education India, 2009.

P. Karjol, M. A. Kumar, and P. K. Ghosh, "Speech enhancement using multiple deep neural networks," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5049-5052.

I. Q. Habeeb, "Hybrid model of post-processing techniques for Arabic optical character recognition," PhD thesis, Universiti Utara Malaysia, Kedah, Malaysia, 2016.

I. Q. Habeeb, Z. Q. Al-Zaydi, and H. N. Abdulkhudhur, "Enhanced Ensemble Technique for Optical Character Recognition," in International Conference on New Trends in Information and Communications Technology Applications, 2018, pp. 213-225.

T. Tan, Y. Qian, H. Hu, Y. Zhou, W. Ding, and K. Yu, "Adaptive very deep convolutional residual network for noise robust speech recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, pp. 1393-1405, 2018.

S. Puligilla and P. Mondal, "Co-existence of aluminosilicate and calcium silicate gel characterized through selective dissolution and FTIR spectral subtraction," Cement and Concrete Research, vol. 70, pp. 39-49, 2015.

D. Wang and C. Bao, "An Ideal Wiener Filter Correction-based cIRM Speech Enhancement Method Using Deep Neural Networks with Skip Connections," in 2018 14th IEEE International Conference on Signal Processing (ICSP), 2018, pp. 270-275.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, et al., "The Kaldi speech recognition toolkit," IEEE Signal Processing Society2011.

Z. Wang, E. Vincent, R. Serizel, and Y. Yan, "Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments," Computer Speech & Language, vol. 49, pp. 37-51, 2018.

H. N. Abdulkhudhur, I. Q. Habeeb, Y. Yusof, and S. A. M. Yusof, "Implementation of Improved Levenshtein Algorithm for Spelling Correction Word Candidate List Generation," Journal of Theoretical and Applied Information Technology, vol. 88, pp. 449-455, 2016.

S. Chehrehsa and T. J. Moir, "Speech enhancement using Maximum A-Posteriori and Gaussian Mixture Models for speech and noise Periodogram estimation," Computer Speech & Language, vol. 36, pp. 58-71, 2016.

Total views : 89 times


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

shopify stats IJEECS visitor statistics