Open Access Open Access  Restricted Access Subscription or Fee Access

Speech Recognition Using Multi-Deep Learning Techniques for Person Identification

Dharman J.


Speech recognition is one of the fastest-growing technologies. It offers multiple potential benefits and has varied applications in numerous sectors. Many individuals are incapable of conversing with one another due to language barriers. We intend to reduce this barrier through our project, which was designed and developed to achieve solutions in specific settings that considerably assist individuals in sharing information by using voice input to control a computer. It is possible to do speech recognition tasks after gaining knowledge in a variety of disciplines, including linguistics and computer science. It is not a solitary endeavour. Speech recognition techniques include HMM (hidden Markov model), DTW (dynamic temporal warping)-based voice recognition, neural networks, deep feedforward and recurrent neural networks, and end-to-end automatic speech recognition. Different neural networks utilized for automatic voice recognition are the major topic of this study. In this work, a convolutional neural network (CNN) is used in an effort to develop a system that can identify a human speaker (CNN). This study employs the mel-frequency cepstral coefficients (MFCC)-CNN and raw waveform (RW)-CNN techniques. The first approach is a traditional one that makes use of MFCC and the audio features, which are then entered into CNN to carry out a process. The suggested CNN's training process will start by accepting input in the form of a picture. The second approach, RW-CNN, follows the identical processes as the first approach but bypasses the MFCC phases in favour of straight admission to CNN. Both techniques employed the same CNN structure. In this study, both RW-CNN and MFCC-CNN achieved an accuracy of 96%.

Full Text:



Akçay MB, Oğuz K. Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020; 116: 56–76.

Tsontzos G, Diakoloukas V, Koniaris C, Digalakis V. Estimation of general identifiable linear dynamic models with an application in speech recognition. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 2007 Apr 15. Vol. 4, pp. IV-453.

Wahab OA, Mourad A, Otrok H, Taleb T. Federated machine learning: survey, multi-level classification, desirable criteria and future directions in communication and networking systems. IEEE Commun Surveys Tutorials. 2021; 23 (2): 1342–1397.

Franciscatto MH, Augustin I, Lima JC, Maran V. Situation awareness in the speech therapy domain: a systematic mapping study. Computer Speech Lang. 2019; 53: 92–120.

Gulati A, Qin J, Chiu CC, Parmar N, Zhang Y, Yu J, Han W, Wang S, Zhang Z, Wu Y, Pang R. Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. 2020 May 16. Available at

Liu Y, Peng B, Shi P, Yan H, Zhou Y, Han B, Zheng Y, Lin C, Jiang J, Fan Y, Gao T. iQIYI-Vid: a large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548. 2018 Nov 19. Available at

Yang SW, Chi PH, Chuang YS, Lai CIJ, Lakhotia K, Lin YY, Liu AT, Shi J, Chang X, Lin GT, Huang TH, Tseng WC, Lee KT, Liu DR, Huang Z, Dong S, Li SW, Watanabe S, Mohamed A, Lee HY. Superb: speech processing universal performance benchmark. In: Proceedings of InterSpeech 2021. pp. 1194–1198. doi: 10.21437/Interspeech.2021-1775

Ravanelli M, Zhong J, Pascual S, Swietojanski P, Monteiro J, Trmal J, Bengio Y. Multi-task self-supervised learning for robust speech recognition. arXiv:2001.09239 [cs, eess]. 2020. Available at

Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inform Process Syst. 2020; 33: 12449–12460.

Hsu WN, Bolte B, Tsai YHH, Lakhotia K, Salakhutdinov R, Mohamed A. Hubert: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans Audio Speech Lang Process. 2021; 29: 3451–3460.

Lüscher C, Beck E, Irie K, Kitza M, Michel W, Zeyer A, Schlüter R, Ney H. In Proceedings of Interspeech 2019. RWTH ASR Systems for LibriSpeech: Hybrid vs Attention, 2019. pp. 231–235. doi: 10.21437/Interspeech.2019-1780.

Kim C, Shin M, Garg A, Gowda D. Improved vocal tract length perturbation for a state-of-the-art end-to-end speech recognition system. In: Proceedings of Interspeech 2019. pp. 739–743. doi: 10.21437/Interspeech.2019-3227.

Zhang Y, Qin J, Park DS, Han W, Chiu CC, Pang R, Le QV, Wu Y. Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504. 2020. Available at

Khdier HY, Jasim WM, Aliesawi SA. Deep learning algorithms based voiceprint recognition system in noisy environment. J Phys Conf S. 2021 Feb 1;1804(1). doi: 10.1088/1742-6596/1804/1/012042

Chung YA, Zhang Y, Han W, Chiu CC, Qin J, Pang R, Wu Y. W2v-bert: combining contrastive learning and masked language modeling for self-supervised speech pre-training. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 244–250.

Li Y, Zhao T, Kawahara T. Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning. In: Proceedings of Interspeech 2019. pp. 2803–2807. doi: 10.21437/Interspeech.2019-2594.

Lian Z, Tao J, Liu B, Huang J, Yang Z, Li R. Context-dependent domain adversarial neural network for multimodal emotion recognition. In: Proceedings of Interspeech 2020. pp. 394–398.

Ding S, Chen T, Gong X, Zha W, Wang Z. Autospeech: neural architecture search for speaker recognition. arXiv preprint arXiv:2005.03215 [eess.AS].2020. Available at

Deka B, Chakraborty J, Dey A, Nath S, Sarmah P, Nirmala SR, Vijaya S. Speech corpora of under resourced languages of north-east India. In: 2018 Oriental COCOSDA – International Conference on Speech Database and Assessments, Miyazaki, Japan. 2018, May 7–8. pp. 72–77. doi: 10.1109/ICSDA.2018.8693038.

Latif S, Qadir J, Qayyum A, Usama M, Younis S. Speech technology for healthcare: opportunities, challenges, and state of the art. IEEE Rev Biomed Eng. 2020; 14: 342–356.

Besacier L, Barnard E, Karpov A, Schultz T. Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 2014; 56 (1): 85–100. doi: 10.1016/j.specom.2013.07.008.


  • There are currently no refbacks.