
Automatic Speech Recognition and Translation for Low Resource Languages
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
This book is a comprehensive exploration into the cutting-edge research, methodologies, and advancements in addressing the unique challenges associated with ASR and translation for low-resource languages.
Automatic Speech Recognition and Translation for Low Resource Languages contains groundbreaking research from experts and researchers sharing innovative solutions that address language challenges in low-resource environments. The book begins by delving into the fundamental concepts of ASR and translation, providing readers with a solid foundation for understanding the subsequent chapters. It then explores the intricacies of low-resource languages, analyzing the factors that contribute to their challenges and the significance of developing tailored solutions to overcome them.
The chapters encompass a wide range of topics, ranging from both the theoretical and practical aspects of ASR and translation for low-resource languages. The book discusses data augmentation techniques, transfer learning, and multilingual training approaches that leverage the power of existing linguistic resources to improve accuracy and performance. Additionally, it investigates the possibilities offered by unsupervised and semi-supervised learning, as well as the benefits of active learning and crowdsourcing in enriching the training data. Throughout the book, emphasis is placed on the importance of considering the cultural and linguistic context of low-resource languages, recognizing the unique nuances and intricacies that influence accurate ASR and translation. Furthermore, the book explores the potential impact of these technologies in various domains, such as healthcare, education, and commerce, empowering individuals and communities by breaking down language barriers.
Audience
The book targets researchers and professionals in the fields of natural language processing, computational linguistics, and speech technology. It will also be of interest to engineers, linguists, and individuals in industries and organizations working on cross-lingual communication, accessibility, and global connectivity.
More details
Other editions
Additional editions

Persons
L. Ashok Kumar, PhD, is a professor in the Department of Electrical and Electronics Engineering, PSG of Technology, Tamil Nadu, India. He has published more than 175 papers in international and national journals and received 26 awards for his PhD project on wearable electronics at national and international levels. He has created eight Centres of Excellence at PSG in collaboration with government agencies and industries such as the Centre for Audio Visual Speech Recognition and the Centre for Excellence in Solar Thermal Systems. Twenty-three out of 27 of his products have been technologically transferred to government funding agencies.
D. Karthika Renuka, PhD, is a professor at PSG of Technology, Tamil Nadu, India. Her main areas of study focus on data mining, evolutionary algorithms, and machine learning. She is a recipient of the Indo-U.S. Fellowship for Women in STEMM. She has organized two international conferences on The Innovation of Computing Techniques and Information Processing and Remote Computing.
Bharathi Raja Chakravarthi, PhD, is an assistant professor in the School of Computer Science, University of Galway, Ireland. His studies focus on multimodal machine learning, abusive/offensive language detection, bias in natural language processing tasks, inclusive language detection, and multilingualism. He has published many papers in international journals and conferences. He is an associate editor of the journal Expert System with Application and an editorial board member for Computer Speech & Language.
Thomas Mandl, PhD, is a professor of Information Science and Language Technology, University of Hildesheim, Germany. His research interests include information retrieval, human-computer interaction, and internationalization of information technology and he has published more than 300 papers on these topics. He coordinated tracks at the Cross Language Evaluation Forum (CLEF), the European information retrieval evaluation initiative. Thomas Mandl is the co-chair at FIRE, the evaluation initiative for Indian languages, since 2020 and coordinates the HASOC track on hate speech detection.
Content
Foreword xix
Preface xxi
Acknowledgement xxiii
1 A Hybrid Deep Learning Model for Emotion Conversion in Tamil Language 1
Satrughan Kumar Singh, Muniyan Sundararajan and Jainath Yadav
2 Attention-Based End-to-End Automatic Speech Recognition System for Vulnerable Individuals in Tamil 15
S. Suhasini, B. Bharathi and Bharathi Raja Chakravarthi
3 Speech-Based Dialect Identification for Tamil 27
Archana J.P. and B. Bharathi
4 Language Identification Using Speech Denoising Techniques: A Review 41
Amal Kumar, Piyush Kumar Singh and Jainath Yadav
5 Domain Adaptation-Based Self-Supervised ASR Models for Low-Resource Target Domain 51
L. Ashok Kumar, D. Karthika Renuka, Naveena K. S. and Sree Resmi S.
6 ASR Models from Conventional Statistical Models to Transformers and Transfer Learning 69
Elizabeth Sherly, Leena G. Pillai and Kavya Manohar
7 Syllable-Level Morphological Segmentation of Kannada and Tulu Words 113
Asha Hegde and Hosahalli Lakshmaiah Shashirekha
8 A New Robust Deep Learning-Based Automatic Speech Recognition and Machine Transition Model for Tamil and Gujarati 135
Monesh Kumar M. K., Valliammai V., Geraldine Bessie Amali D. and Mathew M. Noel
9 Forensic Voice Comparison Approaches for Low-Resource Languages 155
Kruthika S.G., Trisiladevi C. Nagavi and P. Mahesha
10 CoRePooL--Corpus for Resource-Poor Languages: Badaga Speech Corpus 193
Barathi Ganesh H.B., Jyothish Lal G., Jairam R., Soman K.P., Kamal N.S. and Sharmila B.
11 Bridging the Linguistic Gap: A Deep Learning-Based Image- to-Text Converter for Ancient Tamil with Web Interface 213
S. Umamaheswari, G. Gowtham and K. Harikumar
12 Voice Cloning for Low-Resource Languages: Investigating the Prospects for Tamil 243
Vishnu Radhakrishnan, Aadharsh Aadhithya A., Jayanth Mohan, Visweswaran M., Jyothish Lal G. and Premjith B.
13 Transformer-Based Multilingual Automatic Speech Recognition (ASR) Model for Dravidian Languages 259
Divi Eswar Chowdary, Rahul Ganesan, Harsha Dabbara, G. Jyothish Lal and Premjith B.
14 Language Detection Based on Audio for Indian Languages 275
Amogh A. M., A. Hari Priya, Thanvitha Sai Kanchumarti, Likhitha Ram Bommilla and Rajeshkannan Regunathan
15 Strategies for Corpus Development for Low-Resource Languages: Insights from Nepal 297
Bal Krishna Bal, Balaram Prasain, Rupak Raj Ghimire and Praveen Acharya
16 Deep Neural Machine Translation (DNMT): Hybrid Deep Learning Architecture-Based English-to-Indian Language Translation 331
Nivaashini M., Priyanka G. and Aarthi S.
17 Multiview Learning-Based Speech Recognition for Low-Resource Languages 375
Aditya Kumar and Jainath Yadav
18 Automatic Speech Recognition Based on Improved Deep Learning 405
Kingston Pal Thamburaj and Kartheges Ponniah
19 Comprehensive Analysis of State-of-the-Art Approaches for Speaker Diarization 427
Trisiladevi C. Nagavi, Samanvitha S., Shreya Sudhanva, Sukirth Shivakumar and Vibha Hullur
20 Spoken Language Translation in Low-Resource Language 445
S. Shoba, Sasithradevi A. and S. Deepa
References 456
1
A Hybrid Deep Learning Model for Emotion Conversion in Tamil Language
Satrughan Kumar Singh1*, Muniyan Sundararajan2 and Jainath Yadav1
1Department of Computer Science, Central University of South Bihar, Gaya, Bihar, India
2Department of Mathematics and Computer Science, Mizoram University, Aizawl, Mizoram, India
Abstract
In speech signal processing, emotion recognition is a challenging task in classifying speech into different emotions. In this chapter, we propose a hybrid model based on FFNN (feed forward neural network) and SVM (support vector machine) for automated emotion conversion in the Tamil language. The use of voice command indeed contributes to a better integrated human-machine interface integration where one can give voice command, which intelligent machine understands and obeys. The Tamil language is mostly syllabic for the synthetical analysis of speech signal recognition. The changes in speech signal processing are mainly observed in several acoustic parameters such as root mean square energy, short-time energy, mel-frequency cepstral coefficient, and zero crossing rate, which are subsequently used for discrimination of the generation of a new set of the feature vector. In this proposed model, firstly, the FFNN model is complemented on the training and test datasets. Thereafter, SVM is used to perform the classification task. In the proposed emotion transformation, emotions such as angry, happy, sad, calm, surprised, fearful, neutral, and disgust are considered as target emotions with the multi-layered signal processing framework. This framework is required for spectral mapping to convert neutral utterance into target emotional utterance that is evaluated by subjective tests. Finally, both subjective and objective tests reveal a high and increased accuracy with the proposed model for spectral mapping and also show that the proposed model is better than Gaussian mixture model (GMM), FFNN and some pre-trained convolutional neural network (CNN) architectural models.
Keywords: Spectral mapping, emotion conversion, GMM, FFNN, pre-trained CNN, FFNN+SVM, objective measure
1.1 Introduction
Speech signal processing for emotion conversion has been a recent emerging domain in the human-machine interface. Presently, people are constantly trying to make computers intelligent so that they can do almost all the work easily like humans [1]. The communication between human and computer occurs in both directions [2]. This communication should have two important features of speech technology, speech recognition and speech synthesis. It is known that humans use emotions frequently to convey the intended message. Therefore, it is expected that the machine should be able to understand and generate desired emotions [3, 26]. Most of the existing speech systems can generate only neutral style speech. In this situation, the transformation of emotion is applied to convert the neutral style speech to desired expressive style speech. The modules of emotion transformation are used for making speaking instruments for disabled people and telling the stories in an automatic way [5, 24]. Generation of emotional speech is a challenging research problem. Some research works have attempted to generate expressive speech using text-to-speech synthesis (TTS) technique. Researchers have used the following methods for expressive speech synthesis: (i) formant synthesis or rule-based synthesis, (ii) di-phone concatenation synthesis, (iii) unit selection synthesis, and (iv) Hidden Markov Model (HMM)-based parametric speech synthesis. Emotion transformation approach differs from expressive speech synthesizers because it takes input as neutral speech, while the input of expressive speech synthesizer is text. It can be used with any speech synthesizers to convert their neutral speech output to the desired emotional speech. It generates emotional speech by creating emotional parameters into neutral speech [6, 7]. A formant vocoder is used to synthesize the speech transformation, showing the contour mapping of the target emotion through neural network [25]. For synthesizing emotional speech, the most important issue is to identify features which carry the emotion-specific information. Among various speech features, the widely used features for discrimination of emotions are prosodic and spectral features. The existing emotion transformation techniques transform neutral to emotional speech using prosody manipulation [8-10]. In this chapter, we have generated emotional speech by mapping of spectral features from neutral to target emotions in Tamil language.
1.2 Dataset Collection and Database Preparation
Spectral feature mapping framework needs parallel utterances of source and target emotions to perform emotion transformation process. Around 30 to 50 parallel utterances are sufficient to build emotion-specific mapping functions [11, 12]. In this work, we selected 100 parallel utterances from the emotional speech database collected from one male and one female speaker in Tamil language. These utterances were recorded in eight emotions such as angry, happy, sad, calm, surprised, fearful, neutral, and disgust. Training facilitates a learning system for creating an acoustic training dataset [4]. For training and testing purposes, we used 70 and 30 parallel utterances, respectively.
1.3 Pre-Trained CNN Architectural Models
1.3.1 VGG16
VGG16 is a convolutional neural network (CNN) model which basically focuses on depth. VGG takes 224 x 224 pixel RGB image. It uses a small receptive field (3 x 3 with stride of 1) followed by a ReLu unit. VGG16 has three fully connected layers; the first two have 4096 channels and the third has 1000 channels, one for each class. All of VGG16's hidden layers use ReLu. VGG has many variants, among which is VGG16, which is famous as its name is derived from its architecture using 16 layers in total among 13 convolution layers, two fully connected layers, and one output layer.
1.3.2 ResNet50
ResNet50 is known as residual network. ResNet works on skip connection. As it is known, deep networks always suffer from vanishing gradients without adjustments. Tiny gradients make learning intractable. To overcome this problem, Microsoft introduced a deep residual learning framework. The skip connection provides the learning network to identity function for passing the input through the block without passing through the other weight layers and allowing the network to traverse through its layers without gaps.
1.4 Proposed Method for Emotion Transformation
In this chapter, feed forward neural network (FFNN) was explored for emotion transformation. In the literature, Gaussian mixture model (GMM) was used for mapping features from one domain to others. However, the weakness of GMM is that it uses the assumption that the shape of mapping function is Gaussian. In addition to it, GMM requires to fix the number of mixtures before the mapping process. These weaknesses motivated us to explore FFNN to develop an emotion transformation system. Normally, it contains two hidden layers for capturing global and local information between input and output parameters [13-15]. Any continuous valued function can be simulated by considering two or more hidden layers in the neural network [16]. Hence, two hidden layers are sufficient for developing mapping functions. We considered three hidden layers in place of two hidden layers to take the additional benefit of symmetric structure. The symmetric structure is useful to map input parameter to output parameter [16-20]. The FFNN is depicted in Figure 1.1. The third hidden layer of FFNN compresses the dimension of input parameters. It captures global information while other hidden layers capture local information required for developing mapping functions. The accurate mapping functions are developed by selecting an appropriate structure of FFNN. The mapping function F(t) can be expressed as following:
(1.1)Figure 1.1 The diagram 5 layers feed forward neural network model.
Figure 1.2 Block diagram of the proposed model.
where g(t) = x, h(t) = a tanh(?t), a = 1.72 and ? = 0.66. W1, W2, W3, W4 are weight matrices of the neural network. Feed forward neural network uses back-propagation learning to learn the relation between input and output features.
The proposed model (see Figure 1.2) can be summarized as the stepwise process of preprocessing, acoustic features extraction from the preprocessed audio signals, training with FFNN model for acoustic feature vector set generation, and combining features in a single set. Finally, support vector machine (SVM) classifier is used to train with the hybrid feature vector.
1.4.1 Architecture of Emotion Conversion Framework
Emotion transformation process is performed in two steps: (i) training and (ii) testing. During the training process, GMM and FFNN models are developed to map mel-cepstral coefficients (MCEPs) from neutral to target emotional speech. The average and standard deviation of F0 are computed from both neutral and emotional speeches to perform mapping of F0 using Equation 1.2.
Figure 1.3 Block diagram showing testing process.
During the testing phase (see Figure 1.3), the mapped MCEPs and F0 are given as input to Mel-Log Spectrum Approximation (MLSA) filter [21] to...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.