
Audio Source Separation and Speech Enhancement
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Audio source separation and speech enhancement aim to extract one or more source signals of interest from an audio recording involving several sound sources. These technologies are among the most studied in audio signal processing today and bear a critical role in the success of hearing aids, hands-free phones, voice command and other noise-robust audio analysis systems, and music post-production software.
Research on this topic has followed three convergent paths, starting with sensor array processing, computational auditory scene analysis, and machine learning based approaches such as independent component analysis, respectively. This book is the first one to provide a comprehensive overview by presenting the common foundations and the differences between these techniques in a unified setting.
Key features:
* Consolidated perspective on audio source separation and speech enhancement.
* Both historical perspective and latest advances in the field, e.g. deep neural networks.
* Diverse disciplines: array processing, machine learning, and statistical signal processing.
* Covers the most important techniques for both single-channel and multichannel processing.
This book provides both introductory and advanced material suitable for people with basic knowledge of signal processing and machine learning. Thanks to its comprehensiveness, it will help students select a promising research track, researchers leverage the acquired cross-domain knowledge to design improved techniques, and engineers and developers choose the right technology for their target application scenario. It will also be useful for practitioners from other fields (e.g., acoustics, multimedia, phonetics, and musicology) willing to exploit audio source separation or speech enhancement as pre-processing tools for their own needs.
More details
Other editions
Additional editions


Persons
EMMANUEL VINCENT is a Senior Research Scientist with Inria, Nancy, France. His research focuses on machine learning for speech and audio signal processing. He has been working on audio source separation for 15 years and co-authored over 180 publications in this field. His contributions include harmonic nonnegative matrix factorization, full-rank spatial covariance modeling, joint spatial/spectral estimation, deep learning based multichannel source separation, and objective performance metrics. He has given several keynotes, tutorials and summer school lectures, including at Interspeech 2012 and 2016, WASPAA 2015 and LVA/ICA 2015. He is a founding chair of the series of Signal Separation Evaluation Campaigns (SiSEC) and CHiME Speech Separation and Recognition Challenges and the chair of ISCA's special interest group on Robust Speech Processing.
TUOMAS VIRTANEN is a Professor with the Laboratory of Signal Processing, Tampere University of Technology, Finland, where he is leading the Audio Research Group. He is known for his pioneering work on single-channel sound source separation using nonnegative matrix factorization, and its application to noise-robust speech recognition, music content analysis, and sound event detection. His research interests also include content analysis and processing of audio signals in general. He has authored more than 170 publications and received four best paper awards. He is an IEEE Senior Member, a member of the Audio and Acoustic Signal Processing Technical Committee of IEEE Signal Processing Society, Associate Editor of IEEE/ACM Transaction on Audio, Speech, and Language Processing, and recipient of the ERC 2014 Starting Grant.
SHARON GANNOT is a Full Professor at the Faculty of Engineering, Bar-Ilan University, Israel, where he is heading the Speech and Signal Processing laboratory and the Signal Processing Track. His research interests include multi-microphone speech processing; distributed algorithms for noise reduction and speaker separation; array processing on manifold; dereverberation; single-microphone speech enhancement; and speaker localization and tracking. He received the Bar-Ilan University's Outstanding Lecturer Award for 2010 and 2014 and the Bar-Ilan Rector Innovation in Research Award in 2018. He has co-authored over 200 publications and lectured tutorials at ICASSP 2012, EUSIPCO 2012, ICASSP 2013, and EUSIPCO 2013 and a keynote address at IWAENC 2012. He was a co-editor of the book Speech Processing in Modern Communication: Challenges and Perspectives (Springer, 2012). He also served as an Associate Editor and a Senior Area Chair of the IEEE Transactions on Speech, Audio and Language Processing. He currently serves as the Chair of the IEEE Audio and Acoustic Signal Processing (AASP) Technical Committee.
Content
List of Authors xvii
Preface xxi
Acknowledgment xxiii
Notations xxv
Acronyms xxix
About the Companion Website xxxi
Part I Prerequisites 1
1 Introduction 3
Emmanuel Vincent, Sharon Gannot, and Tuomas Virtanen
1.1 Why are Source Separation and Speech Enhancement Needed? 3
1.2 What are the Goals of Source Separation and Speech Enhancement? 4
1.3 How can Source Separation and Speech Enhancement be Addressed? 9
1.4 Outline 11
Bibliography 12
2 Time-Frequency Processing: Spectral Properties 15
Tuomas Virtanen, Emmanuel Vincent, and Sharon Gannot
2.1 Time-Frequency Analysis and Synthesis 15
2.2 Source Properties in the Time-Frequency Domain 23
2.3 Filtering in the Time-Frequency Domain 25
2.4 Summary 28
Bibliography 28
3 Acoustics: Spatial Properties 31
Emmanuel Vincent, Sharon Gannot, and Tuomas Virtanen
3.1 Formalization of the Mixing Process 31
3.2 Microphone Recordings 32
3.3 Artificial Mixtures 36
3.4 Impulse Response Models 37
3.5 Summary 43
Bibliography 43
4 Multichannel Source Activity Detection, Localization, and Tracking 47
Pasi Pertilä, Alessio Brutti, Piergiorgio Svaizer, and Maurizio Omologo
4.1 Basic Notions in Multichannel Spatial Audio 47
4.2 Multi-Microphone Source Activity Detection 52
4.3 Source Localization 54
4.4 Summary 60
Bibliography 60
Part II Single-Channel Separation and Enhancement 65
5 Spectral Masking and Filtering 67
Timo Gerkmann and Emmanuel Vincent
5.1 Time-Frequency Masking 67
5.2 Mask Estimation Given the Signal Statistics 70
5.3 Perceptual Improvements 81
5.4 Summary 82
Bibliography 83
6 Single-Channel Speech Presence Probability Estimation and Noise Tracking 87
Rainer Martin and Israel Cohen
6.1 Speech Presence Probability and its Estimation 87
6.2 Noise Power Spectrum Tracking 93
6.3 Evaluation Measures 102
6.4 Summary 104
Bibliography 104
7 Single-Channel Classification and Clustering Approaches 107
FelixWeninger, Jun Du, Erik Marchi, and Tian Gao
7.1 Source Separation by Computational Auditory Scene Analysis 108
7.2 Source Separation by Factorial HMMs 111
7.3 Separation Based Training 113
7.4 Summary 125
Bibliography 125
8 Nonnegative Matrix Factorization 131
Roland Badeau and Tuomas Virtanen
8.1 NMF and Source Separation 131
8.2 NMF Theory and Algorithms 137
8.3 NMF Dictionary LearningMethods 145
8.4 Advanced NMF Models 148
8.5 Summary 156
Bibliography 156
9 Temporal Extensions of Nonnegative Matrix Factorization 161
Cédric Févotte, Paris Smaragdis, NasserMohammadiha, and Gautham J.Mysore
9.1 Convolutive NMF 161
9.2 Overview of DynamicalModels 169
9.3 Smooth NMF 170
9.4 Nonnegative State-Space Models 174
9.5 Discrete DynamicalModels 178
9.6 The Use of DynamicModels in Source Separation 182
9.7 Which Model to Use? 183
9.8 Summary 184
9.9 Standard Distributions 184
Bibliography 185
Part III Multichannel Separation and Enhancement 189
10 Spatial Filtering 191
Shmulik Markovich-Golan,Walter Kellermann, and Sharon Gannot
10.1 Fundamentals of Array Processing 192
10.2 Array Topologies 197
10.3 Data-Independent Beamforming 199
10.4 Data-Dependent Spatial Filters: Design Criteria 202
10.5 Generalized Sidelobe Canceler Implementation 209
10.6 Postfilters 210
10.7 Summary 211
Bibliography 212
11 Multichannel Parameter Estimation 219
Shmulik Markovich-Golan,Walter Kellermann, and Sharon Gannot
11.1 Multichannel Speech Presence Probability Estimators 219
11.2 Covariance Matrix Estimators Exploiting SPP 227
11.3 Methods forWeakly Guided and Strongly Guided RTF Estimation 228
11.4 Summary 231
Bibliography 231
12 Multichannel Clustering and Classification Approaches 235
Michael I.Mandel, Shoko Araki, and Tomohiro Nakatani
12.1 Two-Channel Clustering 236
12.2 Multichannel Clustering 244
12.3 Multichannel Classification 251
12.4 Spatial Filtering Based on Masks 255
12.5 Summary 257
Bibliography 258
13 Independent Component and Vector Analysis 263
Hiroshi Sawada and Zbyn¿ek Koldovský
13.1 Convolutive Mixtures and their Time-Frequency Representations 264
13.2 Frequency-Domain Independent Component Analysis 265
13.3 Independent Vector Analysis 279
13.4 Example 280
13.5 Summary 284
Bibliography 284
14 Gaussian Model Based Multichannel Separation 289
Alexey Ozerov and Hirokazu Kameoka
14.1 Gaussian Modeling 289
14.2 Library of Spectral and SpatialModels 295
14.3 Parameter Estimation Criteria and Algorithms 300
14.4 Detailed Presentation of Some Methods 305
14.5 Summary 312
Acknowledgment 312
Bibliography 312
15 Dereverberation 317
Emanuël A.P. Habets and Patrick A. Naylor
15.1 Introduction to Dereverberation 317
15.2 Reverberation Cancellation Approaches 319
15.3 Reverberation Suppression Approaches 329
15.4 Direct Estimation 335
15.5 Evaluation of Dereverberation 336
15.6 Summary 337
Bibliography 337
Part IV Application Scenarios and Perspectives 345
16 Applying Source Separation to Music 347
Bryan Pardo, Antoine Liutkus, Zhiyao Duan, and Gaël Richard
16.1 Challenges and Opportunities 348
16.2 Nonnegative Matrix Factorization in the Case of Music 349
16.3 Taking Advantage of the Harmonic Structure of Music 354
16.4 Nonparametric Local Models: Taking Advantage of Redundancies in Music 358
16.5 Taking Advantage of Multiple Instances 363
16.6 Interactive Source Separation 367
16.7 Crowd-Based Evaluation 367
16.8 Some Examples of Applications 368
16.9 Summary 370
Bibliography 370
17 Application of Source Separation to Robust Speech Analysis and Recognition 377
ShinjiWatanabe, Tuomas Virtanen, and Dorothea Kolossa
17.1 Challenges and Opportunities 377
17.2 Applications 380
17.3 Robust Speech Analysis and Recognition 390
17.4 Integration of Front-End and Back-End 397
17.5 Use of Multimodal Information with Source Separation 403
17.6 Summary 404
Bibliography 405
18 Binaural Speech Processing with Application to Hearing Devices 413
Simon Doclo, Sharon Gannot, Daniel Marquardt, and Elior Hadad
18.1 Introduction to Binaural Processing 413
18.2 Binaural Hearing 415
18.3 Binaural Noise Reduction Paradigms 416
18.4 The Binaural Noise Reduction Problem 420
18.5 Extensions for Diffuse Noise 425
18.6 Extensions for Interfering Sources 431
18.7 Summary 437
Bibliography 437
19 Perspectives 443
Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot
19.1 Advancing Deep Learning 443
19.2 Exploiting Phase Relationships 447
19.3 AdvancingMultichannel Processing 450
19.4 Addressing Multiple-Device Scenarios 453
19.5 TowardsWidespread Commercial Use 455
Acknowledgment 457
Bibliography 457
Index 465
1
Introduction
Emmanuel Vincent Sharon Gannot and Tuomas Virtanen
Source separation and speech enhancement are core problems in the field of audio signal processing, with applications to speech, music, and environmental audio. Research in this field has accompanied technological trends, such as the move from landline to mobile or hands-free phones, the gradual replacement of stereo by 3D audio, and the emergence of connected devices equipped with one or more microphones that can execute audio processing tasks which were previously regarded as impossible. In this short introductory chapter, after a brief discussion of the application needs in Section 1.1, we define the problems of source separation and speech enhancement and introduce relevant terminology regarding the scenarios and the desired outcome in Section 1.2. We then present the general processing scheme followed by most source separation and speech enhancement approaches and categorize these approaches in Section 1.3. Finally, we provide an outline of the book in Section 1.4.
1.1 Why are Source Separation and Speech Enhancement Needed?
The problems of source separation and speech enhancement arise from several application needs in the context of speech, music, and environmental audio processing.
Real-world speech signals are often contaminated by interfering speakers, environmental noise, and/or reverberation. These phenomena deteriorate speech quality and, in adverse scenarios, speech intelligibility and automatic speech recognition (ASR) performance. Source separation and speech enhancement are therefore required in such scenarios. For instance, spoken communication over mobile phones or hands-free systems requires the separation or enhancement of the near-end speaker's voice with respect to interfering speakers and environmental noises before it is transmitted to the far-end listener. Conference call systems or hearing aids face the same problem, except that several speakers may be considered as targets. Source separation and speech enhancement are also crucial preprocessing steps for robust distant-microphone ASR, as available in today's personal assistants, car navigation systems, televisions, video game consoles, medical dictation devices, and meeting transcription systems. Finally, they are necessary components in providing humanoid robots, assistive listening devices, and surveillance systems with "super-hearing" capabilities, which may exceed the hearing capabilities of humans.
Besides speech, music and movie soundtracks are another important application area for source separation. Indeed, music recordings typically involve several instruments playing together live or mixed together in a studio, while movie soundtracks involve speech overlapped with music and sound effects. Source separation has been successfully used to upmix mono or stereo recordings to 3D sound formats and/or to remix them. It lies at the core of object-based audio coders, which encode a given recording as the sum of several sound objects that can then easily be rendered and manipulated. It is also useful for music information retrieval purposes, e.g. to transcribe the melody or the lyrics of a song from the separated singing voice.
This is an emerging research field with many real-life applications concerning the analysis of general sound scenes, involving the detection of sound events, their localization and tracking, and the inference of the acoustic environment properties.
1.2 What are the Goals of Source Separation and Speech Enhancement?
The goal of source separation and speech enhancement can be defined in layman's terms as that of recovering the signal of one or more sound sources from an observed signal involving other sound sources and/or reverberation. This definition turns out to be ambiguous. In order to address the ambiguity, the notion of source and the process leading to the observed signal must be characterized more precisely. In this section and in the rest of this book we adopt the general notations defined on p. xxv-xxvii.
1.2.1 Single-Channel vs. Multichannel
Let us assume that the observed signal has channels indexed by . By channel, we mean the output of one microphone in the case when the observed signal has been recorded by one or more microphones, or the input of one loudspeaker in the case when it is destined to be played back on one or more loudspeakers.1 A signal with channels is called single-channel and is represented by a scalar , while a signal with channels is called multichannel and is represented by an vector . The explanation below employs multichannel notation, but is also valid in the single-channel case.
1.2.2 Point vs. Diffuse Sources
Furthermore, let us assume that there are sound sources indexed by . The word "source" can refer to two different concepts. A point source such as a human speaker, a bird, or a loudspeaker is considered to emit sound from a single point in space. It can be represented as a single-channel signal. A diffuse source such as a car, a piano, or rain simultaneously emits sound from a whole region in space. The sounds emitted from different points of that region are different but not always independent of each other. Therefore, a diffuse source can be thought of as an infinite collection of point sources. The estimation of the individual point sources in this collection can be important for the study of vibrating bodies, but it is considered irrelevant for source separation or speech enhancement. A diffuse source is therefore typically represented by the corresponding signal recorded at the microphone(s) and it is processed as a whole.
1.2.3 Mixing Process
The mixing process leading to the observed signal can generally be expressed in two steps. First, each single-channel point source signal is transformed into an source spatial image signal (Vincent et al., 2012) by means of a possibly nonlinear spatialization operation. This operation can describe the acoustic propagation from the point source to the microphone(s), including reverberation, or some artificial mixing effects. Diffuse sources are directly represented by their spatial images instead. Second, the spatial images of all sources are summed to yield the observed signal called the mixture:
(1.1)This summation is due to the superposition of the sources in the case of microphone recording or to explicit summation in the case of artificial mixing. This implies that the spatial image of each source represents the contribution of the source to the mixture signal. A schematic overview of the mixing process is depicted in Figure 1.1. More specific details are given in Chapter 3.
Note that target sources, interfering sources, and noise are treated in the same way in this formulation. All these signals can be either point or diffuse sources. The choice of target sources depends on the use case. Also, the distinction between interfering sources and noise may or may not be relevant depending on the use case. In the context of speech processing, these terms typically refer to undesired speech vs. nonspeech sources, respectively. In the context of music or environmental sound processing, this distinction is most often irrelevant and the former term is preferred to the latter.
Figure 1.1 General mixing process, illustrated in the case of sources, including three point sources and one diffuse source, and channels.
In the following, we assume that all signals are digital, meaning that the time variable is discrete. We also assume that quantization effects are negligible, so that we can operate on continuous amplitudes. Regarding the conversion of acoustic signals to analog audio signals and analog signals to digital, see, for example, Havelock et al. (2008, Part XII) and Pohlmann (1995, pp. 22-49).
1.2.4 Separation vs. Enhancement
The above mixing process implies one or more distortions of the target signals: interfering sources, noise, reverberation, and echo emitted by the loudspeakers (if any). In this context, source separation refers to the problem of extracting one or more target sources while suppressing interfering sources and noise. It explicitly excludes dereverberation and echo cancellation. Enhancement is more general, in that it refers to the problem of extracting one or more target sources while suppressing all types of distortion, including reverberation and echo. In practice, though, this term is mostly used in the case when the target sources are speech. In the audio processing literature, these two terms are often interchanged, especially when referring to the problem of suppressing both interfering speakers and noise from a speech signal. Note that, for either source separation or enhancement tasks, the extracted source(s) can be either the spatial image of the source or its direct path component, namely the delayed and attenuated version of the original source signal (Vincent et al., 2012; Gannot et al., 2001).
The problem of echo cancellation is out of the scope of this book. Please refer to Hänsler and Schmidt (2004) for a comprehensive overview of this topic. The problem of source localization and tracking cannot be viewed as a separation or enhancement task, but it is sometimes used as a preprocessing...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.