[FREE PDF sample] Digital speech transmission and enhancement, 2nd edition peter vary ebooks

Page 1


Digital Speech Transmission and Enhancement, 2nd Edition Peter Vary

Visit to download the full and correct content document: https://ebookmass.com/product/digital-speech-transmission-and-enhancement-2nd-e dition-peter-vary/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Circuit Theory and Transmission Lines 2nd Edition

Ravish R. Singh

https://ebookmass.com/product/circuit-theory-and-transmissionlines-2nd-edition-ravish-r-singh/

The Handbook of Language and Speech Disorders 2nd Edition Jack S. Damico

https://ebookmass.com/product/the-handbook-of-language-andspeech-disorders-2nd-edition-jack-s-damico/

The Clinician’s Guide to Treating Cleft Palate Speech 2nd Edition

https://ebookmass.com/product/the-clinicians-guide-to-treatingcleft-palate-speech-2nd-edition/

The Ethics of Ability and Enhancement 1st Edition

Jessica Flanigan

https://ebookmass.com/product/the-ethics-of-ability-andenhancement-1st-edition-jessica-flanigan/

Speech Science Primer: Physiology, Acoustics, and Perception of Speech 6th Edition, (Ebook PDF)

https://ebookmass.com/product/speech-science-primer-physiologyacoustics-and-perception-of-speech-6th-edition-ebook-pdf/

Sustainable Design Through Process Integration: Fundamentals and Applications to Industrial Pollution Prevention, Resource Conservation, and Profitability Enhancement 2nd Edition Mahmoud M. El-Halwagi

https://ebookmass.com/product/sustainable-design-through-processintegration-fundamentals-and-applications-to-industrialpollution-prevention-resource-conservation-and-profitabilityenhancement-2nd-edition-mahmoud-m-el-halwagi/

Ontology and Phenomenology of Speech: An Existential Theory of Speech Marklen E. Konurbaev

https://ebookmass.com/product/ontology-and-phenomenology-ofspeech-an-existential-theory-of-speech-marklen-e-konurbaev/

Body Utopianism: Prosthetic Being Between Enhancement and Estrangement Franziska Bork Petersen

https://ebookmass.com/product/body-utopianism-prosthetic-beingbetween-enhancement-and-estrangement-franziska-bork-petersen/

The Clinicianu2019s Guide to Treating Cleft Palate Speech E Book 2nd Edition, (Ebook PDF)

https://ebookmass.com/product/the-clinicians-guide-to-treatingcleft-palate-speech-e-book-2nd-edition-ebook-pdf/

DigitalSpeechTransmissionandEnhancement

Secondedition PeterVary

InstituteofCommunicationSystems RWTHAachenUniversity Aachen,Germany

RainerMartin InstituteofCommunicationAcoustics Ruhr-UniversitätBochum Bochum,Germany

Thissecondeditionfirstpublished2024 ©2024JohnWiley&SonsLtd.

EditionHistory

JohnWiley&SonsLtd.(1e,2006)

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,or transmitted,inanyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise, exceptaspermittedbylaw.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailable athttp://www.wiley.com/go/permissions.

TherightofPeterVaryandRainerMartintobeidentifiedastheauthorsofthisworkhasbeenassertedin accordancewithlaw.

RegisteredOffices

JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA

JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UK

Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproducts visitusatwww.wiley.com.

Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.

Trademarks:WileyandtheWileylogoaretrademarksorregisteredtrademarksofJohnWiley&Sons,Inc. and/oritsaffiliatesintheUnitedStatesandothercountriesandmaynotbeusedwithoutwritten permission.Allothertrademarksarethepropertyoftheirrespectiveowners.JohnWiley&Sons,Inc.isnot associatedwithanyproductorvendormentionedinthisbook.

LimitofLiability/DisclaimerofWarranty

Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakeno representationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisworkand specificallydisclaimallwarranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantability orfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,written salesmaterialsorpromotionalstatementsforthiswork.Thisworkissoldwiththeunderstandingthatthe publisherisnotengagedinrenderingprofessionalservices.Theadviceandstrategiescontainedhereinmay notbesuitableforyoursituation.Youshouldconsultwithaspecialistwhereappropriate.Thefactthatan organization,website,orproductisreferredtointhisworkasacitationand/orpotentialsourceoffurther informationdoesnotmeanthatthepublisherandauthorsendorsetheinformationorservicesthe organization,website,orproductmayprovideorrecommendationsitmaymake.Further,readersshould beawarethatwebsiteslistedinthisworkmayhavechangedordisappearedbetweenwhenthisworkwas writtenandwhenitisread.Neitherthepublishernorauthorsshallbeliableforanylossofprofitorany othercommercialdamages,includingbutnotlimitedtospecial,incidental,consequential,orother damages.

LibraryofCongressCataloging-in-PublicationDataappliedfor

Hardback:9781119060963

ePdf:9781119060994

ePub:9781119060987

CoverDesign:Wiley

CoverImage:©BAIVECTOR/Shutterstock

Setin9.5/12.5ptSTIXTwoTextbyStraive,Chennai,India

Contents

Preface xv

1Introduction 1

2ModelsofSpeechProductionandHearing 5

2.1SoundWaves 5

2.2OrgansofSpeechProduction 7

2.3CharacteristicsofSpeechSignals 9

2.4ModelofSpeechProduction 10

2.4.1AcousticTubeModeloftheVocalTract 12

2.4.2DiscreteTimeAll-PoleModeloftheVocalTract 19

2.5AnatomyofHearing 25

2.6PsychoacousticPropertiesoftheAuditorySystem 27

2.6.1HearingandLoudness 27

2.6.2SpectralResolution 29

2.6.3Masking 31

2.6.4SpatialHearing 32

2.6.4.1Head-RelatedImpulseResponsesandTransferFunctions 33

2.6.4.2LawofTheFirstWavefront 34 References 35

3SpectralTransformations 37

3.1FourierTransformofContinuousSignals 37

3.2FourierTransformofDiscreteSignals 38

3.3LinearShiftInvariantSystems 41

3.3.1FrequencyResponseofLSISystems 42

3.4The z-transform 42

3.4.1RelationtoFourierTransform 43

3.4.2PropertiesoftheROC 44

3.4.3Inverse z-Transform 44

3.4.4 z-TransformAnalysisofLSISystems 46

3.5TheDiscreteFourierTransform 47

3.5.1LinearandCyclicConvolution 48

3.5.2TheDFTofWindowedSequences 51

3.5.3SpectralResolutionandZeroPadding 54

3.5.4TheSpectrogram 55

3.5.5FastComputationoftheDFT:TheFFT 56

3.5.6Radix-2Decimation-in-TimeFFT 57

3.6FastConvolution 60

3.6.1FastConvolutionofLongSequences 60

3.6.2FastConvolutionbyOverlap-Add 61

3.6.3FastConvolutionbyOverlap-Save 61

3.7Analysis–Modification–SynthesisSystems 64

3.8CepstralAnalysis 66

3.8.1ComplexCepstrum 67

3.8.2RealCepstrum 69

3.8.3ApplicationsoftheCepstrum 70

3.8.3.1ConstructionofMinimum-PhaseSequences 70

3.8.3.2DeconvolutionbyCepstralMeanSubtraction 71

3.8.3.3ComputationoftheSpectralDistortionMeasure 72

3.8.3.4FundamentalFrequencyEstimation 73 References 75

4FilterBanksforSpectralAnalysisandSynthesis 79

4.1SpectralAnalysisUsingNarrowbandFilters 79

4.1.1Short-TermSpectralAnalyzer 83

4.1.2PrototypeFilterDesignfortheAnalysisFilterBank 86

4.1.3Short-TermSpectralSynthesizer 87

4.1.4Short-TermSpectralAnalysisandSynthesis 88

4.1.5PrototypeFilterDesignfortheAnalysis–Synthesisfilterbank 90

4.1.6FilterBankInterpretationoftheDFT 92

4.2PolyphaseNetworkFilterBanks 94

4.2.1PPNAnalysisFilterBank 95

4.2.2PPNSynthesisFilterBank 101

4.3QuadratureMirrorFilterBanks 104

4.3.1Analysis–SynthesisFilterBank 104

4.3.2CompensationofAliasingandSignalReconstruction 106

4.3.3EfficientImplementation 109

4.4FilterBankEqualizer 112

4.4.1TheReferenceFilterBank 112

4.4.2UniformFrequencyResolution 113

4.4.3AdaptiveFilterBankEqualizer:GainComputation 117

4.4.3.1ConventionalSpectralSubtraction 117

4.4.3.2FilterBankEqualizer 118

4.4.4Non-uniformFrequencyResolution 120

4.4.5DesignAspects&Implementation 122 References 123

5StochasticSignalsandEstimation 127

5.1BasicConcepts 127

5.1.1RandomEventsandProbability 127

5.1.2ConditionalProbabilities 128

5.1.3RandomVariables 129

5.1.4ProbabilityDistributionsandProbabilityDensityFunctions 129

5.1.5ConditionalPDFs 130

5.2ExpectationsandMoments 130

5.2.1ConditionalExpectationsandMoments 131

5.2.2Examples 131

5.2.2.1TheUniformDistribution 132

5.2.2.2TheGaussianDensity 132

5.2.2.3TheExponentialDensity 132

5.2.2.4TheLaplaceDensity 133

5.2.2.5TheGammaDensity 134

5.2.2.6 �� 2 -Distribution 134

5.2.3TransformationofaRandomVariable 135

5.2.4RelativeFrequenciesandHistograms 136

5.3BivariateStatistics 137

5.3.1MarginalDensities 137

5.3.2ExpectationsandMoments 137

5.3.3UncorrelatednessandStatisticalIndependence 138

5.3.4ExamplesofBivariatePDFs 139

5.3.4.1TheBivariateUniformDensity 139

5.3.4.2TheBivariateGaussianDensity 139

5.3.5FunctionsofTwoRandomVariables 140

5.4ProbabilityandInformation 141

5.4.1Entropy 141

5.4.2Kullback–LeiblerDivergence 141

5.4.3Cross-Entropy 142

5.4.4MutualInformation 142

5.5MultivariateStatistics 142

5.5.1MultivariateGaussianDistribution 143

5.5.2GaussianMixtureModels 144

5.6StochasticProcesses 145

5.6.1StationaryProcesses 145

5.6.2Auto-CorrelationandAuto-CovarianceFunctions 146

5.6.3Cross-CorrelationandCross-CovarianceFunctions 147

5.6.4MarkovProcesses 147

5.6.5MultivariateStochasticProcesses 148

5.7EstimationofStatisticalQuantitiesbyTimeAverages 150

5.7.1ErgodicProcesses 150

5.7.2Short-TimeStationaryProcesses 150

5.8PowerSpectrumanditsEstimation 151

5.8.1WhiteNoise 152

5.8.2ThePeriodogram 152

5.8.3SmoothedPeriodograms 153

5.8.3.1NonRecursiveSmoothinginTime 153

5.8.3.2RecursiveSmoothinginTime 154

5.8.3.3Log-MelFilterBankFeatures 154

5.8.4PowerSpectraandLinearShift-InvariantSystems 156

5.9StatisticalPropertiesofSpeechSignals 157

5.10StatisticalPropertiesofDFTCoefficients 157

5.10.1AsymptoticStatisticalProperties 158

5.10.2Signal-Plus-NoiseModel 159

5.10.3StatisticsofDFTCoefficientsforFiniteFrameLengths 160

5.11OptimalEstimation 162

5.11.1MMSEEstimation 163

5.11.2EstimationofDiscreteRandomVariables 164

5.11.3OptimalLinearEstimator 164

5.11.4TheGaussianCase 165

5.11.5JointDetectionandEstimation 166

5.12Non-LinearEstimationwithDeepNeuralNetworks 167

5.12.1BasicNetworkComponents 168

5.12.1.1ThePerceptron 168

5.12.1.2ConvolutionalNeuralNetwork 170

5.12.2BasicDNNStructures 170

5.12.2.1Fully-ConnectedFeed-ForwardNetwork 171

5.12.2.2AutoencoderNetworks 171

5.12.2.3RecurrentNeuralNetworks 172

5.12.2.4TimeDelay,Wavenet,andTransformerNetworks 175

5.12.2.5TrainingofNeuralNetworks 175

5.12.2.6StochasticGradientDescent(SGD) 176

5.12.2.7AdaptiveMomentEstimationMethod(ADAM) 176 References 177

6LinearPrediction 181

6.1VocalTractModelsandShort-TermPrediction 181

6.1.1All-ZeroModel 182

6.1.2All-PoleModel 183

6.1.3Pole-ZeroModel 183

6.2OptimalPredictionCoefficientsforStationarySignals 187

6.2.1OptimumPrediction 187

6.2.2SpectralFlatnessMeasure 190

6.3PredictorAdaptation 192

6.3.1Block-OrientedAdaptation 192

6.3.1.1Auto-CorrelationMethod 193

6.3.1.2CovarianceMethod 194

6.3.1.3Levinson–DurbinAlgorithm 196

6.3.2SequentialAdaptation 201

6.4Long-TermPrediction 204 References 209

7Quantization 211

7.1AnalogSamplesandDigitalRepresentation 211

7.2UniformQuantization 212

7.3Non-uniformQuantization 219

7.4OptimalQuantization 227

7.5AdaptiveQuantization 228

7.6VectorQuantization 232

7.6.1Principle 232

7.6.2TheComplexityProblem 235

7.6.3LatticeQuantization 236

7.6.4DesignofOptimalVectorCodeBooks 236

7.6.5Gain–ShapeVectorQuantization 239

7.7QuantizationofthePredictorCoefficients 240

7.7.1ScalarQuantizationoftheLPCCoefficients 241

7.7.2ScalarQuantizationoftheReflectionCoefficients 241

7.7.3ScalarQuantizationoftheLSFCoefficients 243 References 246

8SpeechCoding 249

8.1Speech-CodingCategories 249

8.2Model-BasedPredictiveCoding 253

8.3LinearPredictiveWaveformCoding 255

8.3.1First-OrderDPCM 255

8.3.2Open-LoopandClosed-LoopPrediction 258

8.3.3QuantizationoftheResidualSignal 259

8.3.3.1QuantizationwithOpen-LoopPrediction 259

8.3.3.2QuantizationwithClosed-LoopPrediction 261

8.3.3.3SpectralShapingoftheQuantizationError 262

8.3.4ADPCMwithSequentialAdaptation 266

8.4ParametricCoding 268

8.4.1VocoderStructures 268

8.4.2LPCVocoder 271

8.5HybridCoding 272

8.5.1BasicCodecConcepts 272

8.5.1.1ScalarQuantizationoftheResidualSignal 274

8.5.1.2VectorQuantizationoftheResidualSignal 276

8.5.2ResidualSignalCoding:RELP 279

8.5.3AnalysisbySynthesis:CELP 282

8.5.3.1Principle 282

8.5.3.2FixedCodeBook 283

8.5.3.3Long-TermPrediction,AdaptiveCodeBook 287

8.6AdaptivePostfiltering 289

x Contents

8.7SpeechCodecStandards:SelectedExamples 293

8.7.1GSMFull-RateCodec 295

8.7.2EFRCodec 297

8.7.3AdaptiveMulti-RateNarrowbandCodec(AMR-NB) 299

8.7.4ITU-T/G.722:7kHzAudioCodingwithin64kbit/s 301

8.7.5AdaptiveMulti-RateWidebandCodec(AMR-WB) 301

8.7.6CodecforEnhancedVoiceServices(EVS) 303

8.7.7OpusCodecIETFRFC6716 306 References 307

9ConcealmentofErroneousorLostFrames 313

9.1ConceptsforErrorConcealment 314

9.1.1ErrorConcealmentbyHardDecisionDecoding 315

9.1.2ErrorConcealmentbySoftDecisionDecoding 316

9.1.3ParameterEstimation 318

9.1.3.1MAPEstimation 318

9.1.3.2MSEstimation 318

9.1.4TheAPosterioriProbabilities 319

9.1.4.1TheAPrioriKnowledge 320

9.1.4.2TheParameterDistortionProbabilities 320

9.1.5Example:HardDecisionvs.SoftDecision 321

9.2ExamplesofErrorConcealmentStandards 323

9.2.1SubstitutionandMutingofLostFrames 323

9.2.2AMRCodec:SubstitutionandMutingofLostFrames 325

9.2.3EVSCodec:ConcealmentofLostPackets 329

9.3FurtherImprovements 330 References 331

10BandwidthExtensionofSpeechSignals 335

10.1BWEConcepts 337

10.2BWEusingtheModelofSpeechProduction 339

10.2.1ExtensionoftheExcitationSignal 340

10.2.2SpectralEnvelopeEstimation 342

10.2.2.1MinimumMeanSquareErrorEstimation 344

10.2.2.2ConditionalMaximum APosteriori Estimation 345

10.2.2.3Extensions 345

10.2.2.4Simplifications 346

10.2.3EnergyEnvelopeEstimation 346

10.3SpeechCodecswithIntegratedBWE 349

10.3.1BWEintheGSMFull-RateCodec 349

10.3.2BWEintheAMRWidebandCodec 351

10.3.3BWEintheITUCodecG.729.1 353

References 355

11NELE:Near-EndListeningEnhancement 361

11.1FrequencyDomainNELE(FD) 363

11.1.1SpeechIntelligibilityIndexNELEOptimization 364

11.1.1.1SII-OptimizedNELEExample 367

11.1.2Closed-FormGain-ShapeNELE 368

11.1.2.1TheNoisePropShapingFunction 370

11.1.2.2TheNoiseInverseStrategy 371

11.1.2.3Gain-ShapeFrequencyDomainNELEExample 372

11.2TimeDomainNELE(TD) 374

11.2.1NELEProcessingusingLinearPredictionFilters 374 References 378

12Single-ChannelNoiseReduction 381

12.1Introduction 381

12.2LinearMMSEEstimators 383

12.2.1Non-causalIIRWienerFilter 384

12.2.2TheFIRWienerFilter 386

12.3SpeechEnhancementintheDFTDomain 387

12.3.1TheWienerFilterRevisited 388

12.3.2SpectralSubtraction 390

12.3.3EstimationoftheAPrioriSNR 391

12.3.3.1Decision-DirectedApproach 392

12.3.3.2SmoothingintheCepstrumDomain 392

12.3.4QualityandIntelligibilityEvaluation 393

12.3.4.1NoiseOversubtraction 396

12.3.4.2SpectralFloor 396

12.3.4.3LimitationoftheAPrioriSNR 396

12.3.4.4AdaptiveSmoothingoftheSpectralGain 396

12.3.5SpectralAnalysis/SynthesisforSpeechEnhancement 397

12.4OptimalNon-linearEstimators 397

12.4.1MaximumLikelihoodEstimation 398

12.4.2MaximumAPosterioriEstimation 400

12.4.3MMSEEstimation 400

12.4.3.1MMSEEstimationofComplexCoefficients 401

12.4.3.2MMSEAmplitudeEstimation 401

12.5JointOptimumDetectionandEstimationofSpeech 405

12.6ComputationofLikelihoodRatios 407

12.7EstimationoftheAPrioriandAPosterioriProbabilitiesof SpeechPresence 408

12.7.1EstimationoftheAPrioriProbability 409

12.7.2APosterioriSpeechPresenceProbabilityEstimation 409

12.7.3SPPEstimationUsingaFixedSNRPrior 410

12.8VADandNoiseEstimationTechniques 411

12.8.1VoiceActivityDetection 411

12.8.1.1DetectorsBasedontheSubbandSNR 412

12.8.2NoisePowerEstimationBasedonMinimumStatistics 413

12.8.3NoiseEstimationUsingaSoft-DecisionDetector 416

12.8.4NoisePowerTrackingBasedonMinimumMeanSquareErrorEstimation 417

12.8.5EvaluationofNoisePowerTrackers 419

12.9NoiseReductionwithDeepNeuralNetworks 420

12.9.1ProcessingModel 421

12.9.2EstimationTargets 422

12.9.3LossFunction 423

12.9.4InputFeatures 423

12.9.5DataSets 423 References 425

13Dual-ChannelNoiseandReverberationReduction 435

13.1Dual-ChannelWienerFilter 435

13.2TheIdealDiffuseSoundFieldandItsCoherence 438

13.3NoiseCancellation 442

13.3.1ImplementationoftheAdaptiveNoiseCanceller 444

13.4NoiseReduction 445

13.4.1PrincipleofDual-ChannelNoiseReduction 446

13.4.2BinauralEqualization–CancellationandCommonGainNoiseReduction 447

13.4.3CombinedSingle-andDual-ChannelNoiseReduction 449

13.5Dual-ChannelDereverberation 449

13.6MethodsBasedonDeepLearning 452

References 453

14AcousticEchoControl 457

14.1TheEchoControlProblem 457

14.2EchoCancellationandPostprocessing 462

14.2.1EchoCancellerwithCenterClipper 463

14.2.2EchoCancellerwithVoice-ControlledSoft-Switching 463

14.2.3EchoCancellerwithAdaptivePostfilter 464

14.3EvaluationCriteria 465

14.3.1SystemDistance 466

14.3.2EchoReturnLossEnhancement 466

14.4TheWienerSolution 467

14.5TheLMSandNLMSAlgorithms 468

14.5.1DerivationandBasicProperties 468

14.6ConvergenceAnalysisandControloftheLMSAlgorithm 470

14.6.1ConvergenceintheAbsenceofInterference 471

14.6.2ConvergenceinthePresenceofInterference 473

14.6.3FilterOrderoftheEchoCanceller 476

14.6.4StepsizeParameter 477

14.7GeometricProjectionInterpretationoftheNLMSAlgorithm 479

14.8TheAffineProjectionAlgorithm 481

14.9Least-SquaresandRecursiveLeast-SquaresAlgorithms 484

14.9.1TheWeightedLeast-SquaresAlgorithm 484

14.9.2TheRLSAlgorithm 485

14.9.3NLMS-andKalman-Algorithm 488

14.9.3.1NLMSAlgorithm 490

14.9.3.2KalmanAlgorithm 490

14.9.3.3SummaryofKalmanAlgorithm 492

14.9.3.4Remarks 492

14.10BlockProcessingandFrequencyDomainAdaptiveFilters 493

14.10.1BlockLMSAlgorithm 494

14.10.2FrequencyDomainAdaptiveFilter(FDAF) 495

14.10.2.1FastConvolutionandOverlap-Save 496

14.10.2.2FLMSAlgorithm 499

14.10.2.3ImprovedStepsizeControl 502

14.10.3SubbandAcousticEchoCancellation 502

14.10.4EchoCancellerwithAdaptivePostfilterintheFrequencyDomain 503

14.10.5InitializationwithPerfectSequences 505

14.11StereophonicAcousticEchoControl 506

14.11.1TheNon-uniquenessProblem 508

14.11.2SolutionstotheNon-uniquenessProblem 508 References 510

15MicrophoneArraysandBeamforming 517

15.1Introduction 517

15.2SpatialSamplingofSoundFields 518

15.2.1TheNear-fieldModel 518

15.2.2TheFar-fieldModel 519

15.2.3SoundPickupinReverberantSpaces 521

15.2.4SpatialCorrelationPropertiesofAcousticSignals 522

15.2.5UniformLinearandCircularArrays 522

15.2.6PhaseAmbiguityinMicrophoneSignals 523

15.3Beamforming 524

15.3.1Delay-and-SumBeamforming 525

15.3.2Filter-and-SumBeamforming 526

15.4PerformanceMeasuresandSpatialAliasing 528

15.4.1ArrayGainandArraySensitivity 528

15.4.2DirectivityPattern 529

15.4.3DirectivityandDirectivityIndex 531

15.4.4Example:DifferentialMicrophones 531

15.5DesignofFixedBeamformers 534

15.5.1MinimumVarianceDistortionlessResponseBeamformer 535

15.5.2MVDRBeamformerwithLimitedSusceptibility 537

15.5.3LinearlyConstrainedMinimumVarianceBeamformer 538

15.5.4Max-SNRBeamformer 539

xiv Contents

15.6MultichannelWienerFilterandPostfilter 540

15.7AdaptiveBeamformers 542

15.7.1TheFrostBeamformer 542

15.7.2GeneralizedSide-LobeCanceller 544

15.7.3GeneralizedSide-lobeCancellerwithAdaptiveBlockingMatrix 546

15.7.4Model-BasedParsimonious-Excitation-BasedGSC 547

15.8Non-linearMulti-channelNoiseReduction 550 References 551

Index 555

Preface

Digitalprocessing,storage,andtransmissionofspeechsignalshavegainedgreatpractical importance.Themainareasofapplicationaredigitalmobileradio,audio-visualconferencing,acoustichuman–machinecommunication,andhearingaids.Infact,theseapplications arethedrivingforcesbehindmanyscientificandtechnologicaldevelopmentsinthisfield. Aspecificfeatureoftheseapplicationareasisthattheoryandimplementationare closelylinked;thereisaseamlesstransitionfromtheoryandalgorithmstosystem simulationsusinggeneral-purposecomputersandtoimplementationsonembedded processors.

Thisbookhasbeenwrittenforengineersandengineeringstudentsspecializinginspeech andaudioprocessing.Itsummarizesfundamentaltheoryandrecentdevelopmentsinthe broadfieldofdigitalspeechtransmissionandenhancementandincludesjointresearchof theauthorsandtheirPhDstudents.ThisbookisbeingusedingraduatecoursesatRWTH AachenUniversityandRuhr-UniversitätBochumandotheruniversities.

Thissecondeditionalsoreflectsprogressindigitalspeechtransmissionandenhancement sincethepublicationofthefirstedition[Vary,Martin2006].Inthisrespect,newspeech codingstandardshavebeenincluded,suchastheEnhancedVoiceServices(EVS)codec. Throughoutthisbook,theterm enhancement comprisesbesidesnoisereductionalsothe topicsoferrorconcealment,artificialbandwidthextension,echocancellation,andthenew topicofnear-endlisteningenhancement.

Furthermore,summariesofessentialtoolssuchasspectralanalysis,digitalfilterbanks, includingtheso-calledfilterbankequalizer,aswellasstochasticsignalprocessingand estimationtheoryareprovided.Recenttrendsofapplyingmachinelearningtechniquesin speechsignalprocessingareaddressed.

Asasupplementtothefirstandsecondedition,thecompanionbook Advancesin DigitalSpeechTransmission [Martinetal.2008]shouldbementionedthatcoversspecifictopicsinSpeechQualityAssessment,AcousticSignalProcessing,SpeechCoding, JointSource-ChannelCoding,andSpeechProcessinginHearingInstrumentsand Human–MachineInterfaces.

Furthermore,thereaderwillfindsupplementaryinformation,publications,programs, andaudiosamples,theAachendatabases(singleandmultichannelroomimpulse

xvi Preface responses,activenoisecancellationimpulseresponses),andadatabaseofsimulatedroom impulseresponsesforacousticsensornetworksonthefollowingwebsites:

http://www.iks.rwth-aachen.de

http://www.rub.de/ika

Thescopeoftheindividualsubjectstreatedinthebookchaptersexceedsthatofgraduatelectures;recentresearchresults,standards,problemsofrealization,andapplications havebeenincluded,aswellasmanysuggestionsforfurtherreading.Thereadershouldbe familiarwiththefundamentalsofdigitalsignalprocessingandstatisticalsignalprocessing.

Theauthorsaregratefultoallcurrentandformermembersoftheirgroupsandstudents whocontributedtothebookthroughresearchresults,discussions,oreditorialwork. Inparticular,weliketothankDr.-Ing.ChristianeAntweiler,Dr.-Ing.ColinBreithaupt, Prof.GeraldEnzner,Prof.TimFingscheidt,Prof.TimoGerkmann,Prof.PeterJax,Dr.-Ing. HeinerLöllmann,Prof.NileshMadhu,Dr.-Ing.AnilNagathil,Dr.-Ing.MarkusNiermann, Dr.-Ing.BastianSauert,andDr.-Ing.ThomasSchlienforfruitfuldiscussionsandvaluablecontributions.Furthermore,wewouldespeciallyliketothankDr.-Ing.Christiane Antweiler,forhertirelesssupporttothisproject,andHorstKrottandDipl.-Geogr.Julia Ringeisforpreparingmostofthediagrams.

Finally,wewouldliketoexpressoursincerethankstothemanagingeditorsandstaffof JohnWiley&Sonsfortheirkindandpatientassistance.

AachenandBochum October2023

References

PeterVaryandRainerMartin

Martin,R.;Heute,U.;Antweiler,C.(2008). AdvancesinDigitalSpeechTransmission, JohnWiley&Sons. Vary,P.;Martin,R.(2006). DigitalSpeechTransmission–Enhancement,CodingandError Concealment,JohnWiley&Sons.

Introduction

Languageisthemostessentialmeansofhumancommunication.Itisusedintwomodes:as spokenlanguage(speechcommunication)andaswrittenlanguage(textualcommunication). Inourmoderninformationsocietybothmodesaregreatlyenhancedbytechnicalsystems anddevices.E-mail,shortmessaging,andtheworldwidewebhaverevolutionizedtextual communication,while

● digitalcellularradiosystems,

● audio–visualconferencesystems,

● acoustichuman–machinecommunication,and

● digitalhearingaids havesignificantlyexpandedthepossibilitiesandconvenienceofspeechandaudio–visual communication.

Digitalprocessingandenhancementofspeechsignals forthepurposeoftransmission(or storage)isabranchofinformationtechnologyandanengineeringsciencewhichdraws onvariousotherdisciplines,suchasphysiology,phonetics,linguistics,acoustics,andpsychoacoustics.Itisthismultidisciplinaryaspectwhichmakesdigitalspeechprocessinga challengingaswellasrewardingtask.

Thegoalofthisbookisacomprehensivediscussionoffundamentalissues,standards,and trendsinspeechcommunicationtechnology.Speechcommunicationtechnologyhelpsto mitigateanumberofphysicalconstraintsandtechnologicallimitations,mostnotably

● bandwidthlimitationsofthetelephonechannel,

● shortageofradiofrequencies,

● acousticbackgroundnoiseatthenear-end(receivingside),

● acousticbackgroundnoiseatthefar-end(transmittingside),

● (residual)transmissionerrorsandpacketlossescausedbythetransmissionchannel,

● interferingacousticechosignalsfromloudspeaker(s).

Theenormousadvancesinsignalprocessingtechnologyhavecontributedtothesuccess ofspeechsignalprocessing.Atpresent,integrateddigitalsignalprocessorsalloweconomic real-timeimplementationsofcomplexalgorithms,whichrequireseveralthousandoperationsperspeechsample.Forthisreason,advancedspeechsignalprocessingfunctionscan beimplementedincellularphonesandaudio–visualterminals,asillustratedinFigure1.1.

DigitalSpeechTransmissionandEnhancement,SecondEdition.PeterVaryandRainerMartin. ©2024JohnWiley&SonsLtd.Published2024byJohnWiley&SonsLtd.

Figure1.1 Speechsignalprocessinginahandsfreecellularterminal.BF:beamforming,AEC: acousticechocancellation,NR:noisereduction,SC:speechcoding,ETC:equivalenttransmission channel,EC:errorconcealment,SD:speechdecoding,BWE:bandwidthextension,andNELE: near-endlisteningenhancement.

ThehandsfreeterminalinFigure1.1facilitatescommunicationviamicrophonesand loudspeakers.Handsfreetelephonedevicesareinstalledinmotorvehiclesinorderto enhanceroadsafetyandtoincreaseconvenienceingeneral.

Atthefarendofthetransmissionsystem,threedifferentpre-processingstepsaretaken toimprovecommunicationinthepresenceofambientnoiseandloudspeakersignals.In thefirststep,twoormoremicrophonesareusedtoenhancethenear-endspeechsignal by beamforming(BF).Specificcharacteristicsoftheinterference,suchasthespatial distributionofthesoundsourcesandthestatisticsofthespatialsoundfield,areexploited.

Acousticechoes occurwhenthefar-endsignalleaksatthenear-endfromtheloudspeakerofthehandsfreesetintothemicrophone(s)viatheacousticpath.Asaconsequence, thefar-endspeakerswillheartheirownvoicedelayedbytwicethesignalpropagationtime ofthetelephonenetwork.Therefore,inasecondstep,theacousticechomustbecompensatedbyanadaptivedigitalfilter,the acousticechocanceller(AEC). Thethirdmoduleofthepre-processingchainis noisereduction(NR) aimingatan improvementofspeechqualitypriortocodingandtransmission.Single-channelNRsystemsrelyonspectralmodificationsandaremosteffectiveforshort-termstationarynoise.

Speechcoding(SC), errorconcealment(EC),and speechdecoding(SD) facilitate theefficientuseofthetransmissionchannel.SCalgorithmsforcellularcommunications withtypicalbitratesbetween4and24bit/sareexplicitlybaseduponamodelofspeech productionandexploitpropertiesofthehearingmechanism.

Atthereceivingsideofthetransmissionsystem,speechqualityisensuredbymeans oferrorcorrection(channeldecoding),whichisnotwithinthescopeofthisbook. InFigure1.1,the(inner)channelcoding/decodingaswellasmodulation/demodulation andtransmissionoverthephysicalchannelaremodeledasan equivalenttransmission channel(ETC).Inspiteofchannelcoding,quitefrequentlyresidualerrorsremain.The negativeauditiveeffectsoftheseerrorscanbemitigatedby errorconcealment(EC) techniques.Inmanycases,theseeffectscanbereducedbyexploitingbothresidualsource redundancyandinformationabouttheinstantaneousqualityofthetransmissionchannel.

Finally,thedecodedsignalmightbesubjectedtoartificial bandwidthextension(BWE) whichexpandsnarrowband(0.3–3.4kHz)towideband(0.05–7.0kHz)speechorwideband speechtosuperwideband(0.05–14.0kHz).Withtheintroductionoftruewideband andsuperwidebandspeechaudiocodingintotelephonenetworks,thisstepwillbeof significantimportanceas,foralongtransitionperiod,narrowbandandwidebandspeech terminalswillcoexist.

Atthereceivingend(near-end),theperceptionofthedecoded(andeventuallybandwidthexpanded)speechsignalmightbedisturbedbyacousticbackgroundnoise.The taskofthelastmoduleinthetransmissionchainistoimproveintelligibilityoratleastto reducethelisteningeffort.Thereceivedspeechsignalismodified,takingthenear-end backgroundnoiseintoaccount,whichcanbecapturedwithamicrophone.Thismethodis called near-endlisteningenhancement(NELE).

Someoftheseprocessingfunctionsfindalsoapplicationsinaudio–visualconferencing devicesanddigitalhearingaids.

Thebookisorganizedasfollows.Thefirstpart fundamentals (Chapters2–5)deals withmodelsofspeechproductionandhearing,spectraltransformations,filterbanks,and stochasticprocesses.

Thesecondpart speechcoding (Chapters6–8)coversquantization,differentialwaveform codingandespeciallytheconceptsofcodeexcitedlinearprediction(CELP)arediscussed. Finally,someofthemostrelevantspeechcodecstandardsarepresented.Recentdevelopmentssuchasthe AdaptiveMulti-Rate,AMR codec,orthe EnhancedVoiceServices (EVS) codecforcellularandIPcommunicationaredescribed.

Thethirdpart speechenhancement (Chapters9–15)isconcernedwitherrorconcealment, bandwidthextension,near-endlisteningenhancement,singleanddual-channelnoiseand reverberationreduction,acousticechocancellation,andbeamforming.

ModelsofSpeechProductionandHearing

Digitalspeechcommunicationsystemsarelargelybasedonknowledgeofspeechproduction,hearing,andperception.Inthischapter,wewilldiscusssomefundamentalaspectsin sofarastheyareofimportanceforoptimizingspeech-processingalgorithmssuchasspeech coding,speechenhancement,orfeatureextractionforautomaticspeechrecognition.

Inparticular,wewillstudythemechanismofspeechproductionandthetypicalcharacteristicsofspeechsignals.Thedigitalspeechproductionmodelwillbederivedfrom acousticalandphysicalconsiderations.Theresulting all-polemodelofthevocaltract isthe keyelementofmostofthecurrentspeech-codingalgorithmsandstandards.

Furthermore,wewillprovideinsightsintothehumanauditorysystemandwewillfocus onperceptualfundamentalswhichcanbeexploitedtoimprovethequalityandtheeffectivenessofspeech-processingalgorithmstobediscussedinlaterchapters.Withrespectto perception,themainaspectstobeconsideredindigitalspeechtransmissionarethe masking effect andthespectralresolutionoftheauditorysystem.

Asadetaileddiscussionoftheacoustictheoryofspeechproduction,phonetics,psychoacoustics,andperceptionisbeyondthescopeofthisbook,thereaderisreferredtothe literature(e.g.,[Fant1970],[Flanagan1972],[Rabiner,Schafer1978],[Picket1980],and [Zwicker,Fastl2007]).

2.1SoundWaves

Soundisamechanicalvibrationthatpropagatesthroughmatterintheformofwaves.Sound wavesmaybedescribedintermsofasoundpressurefield p(r, t) andasoundvelocityvectorfield u(r, t),whicharebothfunctionsofaspatialco-ordinatevector r andtime t.While thesoundpressurecharacterizesthedensityvariations(wedonotconsidertheDCcomponent,alsoknownasatmosphericpressure),thesoundvelocitydescribesthevelocityof dislocationofthephysicalparticlesofthemediumwhichcarriesthewaves.Thisvelocityis differentfromthespeed c ofthetravelingsoundwave.

Inthecontextofourapplications,i.e.,soundwavesinair,soundpressure p(r, t) and resultingdensityvariations ��(r, t) arerelatedby

p(r, t)= c2 ��(r, t) (2.1)

DigitalSpeechTransmissionandEnhancement,SecondEdition.PeterVaryandRainerMartin. ©2024JohnWiley&SonsLtd.Published2024byJohnWiley&SonsLtd.

2ModelsofSpeechProductionandHearing

andalsotherelationbetween p(r, t) and u(r, t) maybelinearized.Then,inthegeneralcase ofthreespatialdimensionsthesetwoquantitiesarerelatedviadifferentialoperatorsinan infinitesimallysmallvolumeofairparticlesas

where c and ��0 arethespeedofsoundandthedensityatrest,respectively.Theseequations, alsoknownasEuler’sequationandcontinuityequation[Xiang,Blauert2021],maybe combinedintothewaveequation

wheretheLaplaceoperator Δ

Asolutionofthewaveequation(2.3)isplanewaveswhichfeaturesurfacesofconstant soundpressurepropagatinginagivenspatialdirection.Aharmonicplanewaveofangular frequency �� whichpropagatesinpositive x directionornegative x directionmaybewritten incomplexnotationas

where ̃ �� = ��∕c = 2�� ∕�� isthe wavenumber , �� isthe wavelength,and ̂ pf , ̂ pb arethe(possibly complex-valued) amplitudes.Using(2.2),the x componentofthesoundvelocityisthen givenby

Thus,foraplanewave,thesoundvelocityisproportionaltothesoundpressure. Inourapplications,waveswhichhaveaconstantsoundpressureonconcentricalspheres arealsoofinterest.Indeed,thewaveequation(2.3)deliversasolutionforthe sphericalwave whichpropagatesinradialdirection r as

(r , t)= 1 r f (

ct), (2.7) where f isthepropagatingwaveform.Theamplitudeofthesoundwavediminisheswith increasingdistancefromthesource.Wemaythenusetheabstractionofa pointsource to explainthegenerationofsuch sphericalwaves

Anidealpointsourcemayberepresentedbyitssourcestrength ��0 (t) [Xiang,Blauert 2021].Furthermore,with(2.2)wehave �� ur (r , t)

1

(r , t)

Then,theradialcomponentofthevelocityvectormaybeintegratedoverasphereofradius r toyield ��0 (t)≈ 4�� r 2 ur (r , t).For r → 0,thesecondtermontheright-handsideof(2.8)is smallerthanthefirst.Therefore,foraninfinitesimallysmallsphere,wefindwith(2.8)

and,with(2.7),forany r

whichcharacterizes,again,asphericalwave.Thesoundpressureisinverselyproportional totheradialdistance r fromthepointsource.Foraharmonicexcitation

wefindthesoundpressure

andhence,with(2.8)andanintegrationwithrespecttotime,thesoundvelocity

Clearly,(2.12)and(2.13)satisfy(2.8).Becauseofthesecondtermintheparenthesesin (2.13),soundpressureandsoundvelocityarenotinphase.Dependingonthedistanceofthe observationpointtothepointsource,thebehaviorofthewaveisdistinctlydifferent.When thesecondtermcannotbeneglected,theobservationpointisinthe nearfield ofthesource. For �� r ≫ 1,theobservationpointisinthe farfield.Thetransitionfromthenearfieldtothe farfielddependsonthewavenumber �� and,assuch,onthewavelengthorthefrequencyof theharmonicexcitation.

2.2OrgansofSpeechProduction

Theproductionofspeechsoundsinvolvesthemanipulationofanairstream.Theacousticrepresentationofspeechisasoundpressurewaveoriginatingfromthephysiological speechproductionsystem.Asimplifiedschematicofthehumanspeechorgansisgivenin Figure2.1.Themaincomponentsandtheirfunctionsare:

● lungs:

● trachea:

● larynxwithvocalcords:

● vocaltractwithpharynx, oralandnasalcavities: theenergygenerator, forenergytransport, thesignalgenerator,and theacousticfilter.

Bycontraction,the lungs produceanairflowwhichismodulatedbythe larynx ,processed bythe vocaltract,andradiatedviathelipsandthenostrils.The larynx providesseveral biologicalandsoundproductionfunctions.Inthecontextofspeechproduction,itspurpose istocontrolthestreamofairthatentersthevocaltractviathe vocalcords.

Speechsoundsareproducedbymeansofvariousmechanisms. Voicedsounds areproducedwhentheairflowisinterruptedperiodicallybythemovements(vibration)ofthe vocalcords(seeFigure2.2).Thisself-sustainedoscillation,i.e.,therepeatedopeningand closingofthevocalcords,canbeexplainedbytheso-called Bernoullieffect asinfluid

Vocal

Nostrils

Palate

Jaw

Thyroid cartilage

Larynx

Cricoid cartilage

Uvula

Oral cavity

Pharynx cavity

Epiglottis

Vocal cords

Trachea

Figure2.1 Organsofspeechproduction.

dynamics:asairflowvelocityincreases,localpressuredecreases.Atthebeginningofeach cycle,theareabetweenthevocalcords,whichiscalledthe glottis,isalmostclosedbymeans ofappropriatetensionofthevocalcords.Thenanincreasedairpressurebuildsupbelow theglottis,forcingthevocalcordstoopen.Asthevocalcordsdiverge,thevelocityofthe airflowingthroughtheglottisincreasessteadily,whichcausesadropinthelocalpressure.Then,thevocalcordssnapbacktotheirinitialpositionandthenextcyclecanstart iftheairflowfromthelungsandthetensionofthevocalcordsaresustained.Duetothe abruptperiodicinterruptionsoftheglottalairflow,asschematicallyillustratedinFigure2.2, theresultingexcitation(pressurewave)ofthevocaltracthasafundamentalfrequencyof f0 = 1∕T0 andhasalargenumberofharmonics.Thesearespectrallyshapedaccordingto thefrequencyresponseoftheacousticvocaltract.Theduration T0 ofasinglecycleiscalled the pitchperiod

Unvoicedsounds aregeneratedbyaconstrictionattheopenglottisoralongthevocaltract causinganonperiodicturbulentairflow.

Plosivesounds (alsoknownasstops)arecausedbybuildinguptheairpressurebehind acompleteconstrictionsomewhereinthevocaltract,followedbyasuddenopening.The releasedairflowmaycreateavoicedoranunvoicedsoundorevenamixtureofboth, dependingontheactualconstellationofthearticulators.

The vocaltract canbesubdividedintothreesections:thepharynx,theoralcavity,andthe nasalcavity.Astheentrancetothenasalcavitycanbeclosedbythevelum,adistinctionis oftenmadeintheliteraturebetweenthenasaltract(fromvelumtonostrils)andtheother

Glottal
Figure2.2 Glottalairflowduringvoicedsounds.

2.3CharacteristicsofSpeechSignals 9 twosections(fromtracheatolips,includingthepharynxcavity).Inthischapter,wewill definethe vocaltract asavariableacousticresonatorincludingthenasalcavitywith thevelumeitheropenorclosed,dependingonthespecificsoundtobeproduced.From theengineeringpointofview,theresonancefrequenciesarevariedbychangingthesize andtheshapeofthevocaltractusingdifferentconstellationsandmovementsofthe articulators,i.e.,tongue,teeth,lips,velum,lowerjaw,etc.Thus,humanscanproduceavariety ofdifferentsoundsbasedondifferentvocaltractconstellationsanddifferentacoustic excitations.

Finally,theacousticwavescarryingspeechsoundsareradiatedviathemouthandhead. Inafirstapproximation,wemaymodeltheradiatingheadasasphericalsourceinfree space.The(complex-valued)acousticload ZL (rH ) atthelipsmaythenbeapproximatedby theradiationloadofasphericalsourceofradius rH where rH ,representstheheadradius. Following(2.13),thisloadexhibitsahigh-passcharacteristic,

where �� = 2�� f denotesangularfrequencyand c isthe speedofsound.Thismodelsuggests anacoustic“shortcircuit”atverylowfrequencies,i.e.,littleacousticradiationatlowfrequencies,whichisalsosupportedbymeasurements[Flanagan1960].Foranassumedhead radiusof rH = 8.5cmand c = 343m/s,the3-dBcutofffrequency

) isabout 640Hz.

2.3CharacteristicsofSpeechSignals

Mostlanguagescanbedescribedasasetofelementarylinguisticunits,whicharecalled phonemes.Aphonemeisdefinedasthesmallestunitwhichdifferentiatesthemeaning oftwowordsinonelanguage.Theacousticrepresentationassociatedwithaphonemeis calleda phone.AmericanEnglish,forinstance,consistsofabout42phonemes,whichare subdividedintofourclasses:

Vowels arevoicedandbelongtothespeechsoundswiththelargestenergy.Theyexhibit aquasiperiodictimestructurecausedbyoscillationofthevocalcords.Thedurationvaries from40to400ms.Vowelscanbedistinguishedbythetime-varyingresonancecharacteristicsofthevocaltract.Theresonancefrequenciesarealsocalled formantfrequencies Examples:/a/asin“father”and/i/asin“eve.”

Diphthongs involveaglidingtransitionofthearticulatorsfromonevoweltoanother vowel.Examples:/oU/asin“boat”and/ju/asin“you.”

Approximants areagroupofvoicedphonemesforwhichtheairstreamescapesthrough arelativelynarrowapertureinthevocaltract.Theycan,thus,beregardedasintermediate betweenvowelsandconsonants[Gimson,Cruttenden1994].Examples:/w/in“wet”and /r/in“ran.”

Consonants areproducedwithstrongerconstrictionofthevocaltractthanvowels.All kindsofexcitationcanbeobserved.Consonantsaresubdividedinto nasals, stops, fricatives, aspirates,and affricatives.Examplesofthesefivesubclasses:/m/asin“more,”/t/asin “tea,”/f/asin“free,”/h/asin“hold,”and/t∫ /asin“chase.”

2ModelsofSpeechProductionandHearing

Eachoftheseclassesmaybefurtherdividedintosubclasses,whicharerelatedtothe interactionofthearticulatorswithinthevocaltract.Thephonemescanfurtherbeclassified aseither continuant (excitationofamoreorlessnontime-varyingvocaltract)or non continuant (rapidvocaltractchanges).Theclassofcontinuantsoundsconsistsofvowels andfricatives(voicedandunvoiced).Thenoncontinuantsoundsarerepresentedby diphthongs,semivowels,stops,andaffricates.

Forthepurposeofspeech-signalprocessing,specificarticulatoryandphoneticaspects arenotasimportantasthetypicalcharacteristicsofthewaveforms,namely,thebasiccategories:

● voiced,

● unvoiced,

● mixedvoiced/unvoiced,

● plosive,and

● silence.

Voicedsoundsarecharacterizedbytheirfundamentalfrequency,i.e.,thefrequencyof vibrationofthevocalcords,andbythespecificpatternofamplitudesofthespectralharmonics.

Inthespeechsignalprocessingliterature,thefundamentalfrequencyisoftencalled pitch andtherespectiveperiodiscalled pitchperiod.Itshouldbenoted,however,thatinpsychoacousticsthetermpitchisuseddifferently,i.e.,fortheperceivedfundamentalfrequencyof asound,whetherornotthatfrequencyisactuallypresentinthewaveform(e.g.,[Deller Jr.etal.2000]).Thefundamentalfrequencyofyoungmenrangesfrom85to155Hzand thatofyoungwomenfrom165to255Hz[Fitch,Holbrook1970].Fundamentalfrequency, alsoincombinationwithvocaltractlength,isindicativeofsex,age,andsizeofthespeaker [Smith,Patterson2005].

Unvoicedsounds aredeterminedmainlybytheircharacteristicspectralenvelopes.Voiced andunvoicedexcitationdonotexcludeeachother.Theymayoccursimultaneously,e.g.,in fricativesounds.

Thedistinctivefeatureof plosivesounds isthedynamicallytransientchangeofthevocal tract.Immediatelybeforethetransition,atotalconstrictioninthevocaltractstopssound radiationfromthelipsforashortperiod.Theremightbeasmallamountoflow-frequency componentsradiatedthroughthethroat.Then,thesuddenchangewithreleaseoftheconstrictionproducesaplosiveburst.

SometypicalspeechwaveformsareshowninFigure2.3.

2.4ModelofSpeechProduction

Thepurposeofdevelopingamodelofspeechproductionisnottoobtainanaccuratedescriptionoftheanatomyandphysiologyofhumanspeechproductionbut rathertoachieveasimplifyingmathematicalrepresentationforreproducingtheessential characteristicsofspeechsignals.

InanalogytotheorgansofhumanspeechproductionasdiscussedinSection2.2,itseems reasonabletodesignaparametrictwo-stagemodelconsistingofan excitationsource anda

Figure2.3 Characteristicwaveformsofspeechsignals:(a)Voiced(vowelwithtransitiontovoiced consonant);(b)Unvoiced(fricative);(c)Transition:pause–plosive–vowel.

vocaltractfilter ,seealso[Rabiner,Schafer1978],[Parsons1986],[Quatieri2001],[Deller Jr.etal.2000].Theresultingdigital source-filtermodel,asillustratedinFigure2.4,willbe derivedbelow.

Themodelconsistsoftwocomponents:

● the excitationsource featuringmainlytheinfluenceofthelungsandthevocalcords (voiced,unvoiced,mixed)and

● the time-varyingdigitalvocaltractfilter approximatingthebehaviorofthevocaltract (spectralenvelopeanddynamictransitions).

Inthefirstandsimplemodel,the excitationgenerator onlyhastodelivereitherwhite noiseoraperiodicsequenceof pitchpulses forsynthesizingunvoicedandvoicedsounds, respectively,whereasthevocaltractismodeledasatime-varyingdiscrete-timefilter.

Figure2.4 Digitalsource-filtermodel. x(k)

Time-varying digital vocal tract filter v(k) Excitation source

Source parameters

Filter parameters

2.4.1AcousticTubeModeloftheVocalTract

Thedigitalsource-filtermodelofFigure2.4,especiallythevocaltractfilter,willbe derivedfromthephysicsofsoundpropagationinsideanacoustictube.Toestimatethe necessaryfilterorder,westartwiththeextremelysimplifyingphysicalmodelofFigure 2.5.Accordingtothissimplisticmodel,thepharynxandoralcavitiesarerepresentedby alosslesstubewithconstantcrosssectionandthenasalcavitybyasecondtubewhich canbeclosedbythevelum.Thelengthof L = 17cmcorrespondstotheaveragelengthof thevocaltractofamaleadult.Thetubeis(almost)closedattheglottissideandopenat thelips.

Inthecaseofanonnasalsound,thevelumisclosed.Then,thewavelength ��i ofeach resonancefrequencyofthemaintubefromtheglottistothelipsfulfillsthestandingwave condition

For L = 17cm,wecomputetheresonancefrequencies

wherethespeedofsoundisgivenby c = 340m/s.

Taking(2.16)intoaccountaswellasthefactthattheconventional narrowbandtelephone (NB)servicehasafrequencyrangeofabout200–3400Hz,andthatthe wideband telephone (WB)servicecoversafrequencyrangefrom50to7000Hz,wehavetoconsider onlyfour(NB)andeightresonances(WB)ofthevocaltractmodel,respectively.Asthe acousticalbandwidthofspeechiswiderthan3400Hzandevenwiderthan7000Hz, lowpassfilteringwithafinitetransitionwidthfrompassbandtostopbandisrequiredas partofanalog-to-digitalconversion.Thus,thesamplingratefortelephonespeechiseither 8kHz(NB)or16kHz(WB),andtheoverallfilterorderforsynthesizingtelephonespeech isroughlyonly n = 8or n = 16.Eachresonancefrequencycorrespondstoapole-pairor second-orderfiltersection.Asaruleofthumb,wecanstatetheneedfor “oneresonance perkHz.”

Inthesecondstep,weimproveouracoustictubemodel,asshowninFigure2.6.For simplicity,thenasalcavityisnotconsidered(velumisclosed).Thecylindricallosslesstube

Figure2.5 Simplifiedphysicalmodelofthevocaltract.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.