DigitalSpeechTransmissionandEnhancement
Secondedition PeterVary
InstituteofCommunicationSystems RWTHAachenUniversity Aachen,Germany
RainerMartin InstituteofCommunicationAcoustics Ruhr-UniversitätBochum Bochum,Germany
Thissecondeditionfirstpublished2024 ©2024JohnWiley&SonsLtd.
EditionHistory
JohnWiley&SonsLtd.(1e,2006)
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,or transmitted,inanyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise, exceptaspermittedbylaw.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailable athttp://www.wiley.com/go/permissions.
TherightofPeterVaryandRainerMartintobeidentifiedastheauthorsofthisworkhasbeenassertedin accordancewithlaw.
RegisteredOffices
JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA
JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UK
Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproducts visitusatwww.wiley.com.
Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.
Trademarks:WileyandtheWileylogoaretrademarksorregisteredtrademarksofJohnWiley&Sons,Inc. and/oritsaffiliatesintheUnitedStatesandothercountriesandmaynotbeusedwithoutwritten permission.Allothertrademarksarethepropertyoftheirrespectiveowners.JohnWiley&Sons,Inc.isnot associatedwithanyproductorvendormentionedinthisbook.
LimitofLiability/DisclaimerofWarranty
Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakeno representationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisworkand specificallydisclaimallwarranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantability orfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,written salesmaterialsorpromotionalstatementsforthiswork.Thisworkissoldwiththeunderstandingthatthe publisherisnotengagedinrenderingprofessionalservices.Theadviceandstrategiescontainedhereinmay notbesuitableforyoursituation.Youshouldconsultwithaspecialistwhereappropriate.Thefactthatan organization,website,orproductisreferredtointhisworkasacitationand/orpotentialsourceoffurther informationdoesnotmeanthatthepublisherandauthorsendorsetheinformationorservicesthe organization,website,orproductmayprovideorrecommendationsitmaymake.Further,readersshould beawarethatwebsiteslistedinthisworkmayhavechangedordisappearedbetweenwhenthisworkwas writtenandwhenitisread.Neitherthepublishernorauthorsshallbeliableforanylossofprofitorany othercommercialdamages,includingbutnotlimitedtospecial,incidental,consequential,orother damages.
LibraryofCongressCataloging-in-PublicationDataappliedfor
Hardback:9781119060963
ePdf:9781119060994
ePub:9781119060987
CoverDesign:Wiley
CoverImage:©BAIVECTOR/Shutterstock
Setin9.5/12.5ptSTIXTwoTextbyStraive,Chennai,India
Contents
Preface xv
1Introduction 1
2ModelsofSpeechProductionandHearing 5
2.1SoundWaves 5
2.2OrgansofSpeechProduction 7
2.3CharacteristicsofSpeechSignals 9
2.4ModelofSpeechProduction 10
2.4.1AcousticTubeModeloftheVocalTract 12
2.4.2DiscreteTimeAll-PoleModeloftheVocalTract 19
2.5AnatomyofHearing 25
2.6PsychoacousticPropertiesoftheAuditorySystem 27
2.6.1HearingandLoudness 27
2.6.2SpectralResolution 29
2.6.3Masking 31
2.6.4SpatialHearing 32
2.6.4.1Head-RelatedImpulseResponsesandTransferFunctions 33
2.6.4.2LawofTheFirstWavefront 34 References 35
3SpectralTransformations 37
3.1FourierTransformofContinuousSignals 37
3.2FourierTransformofDiscreteSignals 38
3.3LinearShiftInvariantSystems 41
3.3.1FrequencyResponseofLSISystems 42
3.4The z-transform 42
3.4.1RelationtoFourierTransform 43
3.4.2PropertiesoftheROC 44
3.4.3Inverse z-Transform 44
3.4.4 z-TransformAnalysisofLSISystems 46
3.5TheDiscreteFourierTransform 47
3.5.1LinearandCyclicConvolution 48
3.5.2TheDFTofWindowedSequences 51
3.5.3SpectralResolutionandZeroPadding 54
3.5.4TheSpectrogram 55
3.5.5FastComputationoftheDFT:TheFFT 56
3.5.6Radix-2Decimation-in-TimeFFT 57
3.6FastConvolution 60
3.6.1FastConvolutionofLongSequences 60
3.6.2FastConvolutionbyOverlap-Add 61
3.6.3FastConvolutionbyOverlap-Save 61
3.7Analysis–Modification–SynthesisSystems 64
3.8CepstralAnalysis 66
3.8.1ComplexCepstrum 67
3.8.2RealCepstrum 69
3.8.3ApplicationsoftheCepstrum 70
3.8.3.1ConstructionofMinimum-PhaseSequences 70
3.8.3.2DeconvolutionbyCepstralMeanSubtraction 71
3.8.3.3ComputationoftheSpectralDistortionMeasure 72
3.8.3.4FundamentalFrequencyEstimation 73 References 75
4FilterBanksforSpectralAnalysisandSynthesis 79
4.1SpectralAnalysisUsingNarrowbandFilters 79
4.1.1Short-TermSpectralAnalyzer 83
4.1.2PrototypeFilterDesignfortheAnalysisFilterBank 86
4.1.3Short-TermSpectralSynthesizer 87
4.1.4Short-TermSpectralAnalysisandSynthesis 88
4.1.5PrototypeFilterDesignfortheAnalysis–Synthesisfilterbank 90
4.1.6FilterBankInterpretationoftheDFT 92
4.2PolyphaseNetworkFilterBanks 94
4.2.1PPNAnalysisFilterBank 95
4.2.2PPNSynthesisFilterBank 101
4.3QuadratureMirrorFilterBanks 104
4.3.1Analysis–SynthesisFilterBank 104
4.3.2CompensationofAliasingandSignalReconstruction 106
4.3.3EfficientImplementation 109
4.4FilterBankEqualizer 112
4.4.1TheReferenceFilterBank 112
4.4.2UniformFrequencyResolution 113
4.4.3AdaptiveFilterBankEqualizer:GainComputation 117
4.4.3.1ConventionalSpectralSubtraction 117
4.4.3.2FilterBankEqualizer 118
4.4.4Non-uniformFrequencyResolution 120
4.4.5DesignAspects&Implementation 122 References 123
5StochasticSignalsandEstimation 127
5.1BasicConcepts 127
5.1.1RandomEventsandProbability 127
5.1.2ConditionalProbabilities 128
5.1.3RandomVariables 129
5.1.4ProbabilityDistributionsandProbabilityDensityFunctions 129
5.1.5ConditionalPDFs 130
5.2ExpectationsandMoments 130
5.2.1ConditionalExpectationsandMoments 131
5.2.2Examples 131
5.2.2.1TheUniformDistribution 132
5.2.2.2TheGaussianDensity 132
5.2.2.3TheExponentialDensity 132
5.2.2.4TheLaplaceDensity 133
5.2.2.5TheGammaDensity 134
5.2.2.6 �� 2 -Distribution 134
5.2.3TransformationofaRandomVariable 135
5.2.4RelativeFrequenciesandHistograms 136
5.3BivariateStatistics 137
5.3.1MarginalDensities 137
5.3.2ExpectationsandMoments 137
5.3.3UncorrelatednessandStatisticalIndependence 138
5.3.4ExamplesofBivariatePDFs 139
5.3.4.1TheBivariateUniformDensity 139
5.3.4.2TheBivariateGaussianDensity 139
5.3.5FunctionsofTwoRandomVariables 140
5.4ProbabilityandInformation 141
5.4.1Entropy 141
5.4.2Kullback–LeiblerDivergence 141
5.4.3Cross-Entropy 142
5.4.4MutualInformation 142
5.5MultivariateStatistics 142
5.5.1MultivariateGaussianDistribution 143
5.5.2GaussianMixtureModels 144
5.6StochasticProcesses 145
5.6.1StationaryProcesses 145
5.6.2Auto-CorrelationandAuto-CovarianceFunctions 146
5.6.3Cross-CorrelationandCross-CovarianceFunctions 147
5.6.4MarkovProcesses 147
5.6.5MultivariateStochasticProcesses 148
5.7EstimationofStatisticalQuantitiesbyTimeAverages 150
5.7.1ErgodicProcesses 150
5.7.2Short-TimeStationaryProcesses 150
5.8PowerSpectrumanditsEstimation 151
5.8.1WhiteNoise 152
5.8.2ThePeriodogram 152
5.8.3SmoothedPeriodograms 153
5.8.3.1NonRecursiveSmoothinginTime 153
5.8.3.2RecursiveSmoothinginTime 154
5.8.3.3Log-MelFilterBankFeatures 154
5.8.4PowerSpectraandLinearShift-InvariantSystems 156
5.9StatisticalPropertiesofSpeechSignals 157
5.10StatisticalPropertiesofDFTCoefficients 157
5.10.1AsymptoticStatisticalProperties 158
5.10.2Signal-Plus-NoiseModel 159
5.10.3StatisticsofDFTCoefficientsforFiniteFrameLengths 160
5.11OptimalEstimation 162
5.11.1MMSEEstimation 163
5.11.2EstimationofDiscreteRandomVariables 164
5.11.3OptimalLinearEstimator 164
5.11.4TheGaussianCase 165
5.11.5JointDetectionandEstimation 166
5.12Non-LinearEstimationwithDeepNeuralNetworks 167
5.12.1BasicNetworkComponents 168
5.12.1.1ThePerceptron 168
5.12.1.2ConvolutionalNeuralNetwork 170
5.12.2BasicDNNStructures 170
5.12.2.1Fully-ConnectedFeed-ForwardNetwork 171
5.12.2.2AutoencoderNetworks 171
5.12.2.3RecurrentNeuralNetworks 172
5.12.2.4TimeDelay,Wavenet,andTransformerNetworks 175
5.12.2.5TrainingofNeuralNetworks 175
5.12.2.6StochasticGradientDescent(SGD) 176
5.12.2.7AdaptiveMomentEstimationMethod(ADAM) 176 References 177
6LinearPrediction 181
6.1VocalTractModelsandShort-TermPrediction 181
6.1.1All-ZeroModel 182
6.1.2All-PoleModel 183
6.1.3Pole-ZeroModel 183
6.2OptimalPredictionCoefficientsforStationarySignals 187
6.2.1OptimumPrediction 187
6.2.2SpectralFlatnessMeasure 190
6.3PredictorAdaptation 192
6.3.1Block-OrientedAdaptation 192
6.3.1.1Auto-CorrelationMethod 193
6.3.1.2CovarianceMethod 194
6.3.1.3Levinson–DurbinAlgorithm 196
6.3.2SequentialAdaptation 201
6.4Long-TermPrediction 204 References 209
7Quantization 211
7.1AnalogSamplesandDigitalRepresentation 211
7.2UniformQuantization 212
7.3Non-uniformQuantization 219
7.4OptimalQuantization 227
7.5AdaptiveQuantization 228
7.6VectorQuantization 232
7.6.1Principle 232
7.6.2TheComplexityProblem 235
7.6.3LatticeQuantization 236
7.6.4DesignofOptimalVectorCodeBooks 236
7.6.5Gain–ShapeVectorQuantization 239
7.7QuantizationofthePredictorCoefficients 240
7.7.1ScalarQuantizationoftheLPCCoefficients 241
7.7.2ScalarQuantizationoftheReflectionCoefficients 241
7.7.3ScalarQuantizationoftheLSFCoefficients 243 References 246
8SpeechCoding 249
8.1Speech-CodingCategories 249
8.2Model-BasedPredictiveCoding 253
8.3LinearPredictiveWaveformCoding 255
8.3.1First-OrderDPCM 255
8.3.2Open-LoopandClosed-LoopPrediction 258
8.3.3QuantizationoftheResidualSignal 259
8.3.3.1QuantizationwithOpen-LoopPrediction 259
8.3.3.2QuantizationwithClosed-LoopPrediction 261
8.3.3.3SpectralShapingoftheQuantizationError 262
8.3.4ADPCMwithSequentialAdaptation 266
8.4ParametricCoding 268
8.4.1VocoderStructures 268
8.4.2LPCVocoder 271
8.5HybridCoding 272
8.5.1BasicCodecConcepts 272
8.5.1.1ScalarQuantizationoftheResidualSignal 274
8.5.1.2VectorQuantizationoftheResidualSignal 276
8.5.2ResidualSignalCoding:RELP 279
8.5.3AnalysisbySynthesis:CELP 282
8.5.3.1Principle 282
8.5.3.2FixedCodeBook 283
8.5.3.3Long-TermPrediction,AdaptiveCodeBook 287
8.6AdaptivePostfiltering 289
x Contents
8.7SpeechCodecStandards:SelectedExamples 293
8.7.1GSMFull-RateCodec 295
8.7.2EFRCodec 297
8.7.3AdaptiveMulti-RateNarrowbandCodec(AMR-NB) 299
8.7.4ITU-T/G.722:7kHzAudioCodingwithin64kbit/s 301
8.7.5AdaptiveMulti-RateWidebandCodec(AMR-WB) 301
8.7.6CodecforEnhancedVoiceServices(EVS) 303
8.7.7OpusCodecIETFRFC6716 306 References 307
9ConcealmentofErroneousorLostFrames 313
9.1ConceptsforErrorConcealment 314
9.1.1ErrorConcealmentbyHardDecisionDecoding 315
9.1.2ErrorConcealmentbySoftDecisionDecoding 316
9.1.3ParameterEstimation 318
9.1.3.1MAPEstimation 318
9.1.3.2MSEstimation 318
9.1.4TheAPosterioriProbabilities 319
9.1.4.1TheAPrioriKnowledge 320
9.1.4.2TheParameterDistortionProbabilities 320
9.1.5Example:HardDecisionvs.SoftDecision 321
9.2ExamplesofErrorConcealmentStandards 323
9.2.1SubstitutionandMutingofLostFrames 323
9.2.2AMRCodec:SubstitutionandMutingofLostFrames 325
9.2.3EVSCodec:ConcealmentofLostPackets 329
9.3FurtherImprovements 330 References 331
10BandwidthExtensionofSpeechSignals 335
10.1BWEConcepts 337
10.2BWEusingtheModelofSpeechProduction 339
10.2.1ExtensionoftheExcitationSignal 340
10.2.2SpectralEnvelopeEstimation 342
10.2.2.1MinimumMeanSquareErrorEstimation 344
10.2.2.2ConditionalMaximum APosteriori Estimation 345
10.2.2.3Extensions 345
10.2.2.4Simplifications 346
10.2.3EnergyEnvelopeEstimation 346
10.3SpeechCodecswithIntegratedBWE 349
10.3.1BWEintheGSMFull-RateCodec 349
10.3.2BWEintheAMRWidebandCodec 351
10.3.3BWEintheITUCodecG.729.1 353
References 355
11NELE:Near-EndListeningEnhancement 361
11.1FrequencyDomainNELE(FD) 363
11.1.1SpeechIntelligibilityIndexNELEOptimization 364
11.1.1.1SII-OptimizedNELEExample 367
11.1.2Closed-FormGain-ShapeNELE 368
11.1.2.1TheNoisePropShapingFunction 370
11.1.2.2TheNoiseInverseStrategy 371
11.1.2.3Gain-ShapeFrequencyDomainNELEExample 372
11.2TimeDomainNELE(TD) 374
11.2.1NELEProcessingusingLinearPredictionFilters 374 References 378
12Single-ChannelNoiseReduction 381
12.1Introduction 381
12.2LinearMMSEEstimators 383
12.2.1Non-causalIIRWienerFilter 384
12.2.2TheFIRWienerFilter 386
12.3SpeechEnhancementintheDFTDomain 387
12.3.1TheWienerFilterRevisited 388
12.3.2SpectralSubtraction 390
12.3.3EstimationoftheAPrioriSNR 391
12.3.3.1Decision-DirectedApproach 392
12.3.3.2SmoothingintheCepstrumDomain 392
12.3.4QualityandIntelligibilityEvaluation 393
12.3.4.1NoiseOversubtraction 396
12.3.4.2SpectralFloor 396
12.3.4.3LimitationoftheAPrioriSNR 396
12.3.4.4AdaptiveSmoothingoftheSpectralGain 396
12.3.5SpectralAnalysis/SynthesisforSpeechEnhancement 397
12.4OptimalNon-linearEstimators 397
12.4.1MaximumLikelihoodEstimation 398
12.4.2MaximumAPosterioriEstimation 400
12.4.3MMSEEstimation 400
12.4.3.1MMSEEstimationofComplexCoefficients 401
12.4.3.2MMSEAmplitudeEstimation 401
12.5JointOptimumDetectionandEstimationofSpeech 405
12.6ComputationofLikelihoodRatios 407
12.7EstimationoftheAPrioriandAPosterioriProbabilitiesof SpeechPresence 408
12.7.1EstimationoftheAPrioriProbability 409
12.7.2APosterioriSpeechPresenceProbabilityEstimation 409
12.7.3SPPEstimationUsingaFixedSNRPrior 410
12.8VADandNoiseEstimationTechniques 411
12.8.1VoiceActivityDetection 411
12.8.1.1DetectorsBasedontheSubbandSNR 412
12.8.2NoisePowerEstimationBasedonMinimumStatistics 413
12.8.3NoiseEstimationUsingaSoft-DecisionDetector 416
12.8.4NoisePowerTrackingBasedonMinimumMeanSquareErrorEstimation 417
12.8.5EvaluationofNoisePowerTrackers 419
12.9NoiseReductionwithDeepNeuralNetworks 420
12.9.1ProcessingModel 421
12.9.2EstimationTargets 422
12.9.3LossFunction 423
12.9.4InputFeatures 423
12.9.5DataSets 423 References 425
13Dual-ChannelNoiseandReverberationReduction 435
13.1Dual-ChannelWienerFilter 435
13.2TheIdealDiffuseSoundFieldandItsCoherence 438
13.3NoiseCancellation 442
13.3.1ImplementationoftheAdaptiveNoiseCanceller 444
13.4NoiseReduction 445
13.4.1PrincipleofDual-ChannelNoiseReduction 446
13.4.2BinauralEqualization–CancellationandCommonGainNoiseReduction 447
13.4.3CombinedSingle-andDual-ChannelNoiseReduction 449
13.5Dual-ChannelDereverberation 449
13.6MethodsBasedonDeepLearning 452
References 453
14AcousticEchoControl 457
14.1TheEchoControlProblem 457
14.2EchoCancellationandPostprocessing 462
14.2.1EchoCancellerwithCenterClipper 463
14.2.2EchoCancellerwithVoice-ControlledSoft-Switching 463
14.2.3EchoCancellerwithAdaptivePostfilter 464
14.3EvaluationCriteria 465
14.3.1SystemDistance 466
14.3.2EchoReturnLossEnhancement 466
14.4TheWienerSolution 467
14.5TheLMSandNLMSAlgorithms 468
14.5.1DerivationandBasicProperties 468
14.6ConvergenceAnalysisandControloftheLMSAlgorithm 470
14.6.1ConvergenceintheAbsenceofInterference 471
14.6.2ConvergenceinthePresenceofInterference 473
14.6.3FilterOrderoftheEchoCanceller 476
14.6.4StepsizeParameter 477
14.7GeometricProjectionInterpretationoftheNLMSAlgorithm 479
14.8TheAffineProjectionAlgorithm 481
14.9Least-SquaresandRecursiveLeast-SquaresAlgorithms 484
14.9.1TheWeightedLeast-SquaresAlgorithm 484
14.9.2TheRLSAlgorithm 485
14.9.3NLMS-andKalman-Algorithm 488
14.9.3.1NLMSAlgorithm 490
14.9.3.2KalmanAlgorithm 490
14.9.3.3SummaryofKalmanAlgorithm 492
14.9.3.4Remarks 492
14.10BlockProcessingandFrequencyDomainAdaptiveFilters 493
14.10.1BlockLMSAlgorithm 494
14.10.2FrequencyDomainAdaptiveFilter(FDAF) 495
14.10.2.1FastConvolutionandOverlap-Save 496
14.10.2.2FLMSAlgorithm 499
14.10.2.3ImprovedStepsizeControl 502
14.10.3SubbandAcousticEchoCancellation 502
14.10.4EchoCancellerwithAdaptivePostfilterintheFrequencyDomain 503
14.10.5InitializationwithPerfectSequences 505
14.11StereophonicAcousticEchoControl 506
14.11.1TheNon-uniquenessProblem 508
14.11.2SolutionstotheNon-uniquenessProblem 508 References 510
15MicrophoneArraysandBeamforming 517
15.1Introduction 517
15.2SpatialSamplingofSoundFields 518
15.2.1TheNear-fieldModel 518
15.2.2TheFar-fieldModel 519
15.2.3SoundPickupinReverberantSpaces 521
15.2.4SpatialCorrelationPropertiesofAcousticSignals 522
15.2.5UniformLinearandCircularArrays 522
15.2.6PhaseAmbiguityinMicrophoneSignals 523
15.3Beamforming 524
15.3.1Delay-and-SumBeamforming 525
15.3.2Filter-and-SumBeamforming 526
15.4PerformanceMeasuresandSpatialAliasing 528
15.4.1ArrayGainandArraySensitivity 528
15.4.2DirectivityPattern 529
15.4.3DirectivityandDirectivityIndex 531
15.4.4Example:DifferentialMicrophones 531
15.5DesignofFixedBeamformers 534
15.5.1MinimumVarianceDistortionlessResponseBeamformer 535
15.5.2MVDRBeamformerwithLimitedSusceptibility 537
15.5.3LinearlyConstrainedMinimumVarianceBeamformer 538
15.5.4Max-SNRBeamformer 539
xiv Contents
15.6MultichannelWienerFilterandPostfilter 540
15.7AdaptiveBeamformers 542
15.7.1TheFrostBeamformer 542
15.7.2GeneralizedSide-LobeCanceller 544
15.7.3GeneralizedSide-lobeCancellerwithAdaptiveBlockingMatrix 546
15.7.4Model-BasedParsimonious-Excitation-BasedGSC 547
15.8Non-linearMulti-channelNoiseReduction 550 References 551
Index 555
Preface
Digitalprocessing,storage,andtransmissionofspeechsignalshavegainedgreatpractical importance.Themainareasofapplicationaredigitalmobileradio,audio-visualconferencing,acoustichuman–machinecommunication,andhearingaids.Infact,theseapplications arethedrivingforcesbehindmanyscientificandtechnologicaldevelopmentsinthisfield. Aspecificfeatureoftheseapplicationareasisthattheoryandimplementationare closelylinked;thereisaseamlesstransitionfromtheoryandalgorithmstosystem simulationsusinggeneral-purposecomputersandtoimplementationsonembedded processors.
Thisbookhasbeenwrittenforengineersandengineeringstudentsspecializinginspeech andaudioprocessing.Itsummarizesfundamentaltheoryandrecentdevelopmentsinthe broadfieldofdigitalspeechtransmissionandenhancementandincludesjointresearchof theauthorsandtheirPhDstudents.ThisbookisbeingusedingraduatecoursesatRWTH AachenUniversityandRuhr-UniversitätBochumandotheruniversities.
Thissecondeditionalsoreflectsprogressindigitalspeechtransmissionandenhancement sincethepublicationofthefirstedition[Vary,Martin2006].Inthisrespect,newspeech codingstandardshavebeenincluded,suchastheEnhancedVoiceServices(EVS)codec. Throughoutthisbook,theterm enhancement comprisesbesidesnoisereductionalsothe topicsoferrorconcealment,artificialbandwidthextension,echocancellation,andthenew topicofnear-endlisteningenhancement.
Furthermore,summariesofessentialtoolssuchasspectralanalysis,digitalfilterbanks, includingtheso-calledfilterbankequalizer,aswellasstochasticsignalprocessingand estimationtheoryareprovided.Recenttrendsofapplyingmachinelearningtechniquesin speechsignalprocessingareaddressed.
Asasupplementtothefirstandsecondedition,thecompanionbook Advancesin DigitalSpeechTransmission [Martinetal.2008]shouldbementionedthatcoversspecifictopicsinSpeechQualityAssessment,AcousticSignalProcessing,SpeechCoding, JointSource-ChannelCoding,andSpeechProcessinginHearingInstrumentsand Human–MachineInterfaces.
Furthermore,thereaderwillfindsupplementaryinformation,publications,programs, andaudiosamples,theAachendatabases(singleandmultichannelroomimpulse
xvi Preface responses,activenoisecancellationimpulseresponses),andadatabaseofsimulatedroom impulseresponsesforacousticsensornetworksonthefollowingwebsites:
http://www.iks.rwth-aachen.de
http://www.rub.de/ika
Thescopeoftheindividualsubjectstreatedinthebookchaptersexceedsthatofgraduatelectures;recentresearchresults,standards,problemsofrealization,andapplications havebeenincluded,aswellasmanysuggestionsforfurtherreading.Thereadershouldbe familiarwiththefundamentalsofdigitalsignalprocessingandstatisticalsignalprocessing.
Theauthorsaregratefultoallcurrentandformermembersoftheirgroupsandstudents whocontributedtothebookthroughresearchresults,discussions,oreditorialwork. Inparticular,weliketothankDr.-Ing.ChristianeAntweiler,Dr.-Ing.ColinBreithaupt, Prof.GeraldEnzner,Prof.TimFingscheidt,Prof.TimoGerkmann,Prof.PeterJax,Dr.-Ing. HeinerLöllmann,Prof.NileshMadhu,Dr.-Ing.AnilNagathil,Dr.-Ing.MarkusNiermann, Dr.-Ing.BastianSauert,andDr.-Ing.ThomasSchlienforfruitfuldiscussionsandvaluablecontributions.Furthermore,wewouldespeciallyliketothankDr.-Ing.Christiane Antweiler,forhertirelesssupporttothisproject,andHorstKrottandDipl.-Geogr.Julia Ringeisforpreparingmostofthediagrams.
Finally,wewouldliketoexpressoursincerethankstothemanagingeditorsandstaffof JohnWiley&Sonsfortheirkindandpatientassistance.
AachenandBochum October2023
References
PeterVaryandRainerMartin
Martin,R.;Heute,U.;Antweiler,C.(2008). AdvancesinDigitalSpeechTransmission, JohnWiley&Sons. Vary,P.;Martin,R.(2006). DigitalSpeechTransmission–Enhancement,CodingandError Concealment,JohnWiley&Sons.
Introduction
Languageisthemostessentialmeansofhumancommunication.Itisusedintwomodes:as spokenlanguage(speechcommunication)andaswrittenlanguage(textualcommunication). Inourmoderninformationsocietybothmodesaregreatlyenhancedbytechnicalsystems anddevices.E-mail,shortmessaging,andtheworldwidewebhaverevolutionizedtextual communication,while
● digitalcellularradiosystems,
● audio–visualconferencesystems,
● acoustichuman–machinecommunication,and
● digitalhearingaids havesignificantlyexpandedthepossibilitiesandconvenienceofspeechandaudio–visual communication.
Digitalprocessingandenhancementofspeechsignals forthepurposeoftransmission(or storage)isabranchofinformationtechnologyandanengineeringsciencewhichdraws onvariousotherdisciplines,suchasphysiology,phonetics,linguistics,acoustics,andpsychoacoustics.Itisthismultidisciplinaryaspectwhichmakesdigitalspeechprocessinga challengingaswellasrewardingtask.
Thegoalofthisbookisacomprehensivediscussionoffundamentalissues,standards,and trendsinspeechcommunicationtechnology.Speechcommunicationtechnologyhelpsto mitigateanumberofphysicalconstraintsandtechnologicallimitations,mostnotably
● bandwidthlimitationsofthetelephonechannel,
● shortageofradiofrequencies,
● acousticbackgroundnoiseatthenear-end(receivingside),
● acousticbackgroundnoiseatthefar-end(transmittingside),
● (residual)transmissionerrorsandpacketlossescausedbythetransmissionchannel,
● interferingacousticechosignalsfromloudspeaker(s).
Theenormousadvancesinsignalprocessingtechnologyhavecontributedtothesuccess ofspeechsignalprocessing.Atpresent,integrateddigitalsignalprocessorsalloweconomic real-timeimplementationsofcomplexalgorithms,whichrequireseveralthousandoperationsperspeechsample.Forthisreason,advancedspeechsignalprocessingfunctionscan beimplementedincellularphonesandaudio–visualterminals,asillustratedinFigure1.1.
DigitalSpeechTransmissionandEnhancement,SecondEdition.PeterVaryandRainerMartin. ©2024JohnWiley&SonsLtd.Published2024byJohnWiley&SonsLtd.
Figure1.1 Speechsignalprocessinginahandsfreecellularterminal.BF:beamforming,AEC: acousticechocancellation,NR:noisereduction,SC:speechcoding,ETC:equivalenttransmission channel,EC:errorconcealment,SD:speechdecoding,BWE:bandwidthextension,andNELE: near-endlisteningenhancement.
ThehandsfreeterminalinFigure1.1facilitatescommunicationviamicrophonesand loudspeakers.Handsfreetelephonedevicesareinstalledinmotorvehiclesinorderto enhanceroadsafetyandtoincreaseconvenienceingeneral.
Atthefarendofthetransmissionsystem,threedifferentpre-processingstepsaretaken toimprovecommunicationinthepresenceofambientnoiseandloudspeakersignals.In thefirststep,twoormoremicrophonesareusedtoenhancethenear-endspeechsignal by beamforming(BF).Specificcharacteristicsoftheinterference,suchasthespatial distributionofthesoundsourcesandthestatisticsofthespatialsoundfield,areexploited.
Acousticechoes occurwhenthefar-endsignalleaksatthenear-endfromtheloudspeakerofthehandsfreesetintothemicrophone(s)viatheacousticpath.Asaconsequence, thefar-endspeakerswillheartheirownvoicedelayedbytwicethesignalpropagationtime ofthetelephonenetwork.Therefore,inasecondstep,theacousticechomustbecompensatedbyanadaptivedigitalfilter,the acousticechocanceller(AEC). Thethirdmoduleofthepre-processingchainis noisereduction(NR) aimingatan improvementofspeechqualitypriortocodingandtransmission.Single-channelNRsystemsrelyonspectralmodificationsandaremosteffectiveforshort-termstationarynoise.
Speechcoding(SC), errorconcealment(EC),and speechdecoding(SD) facilitate theefficientuseofthetransmissionchannel.SCalgorithmsforcellularcommunications withtypicalbitratesbetween4and24bit/sareexplicitlybaseduponamodelofspeech productionandexploitpropertiesofthehearingmechanism.
Atthereceivingsideofthetransmissionsystem,speechqualityisensuredbymeans oferrorcorrection(channeldecoding),whichisnotwithinthescopeofthisbook. InFigure1.1,the(inner)channelcoding/decodingaswellasmodulation/demodulation andtransmissionoverthephysicalchannelaremodeledasan equivalenttransmission channel(ETC).Inspiteofchannelcoding,quitefrequentlyresidualerrorsremain.The negativeauditiveeffectsoftheseerrorscanbemitigatedby errorconcealment(EC) techniques.Inmanycases,theseeffectscanbereducedbyexploitingbothresidualsource redundancyandinformationabouttheinstantaneousqualityofthetransmissionchannel.
Finally,thedecodedsignalmightbesubjectedtoartificial bandwidthextension(BWE) whichexpandsnarrowband(0.3–3.4kHz)towideband(0.05–7.0kHz)speechorwideband speechtosuperwideband(0.05–14.0kHz).Withtheintroductionoftruewideband andsuperwidebandspeechaudiocodingintotelephonenetworks,thisstepwillbeof significantimportanceas,foralongtransitionperiod,narrowbandandwidebandspeech terminalswillcoexist.
Atthereceivingend(near-end),theperceptionofthedecoded(andeventuallybandwidthexpanded)speechsignalmightbedisturbedbyacousticbackgroundnoise.The taskofthelastmoduleinthetransmissionchainistoimproveintelligibilityoratleastto reducethelisteningeffort.Thereceivedspeechsignalismodified,takingthenear-end backgroundnoiseintoaccount,whichcanbecapturedwithamicrophone.Thismethodis called near-endlisteningenhancement(NELE).
Someoftheseprocessingfunctionsfindalsoapplicationsinaudio–visualconferencing devicesanddigitalhearingaids.
Thebookisorganizedasfollows.Thefirstpart fundamentals (Chapters2–5)deals withmodelsofspeechproductionandhearing,spectraltransformations,filterbanks,and stochasticprocesses.
Thesecondpart speechcoding (Chapters6–8)coversquantization,differentialwaveform codingandespeciallytheconceptsofcodeexcitedlinearprediction(CELP)arediscussed. Finally,someofthemostrelevantspeechcodecstandardsarepresented.Recentdevelopmentssuchasthe AdaptiveMulti-Rate,AMR codec,orthe EnhancedVoiceServices (EVS) codecforcellularandIPcommunicationaredescribed.
Thethirdpart speechenhancement (Chapters9–15)isconcernedwitherrorconcealment, bandwidthextension,near-endlisteningenhancement,singleanddual-channelnoiseand reverberationreduction,acousticechocancellation,andbeamforming.
ModelsofSpeechProductionandHearing
Digitalspeechcommunicationsystemsarelargelybasedonknowledgeofspeechproduction,hearing,andperception.Inthischapter,wewilldiscusssomefundamentalaspectsin sofarastheyareofimportanceforoptimizingspeech-processingalgorithmssuchasspeech coding,speechenhancement,orfeatureextractionforautomaticspeechrecognition.
Inparticular,wewillstudythemechanismofspeechproductionandthetypicalcharacteristicsofspeechsignals.Thedigitalspeechproductionmodelwillbederivedfrom acousticalandphysicalconsiderations.Theresulting all-polemodelofthevocaltract isthe keyelementofmostofthecurrentspeech-codingalgorithmsandstandards.
Furthermore,wewillprovideinsightsintothehumanauditorysystemandwewillfocus onperceptualfundamentalswhichcanbeexploitedtoimprovethequalityandtheeffectivenessofspeech-processingalgorithmstobediscussedinlaterchapters.Withrespectto perception,themainaspectstobeconsideredindigitalspeechtransmissionarethe masking effect andthespectralresolutionoftheauditorysystem.
Asadetaileddiscussionoftheacoustictheoryofspeechproduction,phonetics,psychoacoustics,andperceptionisbeyondthescopeofthisbook,thereaderisreferredtothe literature(e.g.,[Fant1970],[Flanagan1972],[Rabiner,Schafer1978],[Picket1980],and [Zwicker,Fastl2007]).
2.1SoundWaves
Soundisamechanicalvibrationthatpropagatesthroughmatterintheformofwaves.Sound wavesmaybedescribedintermsofasoundpressurefield p(r, t) andasoundvelocityvectorfield u(r, t),whicharebothfunctionsofaspatialco-ordinatevector r andtime t.While thesoundpressurecharacterizesthedensityvariations(wedonotconsidertheDCcomponent,alsoknownasatmosphericpressure),thesoundvelocitydescribesthevelocityof dislocationofthephysicalparticlesofthemediumwhichcarriesthewaves.Thisvelocityis differentfromthespeed c ofthetravelingsoundwave.
Inthecontextofourapplications,i.e.,soundwavesinair,soundpressure p(r, t) and resultingdensityvariations ��(r, t) arerelatedby
p(r, t)= c2 ��(r, t) (2.1)
DigitalSpeechTransmissionandEnhancement,SecondEdition.PeterVaryandRainerMartin. ©2024JohnWiley&SonsLtd.Published2024byJohnWiley&SonsLtd.
2ModelsofSpeechProductionandHearing
andalsotherelationbetween p(r, t) and u(r, t) maybelinearized.Then,inthegeneralcase ofthreespatialdimensionsthesetwoquantitiesarerelatedviadifferentialoperatorsinan infinitesimallysmallvolumeofairparticlesas
where c and ��0 arethespeedofsoundandthedensityatrest,respectively.Theseequations, alsoknownasEuler’sequationandcontinuityequation[Xiang,Blauert2021],maybe combinedintothewaveequation
wheretheLaplaceoperator Δ
Asolutionofthewaveequation(2.3)isplanewaveswhichfeaturesurfacesofconstant soundpressurepropagatinginagivenspatialdirection.Aharmonicplanewaveofangular frequency �� whichpropagatesinpositive x directionornegative x directionmaybewritten incomplexnotationas
where ̃ �� = ��∕c = 2�� ∕�� isthe wavenumber , �� isthe wavelength,and ̂ pf , ̂ pb arethe(possibly complex-valued) amplitudes.Using(2.2),the x componentofthesoundvelocityisthen givenby
Thus,foraplanewave,thesoundvelocityisproportionaltothesoundpressure. Inourapplications,waveswhichhaveaconstantsoundpressureonconcentricalspheres arealsoofinterest.Indeed,thewaveequation(2.3)deliversasolutionforthe sphericalwave whichpropagatesinradialdirection r as
(r , t)= 1 r f (
ct), (2.7) where f isthepropagatingwaveform.Theamplitudeofthesoundwavediminisheswith increasingdistancefromthesource.Wemaythenusetheabstractionofa pointsource to explainthegenerationofsuch sphericalwaves
Anidealpointsourcemayberepresentedbyitssourcestrength ��0 (t) [Xiang,Blauert 2021].Furthermore,with(2.2)wehave �� ur (r , t)
1
(r , t)
Then,theradialcomponentofthevelocityvectormaybeintegratedoverasphereofradius r toyield ��0 (t)≈ 4�� r 2 ur (r , t).For r → 0,thesecondtermontheright-handsideof(2.8)is smallerthanthefirst.Therefore,foraninfinitesimallysmallsphere,wefindwith(2.8)
and,with(2.7),forany r
whichcharacterizes,again,asphericalwave.Thesoundpressureisinverselyproportional totheradialdistance r fromthepointsource.Foraharmonicexcitation
wefindthesoundpressure
andhence,with(2.8)andanintegrationwithrespecttotime,thesoundvelocity
Clearly,(2.12)and(2.13)satisfy(2.8).Becauseofthesecondtermintheparenthesesin (2.13),soundpressureandsoundvelocityarenotinphase.Dependingonthedistanceofthe observationpointtothepointsource,thebehaviorofthewaveisdistinctlydifferent.When thesecondtermcannotbeneglected,theobservationpointisinthe nearfield ofthesource. For �� r ≫ 1,theobservationpointisinthe farfield.Thetransitionfromthenearfieldtothe farfielddependsonthewavenumber �� and,assuch,onthewavelengthorthefrequencyof theharmonicexcitation.
2.2OrgansofSpeechProduction
Theproductionofspeechsoundsinvolvesthemanipulationofanairstream.Theacousticrepresentationofspeechisasoundpressurewaveoriginatingfromthephysiological speechproductionsystem.Asimplifiedschematicofthehumanspeechorgansisgivenin Figure2.1.Themaincomponentsandtheirfunctionsare:
● lungs:
● trachea:
● larynxwithvocalcords:
● vocaltractwithpharynx, oralandnasalcavities: theenergygenerator, forenergytransport, thesignalgenerator,and theacousticfilter.
Bycontraction,the lungs produceanairflowwhichismodulatedbythe larynx ,processed bythe vocaltract,andradiatedviathelipsandthenostrils.The larynx providesseveral biologicalandsoundproductionfunctions.Inthecontextofspeechproduction,itspurpose istocontrolthestreamofairthatentersthevocaltractviathe vocalcords.
Speechsoundsareproducedbymeansofvariousmechanisms. Voicedsounds areproducedwhentheairflowisinterruptedperiodicallybythemovements(vibration)ofthe vocalcords(seeFigure2.2).Thisself-sustainedoscillation,i.e.,therepeatedopeningand closingofthevocalcords,canbeexplainedbytheso-called Bernoullieffect asinfluid
Vocal
Nostrils
Palate
Jaw
Thyroid cartilage
Larynx
Cricoid cartilage
Uvula
Oral cavity
Pharynx cavity
Epiglottis
Vocal cords
Trachea
Figure2.1 Organsofspeechproduction.
dynamics:asairflowvelocityincreases,localpressuredecreases.Atthebeginningofeach cycle,theareabetweenthevocalcords,whichiscalledthe glottis,isalmostclosedbymeans ofappropriatetensionofthevocalcords.Thenanincreasedairpressurebuildsupbelow theglottis,forcingthevocalcordstoopen.Asthevocalcordsdiverge,thevelocityofthe airflowingthroughtheglottisincreasessteadily,whichcausesadropinthelocalpressure.Then,thevocalcordssnapbacktotheirinitialpositionandthenextcyclecanstart iftheairflowfromthelungsandthetensionofthevocalcordsaresustained.Duetothe abruptperiodicinterruptionsoftheglottalairflow,asschematicallyillustratedinFigure2.2, theresultingexcitation(pressurewave)ofthevocaltracthasafundamentalfrequencyof f0 = 1∕T0 andhasalargenumberofharmonics.Thesearespectrallyshapedaccordingto thefrequencyresponseoftheacousticvocaltract.Theduration T0 ofasinglecycleiscalled the pitchperiod
Unvoicedsounds aregeneratedbyaconstrictionattheopenglottisoralongthevocaltract causinganonperiodicturbulentairflow.
Plosivesounds (alsoknownasstops)arecausedbybuildinguptheairpressurebehind acompleteconstrictionsomewhereinthevocaltract,followedbyasuddenopening.The releasedairflowmaycreateavoicedoranunvoicedsoundorevenamixtureofboth, dependingontheactualconstellationofthearticulators.
The vocaltract canbesubdividedintothreesections:thepharynx,theoralcavity,andthe nasalcavity.Astheentrancetothenasalcavitycanbeclosedbythevelum,adistinctionis oftenmadeintheliteraturebetweenthenasaltract(fromvelumtonostrils)andtheother
Glottal
Figure2.2 Glottalairflowduringvoicedsounds.
2.3CharacteristicsofSpeechSignals 9 twosections(fromtracheatolips,includingthepharynxcavity).Inthischapter,wewill definethe vocaltract asavariableacousticresonatorincludingthenasalcavitywith thevelumeitheropenorclosed,dependingonthespecificsoundtobeproduced.From theengineeringpointofview,theresonancefrequenciesarevariedbychangingthesize andtheshapeofthevocaltractusingdifferentconstellationsandmovementsofthe articulators,i.e.,tongue,teeth,lips,velum,lowerjaw,etc.Thus,humanscanproduceavariety ofdifferentsoundsbasedondifferentvocaltractconstellationsanddifferentacoustic excitations.
Finally,theacousticwavescarryingspeechsoundsareradiatedviathemouthandhead. Inafirstapproximation,wemaymodeltheradiatingheadasasphericalsourceinfree space.The(complex-valued)acousticload ZL (rH ) atthelipsmaythenbeapproximatedby theradiationloadofasphericalsourceofradius rH where rH ,representstheheadradius. Following(2.13),thisloadexhibitsahigh-passcharacteristic,
where �� = 2�� f denotesangularfrequencyand c isthe speedofsound.Thismodelsuggests anacoustic“shortcircuit”atverylowfrequencies,i.e.,littleacousticradiationatlowfrequencies,whichisalsosupportedbymeasurements[Flanagan1960].Foranassumedhead radiusof rH = 8.5cmand c = 343m/s,the3-dBcutofffrequency
) isabout 640Hz.
2.3CharacteristicsofSpeechSignals
Mostlanguagescanbedescribedasasetofelementarylinguisticunits,whicharecalled phonemes.Aphonemeisdefinedasthesmallestunitwhichdifferentiatesthemeaning oftwowordsinonelanguage.Theacousticrepresentationassociatedwithaphonemeis calleda phone.AmericanEnglish,forinstance,consistsofabout42phonemes,whichare subdividedintofourclasses:
Vowels arevoicedandbelongtothespeechsoundswiththelargestenergy.Theyexhibit aquasiperiodictimestructurecausedbyoscillationofthevocalcords.Thedurationvaries from40to400ms.Vowelscanbedistinguishedbythetime-varyingresonancecharacteristicsofthevocaltract.Theresonancefrequenciesarealsocalled formantfrequencies Examples:/a/asin“father”and/i/asin“eve.”
Diphthongs involveaglidingtransitionofthearticulatorsfromonevoweltoanother vowel.Examples:/oU/asin“boat”and/ju/asin“you.”
Approximants areagroupofvoicedphonemesforwhichtheairstreamescapesthrough arelativelynarrowapertureinthevocaltract.Theycan,thus,beregardedasintermediate betweenvowelsandconsonants[Gimson,Cruttenden1994].Examples:/w/in“wet”and /r/in“ran.”
Consonants areproducedwithstrongerconstrictionofthevocaltractthanvowels.All kindsofexcitationcanbeobserved.Consonantsaresubdividedinto nasals, stops, fricatives, aspirates,and affricatives.Examplesofthesefivesubclasses:/m/asin“more,”/t/asin “tea,”/f/asin“free,”/h/asin“hold,”and/t∫ /asin“chase.”
2ModelsofSpeechProductionandHearing
Eachoftheseclassesmaybefurtherdividedintosubclasses,whicharerelatedtothe interactionofthearticulatorswithinthevocaltract.Thephonemescanfurtherbeclassified aseither continuant (excitationofamoreorlessnontime-varyingvocaltract)or non continuant (rapidvocaltractchanges).Theclassofcontinuantsoundsconsistsofvowels andfricatives(voicedandunvoiced).Thenoncontinuantsoundsarerepresentedby diphthongs,semivowels,stops,andaffricates.
Forthepurposeofspeech-signalprocessing,specificarticulatoryandphoneticaspects arenotasimportantasthetypicalcharacteristicsofthewaveforms,namely,thebasiccategories:
● voiced,
● unvoiced,
● mixedvoiced/unvoiced,
● plosive,and
● silence.
Voicedsoundsarecharacterizedbytheirfundamentalfrequency,i.e.,thefrequencyof vibrationofthevocalcords,andbythespecificpatternofamplitudesofthespectralharmonics.
Inthespeechsignalprocessingliterature,thefundamentalfrequencyisoftencalled pitch andtherespectiveperiodiscalled pitchperiod.Itshouldbenoted,however,thatinpsychoacousticsthetermpitchisuseddifferently,i.e.,fortheperceivedfundamentalfrequencyof asound,whetherornotthatfrequencyisactuallypresentinthewaveform(e.g.,[Deller Jr.etal.2000]).Thefundamentalfrequencyofyoungmenrangesfrom85to155Hzand thatofyoungwomenfrom165to255Hz[Fitch,Holbrook1970].Fundamentalfrequency, alsoincombinationwithvocaltractlength,isindicativeofsex,age,andsizeofthespeaker [Smith,Patterson2005].
Unvoicedsounds aredeterminedmainlybytheircharacteristicspectralenvelopes.Voiced andunvoicedexcitationdonotexcludeeachother.Theymayoccursimultaneously,e.g.,in fricativesounds.
Thedistinctivefeatureof plosivesounds isthedynamicallytransientchangeofthevocal tract.Immediatelybeforethetransition,atotalconstrictioninthevocaltractstopssound radiationfromthelipsforashortperiod.Theremightbeasmallamountoflow-frequency componentsradiatedthroughthethroat.Then,thesuddenchangewithreleaseoftheconstrictionproducesaplosiveburst.
SometypicalspeechwaveformsareshowninFigure2.3.
2.4ModelofSpeechProduction
Thepurposeofdevelopingamodelofspeechproductionisnottoobtainanaccuratedescriptionoftheanatomyandphysiologyofhumanspeechproductionbut rathertoachieveasimplifyingmathematicalrepresentationforreproducingtheessential characteristicsofspeechsignals.
InanalogytotheorgansofhumanspeechproductionasdiscussedinSection2.2,itseems reasonabletodesignaparametrictwo-stagemodelconsistingofan excitationsource anda
Figure2.3 Characteristicwaveformsofspeechsignals:(a)Voiced(vowelwithtransitiontovoiced consonant);(b)Unvoiced(fricative);(c)Transition:pause–plosive–vowel.
vocaltractfilter ,seealso[Rabiner,Schafer1978],[Parsons1986],[Quatieri2001],[Deller Jr.etal.2000].Theresultingdigital source-filtermodel,asillustratedinFigure2.4,willbe derivedbelow.
Themodelconsistsoftwocomponents:
● the excitationsource featuringmainlytheinfluenceofthelungsandthevocalcords (voiced,unvoiced,mixed)and
● the time-varyingdigitalvocaltractfilter approximatingthebehaviorofthevocaltract (spectralenvelopeanddynamictransitions).
Inthefirstandsimplemodel,the excitationgenerator onlyhastodelivereitherwhite noiseoraperiodicsequenceof pitchpulses forsynthesizingunvoicedandvoicedsounds, respectively,whereasthevocaltractismodeledasatime-varyingdiscrete-timefilter.
Figure2.4 Digitalsource-filtermodel. x(k)
Time-varying digital vocal tract filter v(k) Excitation source
Source parameters
Filter parameters
2.4.1AcousticTubeModeloftheVocalTract
Thedigitalsource-filtermodelofFigure2.4,especiallythevocaltractfilter,willbe derivedfromthephysicsofsoundpropagationinsideanacoustictube.Toestimatethe necessaryfilterorder,westartwiththeextremelysimplifyingphysicalmodelofFigure 2.5.Accordingtothissimplisticmodel,thepharynxandoralcavitiesarerepresentedby alosslesstubewithconstantcrosssectionandthenasalcavitybyasecondtubewhich canbeclosedbythevelum.Thelengthof L = 17cmcorrespondstotheaveragelengthof thevocaltractofamaleadult.Thetubeis(almost)closedattheglottissideandopenat thelips.
Inthecaseofanonnasalsound,thevelumisclosed.Then,thewavelength ��i ofeach resonancefrequencyofthemaintubefromtheglottistothelipsfulfillsthestandingwave condition
For L = 17cm,wecomputetheresonancefrequencies
wherethespeedofsoundisgivenby c = 340m/s.
Taking(2.16)intoaccountaswellasthefactthattheconventional narrowbandtelephone (NB)servicehasafrequencyrangeofabout200–3400Hz,andthatthe wideband telephone (WB)servicecoversafrequencyrangefrom50to7000Hz,wehavetoconsider onlyfour(NB)andeightresonances(WB)ofthevocaltractmodel,respectively.Asthe acousticalbandwidthofspeechiswiderthan3400Hzandevenwiderthan7000Hz, lowpassfilteringwithafinitetransitionwidthfrompassbandtostopbandisrequiredas partofanalog-to-digitalconversion.Thus,thesamplingratefortelephonespeechiseither 8kHz(NB)or16kHz(WB),andtheoverallfilterorderforsynthesizingtelephonespeech isroughlyonly n = 8or n = 16.Eachresonancefrequencycorrespondstoapole-pairor second-orderfiltersection.Asaruleofthumb,wecanstatetheneedfor “oneresonance perkHz.”
Inthesecondstep,weimproveouracoustictubemodel,asshowninFigure2.6.For simplicity,thenasalcavityisnotconsidered(velumisclosed).Thecylindricallosslesstube
Figure2.5 Simplifiedphysicalmodelofthevocaltract.