https://ebookmass.com/product/source-separation-and-machine-
Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...
Source Separation in Physical-Chemical Sensing Christian Jutten
https://ebookmass.com/product/source-separation-in-physical-chemicalsensing-christian-jutten/
ebookmass.com
Applied Machine Learning 1st Edition M. Gopal
https://ebookmass.com/product/applied-machine-learning-1st-edition-mgopal/
ebookmass.com
Machine Learning in Microservices: Productionizing microservices architecture for machine learning solutions Abouahmed
https://ebookmass.com/product/machine-learning-in-microservicesproductionizing-microservices-architecture-for-machine-learningsolutions-abouahmed/ ebookmass.com
Microsoft Office 365 : Office 2016 : introductory Freund
https://ebookmass.com/product/microsoftoffice-365-office-2016-introductory-freund/
ebookmass.com
People Forced to Flee: History, Change and Challenge
United Nations High Commissioner For Refugees (Unhcr)
https://ebookmass.com/product/people-forced-to-flee-history-changeand-challenge-united-nations-high-commissioner-for-refugees-unhcr/
ebookmass.com
The Local and the Digital in Environmental Communication 1st ed. Edition Joana Díaz-Pont
https://ebookmass.com/product/the-local-and-the-digital-inenvironmental-communication-1st-ed-edition-joana-diaz-pont/
ebookmass.com
The ends of empire : the last colonies revisited Robert Aldrich
https://ebookmass.com/product/the-ends-of-empire-the-last-coloniesrevisited-robert-aldrich/
ebookmass.com
Jubilee, Kentucky 02-Last Rites Sala
https://ebookmass.com/product/jubilee-kentucky-02-last-rites-sala/
ebookmass.com
Un desafío para Mr Parker 1ª Edition Ella Valentine
https://ebookmass.com/product/un-desafio-para-mr-parker-1a-editionella-valentine/
ebookmass.com
Corporate https://ebookmass.com/product/corporate-entrepreneurship-andinnovation-4th-edition-paul-burns/
ebookmass.com
SourceSeparationand MachineLearning SourceSeparationand MachineLearning Jen-TzungChien
NationalChiaoTungUniversity
AcademicPressisanimprintofElsevier 125LondonWall,LondonEC2Y5AS,UnitedKingdom 525BStreet,Suite1650,SanDiego,CA92101,UnitedStates 50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UnitedKingdom
Copyright©2019ElsevierInc.Allrightsreserved.
Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicormechanical,including photocopying,recording,oranyinformationstorageandretrievalsystem,withoutpermissioninwritingfromthepublisher.Detailson howtoseekpermission,furtherinformationaboutthePublisher’spermissionspoliciesandourarrangementswithorganizationssuchas theCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite: www.elsevier.com/permissions.
ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher(otherthanasmaybenoted herein).
Notices
Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperiencebroadenourunderstanding,changes inresearchmethods,professionalpractices,ormedicaltreatmentmaybecomenecessary.
Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingandusinganyinformation, methods,compounds,orexperimentsdescribedherein.Inusingsuchinformationormethodstheyshouldbemindfuloftheirownsafety andthesafetyofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility.
Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeanyliabilityforanyinjuryand/or damagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise,orfromanyuseoroperationofanymethods, products,instructions,orideascontainedinthematerialherein.
LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress
BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary
ISBN:978-0-12-817796-9
ForinformationonallAcademicPresspublications visitourwebsiteat https://www.elsevier.com/books-and-journals
Publisher: MaraConner
AcquisitionEditor: TimPitts
EditorialProjectManager: JohnLeonard
ProductionProjectManager: SuryaNarayananJayachandran
Designer: MarkRogers
TypesetbyVTeX
ListofFigures Fig.1.1Cocktailpartyproblemwiththreespeakersandthreemicrophones.4
Fig.1.2Agenerallinearmixingsystemwith n observationsand m sources.5
Fig.1.3Anillustrationofmonauralsourceseparationwiththreesourcesignals.7
Fig.1.4Anillustrationofsingingvoiceseparation.8
Fig.1.5Speechrecognitionsysteminareverberantenvironment.9
Fig.1.6Characteristicsofroomimpulseresponse.9
Fig.1.7Categorizationofvariousmethodsforspeechdereverberation.10
Fig.1.8Categorizationofvariousapplicationsforsourceseparation.12
Fig.1.9Blindsourceseparationforelectroencephalographyartifactremoval.14
Fig.1.10Challengesinfront-endprocessingandback-endlearningforaudiosourceseparation.18
Fig.2.1Evolutionofdifferentsourceseparationmethods.21
Fig.2.2ICAlearningprocedureforfindingademixingmatrix W.24
Fig.2.3Illustrationfornonnegativematrixfactorization X ≈ BW.25
Fig.2.4Nonnegativematrixfactorizationshowninasummationofrank-onenonnegativematrices.26
Fig.2.5Supervisedlearningforsingle-channelsourceseparationusingnonnegativematrixfactorization inpresenceofaspeechsignal Xs andamusicsignal Xm .27
Fig.2.6Unsupervisedlearningforsingle-channelsourceseparationusingnonnegativematrix factorizationinpresenceoftwosources.27
Fig.2.7Illustrationofmultiwayobservationdata.33
Fig.2.8Atensordatawhichiscomposedofthreewaysoftime,frequencyandchannel.34
Fig.2.9Tuckerdecompositionforathree-waytensor.34
Fig.2.10CPdecompositionforathree-waytensor.37
Fig.2.11Multilayerperceptronwithoneinputlayer,onehiddenlayer,andoneoutputlayer.38
Fig.2.12ComparisonofactivationsusingReLUfunction,logisticsigmoidfunction,andhyperbolic tangentfunction. 39
Fig.2.13Calculationsinerrorbackpropagationalgorithm:(A)forwardpassand(B)backwardpass.The propagationsareshownbyarrows.Intheforwardpass,activations at andoutputs zt in individualnodesarecalculatedandpropagated.Inthebackwardpass,localgradients δt in individualnodesarecalculatedandpropagated.41
Fig.2.14Procedureintheerrorbackpropagationalgorithm:(A)computetheerrorfunction, (B)propagatethelocalgradientfromtheoutputlayer L to(C)hiddenlayer L 1and(D)until inputlayer1. 42
Fig.2.15Atwo-layerstructureinrestrictedBoltzmannmachine.43
Fig.2.16Astack-wisetrainingprocedurefordeepbeliefnetwork.Thismodelisfurtherusedasa pretrainedmodelfordeepneuralnetworkoptimization.44
Fig.2.17Arecurrentneuralnetworkwithsinglehiddenlayer.45
Fig.2.18Backpropagationthroughtimewithasinglehiddenlayerandtwotimesteps τ = 2.46
Fig.2.19Adeeprecurrentneuralnetworkforsingle-channelsourceseparationinthepresenceoftwo sourcesignals.
49
Fig.2.20Aprocedureofsingle-channelsourceseparationbasedonDNNorRNNwheretwosource signalsarepresent. 49
Fig.2.21Preservationofgradientinformationthroughagatingmechanism.50
Fig.2.22Adetailedviewoflongshort-termmemory.Blockcirclemeansmultiplication.Recurrentstate zt 1 andcell ct 1 inaprevioustimestepareshownbydashedlines. 51
Fig.3.1PortraitofThomasBayes(1701–1761)whoiscreditedfortheBayestheorem. 58
Fig.3.2Illustrationofascalablemodelinpresenceofincreasingnumberoftrainingdata. 59
Fig.3.3Optimalweights w estimatedbyminimizingthe 2 -and 1 -regularizedobjectivefunctions whichareshownontheleftandright,respectively. 62
Fig.3.4Laplacedistributioncenteredatzerowithdifferentcontrolparameters λ.63
Fig.3.5Comparisonoftwo-dimensionalGaussiandistribution(A)andStudent’s t -distribution(B)with zeromean. 64
Fig.3.6Scenarioforatime-varyingsourceseparationsystemwith(A)onemale,onefemaleandone musicplayer.(B)Then,themaleismovingtoanewlocation.(C)Afterawhile,themale disappearsandthefemaleisreplacedbyanewone.
66
Fig.3.7SequentialupdatingofparametersandhyperparametersforonlineBayesianlearning. 68
Fig.3.8Evolutionofdifferentinferencealgorithmsforconstructionoflatentvariablemodels. 75
Fig.3.9Illustrationfor(A)decomposingalog-likelihoodintoaKLdivergenceandalowerbound, (B)updatingthelowerboundbysettingKLdivergencetozerowithnewdistribution q(Z),and (C)updatinglowerboundagainbyusingnewparameters .AdaptedfromBishop(2006).80
Fig.3.10Illustrationofupdatinglowerboundofthelog-likelihoodfunctioninEMiterationfrom τ to τ + 1.AdaptedfromBishop(2006).
Fig.3.11Bayesrelationamongposteriordistribution,likelihoodfunction,priordistributionandevidence function.
81
83
Fig.3.12Illustrationofseekinganapproximatedistribution q(Z) whichisfactorizableandmaximally similartothetrueposterior p(Z|X).86
Fig.3.13IllustrationofminimizingKLdivergenceviamaximizationofvariationallowerbound. KLdivergenceisreducedfromlefttorightduringtheoptimizationprocedure.
Fig.4.1Constructionoftheeigenspaceandindependentspaceforfindingtheadaptedmodelsfromaset ofseedmodelsbyusingenrollmentdata.
88
102
Fig.4.2Comparisonofkurtosisvaluesoftheestimatedeigenvoicesandindependentvoices. 104
Fig.4.3ComparisonofBICvaluesofusingPCAwitheigenvoicesandICAwithindependentvoices.105
Fig.4.4ComparisonofWERs(%)ofusingPCAwitheigenvoicesandICAwithindependentvoices wherethenumbersofbasisvectors(K )andthenumbersofadaptationsentences(L)are changed. 105
Fig.4.5Taxonomyofcontrastfunctionsforoptimizationinindependentcomponentanalysis. 107
Fig.4.6ICAtransformationand k -meansclusteringforspeechrecognitionwithmultiplehidden Markovmodels. 115
Fig.4.7Generationofsupervectorsfromdifferentacousticsegmentsofalignedutterances. 116
Fig.4.8TaxonomyofICAcontrastfunctionswheredifferentrealizationsofmutualinformationare included. 119
Fig.4.9Illustrationofaconvexfunction f(x). 119
Fig.4.10Comparisonof(A)anumberofdivergencemeasuresand(B) α -DIVandC-DIVforvarious α underdifferentjointprobability Py1 ,y2 (A,A) 122
Fig.4.11Divergencemeasuresofdemixedsignalsversustheparametersofdemixingmatrix θ1 and θ2 KL-DIV,C-DIVat α = 1andC-DIVat α =−1arecompared.
128
Fig.4.12Divergencemeasuresversusnumberoflearning.KL-ICAandC-ICAat α = 1and α =−1are evaluated.Thegradientdescent(GD)andnaturalgradient(NG)algorithmsarecompared.129
Fig.4.13ComparisonofSIRsofthreedemixedsignalsinthepresenceofaninstantaneousmixing condition.DifferentICAalgorithmsareevaluated.
130
Fig.4.14ComparisonofSIRsofthreedemixedsignalsinthepresenceofinstantaneousmixingcondition withadditivenoise.DifferentICAalgorithmsareevaluated. 131
Fig.4.15Integratedscenarioofatime-varyingsourceseparationsystemfrom t to t + 1and t + 2as showninFig. 3.6
Fig.4.16GraphicalrepresentationfornonstationaryBayesianICAmodel.
Fig.4.17Comparisonofspeechwaveformsfor(A)sourcesignal1(blueinwebversionordarkgrayin printversion),sourcesignal2(redinwebversionorlightgrayinprintversion)andtwomixed signals(black)and(B)twodemixedsignals1(blueinwebversionordarkgrayinprint version)andtwodemixedsignals2(redinwebversionorlightgrayinprintversion)byusing NB-ICAandOLGP-ICAalgorithms.
Fig.4.18ComparisonofvariationallowerboundsbyusingNB-ICAwith(blueinwebversionordark grayinprintversion)andwithout(redinwebversionorlightgrayinprintversion)adaptation ofARDparameter.
Fig.4.19ComparisonoftheestimatedARDparametersofsourcesignal1(blueinwebversionordark grayinprintversion)andsourcesignal2(redinwebversionorlightgrayinprintversion) usingNB-ICA.
Fig.4.20EvolutionfromNB-ICAtoOLGP-ICAfornonstationarysourceseparation.
Fig.4.21GraphicalrepresentationforonlineGaussianprocessICAmodel.
133
136
143
144
144
146
150
Fig.4.22ComparisonofsquareerrorsoftheestimatedmixingcoefficientsbyusingNS-ICA(black), SMC-ICA(pinkinwebversionorlightgrayinprintversion),NB-ICA(blueinwebversionor darkgrayinprintversion)andOLGP-ICA(redinwebversionormidgrayinprintversion).157
Fig.4.23Comparisonofabsoluteerrorsoftemporalpredictabilitiesbetweentrueanddemixedsource signalswheredifferentICAmethodsareevaluated.
Fig.4.24Comparisonofsignal-to-interferenceratiosofdemixedsignalswheredifferentICAmethods areevaluated.
Fig.5.1GraphicalrepresentationforBayesianspeechdereverberation.
158
159
167
Fig.5.2GraphicalrepresentationforGaussian–ExponentialBayesiannonnegativematrixfactorization.184
Fig.5.3GraphicalrepresentationforPoisson–GammaBayesiannonnegativematrixfactorization.187
Fig.5.4GraphicalrepresentationforPoisson–ExponentialBayesiannonnegativematrixfactorization.190
Fig.5.5Implementationprocedureforsupervisedspeechandmusicseparation. 197
Fig.5.6Histogramoftheestimatednumberofbases(K )usingPE-BNMFforsourcesignalsofspeech, pianoandviolin. 198
Fig.5.7ComparisonoftheaveragedSDRusingNMFwithafixednumberofbases(1–5pairsofbars), PG-BNMFwiththebestfixednumberofbases(sixthpairofbars),PG-BNMFwithadaptive numberofbases(seventhpairofbars),PE-BNMFwiththebestfixednumberofbases(eighth pairofbars)andPE-BNMFwithadaptivenumberofbases(ninthpairofbars).
198
Fig.5.8Implementationprocedureforunsupervisedsingingvoiceseparation. 199
Fig.5.9ComparisonofGNSDRoftheseparatedsingingvoicesatdifferentSMRsusingPE-BNMFs with K -meansclustering(denotedbyBNMF1),NMFclustering(denotedbyBNMF2)and shiftedNMFclustering(denotedbyBNMF3).Fivecompetitivemethodsareincludedfor comparison.
201
Fig.5.10Illustrationforgroupbasisrepresentation.204
Fig.5.11ComparisonofGaussian,LaplaceandLSMdistributionscenteredatzero.207
Fig.5.12GraphicalrepresentationforBayesiangroupsparsenonnegativematrixfactorization.208
Fig.5.13Spectrogramsof“music5”containingthedrumsignal(firstpanel),thesaxophonesignal (secondpanel),themixedsignal(thirdpanel),thedemixeddrumsignal(fourthpanel)andthe demixedsaxophonesignal(fifthpanel).217
Fig.5.14Illustrationforlayerednonnegativematrixfactorization.221
Fig.5.15Illustrationfordiscriminativelayerednonnegativematrixfactorization.224
Fig.6.1Categorizationandevolutionfordifferenttensorfactorizationmethods.231
Fig.6.2Procedureofproducingthemodulationspectrogramsfortensorfactorization.232
Fig.6.3(A)Timeresolutionand(B)frequencyresolutionofamixedaudiosignal.235
Fig.6.4Comparisonbetween(A)nonnegativematrixfactorizationand(B)positivesemidefinitetensor factorization.
253
Fig.7.1Categorizationofdifferentdeeplearningmethodsforsourceseparation.260
Fig.7.2Adeepneuralnetworkforsingle-channelspeechseparation,where xt denotestheinputfeatures ofthemixedsignalattimestep t , z(l) t denotesthefeaturesinthehiddenlayer l , y1,t and y2,t denotethemaskfunctionforsourceoneandsourcetwo,and x1,t and x2,t aretheestimated signalsforsourceoneandsourcetwo,respectively.
Fig.7.3Illustrationofstackingindeepensemblelearningforspeechseparation.
Fig.7.4Speechsegregationprocedurebyusingdeepneuralnetwork.
Fig.7.5Adeeprecurrentneuralnetworkforspeechdereverberation.
Fig.7.6Factorizedfeaturesinspectralandtemporaldomainsforspeechdereverberation.
Fig.7.7ComparisonofSDR,SIRandSARoftheseparatedsignalsbyusingNMF,DRNN, DDRNN-bwandDDRNN-diff.
Fig.7.8Vanishinggradientsinarecurrentneuralnetwork.Degreeoflightnessmeansthelevelof vanishinginthegradient.ThisfigureisacounterparttoFig. 2.21 wherevanishinggradientsare mitigatedbyagatingmechanism.
Fig.7.9Asimplifiedviewoflongshort-termmemory.Thered(lightgrayinprintversion)lineshows recurrentstate zt .ThedetailedviewwasprovidedinFig. 2.22
Fig.7.10Illustrationofpreservinggradientsinlongshort-termmemory.Therearethreegatesina memoryblock. ◦ meansgateopeningwhile denotesgateclosing.
261
267
268
279
283
291
292
293
293
Fig.7.11Aconfigurationoffourstackedlongshort-termmemorylayersalongthreetimestepsfor monauralspeechseparation.DashedarrowsindicatethemodelingofthesameLSTMacross timestepswhilesolidarrowsmeanthemodelingofdifferentLSTMsindeepstructure.294
Fig.7.12Illustrationofbidirectionalrecurrentneuralnetworkformonauralsourceseparation.Two hiddenlayersareconfiguredtolearnbidirectionalfeaturesforfindingsoftmaskfunctionsfor twosources.Forwardandbackwarddirectionsareshownindifferentcolors.
Fig.7.13(A)Encoderanddecoderinavariationalautoencoder.(B)Graphicalrepresentationfora variationalautoencoder.
Fig.7.14Graphicalrepresentationfor(A)recurrentneuralnetworkand(B)variationalrecurrentneural network.
296
300
301
Fig.7.15Inferenceprocedureforvariationalrecurrentneuralnetwork.302
Fig.7.16Implementationtopologyforvariationalrecurrentneuralnetwork.305
Fig.7.17ComparisonofSDR,SIRandSARoftheseparatedsignalsbyusingNMF,DNN,DRNN, DDRNNandVRNN.306
Fig.7.18(A)Single-channelsourceseparationwithdynamicstate Zt ={zt , rt , wr,t , ww,t } inrecurrent layer L 1.(B)Recurrentlayersaredrivenbyacell ct andacontrollerformemory Mt where dashedlinedenotestheconnectionbetweencellandmemoryatprevioustimestepandbold linesdenotetheconnectionswithweights.308
Fig.7.19Fourstepsofaddressingprocedurewhichisdrivenbyparameters {kt ,βt ,gt , st ,γt }.310
Fig.7.20Anend-to-endmemorynetworkformonauralsourceseparationcontainingabidirectional LSTMontheleftasanencoder,anLSTMontherightasadecoderandanLSTMonthetopas aseparator. 315
ListofTables Table2.1ComparisonofNMFupdatingrulesbasedondifferentlearningobjectives31
Table2.2ComparisonofupdatingrulesofstandardNMFandsparseNMFbasedonlearningobjectives ofsquaredEuclideandistanceandKullback–Leiblerdivergence32
Table3.1ComparisonofapproximateinferencemethodsusingvariationalBayesianandGibbssampling95
Table4.1Comparisonofsyllableerrorrates(SERs)(%)withandwithoutHMMclusteringandICA learning.DifferentICAalgorithmsareevaluated117
Table4.2Comparisonofsignal-to-interferenceratios(SIRs)(dB)ofmixedsignalswithoutICA processingandwithICAlearningbasedonMMIandNLRcontrastfunctions117
Table4.3Comparisonofdifferentdivergencemeasureswithrespecttosymmetricdivergence,convexity parameter,combinationweightandspecialrealization121
Table5.1ComparisonofmultiplicativeupdatingrulesofstandardNMFDandsparseNMFDbasedonthe objectivefunctionsofsquaredEuclideandistanceandKLdivergence164
Table5.2ComparisonofmultiplicativeupdatingrulesofstandardNMF2DandsparseNMF2Dbasedon theobjectivefunctionsofsquaredEuclideandistanceandKLdivergence164
Table5.3Comparisonofusingdifferentmethodsforspeechdereverberationundervarioustestconditions (near,nearmicrophone;far,farmicrophone;sim,simulateddata;real,realrecording)interms ofevaluationmetricsofCD,LLR,FWSegSNRandSRMR(dB)173
Table5.4ComparisonofdifferentBayesianNMFsintermsofinferencealgorithm,closed-formsolution andoptimizationtheory195
Table5.5ComparisonofGNSDR(dB)oftheseparatedsingingvoices(V)andmusicaccompaniments (M)usingNMFwithfixednumberofbases K = 10,20and30andPE-BNMFwithadaptive K Threeclusteringalgorithmsareevaluated200
Table5.6ComparisonofNMFanddifferentBNMFsintermsofSDRandGNSDRfortwoseparation tasks.Standarddeviationisgivenintheparentheses201
Table5.7ComparisonofSIRs(indB)ofthereconstructedrhythmicsignal(denotedbyR)andharmonic signal(denotedbyH)basedonNMF,BNMF,GNMFandBGS-NMF.Sixmixedmusicsignals areinvestigated 218
Table5.8PerformanceofspeechseparationbyusingNMF,LNMFandDLNMFintermsofSDR,SIR andSAR(dB) 228
Table6.1ComparisonofmultiplicativeupdatingrulesofNMF2DandNTF2Dbasedontheobjective functionsofsquaredEuclideandistanceandKullback–Leiblerdivergence241
Table7.1ComparisonofusingdifferentmodelsforspeechdereverberationintermsofSRMRandPESQ (indB)undertheconditionofusingsimulateddataandnearmicrophone284
Table7.2ComparisonofSTOIsusingDNN,LSTMandNTMunderdifferentSNRswithseenspeakers313
Table7.3ComparisonofSTOIsusingDNN,LSTMandNTMunderdifferentSNRswithunseenspeakers313
Table7.4ComparisonofSTOIsunderdifferentSNRsbyusingDNN,LSTManddifferentvariantsof NTMandRCNN 319 xvii
Foreword WiththeuseofDeepNeuralNetworks(DNNs)andRecursiveNeuralNetworks(RNNs),speechrecognitionperformancehasrecentlyimprovedrapidly,andspeechrecognitionhasbeenwidelyusedin smartspeakersandsmartphones.Speechrecognitionperformanceforspeechutteredinaquietenvironmenthasbecomequiteclosetohumanperformance.However,insituationswherethereisnoise inthesurroundingsorroomreverberation,speechrecognitionperformancefallsfarshortofhuman performance.
Noiseandotherperson’svoicesarealmostalwayssuperimposedonthevoiceinthehouse,office, conferenceroom,etc.Peoplecannaturallyextractandheartheconversationofinterestedpersons,even whilemanypeoplearechattinglikeinacocktailparty.Thisissaidtobeacocktailpartyeffect.To automaticallyrecognizethetargetvoice,itisnecessarytodevelopatechnologyforseparatingthevoice utteredintheactualenvironmentfromthesurroundingnoiseandremovingtheinfluenceoftheroom reverberation.Whenanalyzingandprocessingmusicsignals,itisalsorequiredtoseparateoverlapped soundsourcesignals.
Soundsourceseparationis,therefore,averyimportanttechnologyinawiderangeofsignalprocessing,particularlyspeech,sound,andmusicsignalprocessing.Variousresearcheshavebeenconducted sofar,buttheperformanceofcurrentsoundsourceseparationtechnologyisfarshortofhumancapability.Thisisoneofthemajorreasonswhyspeechrecognitionperformanceinageneralenvironment doesnotreachhumanperformance.
Generally,speechandaudiosignalsarerecordedwithoneormoremicrophones.Forthisreason, thesoundsourceseparationtechnologycanbeclassifiedintomonauralsoundsourceseparationand multichannelsoundsourceseparation.Thisbookfocusesonblindsourceseparation(BSS)whichisthe processofseparatingasetofsourcesignalsfromasetofmixedsignalswithouttheaidofinformation orwithverylittleinformationaboutthesourcesignalsorthemixingprocess.Itisrarethatinformation onthesoundsourcesignaltobeseparatedisobtainedbeforehand,soitisimportanttobeableto separatethesoundsourcewithoutsuchinformation.Thisbookalsoaddressesvariouschallenging issuescoveringthesingle-channelsourceseparationwherethemultiplesourcesignalsfromasingle mixedsignalarelearnedinasupervisedway,aswellasthespeakerandnoiseindependentsource separationwherealargesetoftrainingdataisavailabletolearnageneralizedmodel.
Inresponsetothegrowingneedforperformanceimprovementinspeechrecognition,researchon BSShasbeenrapidlyadvancedinrecentyearsbasedonvariousmachinelearningandsignalprocessing technology.Thisbookdescribesstate-of-the-artmachinelearningapproachesformodel-basedBSSfor speechrecognition,speechseparation,instrumentalmusicseparation,singingvoiceseparation,music informationretrieval,brainsignalseparationandimageprocessing.
Themodel-basedtechniques,combiningvarioussignalprocessingandmachinelearningtechniques,rangefromlineartononlinearmodels.Majortechniquesinclude:IndependentComponent Analysis(ICA),NonnegativeMatrixFactorization(NMF),NonnegativeTensorFactorization(NTF), DeepNeuralNetwork(DNN),andRecurrentNeuralNetwork(RNN).Therapidprogressinthelast fewyearsislargelyduetotheuseofDNNsandRNNs.
ThisbookisuniqueinthatitcoverstopicsfromthebasicsignalprocessingtheoryconcerningBSS tothetechnologyusingDNNsandRNNsinrecentyears.Attheendofthisbook,thedirectionoffuture xix
researchisalsodescribed.Thislandmarkbookisveryusefulasastudent’stextbookandresearcher’s referencebook.Thereaderswillbecomearticulateinthestate-of-the-artofmodel-basedBSS.Iwould liketoencouragemanystudentsandresearcherstoreadthisbooktomakebigcontributionstothe futuredevelopmentoftechnologyinthisfield.
SadaokiFurui
President,ToyotaTechnologicalInstituteatChicago ProfessorEmeritus,TokyoInstituteofTechnology
Preface Ingeneral,blindsourceseparation(BSS)isknownasarapidlyemergingandpromisingareawhichinvolvesextensiveknowledgeofsignalprocessingandmachinelearning.Thisbookintroducesstate-ofthe-artmachinelearningapproachesformodel-basedblindsourceseparation(BSS)withapplications tospeechrecognition,speechseparation,instrumentalmusicseparation,singingvoiceseparation,musicinformationretrieval,brainsignalseparationandimageprocessing.ThetraditionalBSSapproaches basedonindependentcomponentanalysisweredesignedtoresolvethemixingsystembyoptimizing acontrastfunctionoranindependentmeasure.Theunderdeterminedprobleminthepresenceofmore sourcesthansensorsmaynotbecarefullytackled.Thecontrastfunctionsmaynotflexiblyandhonestly measuretheindependenceforanoptimizationwithconvergence.Assumingthestaticmixingcondition,onecannotcatchtheunderlyingdynamicsinsourcesignalsandsensornetworks.Theuncertainty ofsystemparametersmaynotbepreciselycharacterizedsothattherobustnessagainstadverseenvironmentsisnotguaranteed.Thetemporalstructuresinmixingsystemsaswellassourcesignalsmay notbeproperlycaptured.Weassumethatthemodelcomplexityorthedictionarysizemaynotbefitted tothetrueoneinsourcesignals.Withtheremarkableadvancesinmachinelearningalgorithms,the issuesofunderdeterminedmixtures,optimizationofcontrastfunction,nonstationarymixingcondition, multidimensionaldecomposition,ill-posedconditionandmodelregularizationhavebeenresolvedby introducingthesolutionsofnonnegativematrixfactorization,information-theoreticlearning,online learning,Gaussianprocess,sparselearning,dictionarylearning,Bayesianinference,modelselection, tensordecomposition,deepneuralnetwork,recurrentneuralnetworkandmemorynetwork.Thisbook willpresenthowthesealgorithmsareconnectedandwhytheyworkforsourceseparation,particularly inspeech,audioandmusicapplications.WestartwithasurveyofBSSapplicationsandmodel-based approaches.Thefundamentaltheories,includingstatisticallearning,optimizationalgorithm,informationtheory,Bayesianlearning,variationalinferenceandMonteCarloMarkovchaininference,willbe addressed.Aseriesofcasestudiesarethenintroducedtodealwithdifferentissuesinmodel-based BSS.Thesecasestudiesarecategorizedintoindependentcomponentanalysis,nonnegativematrix factorization,nonnegativetensorfactorizationanddeepneuralnetworkrangingfromalineartoanonlinearmodel,fromsingle-waytomultiwayprocessing,andfromashallowfeedforwardmodeltoa deeprecurrentmodel.Atlast,wewillpointoutanumberofdirectionsandoutlooksforfuturestudies.
Thisbookiswrittenasatextbookwithfundamentaltheoriesandadvancedtechnologiesdeveloped inthelastdecade.Itisalsoshapedasastyleofresearchmonographbecausesomeadvancesinsource separationusingmachinelearningordeeplearningmethodsareextensivelyaddressed.
Thematerialofthisbookisbasedonatutoriallectureonthisthemeatthe40thInternational ConferenceonAcoustics,Speech,andSignalProcessing(ICASSP)inBrisbane,Australia,inApril 2015.ThistutorialwasoneofthemostpopulartutorialsinICASSPintermsofthenumberofattendees. Thesuccessofthistutoriallecturebroughttheideaofwritingatextbookonthissubjecttopromote usingmachinelearningforsignalprocessing.Someofthematerialisalsobasedonanumberof invitedtalksanddistinguishedlecturesindifferentworkshopsanduniversitiesinJapanandHong Kong.Westronglybelieveintheimportanceofmachinelearninganddeeplearningapproachesto sourceseparation,andsincerelyencouragetheresearcherstoworkonmachinelearningapproachesto sourceseparation.
Acknowledgments First,Iwanttothankmycolleaguesandresearchfriends,especiallythemembersoftheMachine LearningLabatNationalChiaoTungUniversity.Someofthestudiesinthisbookwereactuallyconductedwhilediscussingandworkingwiththem.Iwouldalsoliketothankmanypeopleforcontributing goodideas,proofreadingadraft,andgivingmevaluablecomments,whichgreatlyimprovedthisbook, includingSadaokiFurui,Chin-HuiLee,ShojiMakino,TomohiroNakatani,GeorgeA.Saon,Koichi Shinoda,Man-WaiMak,ShinjiWatanabe,IssamElNaqa,Huan-HsinTseng,Zheng-HuaTan,Zhanyu Ma,Tai-ShihChi,ShokoAraki,MarcDelcroix,JohnR.Hershey,JonathanLeRoux,HakanErdoganandTomokoMatsui.Theresearchexperienceswereespeciallyimpressiveandinspiringwhen workingonsourceseparationproblemswithmypastandcurrentstudents,inparticularBo-Cheng Chen,Chang-KaiChao,Shih-HsiungLee,Meng-FengChen,Tsung-HanLin,Hsin-LungHsieh,PoKaiYang,Chung-ChienHsu,Guan-XiangWang,You-ChengChang,Kuan-TingKuo,Kai-WeiTsou andChe-YuKuo.WeareverygratefultotheMinistryofScienceandTechnologyofTaiwanforlongtermsupportforourresearchesonmachinelearningandsourceseparation.Thegreateffortsfromthe editorsofAcademicPressatElsevier,namelyTimPitts,CharlieKent,CarlaB.Lima,JohnLeonard andSheelaJosy,arealsoappreciated.Finally,Iwouldliketothankmyfamilyforsupportingmywhole researchlive.
Jen-TzungChien Hsinchu,Taiwan October2018
NotationsandAbbreviations GENERALNOTATIONS Thisbookobservesthefollowinggeneralmathematicalnotationsacrossdifferentchapters:
Z+ ={1, 2,... }
Setofpositiveintegers
R Setofrealnumbers
R+
Setofpositiverealnumbers
RD Setof D dimensionalrealnumbers
a Scalarvariable a Vectorvariable
a = a1 aN =
Elementsofavector,whichcanbedescribedwiththesquare brackets [···]. denotesthetransposeoperation.
A Matrixvariable
A = ab cd
Elementsofamatrix,whichcanbedescribedwiththesquare brackets [···]
A Tensorvariable
ID D × D identitymatrix
|A| Determinantofasquarematrix
tr[A] Traceofasquarematrix
A ={a1 ,...,aN }={an }N n=1
A ={an }
Elementsinaset,whichcanbedescribedwiththecurlybraces {···}.
Elementsinaset,wheretherangeofindex n isomittedforsimplicity.
|A| Thenumberofelementsinaset A.Forexample, |{an }N n=1 |= N . f(x) or fx
Functionof x p(x) or q(x)
F [f ]
E[·]
H[·]
Probabilisticdistributionfunctionof x
Functionalof f .Notethatafunctionalusesthesquarebrackets [·] whileafunctionusestheparentheses ( )
Expectationfunction
Entropyfunction
Ep(x |y) [f(x)|y ]= f(x)p(x |y)dx Theexpectationof f(x) withrespecttoprobabilitydistribution p(x |y)
Ex [f(x)|y ]= f(x)p(x |y)dx
Anotherformoftheexpectationof f(x),wherethesubscriptwith theprobabilitydistributionand/ortheconditionalvariableisomitted,whenitistrivial.
δ(a,a ) = 1 a = a , 0otherwise
δ(x x )
Kroneckerdeltafunctionfordiscretevariables a and a
Diracdeltafunctionforcontinuousvariables x and x
ML , MAP , ... Themodelparameters estimatedbyaspecificcriterion(e.g., maximumlikelihood(ML),maximum aposteriori (MAP),etc.) arerepresentedbythecriterionabbreviationinthesubscript.
BASICNOTATIONSUSEDFORSOURCESEPARATION Wealsolistthespecificnotationsforsourceseparation.Thisbookkeepstheconsistencybyusingthe samenotationsfordifferentmodelsandapplications.Theexplanationsofthenotationsinthefollowing listprovideageneraldefinition.
n Numberofchannelsorsensors
m Numberofsources Setofmodelparameters
M Modelvariableincludingtypeofmodel,structure,hyperparameters,etc. Setofhyperparameters
Q(·|·) AuxiliaryfunctionusedinEMalgorithm
H Hessianmatrix
T ∈ Z+ Numberofobservationframes
t ∈{1,...,T } Timeframeindex
xt ∈ Rn n-dimensionalmixedvectorattime t with n channels
X ={xt }T t =1 Sequenceof T mixedvectors
st ∈ Rm m-dimensionalsourcevectorattime t
yt ∈ Rm m-dimensionaldemixedvectorattime t
A ={aij }∈ Rn×m Mixingmatrix
W ={wji }∈ Rm×n Demixingmatrixinindependentcomponentanalysis(ICA)
D (X, W)
J (X, W)
ContrastfunctionforICAusingobservationdata X anddemixingmatrix W Thisfunctioniswrittenasthedivergencemeasuretobeminimized.
ContrastfunctionforICAusingobservationdata X anddemixingmatrix W. Thisfunctioniswrittenastheprobabilisticmeasuretobemaximized.
X ={Xmn }∈ RM ×N + Nonnegativemixedobservationmatrixinnonnegativematrixfactorization (NMF)with N framesand M frequencybins
B ={Bmk }∈ RM ×K + NonnegativebasismatrixinNMFwith M frequencybinsand K basisvectors
W ={Wkn }∈ RK ×N + NonnegativeweightmatrixinNMFwith K basisvectorsand N frames
η Learningrate
τ Iterationorshiftingindex
λ Regularizationparameter
X ={Xlmn }∈ RL×M ×N Three-waymixedobservationtensorhavingdimensions L, M and N
G ={Gijk }∈ RI ×J ×K Three-waycoretensorhavingdimensions I , J and K
xt ={xtd } D -dimensionalmixedobservationvectorinadeepneuralnetwork(DNN)or arecurrentneuralnetwork(RNN).Thereare T vectors.
rt ={rtk } K -dimensionalsourcevector
yt ={ytk } K -dimensionaldemixedvector
zt ={ztm } m-dimensionalfeaturevector
{atm ,atk } Activationsinahiddenlayerandanoutputlayer
{δtm ,δtk } Localgradientswithrespectto {atm ,atk }
σ(a) Sigmoidfunctionusinganactivation a s(a) Softmaxfunction
w(l)
Feedforwardweightsinthe l thhiddenlayer
w(ll) Recurrentweightsinthe l thhiddenlayer
E(w) Errorfunctionofusingthewholetrainingdata {X ={xtd }, R ={rtk }}
En (w) Errorfunctioncorrespondingtothe nthminibatchoftrainingdata {Xn , Rn }
ABBREVIATIONS BSS: BlindSourceSeparation(page 3)
ICA: IndependentComponentAnalysis(page 22)
NMF: NonnegativeMatrixFactorization(page 25)
NTF: NonnegativeTensorFactorization(page 33)
CP: CanonicalDecomposition/ParallelFactors(page 36)
PARAFAC: ParallelFactorAnalysis(page 233)
STFT: Short-TimeFourierTransform(page 6)
GMM: GaussianMixtureModel(page 6)
DNN: DeepNeuralNetwork(page 37)
SGD: StochasticGradientDescent(page 40)
MLP: MultilayerPerceptron(page 38)
ReLU: RectifiedLinearUnit(page 39)
FNN: FeedforwardNeuralNetwork(page 38)
DBN: DeepBeliefNetwork(page 43)
RBM: RestrictedBoltzmannMachine(page 43)
RNN: RecurrentNeuralNetwork(page 45)
DRNN: DeepRecurrentNeuralNetwork(page 48)
DDRNN: DiscriminativeDeepRecurrentNeuralNetwork(page 290)
BPTT: BackpropagationThroughTime(page 45)
LSTM: LongShort-TermMemory(page 50)
BLSTM: BidirectionalLongShort-TermMemory(page 296)
BRNN: BidirectionalRecurrentNeuralNetwork(page 296)
CNN: ConvolutionalNeuralNetwork(page 277)
RIR: RoomImpulseResponse(page 8)
NCTF: NonnegativeConvolutiveTransferFunction(page 11)
MIR: MusicInformationRetrieval(page 13)
CASA: Computational AuditorySceneAnalysis(page 14)
FIR: FiniteImpulseResponse(page 11)
DOA: DirectionofArrival(page 6)
MFCC: Mel-FrequencyCepstralCoefficient(page 27)
KL: Kullback–Leibler(page 29)
IS: Itakura–Saito(page 30)
ML: MaximumLikelihood(page 76)
RHS: Right-HandSide(page 29)
LHS: Left-HandSide(page 180)
ARD: AutomaticRelevanceDetermination(page 59)
MAP: Maximum APosteriori (page 83)
RLS: RegularizedLeast-Squares(page 62)
SBL: SparseBayesianLearning(page 63)
LDA: LinearDiscriminantAnalysis(page 69)
SNR: Signal-to-NoiseRatio(page 103)
SIR: Signal-to-InterferenceRatio(page 24)
SDR: Signal-to-DistortionRatio(page 74)
SAR: Source-to-ArtifactsRatio(page 74)
EM: ExpectationMaximization(page 76)
VB: VariationalBayesian(page 85)
VB-EM: VariationalBayesianExpectationMaximization(page 85)
ELBO: EvidenceLowerBound(page 87)
MCMC: MarkovChainMonteCarlo(page 92)
HMM: HiddenMarkovModel(page 100)
PCA: PrincipalComponentAnalysis(page 100)
MDL: MinimumDescriptionLength(page 101)
BIC: BayesianInformationCriterion(page 101)
CIM: ComponentImportanceMeasure(page 102)
MLED: Maximum LikelihoodEigendecomposition(page 103)
LR: LikelihoodRatio(page 110)
NLR: NonparametricLikelihoodRatio(page 106)
ME: MaximumEntropy(page 108)
MMI: MinimumMutualInformation(page 108)
NMI: NonparametricMutualInformation(page 116)
WER: WordErrorRate(page 104)
SER: SyllableErrorRate(page 115)
C-DIV: ConvexDivergence(page 118)
C-ICA: ConvexDivergenceICA(page 118)
WNMF: WeightedNonnegativeMatrixFactorization(page 123)
NB-ICA: NonstationaryBayesianICA(page 132)
OLGP-ICA: OnlineGaussianProcessICA(page 145)
SMC-ICA: SequentialMonteCarloICA(page 145)
GP: GaussianProcess(page 59)
AR: AutoregressiveProcess(page 147)
NMFD: NonnegativeMatrixFactorDeconvolution(page 161)
NMF2D: NonnegativeMatrixFactor2-DDeconvolution(page 161)
NTFD: NonnegativeTensorFactorDeconvolution(page 237)
NMF2D: NonnegativeTensorFactor2-DDeconvolution(page 237)
GIG: GeneralizedInverse-Gaussian(page 168)
MGIG: Matrix-variateGeneralizedInverse-Gaussian(page 255)
LPC: LinearPredictionCoefficient(page 172)
CD: CepstrumDistance(page 171)
LLR: Log-LikelihoodRatio(page 172)
FWSegSNR: Frequency-WeightedSegmentalSNR(page 172)
SRMR: Speech-to-ReverberationModulationEnergyRatio(page 172)
PLCA: ProbabilisticLatentComponentAnalysis(page 174)
PLCS: ProbabilisticLatentComponentSharing(page 181)
CAS: CollaborativeAudioEnhancement(page 181)
1-D: One-Dimensional(page 177)
2-D: Two-Dimensional(page 177)
BNMF: BayesianNonnegativeMatrixFactorization(page 182)
GE-BNMF: Gaussian–ExponentialBNMF(page 195)
PG-BNMF: Poisson–GammaBNMF(page 195)
PE-BNMF: Poisson–ExponentialBNMF(page 195)
SMR: Speech-to-MusicRatio(page 196)
NSDR: NormalizedSignal-to-DistortionRatio(page 199)
GNSDR: GlobalNormalizedSignal-to-DistortionRatio(page 199)
BGS: BayesianGroupSparselearning(page 202)
LSM: LaplacianScaleMixturedistribution(page 206)
NMPCF: NonnegativeMatrixPartialCo-Factorization(page 202)
GNMF: Group-basedNonnegativeMatrixFactorization(page 203)
DNMF: DiscriminativeNonnegativeMatrixFactorization(page 69)
LNMF: LayeredNonnegativeMatrixFactorization(page 219)
DLNMF: DiscriminativeLayeredNonnegativeMatrixFactorization(page 219)
FA: FactorAnalysis(page 221)
PMF: ProbabilisticMatrixFactorization(page 242)
PTF: ProbabilisticTensorFactorization(page 244)
PSDTF: PositiveSemidefiniteTensorFactorization(page 251)
LD: Log-Determinant(page 253)
GaP: GammaProcess(page 254)
LBFGS: LimitedMemoryBroyden–Fletcher–Goldfarb–Shanno(page 263)
STOI: Short-TimeObjectiveIntelligibility(page 263)
PESQ: PerceptualEvaluationofSpeechQuality(page 264)
T-F: Time-Frequency(page 268)
CCF: Cross-CorrelationFunction(page 269)
ITD: InterauralTimeDifference(page 269)
ILD: InterauralLevelDifference(page 269)
GFCC: GammatoneFrequencyCepstralCoefficients(page 269)
BIR: BinauralImpulseResponse(page 269)
IBM: IdealBinaryMask(page 268)
PIT: PermutationInvariantTraining(page 275)
IRM: IdealRatioMask(page 274)
IAM: IdealAmplitudeMask(page 274)
IPSM: IdealPhaseSensitiveMask(page 274)
FC: Fully-Connected(page 283)
STF: Spectral-TemporalFactorization(page 283)
L-BFGS: Limited-memoryBroyden–Fletcher–Goldfarb–Shanno(page 290)
VRNN: VariationalRecurrentNeuralNetwork(page 299)
VAE: VariationalAuto-Encoder(page 299)
NTM: NeuralTuringMachine(page 307)
RCNN: RecallNeuralNetwork(page 314)
INTRODUCTION Inrealworld,mixedsignalsarereceivedeverywhere.Theobservationsperceivedbyahumanaredegraded.Itisdifficulttoacquirefaithfulinformationfromenvironmentsinmanycases.Forexample,we aresurroundedbysoundsandnoiseswithinterferencefromroomreverberation.Multiplesourcesare activesimultaneously.Thesoundeffectsorlisteningconditionsforspeechandaudiosignalsareconsiderablydeteriorated.Fromtheperspectiveofcomputervision,anobservedimageisusuallyblurred bynoise,illuminatedbylightingormixedwiththeotherimageduetoreflection.Thetargetobject becomeshardtodetectandrecognize.Inaddition,itisalsoimportanttodealwiththemixedsignals ofmedicalimagingdata,includingmagnetoencephalography(MEG)andfunctionalmagneticresonanceimaging(fMRI).Themixinginterferencefromexternalsourcesofelectromagneticfieldsdueto themuscleactivitysignificantlymasksthedesiredmeasurementfrombrainactivity.Therefore,how tocomeupwithapowerfulsolutiontoseparateamixedsignalintoitsindividualsourcesignalsis nowadaysachallengingproblem,whichhasattractedmanyresearchersworkinginthisdirectionand developingpracticalsystemsandapplications.
ThischapterstartswithanintroductiontovarioustypesofseparationsysteminSection 1.1.We thenaddresstheseparationproblemsandchallengesinSection 1.2 wheremachinelearninganddeep learningalgorithmsareperformedtotackletheseproblems.Asetofpracticalsystemsandapplications usingsourceseparationareillustrated.Anoverviewofthewholebookissystematicallydescribedin Section 1.3.
1.1 SOURCESEPARATION Blindsourceseparation(BSS)aimstoseparateasetofsourcesignalsfromasetofmixedsignals withoutorwithverylittleinformationaboutthesourcesignalsorthemixingprocess.BSSdealswith theproblemofsignalreconstructionfromamixedsignalorasetofmixedsignals.Suchascientific domainismultidisciplinary. Signalprocessing and machinelearning aretwoprofessionaldomains whichhavebeenwidelyexploredtodealwithvariouschallengesinBSS.Ingeneral,therearethree typesofamixingsystemorsensornetworkinreal-worldapplications,namely multichannelsource separation, monauralsourceseparation and deconvolution-basedseparation, whicharesurveyedin whatfollows.
1.1.1MULTICHANNELSOURCESEPARATION Aclassicalexampleofasourceseparationproblemisthe cocktailpartyproblem,whereanumber ofpeoplearetalkingsimultaneouslyinaroomatacocktailparty,andalisteneristryingtofollow oneofthediscussions.AsshowninFig. 1.1,threespeakers {st 1 ,st 2 ,st 3 } aretalkingatthesametime. Threemicrophones {xt 1 ,xt 2 ,xt 3 } areinstallednearbyasthesensorstoacquirespeechsignalswhich
SourceSeparationandMachineLearning. https://doi.org/10.1016/B978-0-12-804566-4.00012-7
Copyright©2019ElsevierInc.Allrightsreserved.
FIGURE1.1
Cocktailpartyproblemwiththreespeakersandthreemicrophones.
aremixeddifferentlydependingonthelocation,angleandchannelcharacteristicsofindividualmicrophones.Alinearmixingsystemisconstructedas
This3 × 3mixingsystemcanbewritteninavectorandmatrixformas xt = Ast where xt = [xt 1 xt 2 xt 3 ] , st =[st 1 st 2 st 3 ] and A =[aij ]∈ R3×3 .Thissysteminvolvesthecurrenttime t and constantmixingmatrix A withoutconsideringthenoiseeffect.Wealsocallitthe instantaneous and noiseless mixingsystem.Assuming3 × 3mixturematrix A isinvertible,aninverseproblemisthen tackledtoidentifythesourcesignalsas st = Wxt where W = A 1 isthedemixingmatrixwhich exactlyrecoverstheoriginalsourcesignals st fromthemixedobservations xt
Moregenerally,themultichannelsourceseparationisformulatedasan n × m linearmixingsystem consistingofasetof n linearequationsfor n individualchannelsorsensors xt ∈ Rn×1 where m sources st ∈ Rm×1 arepresent(seeFig. 1.2).Agenerallinearmixingsystem xt = Ast isexpressedandextended by
(1.2)
wherethemixingmatrix A =[aij ]∈ Rn×m ismerged.Inthisproblem,themixingmatrix A and thesourcesignals st areunknown.Ourgoalistoreconstructthesourcesignal yt ∈ Rn×1 byfindinga
FIGURE1.2
Agenerallinearmixingsystemwith n observationsand m sources.
demixingmatrix W ∈ Rm×n through yt = Wxt .Weaimatestimatingthedemixingmatrixaccordingto anobjectivefunction D (X, W) fromasetofmixedsignals X ={xt }T t =1 sothattheconstructedsignals areclosetotheoriginalsourcesignalsasmuchaspossible,i.e., yt ≈ st .Therearethreesituationsin multichannelsourceseparation:
DeterminedSystem: n = m
Inthiscase,thenumberofchannelsisthesameasthenumberofsources.Wearesolvingadetermined systemwhereauniquesolutionexistsifthemixingmatrix A isnonsingularandtheinvertiblematrix W = A 1 istractable.Theexactsolutiontothissituationisobtainedby yt = Wxt where yt = st . Fortheapplicationofaudiosignalseparation,thisconditionimpliesthatthenumberofspeakersor musicalsourcesisthesameasthenumberofmicrophoneswhichareusedtoacquireaudiosignals. Amicrophonearrayisintroducedinsuchasituation.Thismeansthatifmoresourcesarepresent,we needtoemploymoremicrophonestoestimatetheindividualsourcesignals.Independentcomponent analysis(ICA),asdetailedinSection 2.1,isdevelopedtoresolvethiscase.Anumberofadvanced solutionstothisBSScasewillbedescribedinChapter 4.
OverdeterminedSystem: n>m Inthiscase,thenumberofchannels n islargerthanthenumberofsources m.Inmathematics,this isanoverdeterminedcaseinasystemofequationswheretherearemoreequationsthanunknowns. Inaudiosourceseparation,eachspeakerormusicalsourceisseenasanavailabledegreeoffreedom whileeachchannelormicrophoneisviewedasaconstraintthatrestrictsonedegreeoffreedom.The overdeterminedcaseappearswhenthesystemhasbeenoverconstrained.Suchanoverdeterminedsystemisalmostalways inconsistent sothatthereisnoconsistentsolutionespeciallywhenconstructed witharandommixingmatrix A.InSawadaetal.(2007),thecomplex-valuedICAwasdevelopedto tacklethesituation,inwhichthenumberofmicrophoneswasenoughforthenumberofsources.This methodseparatesthefrequencybin-wisemixtures.Foreachfrequencybin,anICAdemixingmatrixis estimatedtooptimallypushthedistributionofthedemixedelementsfarfromaGaussian.
UnderdeterminedSystem: n<m Underdeterminedsysteminsourceseparationoccurswhenthenumberofchannelsislessthanthe numberofsources(Winteretal., 2007).Thissystemisunderconstrained.Itisdifficulttoestimatea
reliablesolution.However,suchacaseischallengingandrelevantinmanyreal-worldapplications. Inparticular,weareinterestedinthecaseofsingle-channelsourceseparationwhereasinglemixed signalisreceivedinthepresenceoftwoormoresourcesignals.Foraudiosourceseparation,several methodshavebeenproposedtodealwiththiscircumstance.InSawadaetal.(2011),atime-frequency maskingschemewasproposedtoidentifywhichsourcehadthelargestamplitudeineachindividual time-frequencyslot (f,t).Duringtheidentificationprocedure,ashort-timeFouriertransform(STFT) wasfirstcalculatedtofindtime-frequencyobservationvectors xft .Theclusteringoftime-frequency observationvectorswasperformedtocalculatetheposteriorprobability p(j |xft ) thatavector xft belongstoaclusterorasource j .Alikelihoodfunction p(xft |j) basedonGaussianmixturemodel (GMM)wasusedinthiscalculation.Atime-frequencymaskingfunction Mj ft wasaccordinglydeterminedtoestimatetheseparatedsignals sj ft = Mj ft xft foranindividualsource j .
Insomecases,thenumberofsources m isunknownbeforehand.Estimatingthenumberofsources requiresidentifyingtherightconditionanddevelopingtherightsolutiontoovercomethecorresponding mixingproblem.InArakietal.(2009a, 2009b),theauthorsconstructedaGMMwithDirichletprior formixtureweightstoidentifythedirection-of-arrival(DOA)ofsourcespeechsignalfromindividual time-frequencyobservations xft andusedDOAinformationtolearnthenumberofsourcesanddevelop aspecializedsolutionforsparsesourceseparation.
Inaddition,weusuallyassumethemixingsystemistime-invariantor,equivalently,themixingmatrix A istimeindependent.Thisassumptionmaynotfaithfullyreflectthereal-worldsourceseparation wheresourcesaremovingorchanging,orthenumberofsourcesisalsochanging.Inthiscase,the mixingmatrixistimedependent,i.e., A → A(t).Weneedtofindthedemixingmatrix,whichisalso timedependentas W → W(t).Estimatingthesourcesignals st underthe nonstationary mixingsystem isrelevantinpracticeandcrucialforreal-worldblindsourceseparation.
1.1.2MONAURALSOURCESEPARATION BSSisingeneralhighlyunderdetermined.Manyapplicationsinvolveasingle-channelsourceseparationproblem(n = 1).Amongdifferentrealizationsofamixingsystem,itiscrucialtodealwith single-channelsourceseparationbecauseawiderangeofapplicationsinvolveonlyasinglerecording channelbutmixorconvolvewithvarioussourcesorinterferences.Fig. 1.3 demonstratesascenarioof monauralsourceseparationusingasinglemicrophonewiththreesources.Weaimtosuppresstheambientnoises,includingbirdandairplane,andidentifythehumanvoicesforlisteningorunderstanding. Therefore,single-channelsourceseparationcanbegenerallytreatedasavenuetospeechenhancementornoisereductioninawaythatwewanttoenhanceorpurifythespeechsignalinthepresence ofsurroundingnoises.
Therearetwolearningstrategiesinmonauralsourceseparation, supervisedlearning and unsupervisedlearning.Supervisedapproachconductssourceseparationgivenbythelabeledtrainingdata fromdifferentsources.Namely,theseparatedtrainingdataarecollectedinadvance.Usingthisstrategy, sourceseparationisnottrulyblind.Asetoftrainingdatapairswithmixedsignalsandseparatedsignalsareprovidedtotrainademixingsystemwhichisgeneralizabletodecomposethoseunseenmixed signals.Nonnegativematrixfactorization(NMF)anddeepneuralnetwork(DNN)aretwomachine learningparadigmstodealwithsingle-channelsourceseparationwhichwillbeextensivelydescribed inSections 2.2 and 2.4 withanumberofadvancedworksorganizedinChapters 5 and 7,respectively. Basically,NMF(LeeandSeung, 1999)factorizesanonnegativedatamatrix X ={xt }T t =1 intoaprod-