Issuu

https://ebookmass.com/product/source-separation-and-machine-

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

Source Separation in Physical-Chemical Sensing Christian Jutten

https://ebookmass.com/product/source-separation-in-physical-chemicalsensing-christian-jutten/

ebookmass.com

Applied Machine Learning 1st Edition M. Gopal

https://ebookmass.com/product/applied-machine-learning-1st-edition-mgopal/

ebookmass.com

Machine Learning in Microservices: Productionizing microservices architecture for machine learning solutions Abouahmed

https://ebookmass.com/product/machine-learning-in-microservicesproductionizing-microservices-architecture-for-machine-learningsolutions-abouahmed/ ebookmass.com

Microsoft Office 365 : Office 2016 : introductory Freund

https://ebookmass.com/product/microsoftoffice-365-office-2016-introductory-freund/

ebookmass.com

People Forced to Flee: History, Change and Challenge

United Nations High Commissioner For Refugees (Unhcr)

https://ebookmass.com/product/people-forced-to-flee-history-changeand-challenge-united-nations-high-commissioner-for-refugees-unhcr/

ebookmass.com

The Local and the Digital in Environmental Communication 1st ed. Edition Joana Díaz-Pont

https://ebookmass.com/product/the-local-and-the-digital-inenvironmental-communication-1st-ed-edition-joana-diaz-pont/

ebookmass.com

The ends of empire : the last colonies revisited Robert Aldrich

https://ebookmass.com/product/the-ends-of-empire-the-last-coloniesrevisited-robert-aldrich/

ebookmass.com

Jubilee, Kentucky 02-Last Rites Sala

https://ebookmass.com/product/jubilee-kentucky-02-last-rites-sala/

ebookmass.com

Un desafío para Mr Parker 1ª Edition Ella Valentine

https://ebookmass.com/product/un-desafio-para-mr-parker-1a-editionella-valentine/

ebookmass.com

Corporate

https://ebookmass.com/product/corporate-entrepreneurship-andinnovation-4th-edition-paul-burns/

ebookmass.com

SourceSeparationand MachineLearning

Jen-TzungChien

NationalChiaoTungUniversity

AcademicPressisanimprintofElsevier 125LondonWall,LondonEC2Y5AS,UnitedKingdom 525BStreet,Suite1650,SanDiego,CA92101,UnitedStates 50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UnitedKingdom

Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicormechanical,including photocopying,recording,oranyinformationstorageandretrievalsystem,withoutpermissioninwritingfromthepublisher.Detailson howtoseekpermission,furtherinformationaboutthePublisher’spermissionspoliciesandourarrangementswithorganizationssuchas theCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite: www.elsevier.com/permissions.

ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher(otherthanasmaybenoted herein).

Notices

Knowledgeandbestpracticeinthisﬁeldareconstantlychanging.Asnewresearchandexperiencebroadenourunderstanding,changes inresearchmethods,professionalpractices,ormedicaltreatmentmaybecomenecessary.

Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingandusinganyinformation, methods,compounds,orexperimentsdescribedherein.Inusingsuchinformationormethodstheyshouldbemindfuloftheirownsafety andthesafetyofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility.

Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeanyliabilityforanyinjuryand/or damagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise,orfromanyuseoroperationofanymethods, products,instructions,orideascontainedinthematerialherein.

LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress

BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary

ISBN:978-0-12-817796-9

ForinformationonallAcademicPresspublications visitourwebsiteat https://www.elsevier.com/books-and-journals

Publisher: MaraConner

AcquisitionEditor: TimPitts

EditorialProjectManager: JohnLeonard

ProductionProjectManager: SuryaNarayananJayachandran

Designer: MarkRogers

TypesetbyVTeX

ListofFigures

Fig.1.1Cocktailpartyproblemwiththreespeakersandthreemicrophones.4

Fig.1.2Agenerallinearmixingsystemwith n observationsand m sources.5

Fig.1.3Anillustrationofmonauralsourceseparationwiththreesourcesignals.7

Fig.1.4Anillustrationofsingingvoiceseparation.8

Fig.1.5Speechrecognitionsysteminareverberantenvironment.9

Fig.1.6Characteristicsofroomimpulseresponse.9

Fig.1.7Categorizationofvariousmethodsforspeechdereverberation.10

Fig.1.8Categorizationofvariousapplicationsforsourceseparation.12

Fig.1.9Blindsourceseparationforelectroencephalographyartifactremoval.14

Fig.1.10Challengesinfront-endprocessingandback-endlearningforaudiosourceseparation.18

Fig.2.1Evolutionofdifferentsourceseparationmethods.21

Fig.2.2ICAlearningprocedureforﬁndingademixingmatrix W.24

Fig.2.3Illustrationfornonnegativematrixfactorization X ≈ BW.25

Fig.2.4Nonnegativematrixfactorizationshowninasummationofrank-onenonnegativematrices.26

Fig.2.5Supervisedlearningforsingle-channelsourceseparationusingnonnegativematrixfactorization inpresenceofaspeechsignal Xs andamusicsignal Xm .27

Fig.2.6Unsupervisedlearningforsingle-channelsourceseparationusingnonnegativematrix factorizationinpresenceoftwosources.27

Fig.2.7Illustrationofmultiwayobservationdata.33

Fig.2.8Atensordatawhichiscomposedofthreewaysoftime,frequencyandchannel.34

Fig.2.9Tuckerdecompositionforathree-waytensor.34

Fig.2.10CPdecompositionforathree-waytensor.37

Fig.2.11Multilayerperceptronwithoneinputlayer,onehiddenlayer,andoneoutputlayer.38

Fig.2.12ComparisonofactivationsusingReLUfunction,logisticsigmoidfunction,andhyperbolic tangentfunction. 39

Fig.2.13Calculationsinerrorbackpropagationalgorithm:(A)forwardpassand(B)backwardpass.The propagationsareshownbyarrows.Intheforwardpass,activations at andoutputs zt in individualnodesarecalculatedandpropagated.Inthebackwardpass,localgradients δt in individualnodesarecalculatedandpropagated.41

Fig.2.14Procedureintheerrorbackpropagationalgorithm:(A)computetheerrorfunction, (B)propagatethelocalgradientfromtheoutputlayer L to(C)hiddenlayer L 1and(D)until inputlayer1. 42

Fig.2.15Atwo-layerstructureinrestrictedBoltzmannmachine.43

Fig.2.16Astack-wisetrainingprocedurefordeepbeliefnetwork.Thismodelisfurtherusedasa pretrainedmodelfordeepneuralnetworkoptimization.44

Fig.2.17Arecurrentneuralnetworkwithsinglehiddenlayer.45

Fig.2.18Backpropagationthroughtimewithasinglehiddenlayerandtwotimesteps τ = 2.46

Fig.2.19Adeeprecurrentneuralnetworkforsingle-channelsourceseparationinthepresenceoftwo sourcesignals.

Fig.2.20Aprocedureofsingle-channelsourceseparationbasedonDNNorRNNwheretwosource signalsarepresent. 49

Fig.2.21Preservationofgradientinformationthroughagatingmechanism.50

Fig.2.22Adetailedviewoflongshort-termmemory.Blockcirclemeansmultiplication.Recurrentstate zt 1 andcell ct 1 inaprevioustimestepareshownbydashedlines. 51

Fig.3.1PortraitofThomasBayes(1701–1761)whoiscreditedfortheBayestheorem. 58

Fig.3.2Illustrationofascalablemodelinpresenceofincreasingnumberoftrainingdata. 59

Fig.3.3Optimalweights w estimatedbyminimizingthe 2 -and 1 -regularizedobjectivefunctions whichareshownontheleftandright,respectively. 62

Fig.3.4Laplacedistributioncenteredatzerowithdifferentcontrolparameters λ.63

Fig.3.5Comparisonoftwo-dimensionalGaussiandistribution(A)andStudent’s t -distribution(B)with zeromean. 64

Fig.3.6Scenarioforatime-varyingsourceseparationsystemwith(A)onemale,onefemaleandone musicplayer.(B)Then,themaleismovingtoanewlocation.(C)Afterawhile,themale disappearsandthefemaleisreplacedbyanewone.

Fig.3.7SequentialupdatingofparametersandhyperparametersforonlineBayesianlearning. 68

Fig.3.8Evolutionofdifferentinferencealgorithmsforconstructionoflatentvariablemodels. 75

Fig.3.9Illustrationfor(A)decomposingalog-likelihoodintoaKLdivergenceandalowerbound, (B)updatingthelowerboundbysettingKLdivergencetozerowithnewdistribution q(Z),and (C)updatinglowerboundagainbyusingnewparameters .AdaptedfromBishop(2006).80

Fig.3.10Illustrationofupdatinglowerboundofthelog-likelihoodfunctioninEMiterationfrom τ to τ + 1.AdaptedfromBishop(2006).

Fig.3.11Bayesrelationamongposteriordistribution,likelihoodfunction,priordistributionandevidence function.

Fig.3.12Illustrationofseekinganapproximatedistribution q(Z) whichisfactorizableandmaximally similartothetrueposterior p(Z|X).86

Fig.3.13IllustrationofminimizingKLdivergenceviamaximizationofvariationallowerbound. KLdivergenceisreducedfromlefttorightduringtheoptimizationprocedure.

Fig.4.1Constructionoftheeigenspaceandindependentspaceforﬁndingtheadaptedmodelsfromaset ofseedmodelsbyusingenrollmentdata.

102

Fig.4.2Comparisonofkurtosisvaluesoftheestimatedeigenvoicesandindependentvoices. 104

Fig.4.3ComparisonofBICvaluesofusingPCAwitheigenvoicesandICAwithindependentvoices.105

Fig.4.4ComparisonofWERs(%)ofusingPCAwitheigenvoicesandICAwithindependentvoices wherethenumbersofbasisvectors(K )andthenumbersofadaptationsentences(L)are changed. 105

Fig.4.5Taxonomyofcontrastfunctionsforoptimizationinindependentcomponentanalysis. 107

Fig.4.6ICAtransformationand k -meansclusteringforspeechrecognitionwithmultiplehidden Markovmodels. 115

Fig.4.7Generationofsupervectorsfromdifferentacousticsegmentsofalignedutterances. 116

Fig.4.8TaxonomyofICAcontrastfunctionswheredifferentrealizationsofmutualinformationare included. 119

Fig.4.9Illustrationofaconvexfunction f(x). 119

Fig.4.10Comparisonof(A)anumberofdivergencemeasuresand(B) α -DIVandC-DIVforvarious α underdifferentjointprobability Py1 ,y2 (A,A) 122

Fig.4.11Divergencemeasuresofdemixedsignalsversustheparametersofdemixingmatrix θ1 and θ2 KL-DIV,C-DIVat α = 1andC-DIVat α =−1arecompared.

128

Fig.4.12Divergencemeasuresversusnumberoflearning.KL-ICAandC-ICAat α = 1and α =−1are evaluated.Thegradientdescent(GD)andnaturalgradient(NG)algorithmsarecompared.129

Fig.4.13ComparisonofSIRsofthreedemixedsignalsinthepresenceofaninstantaneousmixing condition.DifferentICAalgorithmsareevaluated.

130

Fig.4.14ComparisonofSIRsofthreedemixedsignalsinthepresenceofinstantaneousmixingcondition withadditivenoise.DifferentICAalgorithmsareevaluated. 131

Fig.4.15Integratedscenarioofatime-varyingsourceseparationsystemfrom t to t + 1and t + 2as showninFig. 3.6

Fig.4.16GraphicalrepresentationfornonstationaryBayesianICAmodel.

Fig.4.17Comparisonofspeechwaveformsfor(A)sourcesignal1(blueinwebversionordarkgrayin printversion),sourcesignal2(redinwebversionorlightgrayinprintversion)andtwomixed signals(black)and(B)twodemixedsignals1(blueinwebversionordarkgrayinprint version)andtwodemixedsignals2(redinwebversionorlightgrayinprintversion)byusing NB-ICAandOLGP-ICAalgorithms.

Fig.4.18ComparisonofvariationallowerboundsbyusingNB-ICAwith(blueinwebversionordark grayinprintversion)andwithout(redinwebversionorlightgrayinprintversion)adaptation ofARDparameter.

Fig.4.19ComparisonoftheestimatedARDparametersofsourcesignal1(blueinwebversionordark grayinprintversion)andsourcesignal2(redinwebversionorlightgrayinprintversion) usingNB-ICA.

Fig.4.20EvolutionfromNB-ICAtoOLGP-ICAfornonstationarysourceseparation.

Fig.4.21GraphicalrepresentationforonlineGaussianprocessICAmodel.

133

136

143

144

146

150

Fig.4.22ComparisonofsquareerrorsoftheestimatedmixingcoefﬁcientsbyusingNS-ICA(black), SMC-ICA(pinkinwebversionorlightgrayinprintversion),NB-ICA(blueinwebversionor darkgrayinprintversion)andOLGP-ICA(redinwebversionormidgrayinprintversion).157

Fig.4.23Comparisonofabsoluteerrorsoftemporalpredictabilitiesbetweentrueanddemixedsource signalswheredifferentICAmethodsareevaluated.

Fig.4.24Comparisonofsignal-to-interferenceratiosofdemixedsignalswheredifferentICAmethods areevaluated.

Fig.5.1GraphicalrepresentationforBayesianspeechdereverberation.

158

159

167

Fig.5.2GraphicalrepresentationforGaussian–ExponentialBayesiannonnegativematrixfactorization.184

Fig.5.3GraphicalrepresentationforPoisson–GammaBayesiannonnegativematrixfactorization.187

Fig.5.4GraphicalrepresentationforPoisson–ExponentialBayesiannonnegativematrixfactorization.190

Fig.5.5Implementationprocedureforsupervisedspeechandmusicseparation. 197

Fig.5.6Histogramoftheestimatednumberofbases(K )usingPE-BNMFforsourcesignalsofspeech, pianoandviolin. 198

Fig.5.7ComparisonoftheaveragedSDRusingNMFwithafixednumberofbases(1–5pairsofbars), PG-BNMFwiththebestfixednumberofbases(sixthpairofbars),PG-BNMFwithadaptive numberofbases(seventhpairofbars),PE-BNMFwiththebestfixednumberofbases(eighth pairofbars)andPE-BNMFwithadaptivenumberofbases(ninthpairofbars).

198

Fig.5.8Implementationprocedureforunsupervisedsingingvoiceseparation. 199

Fig.5.9ComparisonofGNSDRoftheseparatedsingingvoicesatdifferentSMRsusingPE-BNMFs with K -meansclustering(denotedbyBNMF1),NMFclustering(denotedbyBNMF2)and shiftedNMFclustering(denotedbyBNMF3).Fivecompetitivemethodsareincludedfor comparison.

201

Fig.5.10Illustrationforgroupbasisrepresentation.204

Fig.5.11ComparisonofGaussian,LaplaceandLSMdistributionscenteredatzero.207

Fig.5.12GraphicalrepresentationforBayesiangroupsparsenonnegativematrixfactorization.208

Fig.5.13Spectrogramsof“music5”containingthedrumsignal(ﬁrstpanel),thesaxophonesignal (secondpanel),themixedsignal(thirdpanel),thedemixeddrumsignal(fourthpanel)andthe demixedsaxophonesignal(ﬁfthpanel).217

Fig.5.14Illustrationforlayerednonnegativematrixfactorization.221

Fig.5.15Illustrationfordiscriminativelayerednonnegativematrixfactorization.224

Fig.6.1Categorizationandevolutionfordifferenttensorfactorizationmethods.231

Fig.6.2Procedureofproducingthemodulationspectrogramsfortensorfactorization.232

Fig.6.3(A)Timeresolutionand(B)frequencyresolutionofamixedaudiosignal.235

Fig.6.4Comparisonbetween(A)nonnegativematrixfactorizationand(B)positivesemideﬁnitetensor factorization.

253

Fig.7.1Categorizationofdifferentdeeplearningmethodsforsourceseparation.260

Fig.7.2Adeepneuralnetworkforsingle-channelspeechseparation,where xt denotestheinputfeatures ofthemixedsignalattimestep t , z(l) t denotesthefeaturesinthehiddenlayer l , y1,t and y2,t denotethemaskfunctionforsourceoneandsourcetwo,and x1,t and x2,t aretheestimated signalsforsourceoneandsourcetwo,respectively.

Fig.7.3Illustrationofstackingindeepensemblelearningforspeechseparation.

Fig.7.4Speechsegregationprocedurebyusingdeepneuralnetwork.

Fig.7.5Adeeprecurrentneuralnetworkforspeechdereverberation.

Fig.7.6Factorizedfeaturesinspectralandtemporaldomainsforspeechdereverberation.

Fig.7.7ComparisonofSDR,SIRandSARoftheseparatedsignalsbyusingNMF,DRNN, DDRNN-bwandDDRNN-diff.

Fig.7.8Vanishinggradientsinarecurrentneuralnetwork.Degreeoflightnessmeansthelevelof vanishinginthegradient.ThisﬁgureisacounterparttoFig. 2.21 wherevanishinggradientsare mitigatedbyagatingmechanism.

Fig.7.9Asimpliﬁedviewoflongshort-termmemory.Thered(lightgrayinprintversion)lineshows recurrentstate zt .ThedetailedviewwasprovidedinFig. 2.22

Fig.7.10Illustrationofpreservinggradientsinlongshort-termmemory.Therearethreegatesina memoryblock. ◦ meansgateopeningwhile denotesgateclosing.

261

267

268

279

283

291

292

293

Fig.7.11Aconﬁgurationoffourstackedlongshort-termmemorylayersalongthreetimestepsfor monauralspeechseparation.DashedarrowsindicatethemodelingofthesameLSTMacross timestepswhilesolidarrowsmeanthemodelingofdifferentLSTMsindeepstructure.294

Fig.7.12Illustrationofbidirectionalrecurrentneuralnetworkformonauralsourceseparation.Two hiddenlayersareconﬁguredtolearnbidirectionalfeaturesforﬁndingsoftmaskfunctionsfor twosources.Forwardandbackwarddirectionsareshownindifferentcolors.

Fig.7.13(A)Encoderanddecoderinavariationalautoencoder.(B)Graphicalrepresentationfora variationalautoencoder.

Fig.7.14Graphicalrepresentationfor(A)recurrentneuralnetworkand(B)variationalrecurrentneural network.

296

300

301

Fig.7.15Inferenceprocedureforvariationalrecurrentneuralnetwork.302

Fig.7.16Implementationtopologyforvariationalrecurrentneuralnetwork.305

Fig.7.17ComparisonofSDR,SIRandSARoftheseparatedsignalsbyusingNMF,DNN,DRNN, DDRNNandVRNN.306

Fig.7.18(A)Single-channelsourceseparationwithdynamicstate Zt ={zt , rt , wr,t , ww,t } inrecurrent layer L 1.(B)Recurrentlayersaredrivenbyacell ct andacontrollerformemory Mt where dashedlinedenotestheconnectionbetweencellandmemoryatprevioustimestepandbold linesdenotetheconnectionswithweights.308

Fig.7.19Fourstepsofaddressingprocedurewhichisdrivenbyparameters {kt ,βt ,gt , st ,γt }.310

Fig.7.20Anend-to-endmemorynetworkformonauralsourceseparationcontainingabidirectional LSTMontheleftasanencoder,anLSTMontherightasadecoderandanLSTMonthetopas aseparator. 315

ListofTables

Table2.1ComparisonofNMFupdatingrulesbasedondifferentlearningobjectives31

Table2.2ComparisonofupdatingrulesofstandardNMFandsparseNMFbasedonlearningobjectives ofsquaredEuclideandistanceandKullback–Leiblerdivergence32

Table3.1ComparisonofapproximateinferencemethodsusingvariationalBayesianandGibbssampling95

Table4.1Comparisonofsyllableerrorrates(SERs)(%)withandwithoutHMMclusteringandICA learning.DifferentICAalgorithmsareevaluated117

Table4.2Comparisonofsignal-to-interferenceratios(SIRs)(dB)ofmixedsignalswithoutICA processingandwithICAlearningbasedonMMIandNLRcontrastfunctions117

Table4.3Comparisonofdifferentdivergencemeasureswithrespecttosymmetricdivergence,convexity parameter,combinationweightandspecialrealization121

Table5.1ComparisonofmultiplicativeupdatingrulesofstandardNMFDandsparseNMFDbasedonthe objectivefunctionsofsquaredEuclideandistanceandKLdivergence164

Table5.2ComparisonofmultiplicativeupdatingrulesofstandardNMF2DandsparseNMF2Dbasedon theobjectivefunctionsofsquaredEuclideandistanceandKLdivergence164

Table5.3Comparisonofusingdifferentmethodsforspeechdereverberationundervarioustestconditions (near,nearmicrophone;far,farmicrophone;sim,simulateddata;real,realrecording)interms ofevaluationmetricsofCD,LLR,FWSegSNRandSRMR(dB)173

Table5.4ComparisonofdifferentBayesianNMFsintermsofinferencealgorithm,closed-formsolution andoptimizationtheory195

Table5.5ComparisonofGNSDR(dB)oftheseparatedsingingvoices(V)andmusicaccompaniments (M)usingNMFwithﬁxednumberofbases K = 10,20and30andPE-BNMFwithadaptive K Threeclusteringalgorithmsareevaluated200

Table5.6ComparisonofNMFanddifferentBNMFsintermsofSDRandGNSDRfortwoseparation tasks.Standarddeviationisgivenintheparentheses201

Table5.7ComparisonofSIRs(indB)ofthereconstructedrhythmicsignal(denotedbyR)andharmonic signal(denotedbyH)basedonNMF,BNMF,GNMFandBGS-NMF.Sixmixedmusicsignals areinvestigated 218

Table5.8PerformanceofspeechseparationbyusingNMF,LNMFandDLNMFintermsofSDR,SIR andSAR(dB) 228

Table6.1ComparisonofmultiplicativeupdatingrulesofNMF2DandNTF2Dbasedontheobjective functionsofsquaredEuclideandistanceandKullback–Leiblerdivergence241

Table7.1ComparisonofusingdifferentmodelsforspeechdereverberationintermsofSRMRandPESQ (indB)undertheconditionofusingsimulateddataandnearmicrophone284

Table7.2ComparisonofSTOIsusingDNN,LSTMandNTMunderdifferentSNRswithseenspeakers313

Table7.3ComparisonofSTOIsusingDNN,LSTMandNTMunderdifferentSNRswithunseenspeakers313

Table7.4ComparisonofSTOIsunderdifferentSNRsbyusingDNN,LSTManddifferentvariantsof NTMandRCNN 319 xvii

Foreword

WiththeuseofDeepNeuralNetworks(DNNs)andRecursiveNeuralNetworks(RNNs),speechrecognitionperformancehasrecentlyimprovedrapidly,andspeechrecognitionhasbeenwidelyusedin smartspeakersandsmartphones.Speechrecognitionperformanceforspeechutteredinaquietenvironmenthasbecomequiteclosetohumanperformance.However,insituationswherethereisnoise inthesurroundingsorroomreverberation,speechrecognitionperformancefallsfarshortofhuman performance.

Noiseandotherperson’svoicesarealmostalwayssuperimposedonthevoiceinthehouse,ofﬁce, conferenceroom,etc.Peoplecannaturallyextractandheartheconversationofinterestedpersons,even whilemanypeoplearechattinglikeinacocktailparty.Thisissaidtobeacocktailpartyeffect.To automaticallyrecognizethetargetvoice,itisnecessarytodevelopatechnologyforseparatingthevoice utteredintheactualenvironmentfromthesurroundingnoiseandremovingtheinﬂuenceoftheroom reverberation.Whenanalyzingandprocessingmusicsignals,itisalsorequiredtoseparateoverlapped soundsourcesignals.

Soundsourceseparationis,therefore,averyimportanttechnologyinawiderangeofsignalprocessing,particularlyspeech,sound,andmusicsignalprocessing.Variousresearcheshavebeenconducted sofar,buttheperformanceofcurrentsoundsourceseparationtechnologyisfarshortofhumancapability.Thisisoneofthemajorreasonswhyspeechrecognitionperformanceinageneralenvironment doesnotreachhumanperformance.

Generally,speechandaudiosignalsarerecordedwithoneormoremicrophones.Forthisreason, thesoundsourceseparationtechnologycanbeclassiﬁedintomonauralsoundsourceseparationand multichannelsoundsourceseparation.Thisbookfocusesonblindsourceseparation(BSS)whichisthe processofseparatingasetofsourcesignalsfromasetofmixedsignalswithouttheaidofinformation orwithverylittleinformationaboutthesourcesignalsorthemixingprocess.Itisrarethatinformation onthesoundsourcesignaltobeseparatedisobtainedbeforehand,soitisimportanttobeableto separatethesoundsourcewithoutsuchinformation.Thisbookalsoaddressesvariouschallenging issuescoveringthesingle-channelsourceseparationwherethemultiplesourcesignalsfromasingle mixedsignalarelearnedinasupervisedway,aswellasthespeakerandnoiseindependentsource separationwherealargesetoftrainingdataisavailabletolearnageneralizedmodel.

Inresponsetothegrowingneedforperformanceimprovementinspeechrecognition,researchon BSShasbeenrapidlyadvancedinrecentyearsbasedonvariousmachinelearningandsignalprocessing technology.Thisbookdescribesstate-of-the-artmachinelearningapproachesformodel-basedBSSfor speechrecognition,speechseparation,instrumentalmusicseparation,singingvoiceseparation,music informationretrieval,brainsignalseparationandimageprocessing.

Themodel-basedtechniques,combiningvarioussignalprocessingandmachinelearningtechniques,rangefromlineartononlinearmodels.Majortechniquesinclude:IndependentComponent Analysis(ICA),NonnegativeMatrixFactorization(NMF),NonnegativeTensorFactorization(NTF), DeepNeuralNetwork(DNN),andRecurrentNeuralNetwork(RNN).Therapidprogressinthelast fewyearsislargelyduetotheuseofDNNsandRNNs.

ThisbookisuniqueinthatitcoverstopicsfromthebasicsignalprocessingtheoryconcerningBSS tothetechnologyusingDNNsandRNNsinrecentyears.Attheendofthisbook,thedirectionoffuture xix

researchisalsodescribed.Thislandmarkbookisveryusefulasastudent’stextbookandresearcher’s referencebook.Thereaderswillbecomearticulateinthestate-of-the-artofmodel-basedBSS.Iwould liketoencouragemanystudentsandresearcherstoreadthisbooktomakebigcontributionstothe futuredevelopmentoftechnologyinthisﬁeld.

SadaokiFurui

President,ToyotaTechnologicalInstituteatChicago ProfessorEmeritus,TokyoInstituteofTechnology

Preface

Ingeneral,blindsourceseparation(BSS)isknownasarapidlyemergingandpromisingareawhichinvolvesextensiveknowledgeofsignalprocessingandmachinelearning.Thisbookintroducesstate-ofthe-artmachinelearningapproachesformodel-basedblindsourceseparation(BSS)withapplications tospeechrecognition,speechseparation,instrumentalmusicseparation,singingvoiceseparation,musicinformationretrieval,brainsignalseparationandimageprocessing.ThetraditionalBSSapproaches basedonindependentcomponentanalysisweredesignedtoresolvethemixingsystembyoptimizing acontrastfunctionoranindependentmeasure.Theunderdeterminedprobleminthepresenceofmore sourcesthansensorsmaynotbecarefullytackled.Thecontrastfunctionsmaynotﬂexiblyandhonestly measuretheindependenceforanoptimizationwithconvergence.Assumingthestaticmixingcondition,onecannotcatchtheunderlyingdynamicsinsourcesignalsandsensornetworks.Theuncertainty ofsystemparametersmaynotbepreciselycharacterizedsothattherobustnessagainstadverseenvironmentsisnotguaranteed.Thetemporalstructuresinmixingsystemsaswellassourcesignalsmay notbeproperlycaptured.Weassumethatthemodelcomplexityorthedictionarysizemaynotbeﬁtted tothetrueoneinsourcesignals.Withtheremarkableadvancesinmachinelearningalgorithms,the issuesofunderdeterminedmixtures,optimizationofcontrastfunction,nonstationarymixingcondition, multidimensionaldecomposition,ill-posedconditionandmodelregularizationhavebeenresolvedby introducingthesolutionsofnonnegativematrixfactorization,information-theoreticlearning,online learning,Gaussianprocess,sparselearning,dictionarylearning,Bayesianinference,modelselection, tensordecomposition,deepneuralnetwork,recurrentneuralnetworkandmemorynetwork.Thisbook willpresenthowthesealgorithmsareconnectedandwhytheyworkforsourceseparation,particularly inspeech,audioandmusicapplications.WestartwithasurveyofBSSapplicationsandmodel-based approaches.Thefundamentaltheories,includingstatisticallearning,optimizationalgorithm,informationtheory,Bayesianlearning,variationalinferenceandMonteCarloMarkovchaininference,willbe addressed.Aseriesofcasestudiesarethenintroducedtodealwithdifferentissuesinmodel-based BSS.Thesecasestudiesarecategorizedintoindependentcomponentanalysis,nonnegativematrix factorization,nonnegativetensorfactorizationanddeepneuralnetworkrangingfromalineartoanonlinearmodel,fromsingle-waytomultiwayprocessing,andfromashallowfeedforwardmodeltoa deeprecurrentmodel.Atlast,wewillpointoutanumberofdirectionsandoutlooksforfuturestudies.

Thisbookiswrittenasatextbookwithfundamentaltheoriesandadvancedtechnologiesdeveloped inthelastdecade.Itisalsoshapedasastyleofresearchmonographbecausesomeadvancesinsource separationusingmachinelearningordeeplearningmethodsareextensivelyaddressed.

Thematerialofthisbookisbasedonatutoriallectureonthisthemeatthe40thInternational ConferenceonAcoustics,Speech,andSignalProcessing(ICASSP)inBrisbane,Australia,inApril 2015.ThistutorialwasoneofthemostpopulartutorialsinICASSPintermsofthenumberofattendees. Thesuccessofthistutoriallecturebroughttheideaofwritingatextbookonthissubjecttopromote usingmachinelearningforsignalprocessing.Someofthematerialisalsobasedonanumberof invitedtalksanddistinguishedlecturesindifferentworkshopsanduniversitiesinJapanandHong Kong.Westronglybelieveintheimportanceofmachinelearninganddeeplearningapproachesto sourceseparation,andsincerelyencouragetheresearcherstoworkonmachinelearningapproachesto sourceseparation.

Acknowledgments

First,Iwanttothankmycolleaguesandresearchfriends,especiallythemembersoftheMachine LearningLabatNationalChiaoTungUniversity.Someofthestudiesinthisbookwereactuallyconductedwhilediscussingandworkingwiththem.Iwouldalsoliketothankmanypeopleforcontributing goodideas,proofreadingadraft,andgivingmevaluablecomments,whichgreatlyimprovedthisbook, includingSadaokiFurui,Chin-HuiLee,ShojiMakino,TomohiroNakatani,GeorgeA.Saon,Koichi Shinoda,Man-WaiMak,ShinjiWatanabe,IssamElNaqa,Huan-HsinTseng,Zheng-HuaTan,Zhanyu Ma,Tai-ShihChi,ShokoAraki,MarcDelcroix,JohnR.Hershey,JonathanLeRoux,HakanErdoganandTomokoMatsui.Theresearchexperienceswereespeciallyimpressiveandinspiringwhen workingonsourceseparationproblemswithmypastandcurrentstudents,inparticularBo-Cheng Chen,Chang-KaiChao,Shih-HsiungLee,Meng-FengChen,Tsung-HanLin,Hsin-LungHsieh,PoKaiYang,Chung-ChienHsu,Guan-XiangWang,You-ChengChang,Kuan-TingKuo,Kai-WeiTsou andChe-YuKuo.WeareverygratefultotheMinistryofScienceandTechnologyofTaiwanforlongtermsupportforourresearchesonmachinelearningandsourceseparation.Thegreateffortsfromthe editorsofAcademicPressatElsevier,namelyTimPitts,CharlieKent,CarlaB.Lima,JohnLeonard andSheelaJosy,arealsoappreciated.Finally,Iwouldliketothankmyfamilyforsupportingmywhole researchlive.

Jen-TzungChien Hsinchu,Taiwan October2018

NotationsandAbbreviations

GENERALNOTATIONS

Thisbookobservesthefollowinggeneralmathematicalnotationsacrossdifferentchapters:

Z+ ={1, 2,... }

Setofpositiveintegers

R Setofrealnumbers

R+

Setofpositiverealnumbers

RD Setof D dimensionalrealnumbers

a Scalarvariable a Vectorvariable

a = a1 aN =

Elementsofavector,whichcanbedescribedwiththesquare brackets [···]. denotesthetransposeoperation.

A Matrixvariable

A = ab cd

Elementsofamatrix,whichcanbedescribedwiththesquare brackets [···]

A Tensorvariable

ID D × D identitymatrix

|A| Determinantofasquarematrix

tr[A] Traceofasquarematrix

A ={a1 ,...,aN }={an }N n=1

A ={an }

Elementsinaset,whichcanbedescribedwiththecurlybraces {···}.

Elementsinaset,wheretherangeofindex n isomittedforsimplicity.

|A| Thenumberofelementsinaset A.Forexample, |{an }N n=1 |= N . f(x) or fx

Functionof x p(x) or q(x)

F [f ]

E[·]

H[·]

Probabilisticdistributionfunctionof x

Functionalof f .Notethatafunctionalusesthesquarebrackets [·] whileafunctionusestheparentheses ( )

Expectationfunction

Entropyfunction

Ep(x |y) [f(x)|y ]= f(x)p(x |y)dx Theexpectationof f(x) withrespecttoprobabilitydistribution p(x |y)

Ex [f(x)|y ]= f(x)p(x |y)dx

Anotherformoftheexpectationof f(x),wherethesubscriptwith theprobabilitydistributionand/ortheconditionalvariableisomitted,whenitistrivial.

δ(a,a ) = 1 a = a , 0otherwise

δ(x x )

Kroneckerdeltafunctionfordiscretevariables a and a

Diracdeltafunctionforcontinuousvariables x and x

ML , MAP , ... Themodelparameters estimatedbyaspeciﬁccriterion(e.g., maximumlikelihood(ML),maximum aposteriori (MAP),etc.) arerepresentedbythecriterionabbreviationinthesubscript.

BASICNOTATIONSUSEDFORSOURCESEPARATION

Wealsolistthespeciﬁcnotationsforsourceseparation.Thisbookkeepstheconsistencybyusingthe samenotationsfordifferentmodelsandapplications.Theexplanationsofthenotationsinthefollowing listprovideageneraldeﬁnition.

n Numberofchannelsorsensors

m Numberofsources Setofmodelparameters

M Modelvariableincludingtypeofmodel,structure,hyperparameters,etc. Setofhyperparameters

Q(·|·) AuxiliaryfunctionusedinEMalgorithm

H Hessianmatrix

T ∈ Z+ Numberofobservationframes

t ∈{1,...,T } Timeframeindex

xt ∈ Rn n-dimensionalmixedvectorattime t with n channels

X ={xt }T t =1 Sequenceof T mixedvectors

st ∈ Rm m-dimensionalsourcevectorattime t

yt ∈ Rm m-dimensionaldemixedvectorattime t

A ={aij }∈ Rn×m Mixingmatrix

W ={wji }∈ Rm×n Demixingmatrixinindependentcomponentanalysis(ICA)

D (X, W)

J (X, W)

ContrastfunctionforICAusingobservationdata X anddemixingmatrix W Thisfunctioniswrittenasthedivergencemeasuretobeminimized.

ContrastfunctionforICAusingobservationdata X anddemixingmatrix W. Thisfunctioniswrittenastheprobabilisticmeasuretobemaximized.

X ={Xmn }∈ RM ×N + Nonnegativemixedobservationmatrixinnonnegativematrixfactorization (NMF)with N framesand M frequencybins

B ={Bmk }∈ RM ×K + NonnegativebasismatrixinNMFwith M frequencybinsand K basisvectors

W ={Wkn }∈ RK ×N + NonnegativeweightmatrixinNMFwith K basisvectorsand N frames

η Learningrate

τ Iterationorshiftingindex

λ Regularizationparameter

X ={Xlmn }∈ RL×M ×N Three-waymixedobservationtensorhavingdimensions L, M and N

G ={Gijk }∈ RI ×J ×K Three-waycoretensorhavingdimensions I , J and K

xt ={xtd } D -dimensionalmixedobservationvectorinadeepneuralnetwork(DNN)or arecurrentneuralnetwork(RNN).Thereare T vectors.

rt ={rtk } K -dimensionalsourcevector

yt ={ytk } K -dimensionaldemixedvector

zt ={ztm } m-dimensionalfeaturevector

{atm ,atk } Activationsinahiddenlayerandanoutputlayer

{δtm ,δtk } Localgradientswithrespectto {atm ,atk }

σ(a) Sigmoidfunctionusinganactivation a s(a) Softmaxfunction

w(l)

Feedforwardweightsinthe l thhiddenlayer

w(ll) Recurrentweightsinthe l thhiddenlayer

E(w) Errorfunctionofusingthewholetrainingdata {X ={xtd }, R ={rtk }}

En (w) Errorfunctioncorrespondingtothe nthminibatchoftrainingdata {Xn , Rn }

ABBREVIATIONS

BSS: BlindSourceSeparation(page 3)

ICA: IndependentComponentAnalysis(page 22)

NMF: NonnegativeMatrixFactorization(page 25)

NTF: NonnegativeTensorFactorization(page 33)

CP: CanonicalDecomposition/ParallelFactors(page 36)

PARAFAC: ParallelFactorAnalysis(page 233)

STFT: Short-TimeFourierTransform(page 6)

GMM: GaussianMixtureModel(page 6)

DNN: DeepNeuralNetwork(page 37)

SGD: StochasticGradientDescent(page 40)

MLP: MultilayerPerceptron(page 38)

ReLU: RectiﬁedLinearUnit(page 39)

FNN: FeedforwardNeuralNetwork(page 38)

DBN: DeepBeliefNetwork(page 43)

RBM: RestrictedBoltzmannMachine(page 43)

RNN: RecurrentNeuralNetwork(page 45)

DRNN: DeepRecurrentNeuralNetwork(page 48)

DDRNN: DiscriminativeDeepRecurrentNeuralNetwork(page 290)

BPTT: BackpropagationThroughTime(page 45)

LSTM: LongShort-TermMemory(page 50)

BLSTM: BidirectionalLongShort-TermMemory(page 296)

BRNN: BidirectionalRecurrentNeuralNetwork(page 296)

CNN: ConvolutionalNeuralNetwork(page 277)

RIR: RoomImpulseResponse(page 8)

NCTF: NonnegativeConvolutiveTransferFunction(page 11)

MIR: MusicInformationRetrieval(page 13)

CASA: Computational AuditorySceneAnalysis(page 14)

FIR: FiniteImpulseResponse(page 11)

DOA: DirectionofArrival(page 6)

MFCC: Mel-FrequencyCepstralCoefﬁcient(page 27)

KL: Kullback–Leibler(page 29)

IS: Itakura–Saito(page 30)

ML: MaximumLikelihood(page 76)

RHS: Right-HandSide(page 29)

LHS: Left-HandSide(page 180)

ARD: AutomaticRelevanceDetermination(page 59)

MAP: Maximum APosteriori (page 83)

RLS: RegularizedLeast-Squares(page 62)

SBL: SparseBayesianLearning(page 63)

LDA: LinearDiscriminantAnalysis(page 69)

SNR: Signal-to-NoiseRatio(page 103)

SIR: Signal-to-InterferenceRatio(page 24)

SDR: Signal-to-DistortionRatio(page 74)

SAR: Source-to-ArtifactsRatio(page 74)

EM: ExpectationMaximization(page 76)

VB: VariationalBayesian(page 85)

VB-EM: VariationalBayesianExpectationMaximization(page 85)

ELBO: EvidenceLowerBound(page 87)

MCMC: MarkovChainMonteCarlo(page 92)

HMM: HiddenMarkovModel(page 100)

PCA: PrincipalComponentAnalysis(page 100)

MDL: MinimumDescriptionLength(page 101)

BIC: BayesianInformationCriterion(page 101)

CIM: ComponentImportanceMeasure(page 102)

MLED: Maximum LikelihoodEigendecomposition(page 103)

LR: LikelihoodRatio(page 110)

NLR: NonparametricLikelihoodRatio(page 106)

ME: MaximumEntropy(page 108)

MMI: MinimumMutualInformation(page 108)

NMI: NonparametricMutualInformation(page 116)

WER: WordErrorRate(page 104)

SER: SyllableErrorRate(page 115)

C-DIV: ConvexDivergence(page 118)

C-ICA: ConvexDivergenceICA(page 118)

WNMF: WeightedNonnegativeMatrixFactorization(page 123)

NB-ICA: NonstationaryBayesianICA(page 132)

OLGP-ICA: OnlineGaussianProcessICA(page 145)

SMC-ICA: SequentialMonteCarloICA(page 145)

GP: GaussianProcess(page 59)

AR: AutoregressiveProcess(page 147)

NMFD: NonnegativeMatrixFactorDeconvolution(page 161)

NMF2D: NonnegativeMatrixFactor2-DDeconvolution(page 161)

NTFD: NonnegativeTensorFactorDeconvolution(page 237)

NMF2D: NonnegativeTensorFactor2-DDeconvolution(page 237)

GIG: GeneralizedInverse-Gaussian(page 168)

MGIG: Matrix-variateGeneralizedInverse-Gaussian(page 255)

LPC: LinearPredictionCoefﬁcient(page 172)

CD: CepstrumDistance(page 171)

LLR: Log-LikelihoodRatio(page 172)

FWSegSNR: Frequency-WeightedSegmentalSNR(page 172)

SRMR: Speech-to-ReverberationModulationEnergyRatio(page 172)

PLCA: ProbabilisticLatentComponentAnalysis(page 174)

PLCS: ProbabilisticLatentComponentSharing(page 181)

CAS: CollaborativeAudioEnhancement(page 181)

1-D: One-Dimensional(page 177)

2-D: Two-Dimensional(page 177)

BNMF: BayesianNonnegativeMatrixFactorization(page 182)

GE-BNMF: Gaussian–ExponentialBNMF(page 195)

PG-BNMF: Poisson–GammaBNMF(page 195)

PE-BNMF: Poisson–ExponentialBNMF(page 195)

SMR: Speech-to-MusicRatio(page 196)

NSDR: NormalizedSignal-to-DistortionRatio(page 199)

GNSDR: GlobalNormalizedSignal-to-DistortionRatio(page 199)

BGS: BayesianGroupSparselearning(page 202)

LSM: LaplacianScaleMixturedistribution(page 206)

NMPCF: NonnegativeMatrixPartialCo-Factorization(page 202)

GNMF: Group-basedNonnegativeMatrixFactorization(page 203)

DNMF: DiscriminativeNonnegativeMatrixFactorization(page 69)

LNMF: LayeredNonnegativeMatrixFactorization(page 219)

DLNMF: DiscriminativeLayeredNonnegativeMatrixFactorization(page 219)

FA: FactorAnalysis(page 221)

PMF: ProbabilisticMatrixFactorization(page 242)

PTF: ProbabilisticTensorFactorization(page 244)

PSDTF: PositiveSemideﬁniteTensorFactorization(page 251)

LD: Log-Determinant(page 253)

GaP: GammaProcess(page 254)

LBFGS: LimitedMemoryBroyden–Fletcher–Goldfarb–Shanno(page 263)

STOI: Short-TimeObjectiveIntelligibility(page 263)

PESQ: PerceptualEvaluationofSpeechQuality(page 264)

T-F: Time-Frequency(page 268)

CCF: Cross-CorrelationFunction(page 269)

ITD: InterauralTimeDifference(page 269)

ILD: InterauralLevelDifference(page 269)

GFCC: GammatoneFrequencyCepstralCoefﬁcients(page 269)

BIR: BinauralImpulseResponse(page 269)

IBM: IdealBinaryMask(page 268)

PIT: PermutationInvariantTraining(page 275)

IRM: IdealRatioMask(page 274)

IAM: IdealAmplitudeMask(page 274)

IPSM: IdealPhaseSensitiveMask(page 274)

FC: Fully-Connected(page 283)

STF: Spectral-TemporalFactorization(page 283)

L-BFGS: Limited-memoryBroyden–Fletcher–Goldfarb–Shanno(page 290)

VRNN: VariationalRecurrentNeuralNetwork(page 299)

VAE: VariationalAuto-Encoder(page 299)

NTM: NeuralTuringMachine(page 307)

RCNN: RecallNeuralNetwork(page 314)

INTRODUCTION

Inrealworld,mixedsignalsarereceivedeverywhere.Theobservationsperceivedbyahumanaredegraded.Itisdifficulttoacquirefaithfulinformationfromenvironmentsinmanycases.Forexample,we aresurroundedbysoundsandnoiseswithinterferencefromroomreverberation.Multiplesourcesare activesimultaneously.Thesoundeffectsorlisteningconditionsforspeechandaudiosignalsareconsiderablydeteriorated.Fromtheperspectiveofcomputervision,anobservedimageisusuallyblurred bynoise,illuminatedbylightingormixedwiththeotherimageduetoreflection.Thetargetobject becomeshardtodetectandrecognize.Inaddition,itisalsoimportanttodealwiththemixedsignals ofmedicalimagingdata,includingmagnetoencephalography(MEG)andfunctionalmagneticresonanceimaging(fMRI).Themixinginterferencefromexternalsourcesofelectromagneticfieldsdueto themuscleactivitysignificantlymasksthedesiredmeasurementfrombrainactivity.Therefore,how tocomeupwithapowerfulsolutiontoseparateamixedsignalintoitsindividualsourcesignalsis nowadaysachallengingproblem,whichhasattractedmanyresearchersworkinginthisdirectionand developingpracticalsystemsandapplications.

ThischapterstartswithanintroductiontovarioustypesofseparationsysteminSection 1.1.We thenaddresstheseparationproblemsandchallengesinSection 1.2 wheremachinelearninganddeep learningalgorithmsareperformedtotackletheseproblems.Asetofpracticalsystemsandapplications usingsourceseparationareillustrated.Anoverviewofthewholebookissystematicallydescribedin Section 1.3.

1.1 SOURCESEPARATION

Blindsourceseparation(BSS)aimstoseparateasetofsourcesignalsfromasetofmixedsignals withoutorwithverylittleinformationaboutthesourcesignalsorthemixingprocess.BSSdealswith theproblemofsignalreconstructionfromamixedsignalorasetofmixedsignals.Suchascientiﬁc domainismultidisciplinary. Signalprocessing and machinelearning aretwoprofessionaldomains whichhavebeenwidelyexploredtodealwithvariouschallengesinBSS.Ingeneral,therearethree typesofamixingsystemorsensornetworkinreal-worldapplications,namely multichannelsource separation, monauralsourceseparation and deconvolution-basedseparation, whicharesurveyedin whatfollows.

1.1.1MULTICHANNELSOURCESEPARATION

Aclassicalexampleofasourceseparationproblemisthe cocktailpartyproblem,whereanumber ofpeoplearetalkingsimultaneouslyinaroomatacocktailparty,andalisteneristryingtofollow oneofthediscussions.AsshowninFig. 1.1,threespeakers {st 1 ,st 2 ,st 3 } aretalkingatthesametime. Threemicrophones {xt 1 ,xt 2 ,xt 3 } areinstallednearbyasthesensorstoacquirespeechsignalswhich

SourceSeparationandMachineLearning. https://doi.org/10.1016/B978-0-12-804566-4.00012-7

FIGURE1.1

Cocktailpartyproblemwiththreespeakersandthreemicrophones.

aremixeddifferentlydependingonthelocation,angleandchannelcharacteristicsofindividualmicrophones.Alinearmixingsystemisconstructedas

This3 × 3mixingsystemcanbewritteninavectorandmatrixformas xt = Ast where xt = [xt 1 xt 2 xt 3 ] , st =[st 1 st 2 st 3 ] and A =[aij ]∈ R3×3 .Thissysteminvolvesthecurrenttime t and constantmixingmatrix A withoutconsideringthenoiseeffect.Wealsocallitthe instantaneous and noiseless mixingsystem.Assuming3 × 3mixturematrix A isinvertible,aninverseproblemisthen tackledtoidentifythesourcesignalsas st = Wxt where W = A 1 isthedemixingmatrixwhich exactlyrecoverstheoriginalsourcesignals st fromthemixedobservations xt

Moregenerally,themultichannelsourceseparationisformulatedasan n × m linearmixingsystem consistingofasetof n linearequationsfor n individualchannelsorsensors xt ∈ Rn×1 where m sources st ∈ Rm×1 arepresent(seeFig. 1.2).Agenerallinearmixingsystem xt = Ast isexpressedandextended by

(1.2)

wherethemixingmatrix A =[aij ]∈ Rn×m ismerged.Inthisproblem,themixingmatrix A and thesourcesignals st areunknown.Ourgoalistoreconstructthesourcesignal yt ∈ Rn×1 byﬁndinga

FIGURE1.2

Agenerallinearmixingsystemwith n observationsand m sources.

demixingmatrix W ∈ Rm×n through yt = Wxt .Weaimatestimatingthedemixingmatrixaccordingto anobjectivefunction D (X, W) fromasetofmixedsignals X ={xt }T t =1 sothattheconstructedsignals areclosetotheoriginalsourcesignalsasmuchaspossible,i.e., yt ≈ st .Therearethreesituationsin multichannelsourceseparation:

DeterminedSystem: n = m

Inthiscase,thenumberofchannelsisthesameasthenumberofsources.Wearesolvingadetermined systemwhereauniquesolutionexistsifthemixingmatrix A isnonsingularandtheinvertiblematrix W = A 1 istractable.Theexactsolutiontothissituationisobtainedby yt = Wxt where yt = st . Fortheapplicationofaudiosignalseparation,thisconditionimpliesthatthenumberofspeakersor musicalsourcesisthesameasthenumberofmicrophoneswhichareusedtoacquireaudiosignals. Amicrophonearrayisintroducedinsuchasituation.Thismeansthatifmoresourcesarepresent,we needtoemploymoremicrophonestoestimatetheindividualsourcesignals.Independentcomponent analysis(ICA),asdetailedinSection 2.1,isdevelopedtoresolvethiscase.Anumberofadvanced solutionstothisBSScasewillbedescribedinChapter 4.

OverdeterminedSystem: n>m

Inthiscase,thenumberofchannels n islargerthanthenumberofsources m.Inmathematics,this isanoverdeterminedcaseinasystemofequationswheretherearemoreequationsthanunknowns. Inaudiosourceseparation,eachspeakerormusicalsourceisseenasanavailabledegreeoffreedom whileeachchannelormicrophoneisviewedasaconstraintthatrestrictsonedegreeoffreedom.The overdeterminedcaseappearswhenthesystemhasbeenoverconstrained.Suchanoverdeterminedsystemisalmostalways inconsistent sothatthereisnoconsistentsolutionespeciallywhenconstructed witharandommixingmatrix A.InSawadaetal.(2007),thecomplex-valuedICAwasdevelopedto tacklethesituation,inwhichthenumberofmicrophoneswasenoughforthenumberofsources.This methodseparatesthefrequencybin-wisemixtures.Foreachfrequencybin,anICAdemixingmatrixis estimatedtooptimallypushthedistributionofthedemixedelementsfarfromaGaussian.

UnderdeterminedSystem: n<m

Underdeterminedsysteminsourceseparationoccurswhenthenumberofchannelsislessthanthe numberofsources(Winteretal., 2007).Thissystemisunderconstrained.Itisdifﬁculttoestimatea

reliablesolution.However,suchacaseischallengingandrelevantinmanyreal-worldapplications. Inparticular,weareinterestedinthecaseofsingle-channelsourceseparationwhereasinglemixed signalisreceivedinthepresenceoftwoormoresourcesignals.Foraudiosourceseparation,several methodshavebeenproposedtodealwiththiscircumstance.InSawadaetal.(2011),atime-frequency maskingschemewasproposedtoidentifywhichsourcehadthelargestamplitudeineachindividual time-frequencyslot (f,t).Duringtheidentificationprocedure,ashort-timeFouriertransform(STFT) wasfirstcalculatedtofindtime-frequencyobservationvectors xft .Theclusteringoftime-frequency observationvectorswasperformedtocalculatetheposteriorprobability p(j |xft ) thatavector xft belongstoaclusterorasource j .Alikelihoodfunction p(xft |j) basedonGaussianmixturemodel (GMM)wasusedinthiscalculation.Atime-frequencymaskingfunction Mj ft wasaccordinglydeterminedtoestimatetheseparatedsignals sj ft = Mj ft xft foranindividualsource j .

Insomecases,thenumberofsources m isunknownbeforehand.Estimatingthenumberofsources requiresidentifyingtherightconditionanddevelopingtherightsolutiontoovercomethecorresponding mixingproblem.InArakietal.(2009a, 2009b),theauthorsconstructedaGMMwithDirichletprior formixtureweightstoidentifythedirection-of-arrival(DOA)ofsourcespeechsignalfromindividual time-frequencyobservations xft andusedDOAinformationtolearnthenumberofsourcesanddevelop aspecializedsolutionforsparsesourceseparation.

Inaddition,weusuallyassumethemixingsystemistime-invariantor,equivalently,themixingmatrix A istimeindependent.Thisassumptionmaynotfaithfullyreﬂectthereal-worldsourceseparation wheresourcesaremovingorchanging,orthenumberofsourcesisalsochanging.Inthiscase,the mixingmatrixistimedependent,i.e., A → A(t).Weneedtoﬁndthedemixingmatrix,whichisalso timedependentas W → W(t).Estimatingthesourcesignals st underthe nonstationary mixingsystem isrelevantinpracticeandcrucialforreal-worldblindsourceseparation.

1.1.2MONAURALSOURCESEPARATION

BSSisingeneralhighlyunderdetermined.Manyapplicationsinvolveasingle-channelsourceseparationproblem(n = 1).Amongdifferentrealizationsofamixingsystem,itiscrucialtodealwith single-channelsourceseparationbecauseawiderangeofapplicationsinvolveonlyasinglerecording channelbutmixorconvolvewithvarioussourcesorinterferences.Fig. 1.3 demonstratesascenarioof monauralsourceseparationusingasinglemicrophonewiththreesources.Weaimtosuppresstheambientnoises,includingbirdandairplane,andidentifythehumanvoicesforlisteningorunderstanding. Therefore,single-channelsourceseparationcanbegenerallytreatedasavenuetospeechenhancementornoisereductioninawaythatwewanttoenhanceorpurifythespeechsignalinthepresence ofsurroundingnoises.

Therearetwolearningstrategiesinmonauralsourceseparation, supervisedlearning and unsupervisedlearning.Supervisedapproachconductssourceseparationgivenbythelabeledtrainingdata fromdifferentsources.Namely,theseparatedtrainingdataarecollectedinadvance.Usingthisstrategy, sourceseparationisnottrulyblind.Asetoftrainingdatapairswithmixedsignalsandseparatedsignalsareprovidedtotrainademixingsystemwhichisgeneralizabletodecomposethoseunseenmixed signals.Nonnegativematrixfactorization(NMF)anddeepneuralnetwork(DNN)aretwomachine learningparadigmstodealwithsingle-channelsourceseparationwhichwillbeextensivelydescribed inSections 2.2 and 2.4 withanumberofadvancedworksorganizedinChapters 5 and 7,respectively. Basically,NMF(LeeandSeung, 1999)factorizesanonnegativedatamatrix X ={xt }T t =1 intoaprod-