Issuu on Google+

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

34

RealTimeCharacterizationofLipǦMovements UsingWebcam SureshDS,SandeshC

as df .re s. in

Director&HeadofECEDept,CIT,Gubbi,Tumkur,Karnataka. PGStudent,ECEDept,CIT,Gubbi,Tumkur,Karnataka.

w w w .e dl ib .

Abstract: Communication plays an important role in the dayǦtoǦday activities of human being. The main objectiveofthispaperistohelpthepeoplewhoareunabletospeak.Visualspeechplaysamajorroleinthe lipǦreading for listeners with deaf person. Here we are using local spatiotemporal descriptors in order to identifyorrecognizethewordsorphrasesfromthedisable(dumb)peoplebytheirlipǦmovements.Local binarypatternsextractedfromthelipmovementsareusedtorecognizetheisolatedphrases.Experiments weremadeontenspeakerswith5phrasesandobtainedaccuracyofabout75%forspeakerdependentand 65%forspeakerindependent.WhilebeingmadecomparisonwithotherconceptslikeAVletterdatabase, ourmethodoutperformstheotherbyaccuracyof65%.Theadvantagesofourmethodarerobustnessand recognitionispossibleinrealtime. Keywords:LBPTop,Webcam,Visualspeech,KNNclassifier,Silentspeechinterface.

I.Introduction

D ow nl oa de d

fro m

In recent days, a silent speech interface has becoming an alternative in the field of speech processing. A voicebasedspeechcommunicationhassufferedfrommanychallengeswhicharisesduetofactthatspeech shouldbeclearlyaudible,itcannotbemasked,includeslackofrobustnessprivacyissuesandexclusionof speechdisabledperson.Sothesechallengesmaybeovercomebytheuseofsilentspeechinterface.Silent speech interface is a device, where system enabling speech communication takes place, without the necessityofavoicesignalorwhenaudibleacousticsignalisunavailable.Inourapproach,lipmovements arecapturedbytheuseofawebcam,whichisplacedinfrontofthelips.Manyresearchworkfocusesonly onvisualinformationtoenhancespeechrecognition[1].Audioinformationstillplaysamajorrolethanthe visualfeatureorinformation.But,inmostcasesitisverydifficulttoextractinformation.Inourmethodwe are concentrating on the lip movement representations for speech recognition solely with visual information. ExtractionofsetofvisualobservationvectorsisthekeyelementonAVSR(audioǦvisualspeechrecognition) system. Geometric feature, combined feature and appearance features are mainly used for representing visual information. Geometric feature method represents the facial animation parameters such as lip movement,shapeofjawandwidthofthemouth.Thesemethodsrequiremoreaccurateandreliablefacial featuredetection,whicharedifficultinpracticeandimpossibleatlowimageresolution. In this paper, we propose an approach for lip reading or lip movements, where humanǦcomputer interactionimprovessignificantlyandunderstandinginnoisyenvironmentalsoimproves.

II.LocalSpatiotemporalDescriptorsforVisualInformation

Inthispaper,weareconcentratingonLBPǦTOPinordertoextractthefeaturesfromtheextractedvideo frames. The Local Binary Pattern (LBP) operator is a grayǦscale which is not varied, always remains the sametexture,simplestatistic,whichhasshownthebestperformanceintheclassificationofdifferentkinds oftexture.Foranindividualpixelinanimageitsrespectivebinarywillbegeneratedandthethresholdswill becomparedwiththeitsneighborhoodvalueofcenterpixelasshowninFigure1(a)[1].

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in


Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

35

as df .re s. in



Figure.1.(a)BasicLBPop perator.(b)C Circular(8,2)neighborhoo od.



w w w .e dl ib .

Whereg gcdenotesth hegreyvalueofcenterpixxel{xc,yc}of fthelocalneighborhooda andgpisrelatedtothe greyvalu uesofPequaallyspacedpiixelsonacirrcleofradiussR.Ahistogrramiscreatedinorderto ocollectup the occu urrences of vvarious binaary patterns. The definittion of neigh hbors can be extended to circular neighborrhoodswitha anynumberofpixelsass showninFig gure1(b).Bythisway,wecancollectl largerǦscale texturep primitivesormicroǦpatterrns,likespotss,linesandco orners.

D ow nl oa de d

fro m

Local texxture descrip ptors have ob btained trem mendous atten ntion in the analysis of facial f image because of their rob bustness tochallenge such as pose an nd illuminatio on changes. In our appro oach atempo oral texture recognitiion using loccal binary paatterns were eextracted fro om the three orthogonal p planes(LBPǦT TOP). LBPǦ TOP method is moree efficient th han the ordin nary LBP. In ordinary LB BP we are exttracting info ormation or featuresintwodimen nsion,whereasinLBPǦTO OPweareexxtractinginfo ormationinth hreedimensiioni.e,X,Y and T. F For LBPǦTOP, the radii in n spatial and temporal axxes X, Y, and d T, and the number of n neighboring pointsin ntheXY,XT,,andYTplan nescanalsob bedifferenta andcanbem markedasRX X,RYandRT,PXY,PXT, PYT.TheeLBPǦTOPfe featureisthendenotedassLBPǦTOPP PXY,PXT,PY YT,RX,RY,R RT.Ifthecoo ordinatesof the center pixel gtc,ee are (xc, yc,, tc), than th he coordinatees of local neeighborhood in XY planee gXY,p are givenby(xcǦRXsin(2ɕ ɕp/PXY),yc+R RYcos(2ɕp/P PXY),tc),theecoordinatessoflocalneiighborhoodin  nXTplane gXT,p aare given byy (xcǦRXsin(2ɕp/PXT),ycc, tc –RT cos(2ɕp/PXT T)) and thee coordinates of local neighborrhood in YT plane gYT,p (xc, ycǦRYco os(2ɕp/PYT),,, tc –RT sin((2ɕp/PYT)). S Sometimes, tthe radii in threeaxeesarethesam meandsodothenumberofneighboringpointsinX XY,XT,andY YTplanes.In nthatcase, weuseLBPǦTOPP,Rf forabbreviatiionwhereP= =PXY=PXT=PY YTandR=RX X=RY=RT[1].

 Figure.2.(a)Volum meofutteranccesequence.(b)ImageinXYplane(c)ImageinXT Tplane(d)Im mageinTY plane.

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in


Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

36

w w w .e dl ib .

as df .re s. in

Figure 2(a) demonstrrates the vollume of utteerance sequen nce. Figure 22(b) shows iimage in thee XY plane. Tplaneprovidingavisuallimpressionofonerowc changingint time,while Figure2((c)isanimaageintheXT Figure2((d)describessthemotionofonecolum mnintemporralspace[1].A AnLBPdescrriptionwhen ncomputed over thee whole utterrance sequen nce it encodes only the occurrences of the micro oǦpatterns without w any indicatio on about theeir locations.. In order to o overcome this effect, aa representaation which consists of dividingthemouthim mageintosevveraloverlap ppingblocksi isintroduced d.Figure.3allsogivessom meexamples of the LB BP images. T The second, tthird, and fou urth rows sh how the LBP images whicch are drawn n using LBP code of every pixel from XY (seecond row), XT (third ro ow), and YT T (fourth row w) planes, reespectively, correspo ondingtomou uthimagesin nthefirstrow w.



D ow nl oa de d

fro m

Figure3.. Mouthregio onimages (ffirstrow),LB BPǦXYimagess(secondrow w), LBPǦXTi images(third drow), and LBPǦYTi images(lastr row)fromon neutterance[11].



Figure.4 4.Featuresineachblockv volume.(a)B Blockvolumess.(b)LBPfeaaturesfromt threeorthogo onalplanes. (c)Concaatenatedfeatturesforoneblockvolum mewiththeap ppearanceand dmotion[1].

 Fig gure.5.Mouth hmovementrepresentation[1]. When a person utterrs a comman nd phrase, th he words are pronounced d in order, fo or instance “yyouǦsee” or “seeǦyou””.Ifwedono otconsiderth hetimeorderr,thesetwop phraseswould dgeneratealmostthesam mefeatures.

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in


Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

37

To overccome this efffect, the who ole sequence is not only d divided into block volum mes according g to spatial regions but b also in time t order, aas shown in Figure. 4(a) shows. The LBPÇŚTOP hisstograms in each block volume are computeed and conccatenated intto a single h histogram, as shown in Figure. 4. A All features extracted d from each block volum me are connected to repreesent the app pearance and d motion of the mouth regionseequence,assh howninFigu ure.5.

as df .re s. in

Ahistogrramofthem mouthmovem mentscanbed definedasfolllows



III.Multiiresolution nFeaturesa andFeature eSelection n

w w w .e dl ib .

Multiressolutionfeatu ureswillprovvidemoreacccurateinform mationandalssotheanalyssisofdynamicceventwill beincreaased.ByusingthesemultiiÇŚresolutionf features,itheelpsinincreaasingthenum mberoffeatu uresgreatly. Whenth hefeaturesfrromthedifferentresolutionweremad detobeconn nectedinserriesdirectly,thefeature vectorwouldbeverylongandasa aresultthec computationaalcomplexityywillbetohiigh.Itisobviousthatall uteequally,soit s is necesssarytofind out features likewhich themulttiresolution features will not contribu location, with what rresolution an nd more importantly the types such aas appearancee, horizontall motion or vertical m motion that are very im mportant. Wee need featu ure selection for this purrpose. In changing the parameteers,threediff fferenttypeso ofspatiotemp poralresolutiionarepresented:1)Useo ofadifferenttnumberof neighborring points w when compu uting the features in XY (appearancee), XT (horizzontal motion n), and YT (vertical motion) slicces; 2) Use of different raadii that can capture the occurrencess in differentt space and timescalles;3)Useofblocksofdiff fferentsizest  ocreateglob balandlocals statisticalfeatures[4][5].

IV V.OurSysttem

D ow nl oa de d

fro m

Oursystemconsistso ofthreestagees,asshowninFigure6.T Thefirststag geisadetectiionlipmovem ments.The seconds  tageextractssthevisualfeeaturesfromthemouthm movementseq quence.Theroleofthefinalstageis nizetheinpu ututteranceu usinganKNN Nclassifier. torecogn

 Figurre6.Systemd diagram

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in


Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

38

as df .re s. in

In ourap pproachweb bcamisused tocapture thelipǦmovem ments.Furth her we will exxtractthevissualfeature from thee mouth reg gion. These ffeatures are then fed to the classifieer. For speecch recognitio on, a KNN classifierrisselecteds sinceitiswelllfoundedin nstatisticalleearningtheorryandhasbeeensuccessfu ullyapplied tovariou usobjectdeteectiontasksi incomputervision.SinceetheKNNisonlyusedfo orseparatingtwosetsof points,th heǦphrasecllassificationp problemisdecomposedi  ntotwoǦclasssproblems,t thenavoting gschemeis usedtoa accomplishr recognition.S Sometimesm morethanon neclassgetst thehighestn numberofvo otes;inthis case, 1ǦN NN template matching iss applied to these classes to reach th he final resu ult. This meaans that in training,thespatiotem mporalLBPh histogramso ofutterances  equencesbellongingtoag givenclassarreaveraged togeneraateahistograamtemplateforthatclasss[1].

V V.Experime entsProtocolandResu ults

For more accurate eevaluation off our proposed method, we made a design with different exxperiments, including gspeakerǦind dependent,sp peakerǦdepen ndent,multir resolution.

fro m

w w w .e dl ib .

1. SpeakeerǦIndependeent Experimeents: For the speakerǦindependent experiments, leeaveǦoneǦspeeakerǦout is utilized.Wemadeatrainingofp particularphrraseusingon nespeakeran ndwemadetosaythesaamephrase with10d differentspeaakersandwhentestingwaasdonewew weresuccessfu ulingettingthesamephrrasefrom6 speakers. The overall results werre obtained u using M/N ((M is the tottal number of correctly recognized sequenceesandNisth hetotalnum mberoftestin ngsequences)).Whenwea areextractingthelocalpatterns,we take into o account no ot only locatiions of micro oǦpatterns bu ut also the tiime order of f lipǦmovemeents, so the whole seequence is diivided into b block volumees according to not only spatial regions but also time t order. Figure. 77 demonstratte the perforrmance for every e speakerr. The resultts from the second s speak ker are the worst,m mainlybecauseethebigmou ustacheofth hatspeakerreeallyinfluenceestheappearranceandmo otioninthe mouthreegion.



Figuree.7.Recognitiionperformaanceforeveryyspeaker

D ow nl oa de d

2) Speak kerǦDependen nt Experimen nts: For speaakerǦdependeent experimeents, the leavveǦone utteraanceǦout is utilizedf forcrossvalidation.Inou urapproachw whenaparticcularspeakerristrainedw withsomelisttofphrases andafterrtestingispeerformedwith hthesamesp peaker70%o ofsamephrasseswereidentifiedcorrecttly.

3)OneǦO OneversusO OneǦRestReco ognition:Int thepreviousexperimentssonourowndataset,thetenǦphrase classificaationproblem misdecompo osedinto45twoclasspro oblems(“Helllo”Ǧ“Seeyou u”,“Iamsorrry”Ǧ“Thank you”,“Yo ouarewelcom me”Ǧ“Havea agoodtime”,etc.).ButussingthismulltipletwoǦclaassstrategy,t thenumber of classiffiers grows q quadratically with the num mber of classses to be recognized like in AVLetterrs database. Whenth heclassnumb berisN,then numberofth heKNNclassifierswouldb beN(NǦ1)/2.

 Figure. 8. 8 Selected 155 slices for phrases p “See yyou” and “Th hank you”. “ʜ” in the blocks means th he YT slice (verticalmotion)isseelected,and“ “_”theXTslice(horizontaalmotion),“//”meanstheappearanceX XYslice[1].

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in


Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014]

39

FigureǦ8 shows the selected s slices for similar phrases “seee you” and “tthank you”. T These phrasees were the mostdiff fficulttorecognizebecaussetheyarequ uitesimilarin nthelatterp partcontainin ngthesamew word“you”. Theselecctedslicesarremainlyinthefirstandsecondpartofthephrasse;justonev verticalsliceisfromthe lastpart..Theselected dfeaturesareeconsistentw withthehum manintuition.

TableI

Conclusio on

as df .re s. in

Resultsf fromOneǦtoǦǦOneandOn neǦtoǦRestClaassifiersonSeemiǦSpeakerǦǦDependentE Experiments(Resultsin TheParenthesesareFromOn neǦtoǦRestStrrategy)



w w w .e dl ib .

A realǦtim me capture o of lipǦmovem ments using w webcam for rrecognition o of speech waas proposed, in order to help the disable persson. Our app proach uses llocal spatioteemporal desccriptors for th he recognitio on of input utterancee.LBPǦTOPisusedtoexxtractthefeaaturesfromth hecapturedimages.Expeerimentswerremadeon ten speaakers and th he lip movem ments were converted c in nto speech vvery efficiently. Compareed to other approach h,ourmethod doutperform mstheotherb byaccuracyo of70%.

Reference es

2. 3.

4.

D ow nl oa de d

5.

LipreadingW L WithLocalSpatiotemporallDescriptorsIEEETransaactionsOnMu ultimedia,Vo ol.11,No.7, N November20 009. T T.Ahonen,A A.Hadid,and dM.Pietikäin nen,“Facedeescriptionwitthlocalbinarrypatterns:A Application t tofacerecogn nition,”IEEETrans.PatteernAnal.Macch.Intell.,voll.28,no.12,p pp.2037–20411,2006. G Zhao and G. d M. Pietikäinen, “Dynam mic texture rrecognition u using local b binary pattern ns with an a application to facial exprressions,” IEE EE Trans. Patttern Anal. Mach. M Intell., vol. 29, no. 6, pp. 915– 9 928,2007. G G.Zhao,M. Pietikäinen,andA.Hadid d,“Localspaatiotemporaldescriptorsf forvisualrecognitionof s spokenphras ses,”inProc.2ndInt.Wo orkshopHum manǦCentered dMultimediaa(HCM2007)),2007,pp. 5 57–65. G Zhao and G. d M. Pietikääinen, “Boossted multiǦreesolution spaatiotemporall descriptorss for facial e expression recognition,” Pattern Reccognit. Lett.,, Special Isssue on Imag ge/VideoǦBased Pattern A Analysisand HCI,2009,a acceptedforp publication.

fro m

1.

NCMMI '14

ISBN : 978-81-925233-4-8

www.edlib.asdf.res.in


Ncmmi2014006