Speaker Diarization of Meetings based on Large TDOA vectors Deepu Vijayasenan1 Fabio Valente2 1 UniversitÂ¨ at

des Saarlandes, 2 Idiap Research Institute

30 March 2012

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

1 / 17

Introduction

TDOA features represent location information of the speakers Features are estimated with respect to a reference channel Suboptimal since TDOA is result of different speaker placement with respect to microphones One alternative is to use TDOA values across each pair of microphones

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

2 / 17

All TDOA pairs Average TDOA values in case of speaker 6 and 8 in IDI 200901291000 meeting (NIST reference notation) 2 0 −2 Speaker 6 Speaker 8

−4 −6

1

2

3

4 TDOA index

5

6

7

6 4

Speaker 6 Speaker 8

2 0 −2 −4 1

5

D.Vijayasenan and F.Valente (UdS, Idiap)

9

13 17 TDOA index

21

Diarization based on large TDOA vectors

25

28

30 March 2012

3 / 17

All TDOA pairs

However all TDOA pairs are not used directly as features because of high feature dimension TDOA across all pairs of microphones were employed1 in determining initial clusters The problem of large feature dimension is often addressed with reducing feature dimension or selecting the most prominent features

1

Koh E.C.W. et.al, â€œSpeaker diarization using direction of arrival estimate and acoustic feature information: The i2r-ntu submission for the NIST RT2007 ealuationâ€? in Lecture Notes of Computer science D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

4 / 17

Objective

Using the large TDOA features directly as features in combination with spectral features Increased dimensionality has to be taken care Two diarization systems are studied HMM/GMM based speaker diarization Information Bottleneck based system

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

5 / 17

HMM/GMM system

Each speaker ck modeled by a minimum duration HMM state with GMM emission probability bck (st ), X log bck (st ) = log wcrk N (st , Âľrck , ÎŁrck ) r

Each feature stream is modeled with individual features and h i h i mfcc tdoa tdoa log Lck (st ) = Wmfcc log bcmfcc (s ) + W log b (s ) tdoa t ck t k Agglomerative clustering using a modified BIC criterion to determine the number of clusters

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

6 / 17

Information Bottleneck (IB) Principle

Distributional clustering based on maximizing mutual information w.r.t. set of relevance variables Given set of input variable X , relevance variables Y that contain important information about the problem, IB principle seeks to maximize: F = I (Y , C ) âˆ’ Î˛1 I (C , X ) Optimized w.r.t stochastic mapping P(C |X ) Performed using sequential or agglomerative optimization

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

7 / 17

IB Speaker diarization

Speech segments as input Variables X Components of a background GMM as relevance variables Y Agglomerative clustering to IB objective function optimization System initialized with uniform linear segmentation Each step two clusters that result in minimum loss of IB function are merged (JS divergence in terms of p(y |x)) Number of speakers determined based on a threshold on normalized mutual information

Feature stream combination based on of distributions p(y |xi ) p(y |x) = Wmfcc p(y |x mfcc , Mmfcc ) + Wtdoa p(y |x tdoa , Mtdoa )

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

8 / 17

HMM/GMM vs. IB diarization

clustering merge criterion number of spkr multiple features

D.Vijayasenan and F.Valente (UdS, Idiap)

HMM/GMM agglomerative BIC BIC log likelihood combination

BIC agglomerative JS divergence normalized MI combination of rel.var. distribn.

Diarization based on large TDOA vectors

30 March 2012

9 / 17

Experiments

Evaluate the performance of â€œall pair TDOA featuresâ€? in context of the two diarization systems Performed on a dataset of 24 meetings across 6 meeting rooms TDOA values corresponding to all delay pairs are computed Delay and Sum beamforming using a reference channel to compute MFCC features Combination weights are estimated on a reference data-set consisting of 10 meetings

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

10 / 17

TDOA Features

Estimated using GCC-PHAT GPHAT (i, j) =

Xi (f )Xj∗ (f )

|Xi (f )||Xj (f )| dPHAT (i, j) = arg max RPHAT (d) d

Ref channel TDOA: dimension for M-channel recording is (M − 1) All pair TDOA: dimension for M-channel recording is 12 M(M − 1)

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

11 / 17

Weights Estimation Development Data 30 HMM/GMM IB

Speaker Error

25 20 15 10 5 −4 10

Ref. Channel TDOA All Pairs TDOA

−3

10

−2

10 TDOA weight

aIB (0.7,0.3) (0.8,0.2)

−1

10

0

10

HMM/GMM (0.9,0.1) (0.999,0.001)

#TDOA vec. M-1 M(M − 1)/2

HMM/GMM weight optimization on a logarithmic scale The IB weights do not alter considerably D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

12 / 17

Speaker Error

Ref. Channel TDOA All Pairs TDOA

aIB 12.3 8.2 (+33%)

HMM/GMM 14.3 10.8 (+32%)

Both systems benefit with by combining all pair TDOAs with MFCC features

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

13 / 17

Speaker Error HMM/GMM

IB system appear more robust to change in dimension of features

All Pairs Reference Channel

30 20 10 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ALL Information Bottleneck

40 −Speaker Error−−>

HMM/GMM performance degrades when number of microphones is small

−Speaker Error−−>

40

All Pairs Reference Channel

30 20 10 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ALL Meeting ID

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

14 / 17

Conclusion Proposed to use “All pairs TDOAs” directly in two speaker diarization systems In case of HMM/GMM system the feature weights change considerably as compared to using reference channel TDOAs Log likelihood combination Benefit from additional delays while the weights are optimized on a logarithmic scale Performance degrades with low number of microphones

The weighting only get marginally affected in case of IB system Combination in a normalized relevance variable space Improves consistently across meetings

whenever weighting issues are properly handled, “all pair TDOAs” reduce the speaker error by ≈ 30% compared to TDOA values with respect to reference channel alone D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

15 / 17

Thank You

D.Vijayasenan and F.Valente (UdS, Idiap)

Diarization based on large TDOA vectors

30 March 2012

16 / 17

ID 1 2 3 4 5 6 7 8 9 10 11 12

Meet. CMU 20050912-0900 CMU 20050914-0900 CMU 20061115-1030 CMU 20061115-1530 EDI 20050216-1051 EDI 20050218-0900 EDI 20061113-1500 EDI 20061114-1500 EDI 20071128-1000 EDI 20071128-1500 IDI 20090128-1600 IDI 20090129-1000

D.Vijayasenan and F.Valente (UdS, Idiap)

#Mic 2 2 3 3 16 16 16 16 16 16 16 16

ID 13 14 15 16 17 18 19 20 21 22 23 24

Meet. NIST 20051024-0930 NIST 20051102-1323 NIST 20051104-1515 NIST 20060216-1347 NIST 20080201-1405 NIST 20080227-1501 NIST 20080307-0955 TNO 20041103-1130 VT 20050408-1500 VT 20050425-1000 VT 20050623-1400 VT 20051027-1400

Diarization based on large TDOA vectors

#Mic 8 8 7 7 7 7 7 10 4 7 4 4

30 March 2012

17 / 17