Page 1

KL Realignment for Speaker Diarization with Multiple Feature Streams Deepu Vijayasenan, Fabio Valente and Herve´ Bourlard Presented By: John Dines

IDIAP Research Institute Martigny, CH

Interspeech 2009 – p. 1/16

Speaker Diarization

Speaker diarization determines “Who spoke when” in an audio stream Agglomerative clustering Initialized with an overdetermined number of speaker models At each step, two most similar speaker models are merged according to a distance criterion Conventional systems use an ergodic HMM Speaker model – an HMM state with minimum duration Gaussian Mixture Models for state emission probabilities Interspeech 2009 – p. 2/16

Multistream Diarization Separate models built for individual feature streams Linear combination of individual log likelihoods Individual features might posses diverse statistics different likelihood range, dimensionality etc. 12000 TDOA MFCC

16 16

−−Negative log likelihood →







4 2 4000





Interspeech 2009 – p. 3/16

Motivation The problem of features with diverse statistics is addressed in an ad-hoc manner eg: In [Pardo’06] Gaussian components is initialized as one for TDOA and five for MFCC features We had proposed a non parametric approach based on Information Bottleneck principle clustering is based on posteriors of a background GMM model How to use posterior features for a better multistream speaker diarization?

Interspeech 2009 – p. 4/16

Overview IB Principle Speaker Diarization using IB Feature combination KL Realignment

Interspeech 2009 – p. 5/16

IB Principle

X be a set of elements to cluster into set of clusters C and Y be set of variables of interest Assume the conditional distribution p(y|x) is available ∀x ∈ X, ∀y ∈ Y IB Principle states that the cluster representation C should preserve as much information as possible about Y while minimizing the distortion between C and X. The objective is to maximize I(Y, C) − βI(X, C)

Interspeech 2009 – p. 6/16

IB Optimization

IB criterion is optimized by an agglomerative approach Initialized with |X| clusters Iteratively merge the clusters that result in the minimum loss in the objective function A stopping criterion based on Normalized Mutual information determines the number of clusters I(Y,C) ) (Threshold on I(Y,X)

Interspeech 2009 – p. 7/16

IB based Speaker Diarization

Components of a background GMM used as relevance variables Realignment used to smooth spkr boundaries of IB system output

11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111

(audio features) X

1111 0000 0000 1111 0000 1111 0000 1111 0000 1111






Y (background GMM)

Interspeech 2009 – p. 8/16

Feature Combination Individual Posterior streams are combined to get a P single posterior stream. ie., p(y|x) = i p(y|x, MFi )PFi , where MFi – background GMM for feature Fi PFi – prior probability of the stream Fi Uses only posterior features – problems in combining likelihoods are eliminated

1111111 0000000 000 111 0000000 1111111 111 000 000 111111 000 000 111 000 111 000 111 0000000 1111111 000 111 0000000 1111111 000 111 000 111 000000000 1111111 11 000 11111 00 X










Interspeech 2009 – p. 9/16


Conventionally, Viterbi realignment based on a HMM/GMM improves speaker boundaries Incorporates a minimum duration constraint on speaker duration Spkr 1

Spkr 2

Spkr N

Linear combination of log likelihoods for multiple features – might not scale if features have diverse statistics

Interspeech 2009 – p. 10/16

KL based Realignment

Maximization of the mutual information in IB is equivalent to minimization of: arg min c


KL(p(Y |xt )||p(Y |ct ))


p(Y |xt ) – postr distrbn of relevance variables in a frame p(Y |ct ) – postr distrbn of relevance variables of a cluster In order to incorporate the minimum duration constraint, we extend this objective function as : arg min c


[KL(p(Y |xt )||p(Y |ct )) − log(act ct+1 )]


aci cj – transition probability from cluster ci to cj Interspeech 2009 – p. 11/16

KL based Realignment II

Can be solved using a EM algorithm, and thus can perform realignment based on posterior features The re-estimationP formula becomes p(y|ci ) = p(c1i ) xt :xt ∈ci p(y|xt )p(xt ) Linear combination of the posteriors can be used in case of multistream diarization No additional computational effort with additional features

Interspeech 2009 – p. 12/16

Evaluation Evaluation performed on RT06 Nist Evaluation data for Meeting Diarization Task Since same speech/no-speech reference segmentation is used speaker error is used as the evaluation measure Explored the combination of MFCC and TDOA features Individual feature weights are empirically determined from a development dataset estimated weights are ( (PM F C , PDEL ) = (0.7, 0.3) as compared to (0.9, 0.1) in likelihood based combination Interspeech 2009 – p. 13/16

Results on RT06 Eval data


w/o HMM/ realign GMM 19.3 15.7 24.4 25.5 11.6 10.7

KL based 15.7 23.9 9.9

Both Realignment systems perform equally well with MFCC KL realignment performs better with TDOA features (variable feature dimension across meetings) and with the feature combination Interspeech 2009 – p. 14/16

Conclusions Proposed a KL divergence based realignment scheme that operates only on the a set of posterior features The system provides same performance as conventional HMM/GMM when tested on a single feature stream (MFCC) The KL based realignment system performs better (9.9%) than conventional HMM/GMM realignment (10.7%) when tested on multiple feature stream (MFCC+TDOA) The system currently being extended to more feature streams and initial results shows improvement in performance Interspeech 2009 – p. 15/16


Interspeech 2009 – p. 16/16