Information Bottleneck Features for HMM/GMM Speaker Diarization of Meetings Recordings Sree Harsha Yella, Fabio Valente

August 31, Interspeech 2011, Florence, Italy

August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33

Speaker Diarization

Speaker diarization addresses the task of â€œwho spoke whenâ€?

August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33

Speaker Diarization

Speaker diarization addresses the task of â€œwho spoke whenâ€? Estimation of number of speakers. Identification of speech segments corresponding to each speaker.

August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33

Speaker Diarization

Speaker diarization addresses the task of â€œwho spoke whenâ€? Estimation of number of speakers. Identification of speech segments corresponding to each speaker.

Common approaches HMM/GMM modeling Top-down splitting Bottom-up clustering

Speaker Diarization

Speaker diarization addresses the task of â€œwho spoke whenâ€? Estimation of number of speakers. Identification of speech segments corresponding to each speaker.

Common approaches HMM/GMM modeling Top-down splitting Bottom-up clustering

Non parametric method Information Bottleneck framework.

Speaker Diarization

Common approaches HMM/GMM modeling Top-down splitting Bottom-up clustering

Non parametric method Information Bottleneck framework.

Complementary nature of systems Like in ASR, diarization systems can be combined. August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33

Combining diarization systems

Piped approaches (initializing a system with the output of another) (Moraru et.al, 2002,2003). Does not influence every step in diarization.

Combining diarization systems

Piped approaches (initializing a system with the output of another) (Moraru et.al, 2002,2003). Does not influence every step in diarization.

Voting between outputs of multiple systems (Tranter, 2005). Performs late combination of outputs.

Combining diarization systems

Piped approaches (initializing a system with the output of another) (Moraru et.al, 2002,2003). Does not influence every step in diarization.

Voting between outputs of multiple systems (Tranter, 2005). Performs late combination of outputs.

Integrated approaches (Moraru et.al, 2003; Bozonnet et.al, 2010). Require changing some parameters/modules of individual diarization systems.

Combining diarization systems

Voting between outputs of multiple systems (Tranter, 2005). Performs late combination of outputs.

Integrated approaches (Moraru et.al, 2003; Bozonnet et.al, 2010). Require changing some parameters/modules of individual diarization systems.

Current work Overcomes these problems by performing feature level combination.

TANDEM features used in ASR p(Y |st ) HMM/GMM

st

Spectral features

Phoneme Posteriors

Log + PCA

TANDEM features

TANDEM features used in ASR p(Y |st ) HMM/GMM

st

Spectral features

Phoneme Posteriors

Log + PCA

TANDEM features

Diarization task is unsupervised.

TANDEM features used in ASR p(Y |st ) HMM/GMM

st

Spectral features

Phoneme Posteriors

Log + PCA

TANDEM features

Diarization task is unsupervised. IB diarization output

c1

c2

c1

c1

c1

c2

c3

c3

c3

c3

c2

yL

HMM/GMM

Relevance variables

p(Y |st )

yl

Log + PCA

y2 y1 s1

s2

st

sN

Spectral features

Outline of the talk

1

State-of-the art HMM/GMM diarization

2

Speaker diarization based on IB

3

Information Bottleneck features

4

Experimental setup and results

5

Conclusions

Outline of the talk

1

State-of-the art HMM/GMM diarization

2

Speaker diarization based on IB

3

Information Bottleneck features

4

Experimental setup and results

5

Conclusions

Speaker diarization in an HMM/GMM system

Short time spectral features (MFCC) as input

Speaker diarization in an HMM/GMM system

Short time spectral features (MFCC) as input Speech/Non-speech detection

Speaker diarization in an HMM/GMM system

Short time spectral features (MFCC) as input Speech/Non-speech detection Uniform segmentation/Speaker change detection

Speaker diarization in an HMM/GMM system

Short time spectral features (MFCC) as input Speech/Non-speech detection Uniform segmentation/Speaker change detection Agglomerative Clustering using HMM/GMM speaker models with minimum duration

Speaker diarization in an HMM/GMM system

Short time spectral features (MFCC) as input Speech/Non-speech detection Uniform segmentation/Speaker change detection Agglomerative Clustering using HMM/GMM speaker models with minimum duration Nearest clusters according to a distance measure are merged

Speaker diarization in an HMM/GMM system

Short time spectral features (MFCC) as input Speech/Non-speech detection Uniform segmentation/Speaker change detection Agglomerative Clustering using HMM/GMM speaker models with minimum duration Nearest clusters according to a distance measure are merged

Speaker diarization in an HMM/GMM system

Short time spectral features (MFCC) as input Speech/Non-speech detection Uniform segmentation/Speaker change detection Agglomerative Clustering using HMM/GMM speaker models with minimum duration Nearest clusters according to a distance measure are merged Viterbi realignment to smooth cluster boundaries August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33

Speaker diarization in an HMM/GMM system

Short time spectral features (MFCC) as input Speech/Non-speech detection Uniform segmentation/Speaker change detection Agglomerative Clustering using HMM/GMM speaker models with minimum duration Nearest clusters according to a distance measure are merged Viterbi realignment to smooth cluster boundaries Iterates until a stopping criterion is satisfied August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33

Outline of the talk

1

State-of-the art HMM/GMM diarization

2

Speaker diarization based on IB

3

Information Bottleneck features

4

Experimental setup and results

5

Conclusions

IB Objective function

Consider a set of input variables X and associated relevance variables Y . The clustering representation C : maximizes the mutual information with respect to Y i.e., maximizes I (Y , C ) is compact i.e., minimize I (C , X )

Maximize F = I (Y , C ) âˆ’ Î˛I (C , X )

IB Objective function

Consider a set of input variables X and associated relevance variables Y . The clustering representation C : maximizes the mutual information with respect to Y i.e., maximizes I (Y , C ) is compact i.e., minimize I (C , X )

Maximize F = I (Y , C ) âˆ’ Î˛I (C , X ) The solution is obtained through: Agglomerative clustering

Agglomerative IB (AIB)

Estimate P(Y |X )

Agglomerative IB (AIB)

Estimate P(Y |X ) Initialization with every element of X as a singleton cluster

Agglomerative IB (AIB)

Estimate P(Y |X ) Initialization with every element of X as a singleton cluster Two clusters (ci , cj ) that result in the minimum loss of IB function are merged

Agglomerative IB (AIB)

Estimate P(Y |X ) Initialization with every element of X as a singleton cluster Two clusters (ci , cj ) that result in the minimum loss of IB function are merged The loss can be obtained in closed form (JS divergence) The relevance variable distributions p(Y |ci ), p(Y |cj ) are averaged to give p(Y |cnew ). August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33

Agglomerative IB (AIB)

Estimate P(Y |X ) Initialization with every element of X as a singleton cluster Two clusters (ci , cj ) that result in the minimum loss of IB function are merged The loss can be obtained in closed form (JS divergence) The relevance variable distributions p(Y |ci ), p(Y |cj ) are averaged to give p(Y |cnew ). The merging continues until model selection criterion is met. August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33

Comparison with HMM/GMM Clustering

Modeling Distance Output

HMM/GMM a separate GMM for each speaker c Modified BIC mapping X â†’ C

IB relevance variables Y from a background GMM JS divergence mapping X â†’ C and p(Y |C )

Outline of the talk

1

State-of-the art HMM/GMM diarization

2

Speaker diarization based on IB

3

Information Bottleneck features

4

Experimental setup and results

5

Conclusions

Information Bottleneck features c1

c2

x1

x2

c2

c1

c3

c3

yL

yl

y3 y2 y1 x3

x4

x5

x6

IB diarization output

Information Bottleneck features c1

c2

c2

c1

c3

c3

yL

yl

y3 y2 y1 st1 x1

st2 x2

st3 x3

st4 x4

st5 x5

st6 x6

IB diarization output The frames corresponding to segment xj are represented as stj

Information Bottleneck features c1

c2

c2

c1

c3

c3

yL

yl

y3 y2 y1 st1 x1

st2 x2

st3 x3

st4 x4

st5 x5

st6 x6

IB diarization output The frames corresponding to segment xj are represented as stj F = [p(Y |s11 ), . . . , p(Y |stj ), . . . , p(Y |sTN )], t = 1, . . . , T .

Information Bottleneck features c1

c2

c2

c1

c3

c3

yL

yl

y3 y2 y1 st1 x1

st2 x2

st3 x3

st4 x4

st5 x5

st6 x6

IB diarization output The frames corresponding to segment xj are represented as stj F = [p(Y |s11 ), . . . , p(Y |stj ), . . . , p(Y |sTN )], t = 1, . . . , T . TANDEM processing can be applied on F The probabilities p(Y |stj ) are gaussianized by applying a logarithm. PCA is applied to de-correlate and reduce the dimensionality.

Information Bottleneck features c1

c2

c2

c1

c3

c3

yL

yl

y3 y2 y1 st1 x1

st2 x2

st3 x3

st4 x4

st5 x5

st6 x6

IB diarization output The frames corresponding to segment xj are represented as stj F = [p(Y |s11 ), . . . , p(Y |stj ), . . . , p(Y |sTN )], t = 1, . . . , T . TANDEM processing can be applied on F The probabilities p(Y |stj ) are gaussianized by applying a logarithm. PCA is applied to de-correlate and reduce the dimensionality.

The resulting matrix FIB is referred as Information Bottleneck features. August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33

Integration of MFCC and IB features Meeting Feature Extraction (MFCC) recording

MFCC

IB diarization

p(Y|C)

diarization output

Transformation (log + PCA)

HMM/GMM diarization

Integration of MFCC and IB features Meeting Feature Extraction (MFCC) recording

MFCC

IB diarization

p(Y|C)

diarization output

Transformation (log + PCA)

HMM/GMM diarization

The integration can happen in two ways: Concatenating IB features to MFCC feature vectors (IB aug).

Integration of MFCC and IB features Meeting Feature Extraction (MFCC) recording

MFCC

IB diarization

p(Y|C)

diarization output

Transformation (log + PCA)

HMM/GMM diarization

The integration can happen in two ways: Concatenating IB features to MFCC feature vectors (IB aug). Multistream modelling (IB multistr), where clustering is based on combined likelihood given by wmfcc log bcmfcc + wFIB log bcFIB . mfcc Where bc and bcFIB are GMMs trained on MFCC and FIB features and (wmfcc , wFIB ) are the combination weights.

Outline of the talk

1

State-of-the art HMM/GMM diarization

2

Speaker diarization based on IB

3

Information Bottleneck features

4

Experimental setup and results

5

Conclusions

Experiments and Results

Test dataset: 24 meetings from NIST RT06/RT07/RT09 evaluation datasets. 19 MFCC features from beamformed audio are extracted.

Experiments and Results

Test dataset: 24 meetings from NIST RT06/RT07/RT09 evaluation datasets. 19 MFCC features from beamformed audio are extracted.

Speech/Non-speech detection is based on AMIDA system. Speech/Non-speech Error meeting ALL

Miss 7.3

FA 0.4

SpNsp 7.7

Experiments and Results

Test dataset: 24 meetings from NIST RT06/RT07/RT09 evaluation datasets. 19 MFCC features from beamformed audio are extracted.

Speech/Non-speech detection is based on AMIDA system. Speech/Non-speech Error meeting ALL

Miss 7.3

FA 0.4

SpNsp 7.7

Tuning using a separate development set Optimal number of PCA components: 2 (covering more than 80% of PCA variance). (wmfcc , wFIB ) = (0.9, 0.1).

Experiments and Results Diarization Error Rate(DER); sum of speech/non-speech error and speaker error.

Experiments and Results Diarization Error Rate(DER); sum of speech/non-speech error and speaker error. Speaker Error Baseline 12.0(-)

IB aug 13.5 (-12.5%)

IB multistr 9.7(+19%)

Experiments and Results Diarization Error Rate(DER); sum of speech/non-speech error and speaker error. Speaker Error Baseline 12.0(-)

30

IB aug 13.5 (-12.5%)

IB multistr 9.7(+19%) Baseline IB_aug IB_multistr

−Speaker Error−−>

25

20

15

10

5

0 0 0 0 0 0 323 515 347 405 501 955 1130 500 000 400 400 090 090 103 153 051 900 500 500 000 500 1600 1000 4−0932 1 4−1 6−1 1−1 7−1 7−0 −1 −1 −1 −1 3− − 102 110 −1 − 12− 14− 15− 15− 6−1 8−0 3−1 4−1 8−1 8−1 10 021 020 022 030 110 408 425 623 027 128 129 509 509 611 611 021 021 111 111 112 112 200 200 200 200 2005 2005 2006 2006 2007 2007 0090 0090 2005 2005 2005 2006 2008 2008 2008 200420050200502005020051 VT VT VT CMU CMU CMU CMU EDI EDI EDI EDI EDI EDI IDI 2 IDI 2 NIST NIST NIST NIST NIST NIST NIST TNO VT

Outline of the talk

1

State-of-the art HMM/GMM diarization

2

Speaker diarization based on IB

3

Information Bottleneck features

4

Experimental setup and results

5

Conclusions

Conclusions

The paper proposes an effective method of combination of diarization systems using features.

Conclusions

The paper proposes an effective method of combination of diarization systems using features. The proposed combination method does not make any modifications to original systems.

Conclusions

The paper proposes an effective method of combination of diarization systems using features. The proposed combination method does not make any modifications to original systems. Two combination strategies were investigated with MFCC features.

Conclusions

The paper proposes an effective method of combination of diarization systems using features. The proposed combination method does not make any modifications to original systems. Two combination strategies were investigated with MFCC features. Evaluation results showed that multistream combination decreases the speaker error whereas simple augmentation increases the error.

Thank You Questions?

Agglomerative IB (AIB) Input: Distribution p(y |x) Trade-off parameter β

Output:

The loss in merging two clusters ci , cj : (p(ci ) + p(cj ))JS[p(Y |ci ), p(Y |cj )]

Cm : m-partition of X , m ≤ |X |

Initialization: C ≡X

Main Loop: While |C | > 1 {i, j} = arg mini ′ ,j ′ ∆F (ci , cj ) Merge {ci , cj } ⇒ cr in C

Model selection based on information theoretic criterion Minimum Description Length (MDL) Normalized Mutual Information (NMI)

IB principle applied to diarization

Input X → Fixed length segments of speech Relevance PLVariables Y → components of a background GMM f (s) = j=1 wj N (s, µj , Σj ) Relevance variable distribution p(y |x) estimated from: p(yi |sk ) = PLwi N (sk ,µi ,Σi ) ; i = 1, . . . , L j=1

wj N (sk ,µj ,Σj )

Output of IB diarization Hard partition of X into C clusters (p(ci |xj ) ∈ {0, 1}) p(Y |ci ); i = 1, . . . , |C |

Evaluation Diarization Error(DER) is used as the metric for diarization P ref all seg {dur (seg )[max(N P

DER =

allseg

(seg ),Nsys (seg ))−Ncorrect (seg )]} dur (seg )Nref (seg )

Speech/no-speech error and speaker error

L1

L2

S1

S3

T1

T2

T3

no speech

L3

no speech T4

T5

L1/L3

S2 T6

T7

L1

S1 T8

T9

T10

Mapping S1 → L1, S3 → L2, S2 → L3 DER =

T 2+T 4+T 6+T 8+T 9 T 1+T 2+T 3+T 4+T 7+2∗T 8+2∗T 9+T 10

GCC-PHAT

Compute the TDOA between two channels si [n] and sj [n] The Generalized Cross-Correlation PHAse Transform is defined as: Si (f )S âˆ— (f ) GPHAT (f ) = |Si (f )||Sj j (f ) | The TDOA of si w.r.t. sj is estimated as dPHAT (i , j) = arg maxd RPHAT (d) where RPHAT (d) is the inverse fourier transform of GPHAT (f )

IB objective function

Minimize I (C , X ) − βI (Y , C ) X p(c|x) I (C , X ) = p(x)p(c|x)log p(c)

(1)

x∈X ,c∈C

I (Y , C ) =

X

y ∈Y ,c∈C

p(c)p(y |c)log

p(y |c) p(y )

p(c) p(c|x) = Z (β,x) exp(−β · KL[p(y |x)||p(y |c)]) P p(y |c) = p(y |x)p(c|x) p(x) p(c) Px p(c|x)p(x) p(c) = x

(2)

(3)

Normalized Mutual Information

1

NMI =

I (Y ,C ) I (Y ,X )

Monotonic function of number of clusters

0.8 −Normalized Mutual information−−>

Represents the mutual information preservedI(Y,C) by the clustering representation as fraction of initial value I(Y,X)

0.9

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

100

200

300 −Number of clusters−−>

400

500

600

Minimum Description Length

Minimize the coding length of the representation FMDL = L(m) + L(X |m) N = N log + N[H(Y |C ) + H(C )] M

KL Realignment The initial segmentation is obtained from AIB clustering

KL Realignment The initial segmentation is obtained from AIB clustering P(Y |C ) is estimated from the segmentation

P(yj |ci ) =

X 1 p(yj |xt )p(xt ) p(ci ) x :x âˆˆc t

t

i

KL Realignment The initial segmentation is obtained from AIB clustering P(Y |C ) is estimated from the segmentation

P(yj |ci ) =

X 1 p(yj |xt )p(xt ) p(ci ) x :x âˆˆc t

t

i

Best segmentation is obtained from Viterbi segmentation copt = arg min c

X

KL[p(Y |xt )||p(Y |ct )]âˆ’log(act ct+1 )

t

Objective function – Realignment Consider I (X , Y ) − I (C , Y ) X p(y , c) p(x, y ) − p(y , c) log p(x)p(y ) p(y )p(c) y ,c

=

X

p(x, y ) log

=

X

p(x, y , c) log

X

p(y |x)p(c|x)p(x) log

=

X

p(x)

X

p(c|x)

=

X

p(x)

X

p(c|x)KL (p(Y |x)||p(Y |c))

X

p(x, c)KL (p(Y |x)||p(Y |c))

x,y

x,y ,c

=

p(x, y )p(c) p(y , c)p(x)

x,y ,c

x

x

=

c

X

p(y |x) p(y |c)

p(y |x) log

y

p(y |x) p(y |c)

c

(4)

x,c

Sequential Clustering

AIB is a greedy algorithm Sequential Information Bottleneck(SIB) refines the objective function in a given partition 1

2

3

Sample current partition randomly and select a sample and is represented as a separate cluster This singleton cluster is merged with a new cluster that results in minimum loss of mutual information Step 1,2 are repeated for all samples till convergence

We use SIB to refine the output produced with AIB

SIB Results

MFCC MFCC+TDOA 4 feature

AIB 17.1 9.9 6.7

AIB +SIB 16.6 8.6 6.0

Viterbi Realignment

MFCC MFCC+TDOA 4 feat

Before realign. 24.7 11.6 8.3

After realign. 19.1 9.9 6.7

JS Divergence I (C , Y ) =

XX c

p(c)p(y |c) log

y

p(y |c) p(y )

∆ICY = I (C b , Y ) − I (C a , Y ) Let ci and cj are merged together to obtain c¯ X X p(y |cj ) p(y |ci ) + p(cj ) p(y |cj ) log ∆ICY = p(ci ) p(y |ci ) log p(y ) p(y ) y y −p(¯ c)

X

p(y |¯ c ) log

y

p(y |¯ c) p(y )

p(¯ c ) = p(ci ) + p(cj ) p(y |¯ c )p(¯ c ) = p(y |ci )p(ci ) + p(Y |cj )p(cj ) ∆ICY = p(¯ c )JSΠ [p(y |ci )||p(y |cj )] August 31, Interspeech 2011, Florence, Italy Sree Harsha Yella, Fabio Valente (Idiap Research Feature Institute) level combination of diarization systems / 33