Some Studies on Acoustic Feature Extraction, Feature Selection and Multi-level Fusion Strategies for Robust Text-Independent Speaker Identification

Sandipan Chakroborty

Some Studies on Acoustic Feature Extraction, Feature Selection and Multi-level Fusion Strategies for Robust Text-Independent Speaker Identification

Thesis submitted in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY by

Sandipan Chakroborty Under the supervision of

Prof. Goutam Saha

Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur West Bengal, INDIA 721 302 January 2008

Dedicated to

My Parents, Late Grand Parents, Uncle, and my Friend & Philosopher Mr. Kaushik Ray

Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur Kharagpur, West Bengal, India 721 302.

Certificate

This is to certify that the thesis entitled Some Studies on Acoustic Feature Extraction, Feature Selection and Multi-level Fusion Strategies for Robust TextIndependent Speaker Identification submitted by Sandipan Chakroborty, a Research Scholar, in the Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, India, for the award of the degree of Doctor of Philosophy, is a record of an original research work carried out by him under my supervision and guidance during July 2003-January 2008. The thesis fulfills all requirements as per the regulations of this Institute and in my opinion has reached the standard needed for submission. Neither this thesis nor any part of it has been submitted for any degree or academic award elsewhere.

Dr. Goutam Saha Assistant Professor Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, Place: I.I.T. Kharagpur

West Bengal,

Date:

INDIA 721 302.

January, 2008.

Acknowledgment I would like to express my profound and heartfelt gratitude for my esteemed guide and mentor, Prof. Goutam Saha, for his ever-present blessings, valuable advice, support, best wishes and encouragement all along the course of this work. Without his undeniable presence, this work would not have seen the light of day. I am also grateful to the members of my Doctoral Scrutiny Committee (Prof. T. K. Basu, Prof. A. K. Ray, and Prof. P. K. Biswas) for reviewing my work and for their valuable comments. I would like to express my gratitude towards Prof. A. S. Dhar whose comprehensive support I enjoyed from the very first day of my research. I gratefully acknowledge his warm encouragement and patient guidance. I am indebted to other faculty members of the Department: Prof. D. Datta (Head of the Department), Prof. S. Mahapatra, Prof. S. Mukhopadhyay, Prof. I. Chakraborty, Prof. S. Sengupta, Prof. R. Roy, Prof. S. Banerjee, Late Prof. S. Kal, Prof. R. Garg, Prof. B. N. Chatterjee for their support. I would also like to thank Prof. Sadaoki Furui, Dr. D. A. Reynolds and Dr. Tomi Kinnunen for the correspondences I had with them and for their illustrative replies through emails in answering some of my doubts. In addition, I would like to thank to all the Laboratory staffs Mr. Sudhir Ghorai and Mr. Saroj Kr. Hatai, who contributed to the comfortable and collaborative atmosphere at the Control and Applied Electronics Laboratory. I would like to thank all the office staffs of our department for their support. I like to say a big â€˜thank youâ€™ to all my friends and fellow students for the tremendous support I received from them during my stay at I.I.T. Kharagpur and during the preparation of my thesis. I shall be failing in my duty if I do not name Mr. Aneek Adhya, Mr. Soumitra Debnath, Mr. Suman Senapati, Mr. Samit Ari, Mr. Ravi Shankar, Jinesh P. Nair, Mr. Sushrut Vaidya, Mr. Prasanta Kr. Pattadar, Mr. Anindya Roy, Mr. U. S. Yadhunandan, Mr. Subhendu Seth. Mr. Arumoy Mukherjee, Mr. Sunil N. Reddy, Mr. Pankaj Kumar, Mr. Sairam Reddy.

ii I can never give enough thanks to my parents, sister, and other family members for their encouragement and support.

Place: I.I.T. Kharagpur Date:

January, 2008.

Sandipan Chakroborty

Abstract Over the last decade, an Speaker Identification

(SI) systems, have been employed

efficiently as an add-on module with various speech related applications like Speech Recognition, Speaker Verification (SV), Personalized Speech Coding, and many other Personalized User Interfaces. At the time of testing, the identification time for an unseen speech sample depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. Methods like Prequantization (PQ) and Speaker Pruning help the system to achieve considerable speedup gain. However, the performance of the system degrades somewhat as some of the important parameters have been pruned out during testing in order to make it faster. In this thesis, we focus on decreasing the computational load in the identification phase while an attempt is made simultaneously to keep the recognition accuracy reasonably high through various fusion strategies that take evidence from conventional as well as complementary feature sets and different feature sets with their reduced dimensionality. First, we have proposed a complementary feature set, which can describe high frequency part of the spectrum more than its counterpart, baseline Mel-frequency Cepstral Coefficients (MFCC) method. The SI performances of the proposed complementary feature set has been compared with baseline. Next, a fusion strategy is developed to fuse the score from the models separately developed for baseline and feature sets. PQ method has been introduced in the fusion strategy to compare the performance on equivalent computational load. The results of our analysis indicate that, using the proposed complementary features and PQ based fusion scheme, one can achieve an appreciable enhancement in SI accuracy while utilizing the same amount of resource that a baseline system does. Subsequently, a study has been done to observe the effect of using various filter bank shapes on identification accuracy. Gaussian filter (GF) has been introduced to average the speech spectrum for determining the spectral envelope. Six

iv different shapes of filters that include Rectangular, Triangular, and Gaussian with its four varieties have been explored to extract meaningful cepstral parameters. The study reveals that cepstral features obtained from GF outperform other variants for a closed set SI task. Complementary features are also obtained using the Gaussian shaped filter and in a similar way complementary speaker models have been fused. In sequel, a straightforward and non-exhaustive search based Feature Selection (FS) method based on Singular Value Decomposition (SVD) followed by QR Factorization with Column Pivoting (QRcp) has been proposed in this thesis for achieving higher identification rate with reduced feature set compared to full feature set. The performance of the system has been compared with the system, which uses the features selected by well known Feature Selection (FS) method F-Ratio (FR) based feature selection. It has been observed that our proposed FS method is superior both in terms of number of features required to be used and error rate. Finally, some candidate strategies have been proposed for fusing the information from multiple sources and applied at various levels of an SI system to enhance the systemâ€™s performance over single stream baseline system. Using the best fusion technique, we obtain significant relative improvements in terms of SI error rate as compared to conventional (baseline) system for two public databases, namely YOHO and POLYCOST, respectively comprising more than 130 speakers.

Keywords: Divergence, Fusion, Gaussian Mixture Model (GMM), GIMFCC, GMFCC, IMFCC, MFCC, PQ, QRcp, SI, Speaker Recognition, Speed-up, Subband, SVD.

Contents

Abbreviations

ix

List of Notations and Operations

xi

List of Figures

xvii

List of Tables

xix

List of Algorithms

xxiii

1 Introduction

1

1.1

Basic Terminologies

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Chronological Review of Speaker Recognition Systems . . . . . . . . . .

5

1.2.1

Development in Feature Domain . . . . . . . . . . . . . . . . . .

6

1.2.2

Development in Classification Approaches . . . . . . . . . . . . .

7

1.3

Factors Affecting the Time Complexity of a Speaker Identification System

1.4

A Brief Review of Computational Speed-Up Methods used in Speaker

8

Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.5

Motivation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.6

Major Contribution in the Thesis . . . . . . . . . . . . . . . . . . . . . .

12

1.7

Databases for the Studies and Experimental Procedures . . . . . . . . .

12

1.7.1

YOHO Database . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.7.2

POLYCOST Database . . . . . . . . . . . . . . . . . . . . . . . .

14

Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .

15

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.8

vi

CONTENTS

2 Speaker Identification Based on High-Frequency Cues 2.1

19

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.1.1

Organization of the Chapter . . . . . . . . . . . . . . . . . . . . .

21

2.2

Mel Frequency Cepstral Coefficients and their Calculation . . . . . . . .

22

2.3

The Inverted Mel Frequency Cepstral Coefficients . . . . . . . . . . . . .

25

2.4

Brief Overviews of Various other Stages in the System . . . . . . . . . .

31

2.4.1

Pre-processing stage . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.4.2

Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . .

32

2.5

Comparative Study of Performances of MFCC and IMFCC Feature Sets

35

2.6

Fusion of MFCC and IMFCC based Model Level Scores using Decimation type of Pre-quantization . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.6.1

Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . .

40

2.7

Results and Discussion on PQ based Fusion Strategy with P = 2 . . . .

40

2.8

Further Reduction of Frame Rate with P > 2 . . . . . . . . . . . . . . .

43

2.9

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3 Studies on Gaussian Filter Shapes for Speaker Identification Application 3.1

53 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.1.2

Organization of the chapter . . . . . . . . . . . . . . . . . . . . .

56

Derivation of Gaussian Filter based MFCC . . . . . . . . . . . . . . . .

57

3.2.1

Choice of Îą, the overlap parameter . . . . . . . . . . . . . . . . .

59

3.3

Comparative Performances of Different Feature Sets under Mel Scale . .

62

3.4

Application of Gaussian Filter to Inverted Mel Scale . . . . . . . . . . .

63

3.5

Comparative Performances of Different IMFCC Feature Sets . . . . . . .

66

3.6

Analysis of Class Separability . . . . . . . . . . . . . . . . . . . . . . . .

67

3.7

Fusion of GMFCC & GIMFCC using PQ when P = 2 . . . . . . . . . .

69

3.8

Decimation with P > 2 on GMFCC-GIMFCC based Fused System . . .

72

3.9

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.2

CONTENTS

vii

4 SVD-QRcp based Acoustic Feature Selection for Speaker Identification 4.1

4.2

4.3

4.4

79 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

4.1.1

Review of Feature Selection Methods . . . . . . . . . . . . . . . .

81

4.1.2

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.1.3

Organization of the chapter . . . . . . . . . . . . . . . . . . . . .

86

Theoretical Background on SVD and QRcp . . . . . . . . . . . . . . . .

86

4.2.1

Singular Value Decomposition (SVD)

. . . . . . . . . . . . . . .

86

4.2.2

QRcp Factorization

. . . . . . . . . . . . . . . . . . . . . . . . .

87

Feature Subset Selection using SVD Followed by QRcp . . . . . . . . . .

88

4.3.1

Formation of Matrix F . . . . . . . . . . . . . . . . . . . . . . . .

89

4.3.2

Selection of Number of features using SVD . . . . . . . . . . . .

89

4.3.3

Selection of Effective ‘g’ Number of Features using QRcp . . . .

91

4.3.4

Description of the Complete System . . . . . . . . . . . . . . . .

91

A Discussion on Singular Values and Percentage of Energy Explanation for Different Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . .

96

4.5

Selected Subsets of Features and Their Performances . . . . . . . . . . .

99

4.6

Combination of Best Speaker Models’ Outputs via PQ Based Fusion Strategy with P =2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

106

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

110

4.7

5 Studies on Input Fusion and Output Normalization for Speaker Identification Application

115

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

116

5.1.1

Organization of the chapter . . . . . . . . . . . . . . . . . . . . .

118

Weighted Feature Level fusion . . . . . . . . . . . . . . . . . . . . . . . .

118

5.2

5.2.1

Performances after Feature Level Fusion using MFCC & IMFCC feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2.2 5.3

5.4

120

Performances after Feature level fusion using GMFCC & GIMFCC feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

122

Weighted Score Level Fusion using Output Normalization . . . . . . . .

123

5.3.1

Weight Calculation using Best Speaker and Most Competing Speaker 126

5.3.2

Weight Calculation using min-max Operator . . . . . . . . . . .

Combining Scores of Best Models’ obtained through SVD-QRcp

. . . .

127 131

viii 5.5

CONTENTS Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

132

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133

6 Conclusions

137

6.1

Summary of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .

138

6.2

Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . .

141

A E&M and Split VQ Algorithm References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

145 148

Index

149

Publication

151

Authorâ€™s Biography

155

Abbreviations ANN

Artificial Neural Network

DCT

Discrete Cosine Transform

DFT

Discrete Fourier Transform

DTW

Dynamic Time Warping

E&M

Expectation and Maximization

EER

Equal Error Rate

FFT

Fast Fourier Transform

FR

F-Ratio

FS

Feature Selection

GF

Gaussian Filter

GIMFCC

Gaussian Inverted Mel Frequency Cepstral Coefficients

GMFCC

Gaussian Mel Frequency Cepstral Coefficients

GMM

Gaussian Mixture Model

GVQ

Group Vector Quantization

HMM

Hidden Markov Model

IMFCC

Inverted Mel Frequency Cepstral Coefficients

KLT

Karhunen Lo`eve Transform

LBG

Linde, Buzo & Gray

LDA

Linear Discriminative Analysis

LFCC

Linear Frequency Cepstral Coefficients

LPC

Linear Predictive Coefficients

LPCC

Linear Predictive Cepstral Coefficients

MC

Multiple Classifiers

MFCC

Mel-frequency Cepstral Coefficients

x

List of Abbreviations NIST

National Institute of Standards and Technology

PC

Polynomial Clasifier

Probability Density Function

PIA

Percentage of Identification Accuracy

PLP

Perceptual Linear Prediction

PQ

Pre-quantization

QRcp

QR Factorization with Column Pivoting

RCC

Real Cepstral Coefficients

RF

Rectangular Filter

SA

Speaker Adaptation

SD

Standard Deviation

SI

Speaker Identification

SR

Speaker Recognition

SV

Speaker Verifcation

SF

Speed-up Factor

SVD

Singular Value Decomposition

SVM

Support Vector Machines

TDM

Time Division Multiplexing

TF

Triangular Filter

UBM

Universal Background Model

VQ

Vector Quantization

List of Notations and Operations Îąi

Scaling factor for the variance of ith band-pass filter

The threshold in improvement in distortion

Îťs Ë†s Îť

GMM for speaker s

ÂŻs Îť

Improved GMM obtained in next iteration

Âľ

Global mean vector obtained across speakers

Âľs

Mean vector for speaker s

Âľsi Âľ Ë† si

ith mean vector of GMM for speaker s

Projected GMM in lower dimensional space

ith mean vector of GMM for speaker s projected in lower dimension

Âľ ÂŻ si

Improved ith mean vector of GMM for speaker s in next iteration

ÎŁsi Ë†s ÎŁ i

ith covariance matrix of GMM for speaker s ith covariance matrix of GMM for speaker s projected in lower dimension

ÂŻs ÎŁ i

Improved ith covariance matrix of GMM for speaker s in next iteration

Ďƒi

Standard deviation for ith Gaussian filter in mel-scale

Ďƒ Ë†i

Standard deviation for ith Gaussian filter in inverted mel-scale

Ďˆi (k) ĎˆË†i (k)

Response of ith triangular band-pass filter in mel-scale Response of ith triangular band-pass filter in inverted mel-scale

Ďˆig (k) ĎˆË†ig (k)

Response of ith Gaussian band-pass filter in mel-scale Response of ith Gaussian band-pass filter in inverted mel-scale

xii

List of Notations and Operations B

Overall between scatter matrix

bsi (·)

ith multidimensional Gaussian PDF for speaker s

Cm Cˆm

mth mel cepstral coefficients

g Cm g Cˆm

mth Gaussian mel cepstral coefficients

D ˆ D

Dimension of a feature vector

mth inverted mel cepstral coefficients mth Gaussian inverted mel cepstral coefficients Reduced dimension of a feature vector

di dˆi

Usable spread for ith gaussian filter in mel-scale

e(i)

ith triangular filter bank output using mel-scale wrap-

Usable spread for ith gaussian filter in inverted mel-scale ping

eˆ(i)

ith triangular filter bank output using inverted mel-scale wrapping

eg (i)

ith Gaussian filter bank output using mel-scale wrapping

eˆg (i)

ith Gaussian filter bank output using inverted mel-scale wrapping

F

VQ stack matrix

F1

VQ stack matrix with lower rank

fi

ith column in VQ stack matrix

Fe

Number of operations required for extracting spectral features like MFCC

Fp

Fixed feature set

Fs

Sampling frequency

f

Frequency

f bi

Frequency of the middle point for ith triangular filter in mel-scale

fˆbi

Frequency of the middle point for ith triangular filter in inverted mel-scale

fhigh

Highest frequency in the usable bandwidth

fmel

Mel-scale

−1 fmel

Inverse wrapping (mel to normal frequency mapping) for the mel-scale

fˆmel

Inverted mel scale

List of Notations and Operations −1 fˆmel

Inverse wrapping (inverted mel to normal frequency mapping) for the mel-scale

flow

Lowest frequency in the usable bandwidth

G

Number of streams or Number of feature sets used

g

Number of significant singular values

K

Number of subbands

k bi

DFT coefficient index of the middle point for ith triangular filter in mel-scale

kbi+1

DFT coefficient index of the rightmost point for ith triangular filter in mel-scale

kbi−1

DFT coefficient index of the leftmost point for ith triangular filter in mel-scale

kˆbi

DFT coefficient index of the middle point for ith triangular filter in inverted mel-scale

kˆbi+1

DFT coefficient index of the rightmost point for ith triangular filter in inverted mel-scale

kˆbi−1

DFT coefficient index of the leftmost point for ith triangular filter in inverted mel-scale

L

Transformer for the data matrix to be linearly separable at equal/lower dimensionality

Ls (X)

Likelihood for speaker s when an unknown utterance X is inputted

Lscom

Combined likelihood score for speaker s using MFCCIMFCC paradigm

Lgscom

Combined likelihood score for speaker s using GMFCCGIMFCC paradigm

M ˆ M

Model order

Ms

Number of DFT points

N

Number of samples in time domain frame

Ns

Total number of available training data for speaker s

Pt

Permutation matrix

P (Xtriter )

Partition for the training vectors at ‘iter’th iteration

P

Pre-quantization decimation rate

Number of disjoint region for test data using VQ

xiii

xiv

List of Notations and Operations Pex

Actual percentage of energy explantation obtained from the data after SVD

P0

ex

User defined percentage of energy explantation

p(x|λs )

Probability for feature vector x for the speaker model s

psi pˆsi

ith prior matrix of GMM for speaker s ith prior of GMM for speaker s projected in lower dimension

p¯si

Improved ith prior of GMM for speaker s in next iteration

Q

Orthogonal matrix after QR decompostion

qi

ith column in orthogonal matrix

q0i

ith orthogonal un-normalized vector after Gram Schmidt orthogonalization

Q

Number of band-pass filters

R

Upper triangular matrix after QR decompostion

Rcep

Total number of cepstral parameters

r

Rank of a matrix

S Sˆ

Number of speaker

si

ith singular values

T

Number of vectors in an incoming test sequence

Tf

Time required for evaluating fused stream

Tgf

Time required for evaluating fused stream developed

Reduced number of speaker after speaker pruning

from GMFCC-GIMFCC Tgf

Time required for evaluating GMFCC stream (Single stream)

Tm

Time required for evaluating MFCC stream (Single stream)

Tr

Number of trails to find the best set features other than the fixed feature set

U

Left singular vector matrix

V

Right singular vector matrix

W

Overall within scatter matrix

Ws

Within scatter matrix for speaker s

List of Notations and Operations wi

Assigned weight for ith stream using MFCC-IMFCC paradgm

wig

Assigned weight for ith stream using GMFCC-GIMFCC paradgm

w si

Assigned weight for ith stream, sth speaker using MFCC-IMFCC paradigm

wsgi

Assigned weight for ith stream, sth speaker using GMFCC-GIMFCC paradigm

X

Collection of all the test vectors

Xtr0

Initial codebook from training set

x

An incoming test vector

x(路)

A raw speech frame

Y (路)

Energy spectrum of a pre-emphasized time domain speech frame

y(路)

A pre-emphasized time domain speech frame

xv

List of Figures 1.1

Categorization of the speaker recognition tasks . . . . . . . . . . . . . .

3

1.2

Enrolment Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.3

Identification Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.4

Chronological Development of Speaker Recognition System . . . . . . .

7

1.5

An example of speaker dependent speech recognition system . . . . . . .

10

2.1

Normal frequency vs. Mel-frequency. . . . . . . . . . . . . . . . . . . . .

22

2.2

Boundary points of one filter placed in normal frequency scale. . . . . .

24

2.3

Filter bank structures for MFCC process . . . . . . . . . . . . . . . . . .

24

2.4

Subjective Pitch vs Frequency. For Mel scale, corresponding to the human auditory system, pitch increases progressively less rapidly as the frequency increases, In direct contrast, it increases progressively more rapidly in the proposed Inverted Mel Scale . . . . . . . . . . . . . . . . . . . . . . . . .

2.5

25

Log energy spectrum estimation using MFCC and IMFCC filter bank outputs for a speech frame from YOHO database . . . . . . . . . . . . .

29

2.6

Filter bank structures for MFCC and IMFCC feature extraction process

30

2.7

MFCC and IMFCC feature extraction process . . . . . . . . . . . . . . .

30

2.8

Scatter plot for feature diversity [Note: Each point in the plot represents

2.9

an utterance] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

Typical pre-processing stages . . . . . . . . . . . . . . . . . . . . . . . .

33

2.10 Mixture density by Gaussian mixture models with 4 modes in two dimensional space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.11 Fixed decimation based PQ . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.12 Pre-Quantization based MFCC-IMFCC fusion strategy . . . . . . . . . .

39

2.13 Time vs. PIA for YOHO database . . . . . . . . . . . . . . . . . . . . .

46

xviii

LIST OF FIGURES

2.14 Time vs. PIA for POLYCOST database . . . . . . . . . . . . . . . . . .

46

3.1

Overlapped subbands realized by filters of various shapes . . . . . . . .

56

3.2

Mel to normal frequency scale mapping . . . . . . . . . . . . . . . . . .

58

3.3

Different shapes of filters used for MFCC implementation . . . . . . . .

59

3.4

Gaussian filters realized in mel-scale with different variances . . . . . . .

60

3.5

Gaussian filters realized in inverted mel-scale with different variances . .

64

3.6

Mel and inverted mel to normal frequency scale mapping

. . . . . . . .

65

3.7

Time vs. PIA for YOHO database for two different fused systems . . . .

74

3.8

Time vs. PIA for POLYCOST database for two different fused systems

74

4.1

Typical feature selection method . . . . . . . . . . . . . . . . . . . . . .

81

4.2

Stacked version of vector quantized cepstral vectors using MFCC feature set from YOHO database . . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.3

Column swapping through maximum norm criterion in QRcp . . . . . .

92

4.4

SVD-QRcp based feature selection in SI system . . . . . . . . . . . . . .

95

4.5

Singular values and their corresponding percentage of energy explanation in YOHO Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.6

97

Singular values and their corresponding percentage of energy explanation in POLYCOST Database . . . . . . . . . . . . . . . . . . . . . . . . . .

98

5.1

Different levels of integration . . . . . . . . . . . . . . . . . . . . . . . .

117

5.2

Feature level fusion strategy . . . . . . . . . . . . . . . . . . . . . . . . .

121

5.3

Model level fusion strategy

. . . . . . . . . . . . . . . . . . . . . . . . .

124

A.1 Splitting of a codeword . . . . . . . . . . . . . . . . . . . . . . . . . . . .

147

List of Tables

1.1

Specifications of YOHO Corpus . . . . . . . . . . . . . . . . . . . . . . .

13

1.2

Specifications of POLYCOST Corpus . . . . . . . . . . . . . . . . . . . .

14

2.1

Boundary points fbi and fË†bi in Hz for the MFCC and IMFCC filter banks (with Q = 20, flow = 31.25 Hz and fhigh = 4 kHz. . . . . . . . . . . . . .

2.2

Specifications of Split Vector Quantization based clustering algorithm for initialization of seed mean vectors for GMM . . . . . . . . . . . . . . . .

2.3

44

Reduction in computational complexity with increasing P for fusion scheme on YOHO Database (M = 64)

2.9

42

Serial nos. of allocated vectors towards MFCC & IMFCC stream for over different values of P . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.8

42

Performances using MFCC & IMFCC feature sets in normal, Pre-quantized and fused mode for POLYCOST database for (M = 16) . . . . . . . . .

2.7

36

Performances using MFCC & IMFCC feature sets in normal, Pre-quantized and fused mode for YOHO database for (M = 64) . . . . . . . . . . . .

2.6

36

Comparative performance using MFCC & IMFCC feature sets in POLYCOST database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5

35

Comparative performance using MFCC & IMFCC feature sets in YOHO database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4

27

. . . . . . . . . . . . . . . . . . . . . . .

44

Reduction in computational complexity with increasing P for fusion scheme on POLYCOST Database (M = 16) . . . . . . . . . . . . . . . . . . . .

45

3.1

SI performances of various shapes of filters in mel-scale on YOHO database 63

3.2

SI performances of various shapes of filters in mel-scale on POLYCOST database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

xx

LIST OF TABLES 3.3

SI performances of various shapes of filters in inverted mel-scale on YOHO database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4

SI performances of various shapes of filters in inverted mel-scale on POLYCOST database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5

69

Performances using GMFCC & GIMFCC feature sets in normal, Prequantized and fused mode for YOHO database for (M = 64) . . . . . .

3.7

67

Divergence analysis for different feature sets on YOHO and POLYCOST databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6

66

70

Performances using GMFCC & GIMFCC feature sets in normal, Prequantized and fused mode for POLYCOST database for (M = 16) . . .

71

3.8

Comparative SI performances between two fused systems for YOHO Database 71

3.9

Comparative SI performances between two fused systems for POLYCOST Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

3.10 Reduction in computational complexity with increasing P for fusion scheme on YOHO Database (M = 64)

. . . . . . . . . . . . . . . . . . . . . . .

73

3.11 Reduction in computational complexity with increasing P for fusion scheme on POLYCOST Database (M = 16) . . . . . . . . . . . . . . . . . . . . 4.1

Minimum number of fixed features obtained when P ex = 99% for different feature sets on two databases . . . . . . . . . . . . . . . . . . . . . . . .

4.2

99

Rank of features evaluated by SVD-QRcp and FR based feature selection methods for YOHO Database . . . . . . . . . . . . . . . . . . . . . . . .

4.3

73

101

Rank of features evaluated by SVD-QRcp and FR based feature selection methods for POLYCOST Database . . . . . . . . . . . . . . . . . . . . .

102

4.4

Final list of selected subset of features for YOHO and POLYCOST database 103

4.5

SI performance using selected features from different feature sets on YOHO Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.6

SI performance using selected features from different feature sets on POLYCOST Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.7

108

SI accuracy when best GMFCC & GIMFCC models are fused by PQ based fusion strategy for YOHO Database . . . . . . . . . . . . . . . . .

4.9

105

SI accuracy when best MFCC & IMFCC models are fused by PQ based fusion strategy for YOHO Database . . . . . . . . . . . . . . . . . . . .

4.8

104

108

SI accuracy when best MFCC & IMFCC models are fused by PQ based fusion strategy for POLYCOST Database . . . . . . . . . . . . . . . . .

108

LIST OF TABLES

xxi

4.10 SI accuracy when best GMFCC & GIMFCC models are fused by PQ based fusion strategy for POLYCOST Database . . . . . . . . . . . . . . 5.1

Assigned weights for different feature sets for the YOHO and POLYCOST databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2

130

SI accuracies using various fusion strategies applied on speaker models’ scores (GMFCC-GIMFCC, YOHO database) . . . . . . . . . . . . . . .

5.9

129

SI accuracies using various fusion strategies applied on speaker models’ scores (MFCC-IMFCC, POLYCOST database) . . . . . . . . . . . . . .

5.8

123

SI accuracies using various fusion strategies applied on speaker models’ scores (MFCC-IMFCC, YOHO database) . . . . . . . . . . . . . . . . .

5.7

122

SI accuracies after feature level fusion using GMFCC-GIMFCC paradigm for POLYCOST database . . . . . . . . . . . . . . . . . . . . . . . . . .

5.6

122

SI accuracies after feature level fusion using GMFCC-GIMFCC paradigm for YOHO database . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.5

120

SI accuracies after feature level fusion using MFCC-IMFCC paradigm for POLYCOST database . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.4

120

SI accuracies after feature level fusion using MFCC-IMFCC paradigm for YOHO database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3

109

130

SI accuracies using various fusion strategies applied on speaker models’ scores (GMFCC-GIMFCC, POLYCOST database) . . . . . . . . . . . .

131

5.10 SI accuracies using various fusion strategies applied on the best speaker models’ scores (MFCC-IMFCC, YOHO and POLYCOST databases) with highest model orders (i.e. 64 and 16) . . . . . . . . . . . . . . . . . . . .

132

5.11 SI accuracies using various fusion strategies applied on the best speaker models’ scores (GMFCC-GIMFCC, YOHO and POLYCOST databases) with highest model orders . . . . . . . . . . . . . . . . . . . . . . . . . .

132

List of Algorithms 2.1

PQ based fusion strategy . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1

Training Phase for an SI system with feature selection by SVD followed by QRcp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

41 94

Testing Phase for an SI system with feature selection by SVD followed by QRcp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

Feature Selection using SVD-QRcp with score level fusion based on PQ .

107

A.1 Split-VQ Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

147

4.3

CHAPTER

1

Introduction

3

Preface In Chapter 1, an introduction to the Speaker Identification (SI) problem is presented along with its basic building blocks and important terminologies. This is followed by a chronological review in research on Speaker Recognition (SR) technology. A brief review of conventional speed-up techniques are discussed next and also the motivations for the work. The main contributions of this thesis are enumerated followed by the description on the databases used for the studies. 3

2

Introduction

1.1

Basic Terminologies

Human speech conveys

different types of information. The primary type is the

meaning of words, which the speaker tries to pass to the listener. But other types that are also included in the speech are information about language being spoken, speaker emotions, gender, and identity of the speaker. The goal of Speaker Recognition (SR) is to extract, characterize and recognize the information about speaker identity [1]. SR is usually divided into two different branches, Speaker Verification (SV) [2] and Speaker Identification (SI) [2]. In the identification task, or 1 : S matching, an unknown speaker is compared against a database of S known speakers, and the best matching speaker is returned as the recognition decision. The verification task, or 1 : 1 matching, consists of making a decision whether a given voice sample [2], [3], [4] is produced by a claimed speaker. An identity claim (e.g., a PIN code) is given to the system, and the unknown speakerâ€™s voice sample is compared against the claimed speakerâ€™s voice template. If the degree of similarity between the voice sample and the template exceeds a predefined decision threshold, the speaker is accepted, otherwise rejected. Of the identification and verification tasks, identification task is generally considered more difficult. This is intuitive: when the number of registered speakers increases, the probability of an incorrect decision increases [5], [6], [7]. The performance of the verification task is not, at least in theory, affected by the population size since only two speakers are compared. SI task is classified into open-and closed-set tasks. If the target speaker is assumed to be one of the registered speakers, the recognition task is a closed-set problem. If there is a possibility that the target speaker is none of the registered speakers, the task is called an open-set problem. In general, the open-set problem is much more challenging. In the closed-set task, the system makes a forced decision simply by choosing the best matching speaker from the speaker database-no matter how poorly this speaker matches. However, in the case of open-set identification, the system must have a predefined tolerance level so that the degree of similarity between the unknown speaker and the best matching speaker is within this tolerance. In this way, the verification task can be seen as a special case of the open-set identification task, with only one speaker in the database (S = 1). SI tasks can be further classified into text-dependent or text-independent tasks. In the former case, the utterance presented to the recognizer is known beforehand. In the latter case, no assumptions about the text being spoken is made, but the system must model the general underlying properties of the speakerâ€™s vocal space. In general,

1.1 Basic Terminologies

3

text-dependent systems are more accurate, since both the content and voice can be compared. For instance, a speech recognizer can be used in recognizing whether the user utters the sentence that the system prompted to the the user. This is known as utterance verification, and it can be efficiently combined with SV [8]. Speaker Recognition

Speaker Identification

Speaker Verification (Two-Class Problem)

(Multiple-Class Problem)

Closed-set Identification

Text-independent Identification

Open-set Identification

Text-dependent Identification

Categorization of the speaker recognition tasks depending on the number of involved classes Categorization of the speaker Identification task

Figure 1.1: Categorization of the speaker recognition tasks In text-dependent SV, the pass-phrase presented to the system is either the same always, or alternatively, it can be different for every verification session. In the latter case, the system selects randomly a pass phrase from its database and the user is prompted to utter this phrase. In principle, the pass phrase can be stored as a whole word/utterance template, but a more flexible way is to form it online by concatenating different words (or other units such as diphones). This task is called text-prompted SV. The advantage of text prompting is that a possible intruder cannot know beforehand what the phrase will be, and playback of pre-recorded speech becomes difficult. Furthermore, the system

4

Introduction

can force the user to utter the pass phrase within a short time interval, which makes the intruder harder to use a device or software that synthesizes the customerâ€™s voice. The taxonomy of the SR is represented in Figure 1.1. The process of SI is divided into two main phases. During the first phase, called speaker enrolment, speech samples are collected from the speakers, and they are used to train their models. The collection of enrolled models is also called a speaker database. In the second phase, called identification, a test sample from an unknown speaker is compared against the speaker database. Both phases include the same first step, feature extraction, which is used to extract speaker dependent characteristics from speech. The main purpose of this step is to reduce the amount of test data while retaining speaker discriminative information. Some well known feature extraction techniques are Linear Predictive Coding (LPC) [2], [9], Mel-frequency Cepstral Coefficients (MFCC) [10], [11], Linear Frequency Cepstral Coefficients (LFCC) [10], Linear Predictive Cepstral Coefficients (LPCC) [9], [12], Perceptual Linear Prediction (PLP) [13] etc. Then in the enrollment phase, these features are modeled and stored in the speaker database. State-of-the-art SR utilizes Gaussian Mixture Model (GMM) [14] as the feature modeling technique for its capability of accurate estimation of data in probabilistic sense. However, other modeling techniques like Vector Quantization (VQ) [15], Polynomial Classifier (PC) [16], Support Vector Machine (SVM) [17], Artificial Neural Networks (ANN) [18] etc could also be used. The enrolment process is represented in Figure 1.2. The figure 1.2 shows a pre-processing block, where a raw speech signal is first silence removed followed by pre-emphasis, frame blocking, and windowing. A detailed description on the pre-processing stage is given in section 2.4. Feature Vectors

Speech Frames Dimensionality Reduction

Speaker Model Generalized Representation

Speech

Pre-processing

Feature Extraction

Speaker Modeling

Figure 1.2: Enrolment Phase

Speaker Database

1.2 Chronological Review of Speaker Recognition Systems

5

In the identification step, the extracted features are compared against the models stored in the speaker database. Based on these comparisons the final decision about speaker identity is made. This process is represented in Figure 1.3. However, these two phases are closely related. When identification algorithm usually depends on the modeling algorithm used in the enrollment phase.

Test Speech Pre-processing

Feature Extraction

Comparison with Speaker Database

Decision

(in reduced dimension)

Speaker Database

Figure 1.3: Identification Phase The SI area on its own has witnessed forty years of progress and development. During this period, a variety of speaker recognition tasks were defined in response to the increasing needs of our technologically oriented way of life. The common thought of all these tasks is the assumption that human voice is unique for each individual, and therefore, it can be used as a distinguishing feature for recognizing its owner among other individuals. Conti et al. [19] states (p. 24): â€œBut just as there are no two identical faces, fingerprints or irises in this world, so there arenâ€™t two human voices alike. And since phones already come equipped with the appropriate capture device for voice input (i.e. a microphone), voice biometrics are set to provide the ease of implementation and low costs required for mass-market deployments.â€?

1.2

Chronological Review of Speaker Recognition Systems

The present section provides a survey of various speech features and different classification approaches that were used in the SR tasks over the years.

6

Introduction

1.2.1

Development in Feature Domain

Historically, the following spectrum-related speech features have dominated the speech and speaker recognition areas: Real Cepstral Coefficients (RCC) introduced by Oppenheim [20] (1969), LPC proposed by Atal and Hanauer (1971) [21], LPCC derived by Atal (1974) [9], and MFCC (Davis and Mermelstein, 1980) [10]. Other speech features such as, PLP coefficients (Hermansky, 1990) [13], Adaptive Component Weighting Cepstral Coefficients (Assaleh and Mammone, 1994 [22]), and various wavelet-based features, although presenting reasonable solutions for the same tasks, did not gain widespread practical use. The last was often due to their relatively more sophisticated computation or to the fact that they do not provide significant advantage when compared to the well-known MFCC. LFCC has been successfully used in SR task. However, MFCC has the beauty of being effectively used in both Speech & Speaker Recognition, which often go hand in hand and is widely considered as baseline in SR task. From a perceptual point of view, MFCC bears resemblance to the human auditory system, since they account for the nonlinear nature of pitch perception, as well as for the nonlinear loudness perception. That makes MFCC more adequate features for speech recognition than other formerly used speech parameters like RCC, LPC, and LPCC. This success of MFCC, combined with their robust and cost-effective computation, turned them into a standard choice in the speech recognition applications. MFCC became widely used in SR tasks, too, although they might not represent well some important details that contribute to better differentiation among particular voices. The fundamental frequency and the energy of speech are often appended (more precisely their logarithmically compressed values) to the spectrum-derived speech parameters like MFCC to form a composite feature vector [23]. Some researchers prefer to model the distributions of the energy and fundamental frequency independently from the spectrum-related features, and perform fusion [23], [24] of the scores computed from the individual classifiers. In this way they are able to control better the parameters of the model, while also evade the curse of dimensionality. This exploit available training data better. The estimated values of the fundamental frequency, occasionally their temporal derivatives are also used as additional parameters. Subsequently, some high-order statistical parameters derived on the basis of the distribution of these parameters were also found to be practically helpful. Finally, piece-wise approximations of the temporal tracks of the fundamental frequency and the frame energy were demonstrated to improve the speaker recognition performance. In figure 1.4, a chronological development of the

1.2 Chronological Review of Speaker Recognition Systems

7

speaker recognition technology has been described.

Baselines

Chronological Development of Speaker Recognition System

PC, SVM, PNN Discriminative GMM

GMM, ANN,

GMM

Pattern Matching Techniques

HMM , GVQ

VQ, Projected Long Term Statistics Dynamic Time Warping Long Term Statistics

MFCC, MFCC+ {Residual Phase or Wavelet octave coefficients of residues}

MFCC

Log Area Ratios, PLP Feature Extraction

MFCC, LFCC LPCC , Auto correlation Dynamic Features LPC, Cepstrum RCC (Before 1970)

(1970-1980)

(1980-1990)

(1990-2000)

(2000-till date)

Figure 1.4: Chronological Development of Speaker Recognition System

1.2.2

Development in Classification Approaches

Over the years, various classification approaches have been employed in SR task. Two major categories can be distinguished as discriminative and non-discriminative. The discriminative classifiers are trained to minimize the classification error on a set of training data. Thus, they only need to model the boundary between the classes and are insensitive to the variations within the classes. The discriminative models include: Different kinds of ANN [18], PC [16], Group Vector Quantization (GVQ) [25], Discriminative GMM [26], and SVM [17]. The non-discriminative approaches do not aim directly at minimization of the classification error. A major group of non-discriminative approaches is called generative. As their name suggests, the generative classifiers struggle to build models of the underlying distribution relying on the training data. The group of generative approaches includes: Probabilistic Neural Network [27], which combines Parzen-like probability density function estimators with Bayesâ€™ strategy for decision rules; GMM [14]; and the Hidden Markov Models (HMM) [28]. The HMM are capable of modeling temporal behavior of sequence of events. The GMM (as single state HMM) is not sensitive to the time order of the inputs.

8

Introduction At the present time, the GMM, HMM, and SVM classifiers (see Fig. 1.4) are con-

sidered state-of-art in the speaker recognition technology. Specifically, during the past decade, GMM were state-of-the-art for the text-independent SI/SV, and the HMM were the ultimate leader in the text-dependent tasks. However, in the last few years, the SVM classifiers that were firstly used for SI were mastered on the SV tasks. Recently some SVM based SR systems demonstrated performance results that are close to the best GMM-based systems [17], [29].

1.3

Factors Affecting the Time Complexity of a Speaker Identification System

The identification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. For an unknown utterance, if there are T number of acoustic vectors extracted and sent to a speaker model of order M (M Gaussian for a GMM or M code vectors for VQ based model), then this requires O(M T ) distance (Euclidean or weighted Euclidean) calculations. In addition, calculation of distance (or likelihood) using M th order GMM requires M exponential operations, M multiplications with some scalar quantities including priors, and M âˆ’ 1 additions. Computation of the squared Euclidean distance (or Weighted Euclidean distance) between two D-dimensional vectors, in turn, takes D multiplications

and D âˆ’ 1 additions. Therefore, the total number operations for computing likelihood

using a GMM based speaker model is O(M T D). The computation of likelihood is repeated for all S speakers, so the total identification time is of the order of O(M T DS). The efficiency of the feature extraction depends on the selected signal parametrization. Suppose that the extraction of one vector takes O(F e ) operation. The total number

of operation for feature extraction is then O(T F e ), where T is the number of vectors. Note that the feature extraction needs to be done only once. To sum up, total number of computation in identification is O(T F e + M T DS) = O(T (Fe + M DS)).

1.4

A Brief Review of Computational Speed-Up Methods used in Speaker Identification

There is some literature available for reducing the computational load in the SI systems. In a recent contribution by Kinnunen et al. [31], a considerable speed-up

1.5 Motivation

9

in various SI systems has been achieved by reducing the number of test vectors and speakers. It can be concluded from this contribution, that these two are the key factors for major computations involved in an SI test. With reduction of unnecessary frames and pruning of unlikely speakers from the complete set, computation time can be reduced significantly. The work reports highest speed-up factor of 16:1 and 34:1 for VQ and GMM based SI systems, respectively. However, the work does not show the possible speed-up in an SI system that could have been achieved by reducing the dimension of the feature vectors and complexity (order) of the speaker model. In addition, the performances of the systems are slightly degraded for gaining significant speed-ups. A classification scheme that incorporates Karhunen-Lo`eve transformation (KLT) [32] and GMM for text-independent SI has been proposed in [33]. First, from an unknown utterance, incoming test vectors are compressed via KLT and using these compressed data Bhattacharyya Distances [33] are calculated for each speaker in the database. Speakers which are not close with the compressed test data are pruned out. The original test data is then sent to the rest of the speakers, and the speaker whose score is highest is declared as the identified speaker. For a database with 500 Mandarin speakers, the work shows accuracy improvement of up to 4% and computational cost saving of 10 times compared to those of the conventional GMM model can be achieved. The work, however uses the different excerpts of the same utterance for training and testing (First 25 sec of data for training and last 5 sec data for testing from a 30 sec utterance) for which no idea can be gained about the systemâ€™s performances as session varies. An efficient GMM-based SI system has also been presented by Pellom and Hansen [34]. Since adjacent feature vectors are correlated and the order of the vectors does not affect the final score, the vector sequence can be reordered so that nonadjacent feature vectors are scored first. After the scoring, worst scoring speakers are pruned out using a beam search technique where the beam width is updated during processing. Then, a more detailed sampling of the sequence follows. The process is repeated as long as there are unpruned speakers or input data left, and then the best scoring speaker is selected as the winner. Pellom and Hansen reported speed-up factor of 6:1 relative to the baseline beam search.

1.5

Motivation

SI could be used in adaptive user interfaces. For instance, a computer shared by many people of the same family/community could recognize the user by his/her voice

10

Introduction

as password to unlock it. Similarly, a car shared by many people of the same family could recognize the driver [35] by his/her voice, and tune the radio to his/her favorite channel. An SI system finds application for building security [36] also. As an interface for the elevator, the system consists of hardware and software that uses the userâ€™s voice to command the elevator. The software part of the system is divided into two phrases : Speech Recognition and SI. In the first phase, the user says the desired floor and the system identifies which floor number the user has uttered. After recognizing the desired floor, the system commands the movement of the elevator provided that the user belongs to the group of residents of that floor. These application concepts belong to the more general group of Speaker Adaptation (SA) methods that are already employed in speech recognition systems [37], [38]. The objective of such system is to adapt the speech recognizer parameters to suit better for the current speaker, or to select a speakerdependent speech recognizer (see Fig. 1.5) from its database.

Speaker Group 1

Speaker Classifier

'abc123'

Speaker Group 2

Group 3 Selected Speaker Group 4

Speaker Group 3

Speaker Group 2

Speaker Group 1

Group Dependent Speech Recognition

Speaker Group 3 Speaker Group 4

'abc123'

Figure 1.5: An example of speaker dependent speech recognition system Speaker-specific codecs in personal speech coding have been also demonstrated to give smaller bit rates as opposed to a universal speaker-independent codec [39]. SI have also been applied to the verification problem in [40], where a simple rank-based

1.5 Motivation

11

verification method was proposed. For the unknown speakerâ€™s voice sample, SË† nearest speakers are searched from the database. If the claimed speaker is among the SË† best matches, the speaker is accepted otherwise rejected. In an open set speaker verification for a small corpus, SI is used to check whether claimed speaker is one of the client speakers or not. Finally, another application area, rarely explored so far, is games: child toys, video games, etc. Games can use SI for enhanced interaction, and pesonalization of player profiles. With the evolution of computing power, the use of the vocal modality in games is expected to appear soon. Among the vocal technologies available, SI certainly has a part to play, for example to recognize the owner of a toy, to identify various speakers or even to detect the characteristics or the variations of a voice (e.g. imitation contest). One interesting point with such applications is that the level of performance can be a secondary issue since an error has no real impact. However, the use of SR technology in games is still a prospective area. SI and SA have potentially more applications than verification, which is mostly limited to security systems. However, the verification problem is relatively more studied, which might be due to (1) lack of application concepts for the identification problem, (2) increase in the expected error with growing population size [6], and (3) very high computational cost. As regard to identification accuracy, it may not be always necessary to know the exact speaker identity but the speaker class of the current speaker. However, this has to be performed in real-time [31]. The demand for SI as an add-on module with other speech related applications is growing day by day. In SI, the identification time depends on several factors mentioned before. A few attempts can be found in [31], [41], [42] to decrease the computational load in an SI system by pruning test vectors, number of speakers, and sometimes combining both. With minor degradation of identification accuracy, the above methods have shown considerable amount of decrease in computational burden the system. For practical applications, an increase in speaker population cannot be restricted while at the same time, performance degradation must not be tolerated when new speakers are added to the system. In the present thesis, we focus on decreasing the computational load of identification while attempting to keep the recognition accuracy reasonably high. Some candidate strategies have been proposed, where attempts have been made to keep a balance between the accuracy rate and computational burden in an SI system.

12

Introduction

1.6

Major Contribution in the Thesis

The major contributions of this thesis, listed in the order of their importance, are: • Proposal of exploiting complementary high frequency speaker specific cues for the

SI task. Conventional feature extractor like MFCC concentrates mainly in the low

frequency region of the spectrum to parameterize a speech frame while giving less weightage to the higher frequency zone. • Proposition of Pre-quantization (PQ) as a merging technique for score level fusion of two speaker models developed from complementary feature sets to equalize/reduce the computation involved while yielding higher accuracy rate than that of the single stream based system representing baseline. • Study and analyze the inter subband correlation in SI using different shapes of

filters as the spectral averaging bins. Proposition of a Gaussian Filter (GF) bank

with fixed and variable variances to control the correlation. • Proposition of a feature selection strategy based on Singular Value Decomposition

(SVD) followed by QR Decomposition with Column Pivoting (QRcp) to find the potential set of features from the whole set.

• Proposition to combine PQ based fusion with lower dimensional models developed from the selected features to achieve further reduction in time complexity at testing

phase. • Proposition of discriminative weight based feature level concatenation to yield high SI rate.

• Proposal of finding weights through normalization of models level scores for ‘SUM’ rule based fusion scheme.

1.7

Databases for the Studies and Experimental Procedures

The descriptions on the databases are presented next.

1.7 Databases for the Studies and Experimental Procedures

1.7.1

13

YOHO Database

The YOHO voice verification corpus [43], [44] was collected while testing ITT’s prototype speaker verification system in an office environment. Most subjects were from the New York City area, although there were many exceptions, including some nonnative English speakers. A high-quality telephone handset (Shure XTH-383) was used to collect the speech at sampling frequency of 8 KHz; however, the speech was not passed through a telephone channel. There are 138 speakers (106 males and 32 females); for each speaker, there are 4 enrollment sessions of 24 utterances each and 10 test sessions of 4 utterances each. In each session, a speaker was prompted with a series of phrases to be read aloud; each phrase was a sequence of three two-digit numbers (e.g. “35 - 72 - 41”, pronounced “thirty-five, seventy-two, forty-one”). The salient points of the YOHO database is given in the following table (Table 1.1). Table 1.1: Specifications of YOHO Corpus. No. of speakers

138 (106 M / 32 F)

No. sessions/speaker

4 enrollments, 10 verifications

Intersession interval

Days-month (3 days nominal)

Type of speech

Prompted digit phrases

Microphones

Fixed high-quality in handset

Channels

3.8KHz/clean

Acoustic environment

Office

Experimental Procedure for YOHO Database For all the experiments conducted in this thesis, a closed set text-independent speaker identification framework is adopted where we consider all 138 speakers as client speakers. For a speaker, all the 96 (4 sessions × 24 utterances) utterances are used for developing

the speaker model while for testing, 40 (10 sessions × 4 utterances) utterances are put under test. Therefore, for 138 speakers we put 138 × 40 = 5520 utterances under test

and evaluate the identification accuracies. Note that no adaptation (e.g. concatenation

of test data collected over different sessions, speaker model adaptation on the test utterances, which have been correctly identified, etc) of the available test data has been done.

14

Introduction

1.7.2

POLYCOST Database

The POLYCOST database [45], [46] was recorded as a common initiative within the COST 250 action during January-March 1996. It contains around 10 sessions recorded by 134 subjects from 14 countries. The database was collected through the European telephone network. The recording has been performed with ISDN cards on two XTL SUN platforms with an 8 kHz sampling rate. The speech files contain A-law data according to ITU G.711

1

(A-law coded samples, 8 kHz sampling rate, 8 bits/sample)

with no file header. Table 1.2 shows some of the important specifications of POLYCOST database. Table 1.2: Specifications of POLYCOST Corpus. No. of speakers

134 (74 M /60 F)

No. sessions/speaker

>5

Intersession interval

Days-weeks

Type of speech

Fixed and prompted digit strings, read sentences, free monologue

Microphones

Variable telephone handsets

Channels

Digital ISDN

Acoustic environment

Home/Office

Experimental Procedure for POLYCOST Database Specified guideline [47] for conducting closed set SI experiments is adhered to, i.e. ‘MOT02’ files from first four sessions are used to build a speaker model while ‘MOT01’ files from session five onwards are taken for testing. Unlike YOHO database all the speakers do not have the same number of sessions. Further, three speakers (M042, M045 & F035) are not included in our experiments as they provide sessions, which are fewer than 4. A total 754 ‘MOT01’ utterances are put under test. As with YOHO database, all speakers (131 after deletion of three speakers) in the database were registered as clients and no adaptation on the test data was done. 1

For more information on A-law encoding, http://www.itu.ch/standard

visit International Telecommunication Union.

1.8 Organization of the Thesis

1.8

15

Organization of the Thesis

The thesis is organized as follows. • Chapter 1 presents an overview on speaker recognition, motivation for the work, a brief literature survey, descriptions of databases, and an outline of the work

reported in the thesis. • Chapter 2 proposes a new feature extraction tool for capturing speaker specific high frequency cues through the reversed filter bank, which is complementary to

MFCC. Next, score level fusion has been done using PQ where these scores are obtained from the speaker models developed from the complementary feature sets. • In Chapter 3, Gaussian shaped filter banks are used to replace conventional

triangular shaped ones in order to exploit evidences from the adjacent subband outputs. It is shown that Gaussian MFCC & its complementary feature set give

much superior performance compared to original MFCC & and its complementary features. • Chapter 4 uses SVD technique followed by QRcp to select the effective features for developing efficient speaker models for SI application.

• Chapter 5 covers various fusion strategies at different levels of a SI system. An input level fusion and two score level fusion strategies have been proposed that

help to enhance further the performance shown by unweighed fusion. • Chapter 6 draws the principal conclusions on the work and future directions for present research.

• Appendix A presents algorithms for Expectation and Maximization (E&M) Spilt Vector Quantization.

References [1] D. A. Reynolds, “An Overview of Automatic Speaker Recognition Technology,” in Proc. Acoustics, Speech, and Signal Processing, (ICASSP 2002), 2002, pp. 4072-4075. (Cited in section 1.1.) [2] J. P. Cambell, Jr., “Speaker Recognition: A Tutorial,” Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997. (Cited in sections 1.1 and 1.1.) [3] J. M. Naik, “Speaker Verification: A Tutorial,” IEEE Communications Magazine, vol. 28, no. 1, pp. 42-48, Jan. 1990. (Cited in section 1.1.)

16

Introduction

[4] J.-C. Wang, C.-H. Yang, J.-F. Wang, and H.-P. Lee, “Robust Speaker Identification and Verification,” IEEE Computational Intelligence Magazine, vol. 2, no. 2, May 2007. (Cited in section 1.1.) [5] G. Doddington, “Speaker recognition-identifying people by their voices,” Proceedings of the IEEE, vol. 73, no. 11, pp. 1651-1164, Nov. 1985. (Cited in section 1.1.) [6] S. Furui, Digital Speech Processing, Synthesis, and Recognition, Marcel Dekker, Inc., 2nd Ed., 2001. (Cited in sections 1.1 and 1.5.) [7] S. Prabhakar, S. Pankanti, and A. Jain, “Biometric recognition: security and privacy concerns,” IEEE Security & Privacy Magazine, vol. 1, no. 2, pp. 33-42, Mar.-Apr. 2003. (Cited in section 1.1.) [8] Q. Li, B.-H. Juang, and C.-H Lee “Automatic verbal information verification for user authentication” IEEE Trans. on Speech and Audio Processing, vol. 8, no. 5, pp. 585-596, Sept. 2000. (Cited in section 1.1.) [9] B. S. Atal, “Effectiveness of linear prediction of the speech wave for automatic speaker identification and verification,” J. Acoustical Society of America, vol. 55, no. 6, pp. 1304-1312, Jun. 1974. (Cited in sections 1.1 and 1.2.1.) [10] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Audio Speech and Signal Process., vol. ASSP-28, no. 4, pp. 357-365, Aug. 1980. (Cited in sections 1.1 and 1.2.1.) [11] T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task,” in Proc. of 10th International Conference on Speech and Computer, (SPECOM 2005), 2005, pp. 191-194. (Cited in section 1.1.) [12] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Audio Speech and Signal Process., vol. ASSP-29, no. 2, pp. 254-272, Apr. 1981. (Cited in section 1.1.) [13] H. Hermansky, “Perceptual linear predictive analysis of speech,” J. Acoustical Society of America, vol. 87, no. 4, pp. 1738-1752, Apr. 1990. (Cited in sections 1.1 and 1.2.1.) [14] D. A. Reynolds and R. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Process., vol. 3, no. 1, pp. 72-83, Jan. 1995. (Cited in sections 1.1 and 1.2.2.) [15] F. Soong, F, A. Rosenberg, L. Rabiner, and B. A Juang, “Vector quantization approach to speaker recognition,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Process., (ICASSP 1995), 1985, pp. 387-390. (Cited in section 1.1.) [16] W. M. Campbell, K. T. Assaleh, and C. C. Broun, “Speaker Recognition With Polynomial Classifiers,” IEEE Trans. Speech Audio Process., vol. 10, no. 4, pp. 205-212, May 2002. (Cited in sections 1.1 and 1.2.2.) [17] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo, “Support vector machines for speaker and language recognition,” Computer, Speech and Language, vol. 20, no. 2-3, pp. 210-229, Apr.-Jul. 2006. (Cited in sections 1.1 and 1.2.2.) [18] K. R. Farrell, R. J. Mammone, and K. T. Assaleh, “Speaker Recognition using Neural Networks and Conventional Classifiers,” IEEE Trans. Speech and Audio Process., vol. 2, no. 1, pp. 194-205, Jan. 1994. (Cited in sections 1.1 and 1.2.2.)

1.8 References

17

[19] J. P. Conti, “Analysis - LOOK WHO’S TALKING,” Engineering & Technology, vol. 2, no. 1, pp. 24-25, Jan. 2007. (Cited in section 1.1.) [20] A.V Oppenheim, “A Speech Analysis-Synthesis System Based on Homomorphic Filtering,” Journal of the Acoustic Society of America, vol. 45, pp. 458-465, Feb. 1969. (Cited in section 1.2.1.) [21] B. S. Atal and S. L. Hanauer “Speech Analysis and Syntehsis by Linear Prediction of Speech Wave,” Journal of Acoustical Society of America, vol. 50, no. 2, pp. 637-655, Jun. 1974. (Cited in section 1.2.1.) [22] K. T. Assaleh and R. J. Mammone, “New LP-derived features for speaker identification,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 4, pp. 630-638, Oct. 1994. (Cited in section 1.2.1.) [23] T. Kinnunen, V. Hautamaki, and P. Franti, “Fusion of spectral feature sets for accurate speaker identification,” in Proc. of the International Conference on Speech and Computer (SPECOM 2004), 2004, pp. 361-365. (Cited in section 1.2.1.) [24] D. J. Mashao and M. Skosan, “Combining Classifier Decisions for Robust Speaker Identification,” Pattern Recog., vol. 39, no. 1, pp. 147-155, Jan. 2006. (Cited in section 1.2.1.) [25] H. Jialong, L. Liu, and G. palm. G., “A discriminative training algorithm for VQ-based speaker identification,” IEEE Trans. Speech and Audio Processing, vol. 7, no. 3, pp. 353-356, May 1999. (Cited in section 1.2.2.) [26] Q. Y. Hong and S. Kwong, “A discriminative training approach for text-independent speaker recognition,” Signal Porcessing, vol. 85, no. 7, pp. 1449-1463, Jul. 2005. (Cited in section 1.2.2.) [27] T. D. Gancheva, D. K. Tasoulisb, M. N. Vrahatisb, and N. D. Fakotakisa, “Generalized locally recurrent probabilistic neural networks with application to text-independent speaker verification,” Neurocomputing, vol. 70, no. 7-9, pp. 1424-1438, Mar. 2007. (Cited in section 1.2.2.) [28] T. Matusi and S. Furui, “Comparison of text-independent speaker recognition methods using VQdistortion and discrete/ continuous HMMs,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Process., (ICASSP 1992), 1992, pp. II-157-II-160. (Cited in section 1.2.2.) [29] V. Van and S. Renals, “Speaker Verification using Sequence Discriminant Support Vector Machines,” IEEE Trans. on Speech and Audio Processing, vol. 13, no. 2, pp. 203-210, Mar. 2005. (Cited in section 1.2.2.) [30] D. A. van Leeuwen, A. F. Martin, M. A. Przybocki, and J. S. Bouten, “NIST and NFI-TNO evaluations of automatic speaker recognition,” Computer, Speech and Language, vol. 20, no. 2-3, pp. 128-158, Apr.-Jul. 2006. (Not cited.) [31] T. Kinnunen, E. Karpov, and P. Fr¨ anti, “Real-Time Speaker Identification and Verification,” IEEE Trans. Speech and Audio Process., vol 14, no. 1, pp. 277-288. Jan. 2006. (Cited in sections 1.4 and 1.5.) [32] Y. Hua and W. Liu, “Generalized Karhunen Lo`eve Transform,” IEEE Signal Process. Lett., vol. 5, no. 6, pp. 141-142, Jun. 1998. (Cited in section 1.4.) [33] C.-C. T Chen, C.-T. Chen, and C.-K. Hou, “Speaker identification using hybrid Karhunen-Lo`eve transform and Gaussian mixture model approach,”Pattern Recog., vol. 37, no. 5 , pp. 1073-1075, May 2004. (Cited in section 1.4.)

18

Introduction

[34] B. L. Pellom and J. H. L. Hansen, “An efficient scoring algorithm for gaussian mixture model based speaker identification,” IEEE Signal Process. Lett., vol. 5, no. 11, pp. 281-284, Nov. 1998. (Cited in section 1.4.) [35] J.-D. Wu and S.-H. Ye, “Driver identification based on voice signal using continuous wavelet transform and artificial neural network techniques,” Expert Systems with Applications, 2007, to be published. (Cited in section 1.5.) [36] A. G. Adami and D. A. C. Barone, “A speaker identification system using a model of artificial neaural networks for an elevator application,” Information Sciences, vol. 138, no. 1-4, pp. 1-5, Oct. 2001. (Cited in section 1.5.) [37] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” IEEE Trans. Speech Audio Process., vol. 8, no. 6, pp. 695-707, Nov. 2000. (Cited in section 1.5.) [38] X. He and Y. Zhao, “Fast model selection based speaker adaptation for nonnative speech,” IEEE Trans. Speech and Audio Process., vol. 11, no. 4, pp. 298-307, Jul. 2003. (Cited in section 1.5.) [39] W. Jia and W.-Y. Chan, “An experimental assessment of personal speech coding,” Speech Communication, vol. 30, no. 1, pp. 1-8, Jan. 2000. (Cited in section 1.5.) [40] A. Glaeser and F. Bimbot, “Steps toward the integration of speaker recognition in real-world telecom applications,” in Proc. International Conf. on Spoken Language Processing (ICSLP 1998), 1998. (Cited in section 1.5.) [41] T. Kinnunen, E. Karpov, and P. Fr¨ anti, “A speaker pruning algorithm for real-time speaker identification,” in Proc. Audio- and Video-Based Biometric Authentication (AVBPA 2003), 2003, pp. 639-646. (Cited in section 1.5.) [42] E. Karpov, ”Real-time speaker identification,” M. Sc. thesis, University of Joensuu, Joensuu, Finland, Jan. 2003. (Cited in section 1.5.) [43] J. P. Campbell, Jr., “Testing with the YOHO CDROM voice verification corpus,” in Proc. International Conference on Acoustic, Speech, and Signal Process., (ICASSP 1995), 1995, pp. 341-344. (Cited in section 1.7.1.) [44] LDC Catalog for YOHO verification corpus. [online]. Available: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94S16

(Cited in sec-

tion 1.7.1.) [45] J. Hennebert, H. Melin, D. Petrovska, and D. Genoud, “POLYCOST: A telephone-speech database for speaker recognition,” Speech Communication, vol. 31, no. 2-3, pp. 265-270, Jun 2000. (Cited in section 1.7.2.) [46] T. Nordstr¨ om, H. Melin, and J. Lindberg, “A comparative study of speaker verification systems using the polycost database,” in Proc. of 5th International Conference on Spoken Language Processing, (ICSLP98), 1998, pp. 1359-1362. (Cited in section 1.7.2.) [47] H. Melin and J. Lindberg, “Guidelines for experiments on the polycost database,” in Proc. of a COST 250 workshop on Application of Speaker Recognition Techniques in Telephony, 1996, pp. 59-69. (Cited in section 1.7.2.) 3

CHAPTER

2

Speaker Identification Based on High-Frequency Cues

3

Preface This chapter investigates the speaker specific cues that exist in the higher frequency region. A reversed filter bank is introduced here to capture this information. The work also proposes a fusion strategy based on PQ to combine speaker models developed from MFCC and evidences from the higher frequency zone. The results have been presented on two different kinds of databases each of which comprises more than 130 speakers. 3

20

Speaker Identification Based on High-Frequency Cues

2.1

Introduction

Over the years, MFCC modeled on the human auditory system [1] have been used as a standard acoustic feature set for speech related applications. Since the late 1990s it has successfully replaced the feature set LPCC [2], which performs poorly in noisy conditions. The main advantage of the (LPCC) was computation but with increasing computing power and better performance (and the use of the Fast Fourier Transform (FFT) [3]), the MFCC has completely dominated the designs of the front-ends of speech technology systems. Competition from the other auditory based feature-sets [4] has not been successful mainly due to their lower performance and very high computational cost. The MFCC technique was first proposed [5] for speech recognition [5], [6] to identify monosyllabic words in continuously spoken sentences and not for SI [7], [8]. Also calculation of MFCC is based on the human auditory system aiming for artificial implementation of the ear physiology [1] assuming that the human ear can be a good speaker recognizer too. However, no conclusive evidence exists to support the view that the ear is necessarily the best speaker recognizer. In addition, some studies [2], [9], [10], [11] suggest that, MFCC models the spectral envelope or the formant structure [12] of the vocal tract of a speaker by smoothing the energy spectrum through filtering while reducing the effect of pitch [13],[14]. Computation of MFCC involves averaging the low frequency region of the energy spectrum (approximately demarcated by the upper limit of 1 kHz) by closely spaced overlapping triangular filters while a smaller number of less closely spaced filters with a similar shape are used to average the high frequency zone. Thus MFCC can represent the low frequency region more accurately than the high frequency region and hence it can capture formants [15] which lie in the low frequency range and which characterize the vocal tract resonances [15]. However, other formants [15] can also lie above 1 kHz and these are not effectively captured by the larger spacing of filters in the higher frequency range. In a very recent study [16], Lu et al. have shown that the speaker specific information is encoded non-uniformly in different frequency bands of speech sound. Lu et al. clearly states (p. 4): â€œIn MFCC feature representation, the Mel frequency scale is used to get a high resolution in low frequency region, and a low resolution in high frequency region. This kind of processing is good for obtaining stable phonetic information, but not suitable for speaker features that are located in high frequency regions.â€?

2.1 Introduction

21

All these facts suggest that any SI system based on MFCC can possibly be improved. In this chapter, we propose to invert the entire filter bank structure [17], [18] such that the higher frequency range is averaged by more densely spaced filters and a smaller number of widely spaced filters are used in the lower frequency range. We calculate a new feature set named Inverted Mel-frequency Cepstral Coefficients (IMFCC) [19] following the same procedure as normal MFCC but using this reversed filter bank structure. This effectively captures high frequency formants ignored by the original MFCC. The importance of MFCC in SI should not be underestimated. In order to exploit the best of both paradigms, we model two separate classifiers for every speaker using these two feature sets namely MFCC and IMFCC and effectively fuse their model level scores to obtain the final classification decision. To fuse these scores, we adopt the PQ [20], [21] based pruning technique. PQ is a method by which some test feature vectors are selectively chosen from the entire set that are initially put under an identification test. According to the literature [21], [22] due to the existence of high correlation between nearby speech frames, pruning of redundant test vectors ensures minor or even no degradation of SI performance. There are many PQ techniques out of which fixed decimation [20] based PQ is chosen, which suggests selection of 1 out of P time-domain frames or feature vectors where P is defined as the PQ rate. The objective of this policy is to use PQ based fusion scheme to maintain almost the same computation as the original MFCC based system.

2.1.1

Organization of the Chapter

The rest of the chapter is organized as follows. Calculation of MFCC is described in 2.2 followed by the derivations of IMFCC in 2.3. A brief overview of the pre-processing stage and GMM is presented next (in sec. 2.4). In section 2.5 a comparison between the SI performances of the two feature sets has been shown. Detailed descriptions of PQ based merging strategy, its related results and further pruning techniques are presented in sections 2.6, 2.7, and 2.8 respectively. This is followed by the conclusions in section 2.9.

22

Speaker Identification Based on High-Frequency Cues

2.2

Mel Frequency Cepstral Coefficients and their Calculation

According to psychophysical studies, human perception of the frequency content of sounds follows a subjectively defined nonlinear scale called the Mel Scale [23] (see Fig. 2.1). Normal frequency vs. Mel−Frequency 2500 Approximately Linear Region

Non−linear Region

2000

Pitch (Mels) →

1500

1000

500

0

0

500

1000

1500

2000 2500 Normal Frequency (Hz) →

3000

3500

4000

Figure 2.1: Normal frequency vs. Mel-frequency. This is defined as, fmel = 2595 log 10

f 1+ 700

(2.1)

where fmel is the subjective pitch in Mels corresponding to f , the actual frequency in Hz. This leads to the definition of MFCC , a baseline [24] acoustic feature set for Speech [25] and Speaker Recognition [25], [26], [27] applications, which can be calculated as follows. Let {y(n)}N n=1 represent a frame of speech that is pre-emphasized [28] and Hamming-

windowed [29]. First, y(n) is converted to the frequency domain by an M s -point Discrete Fourier Transform (DFT) [3] which leads to the energy spectrum,

X

N |Y (k)| =

y(n) · e 2

n=1

−j2πnk Ms

2

(2.2)

2.2 Mel Frequency Cepstral Coefficients and their Calculation

23

where, 1 ≤ k ≤ Ms . This is followed by the construction of a filter bank with Q unity

height triangular filters, uniformly spaced in the Mel scale (eqn. 2.1). The filter response ψi (k) of the ith filter in the bank (see Fig. 2.2) is defined as,

ψi (k) =

0 k−kbi−1

for k < kbi−1

kbi −kbi−1 kbi+1 −k kbi+1 −kbi

0

for kbi−1 ≤ k ≤ kbi for kbi ≤ k ≤ kbi+1

(2.3)

for k > kbi+1

where 1 ≤ i ≤ Q, Q is the number of filters in the bank, {k bi }Q+1 i=0 are the boundary

points of the filters and k denotes the coefficient index in the M s -point DFT. The filter bank boundary points, {kbi }Q+1 i=0 are equally spaced in the Mel scale which is satisfied by the definition,

k bi =

Ms Fs

i{fmel (fhigh ) − fmel (flow )} −1 · fmel fmel (flow ) + Q+1

(2.4)

where the function fmel (·) is defined in eqn. 2.1, Ms is the number of points in the DFT (eqn. 2.2), Fs is the sampling frequency, flow and fhigh are the low and high frequency −1 boundaries of the filter bank and fmel is the inverse of the transformation in eqn. 2.1

defined as, fmel −1 f = fmel (fmel ) = 700 · 10 2595 − 1

(2.5)

The sampling frequency Fs and the frequencies flow and fhigh are in Hz while fmel is in Mels. For both the databases considered in this work, F s is 8 kHz. Ms was taken as 256, flow =

Fs Ms

= 31.25 Hz while fhigh =

Fs 2

= 4 kHz.

Next, this filter bank (see Fig. 2.3) is imposed on the spectrum calculated in eqn. 2.2. The outputs {e(i)}Q i=1 of the Mel-scaled band-pass filters can be calculated by a

weighted summation between respective filter response ψ i (k) and the energy spectrum |Y (k)|2 as

Ms

e(i) =

2 X

k=1

|Y (k)|2 · ψi (k)

(2.6)

Finally, Discrete Cosine Transform (DCT) [30] is taken on the log filter bank energies

24

Speaker Identification Based on High-Frequency Cues

Amplitude

1

k bi-1

k b i+1

k bi

DFT Coefficient index

Figure 2.2: Boundary points of one filter placed in normal frequency scale. {log[e(i)]}Q i=1 and the final MFCC coefficients C m can be written as, Cm =

r

(Q−1) 2 X 2l − 1 π log[e(l + 1)] · cos m · · Q 2 Q

(2.7)

l=0

where, 0 ≤ m ≤ Rcep − 1, and Rcep is the desired number of cepstral features. Typically, Q = 20 and 10 to 30 cepstral coefficients are taken for speech processing applications.

Here we took Q = 20, Rcep = 20 and used the last 19 coefficients to model the individual speakers. Note that the first coefficient C 0 is discarded because it contains only a d.c term that signifies spectral energy. Filter Bank structure of canonical MFCC 1

0.9

0.8

Relative Amplitude →

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

500

1000

1500

2000 Frequency (Hz)→

2500

3000

3500

Figure 2.3: Filter bank structures for MFCC process.

4000

2.3 The Inverted Mel Frequency Cepstral Coefficients

2.3

25

The Inverted Mel Frequency Cepstral Coefficients

Although MFCC presents a way to convert a physically measured spectrum of speech into a perceptually meaningful subjective spectrum based on the human auditory system [1], it is not certain that the human ear and hence MFCC is optimized for SI. Here we propose a new scale, the Inverted mel scale (see Fig. 2.4, (dotted line)) defined by a competing filter bank structure which is indicative of a hypothetical auditory system which has followed a diametrically opposite path of evolution than the human auditory system. The idea is to capture that information which otherwise could have been missed by the original MFCC. By information we mean those higher-order formant structures, which are not finely captured by filters used in MFCC. It has been reported in the literature [31] that speaker specific information also exists beyond the third formant [15], which generally lies above 2 kHz. The speaker specific high frequency information [32] is generally conveyed by fricatives [15]. Subjective Pitch vs. Frequency 2500 Mel scale Inverted Mel scale

2000

Pitch (Mels) â†’

1500

1000

500

0

0

500

1000

1500

2000 2500 Normal Frequency (Hz) â†’

3000

3500

4000

Figure 2.4: Subjective Pitch vs Frequency. For Mel scale, corresponding to the human auditory system, pitch increases progressively less rapidly as the frequency increases, In direct contrast, it increases progressively more rapidly in the proposed Inverted Mel Scale. We obtain the new filter bank structure simply by flipping the original filter bank

26

Speaker Identification Based on High-Frequency Cues

around the point f = 2 kHz which is precisely the mid-point of the frequency range considered for SI applications, i.e. 0 to 4 kHz (sec. 2.2). This flip-over is expressed mathematically as, ψˆi (k) = ψQ+1−i

Ms +1−k 2

(2.8)

where ψˆi (k) is the Inverted Mel Scale filter response while ψ i (k) is the response of the original MFCC filter bank, 1 ≤ i ≤ Q and Q is the number of filters in the bank.

Analogous to eqn. 2.3 for the original MFCC filter bank, we can derive an expression for {ψˆi (k)}Q i=1 from eqn. 2.8 as follows,

ψˆi (k) =

0 ˆb k−k i−1

for k < kˆbi−1

ˆ b −k ˆb k i i−1 ˆ kbi+1 −k ˆb ˆb k −k

0

i+1

i

for kˆbi−1 ≤ k ≤ kˆbi for kˆbi ≤ k ≤ kˆbi+1

(2.9)

for k > kˆbi+1

where 1 ≤ k ≤ Ms and {kˆbi }Q+1 i=0 , the boundary points of the Q filters, are defined as, kˆbi =

Ms 2

+ 1 − kbQ+1−i

(2.10)

Table 2.1 shows two sets of 22 boundary points for constructing 20 filters for MFCC and IMFCC feature sets. Now, we can frame an equation analogous to eqn. 2.4, linking {kˆbi }Q+1 i=1 to i, flow

and fhigh as,

kˆbi =

Ms Fs

i{fˆmel (fhigh ) − fˆmel (flow )} −1 · fˆmel fˆmel (flow ) + Q+1

(2.11)

Here, fˆmel (f ) is the subjective pitch in the proposed Inverted Mel Scale corresponding to f , the actual frequency in Hz. From eqns. 2.4, 2.10 and 2.11, it follows that,

Ms Fs

i{fˆmel (fhigh ) − fˆmel (flow )} Ms −1 ˆ ˆ · fmel fmel (flow ) + = +1 Q+1 2 i{fmel (fhigh ) − fmel (flow )} Ms −1 − · fmel fmel (flow ) + Fs Q+1

(2.12)

2.3 The Inverted Mel Frequency Cepstral Coefficients

27

Table 2.1: Boundary points fbi and fˆbi in Hz for the MFCC and IMFCC filter banks (with Q = 20, flow = 31.25 Hz and fhigh = 4 kHz.) i

fbi (Hz)

fˆbi (Hz)

i

fbi (Hz)

fˆbi (Hz)

0

31.25

31.25

11

1237.90

2957.70

1

98.99

429.75

12

1417.40

3108.10

2

173.01

794.46

13

1613.50

3245.70

3

253.89

1128.20

14

1827.90

3371.70

4

342.26

1433.70

15

2062.10

3486.90

5

438.82

1713.30

16

2317.90

3592.40

6

544.32

1969.20

17

2597.50

3689.00

7

659.60

2203.40

18

2903.00

3777.40

8

785.55

2417.70

19

3236.80

3858.20

9

923.17

2613.90

20

3601.50

3932.30

10

1073.50

2793.40

21

4000.00

4000.00

To maintain mathematical uniformity in the calculation of the DFT, we chose the new Inverted Mel Scale to share common boundary points with the actual Mel Scale, i.e., fˆmel (flow ) = fmel (flow ) and fˆmel (fhigh ) = fmel (fhigh ). Using this choice, we rewrite eqn. 2.12 as, i{fmel (fhigh ) − fmel (flow )} Fs Fs −1 ˆ fmel fmel (flow ) + = + Q+1 2 Ms i{fmel (fhigh ) − fmel (flow )} −1 −fmel fmel (fhigh ) − Q+1

(2.13)

By suitably choosing the integers Q and i, we can represent any frequency f in the linear (Hertz) scale as, i{fmel (fhigh ) − fmel (flow )} −1 f = fˆmel fmel (flow ) + Q+1

(2.14)

From eqn. 2.13, it follows that, f=

Fs 2

+

Fs Ms

−

−1 fmel

fmel (fhigh ) + fmel (flow ) − fˆmel (f )

(2.15)

28

Speaker Identification Based on High-Frequency Cues

Finally, we obtain the equation, fˆmel (f ) = fmel (fhigh ) + fmel (flow ) − fmel

Fs Fs + −f 2 Ms

(2.16)

which relates the proposed Inverted Mel Scale to the original Mel Scale. For the current application, we have set (sec. 2.2) Fs = 8 kHz, Ms = 256, flow = fhigh =

Fs 2

Fs Ms

= 31.25 Hz and

= 4 kHz. Hence, using these values in eqn. 2.15, we define the proposed

Inverted Mel Scale as, fˆmel (f ) = 2195.2860 − 2595 log 10

4031.25 − f 1+ 700

(2.17)

where fˆmel is the subjective pitch in the new scale corresponding to f , the actual frequency in Hz. Note that, the relation 2.17 could be changed according to sampling frequency (Fs ) and the usable bandwidth [33], [34], [35] for a corpus whose lower and upper limit are specified by flow and fhigh respectively. We observe that in this scale, pitch increases more and more rapidly (see Fig. 2.4) as the frequency increases. As we aimed, this is in direct contrast to the human auditory system (eqn. 2.1), where it increases less rapidly with rising frequency. Hence, the higher frequency zone coarsely approximated (see Fig. 2.5, solid line) by normal MFCC can be represented more finely (see Fig. 2.5, dotted line) by this new scale and this can capture the speaker-specific formant information present in this zone which could have been neglected by the original MFCC. These facts justify our choice of flipping the MFCC filter bank (see Fig. 2.6) to obtain the new IMFCC feature set. We find the filter outputs {ˆ e(i)}Q i=1 in the same way as MFCC from the same energy

spectrum |Y (k)|2 as,

Ms

eˆ(i) =

2 X

k=1

|Y (k)|2 · ψˆi (k)

(2.18)

Finally, DCT is taken on the log filter bank energies {log[ˆ e(i)]} Q i=1 and the final R cep Inverted MFCC coefficients {Cˆm } can be written as, m=1

Cˆm =

r

(Q−1) 2 X 2l − 1 π log[ˆ e(l + 1)] · cos m · · Q 2 Q

(2.19)

l=0

As with MFCC, we took Q = 20, Rcep = 20 and used the last 19 coefficients to model the individual speakers. The figure 2.7 shows the MFCC and IMFCC feature extraction

2.3 The Inverted Mel Frequency Cepstral Coefficients

29

Spectrum Estimation by MFCC and IMFCC Filter Banks 8 Log Energy Spectrum MFCC Approximation IMFCC Approximation

7.5

7

Log Spectral Magnitude â†’

6.5

6

5.5

5

4.5

4

3.5

3

0

500

1000

1500

2000 Frequency (Hz)â†’

2500

3000

3500

4000

Figure 2.5: Log energy spectrum estimation using MFCC and IMFCC filter bank outputs for a speech frame from YOHO database.

processes. The computations in MFCC and IMFCC are same as the same number of functional blocks have been used for both the feature sets and therefore one does not depend on the other while yielding the cepstral parameters individually. The fact is that IMFCC can be considered as an add-on module, which needs only the MFCC filter bank to be flipped. Note that MFCC has been widely used in SR applications since the last decade and therefore the analysis of IMFCC could also be interesting in the context of SR and other speech related applications. Besides the capability of representing the high frequency formant characteristics, IMFCC provides information complementary to that provided by MFCC. The advantage of the complementary information is found effective in the context of classifier fusion [36] where the errors of the classifiers under combination must be mutually uncorrelated [9], [10], [24], [36] in order to get better performances than a single classifier based system. The errors could be mutually uncorrelated if the feature sets have diversity. Feature diversity refers to extracting different sets of features from the same speech signal(s). These different feature sets would then be used to build and test separate models. Again, the hope is that each feature set will capture some aspect of the speech signal that may

30

Speaker Identification Based on High-Frequency Cues

Filter Bank structure of canonical MFCC

Relative Amplitude →

1

0.8 0.6

0.4 0.2

0

0

500

1000

1500

2000 Frequency (Hz)→

2500

3000

3500

4000

3000

3500

4000

Filter Bank structure of Inverted MFCC

Relative Amplitude →

1

0.8 0.6

0.4 0.2

0

0

500

1000

1500

2000 Frequency (Hz)→

2500

Figure 2.6: Filter bank structures for MFCC and IMFCC feature extraction process. MFCC Filter bank outputs

(C m )

MFCC Filter Bank

log 10 (· )

DCT

MFCC

IMFCC Filter Bank

log 10 (· )

DCT

IMFCC

Speech Signal Pre-processing

FFT | · |2

(Cˆ ) m

T

b

b .a a

IMFCC Filter bank outputs

Figure 2.7: MFCC and IMFCC feature extraction process.

be missed by the other feature set. In particular, if this attribute results in uncorrelated sets of errors for each feature set, then a performance improvement can be achieved. Consider the scatter plot shown in Figure 2.8, which shows raw log-likelihood scores obtained from GMM for MFCC and GMM for IMFCC on the x-axis and the y-axis, respectively, for a true speaker and 15 other anti-speakers from the YOHO database. If only MFCC is used, there would be errors. However, when considering the two-

2.4 Brief Overviews of Various other Stages in the System

31

dimensional problem, the classes are now separable.

Scatter Plot for GMM−MFCC/GMM−IMFCC scores −500 = True Speaker = Anti−Speakers

Log Liklihood scores of GMM for IMFCC →

−1000

−1500

−2000

−2500

−3000

−3500 −4000

−3500

−3000 −2500 −2000 Log Liklihood scores of GMM using MFCC →

−1500

−1000

Figure 2.8: Scatter plot for feature diversity [Note: Each point in the plot represents an utterance]

Different levels of fusion strategies will be described in chapter 5, where the fused system developed from MFCC and IMFCC feature sets performs better than the system developed from MFCC. Note that this complementary information is missing in a linearly spaced filter bank structure (its corresponding cepstral parameters are known Linear Frequency Cepstral Coefficients (LFCC) [5]) because the inversion of the linearly spaced filter bank will be linearly spaced again.

2.4

Brief Overviews of Various other Stages in the System

This section reviews briefly the pre-processing and modeling stages that have been adopted in this work. Note that the specifications in the two stages will remain the same for all the works presented in the thesis. The stages are described next.

32

Speaker Identification Based on High-Frequency Cues

2.4.1

Pre-processing stage

Each incoming speech signal is first passed through a silence removal module [37], [38] that discards the non-voiced portion of the speech based on the energy threshold criterion. The non-voiced portion of the speech includes pauses between two consecutive words with ambient room noise. The silence remover block also determines the end-point of the whole signal by deleting the silence parts that usually come before as well as after the recording of actual speech. The silence removal block serves two purposes: 1) it gives voiced parts of the speech, in which speaker specific information is mostly contained; and 2) it reduces the number of frames in order to facilitate faster running of the algorithms at the time of training and testing. After silence removal, the voiced speech signal is pre-emphasized with a pre-emphasis factor of 0.97, which is given by the following relation y(n) = x(n) − 0.97 · x(n − 1)

(2.20)

where x(n) is the silence removed speech by silence removal block. This is followed by frame blocking with 20ms frame length, i.e N = 160 samples/frame (ref. Sec.2.2) & 50% overlap of the consecutive frames. Finally, each frame is multiplied with Hamming window to reduce the side effects [29], [39]. The pre-processing stage is shown pictorially in figure 2.9.

2.4.2

Gaussian Mixture Models

Speaker Recognition involves state-of-the-art GMM [40], [41] for generalized representation of acoustic vectors irrespective of their extraction process. A GMM can be viewed as a non-parametric, multivariate probability distribution model [42] that is capable of modeling arbitrary distributions and is currently one of the principal methods of modeling speakers for SI systems. The GMM of the distribution of feature vectors for speaker s is a weighted linear combination of M uni-modal Gaussian densities [42] bsi (x), each parameterized by a mean vector µ si with a diagonal covariance matrix Σ si . These parameters, which collectively constitute the speaker model, are represented by the nos tation λs = {psi , µsi , Σsi }M i=1 . The pi are the mixture weights satisfying the stochastic PM s constraint i=1 pi = 1.

For a feature vector x the mixture density (see Fig. 2.10) for a speaker s is computed

as p(x|λs ) =

M X i=1

psi bsi (x)

(2.21)

2.4 Brief Overviews of Various other Stages in the System

33

Silence Remove

Pre-emphasis

Frame-blocking

Tapering window multiplication (Smoothing window)

Time-domain frames

Figure 2.9: Typical pre-processing stages. where, bsi (x)

1 = e (D/2) (2π) | Σsi |(1/2)

1 (x−µsi )t (Σsi )−1 (x−µsi ) 2

(2.22)

and D is the dimension of the feature-space. Given a sequence of feature vectors X = {x 1 , x2 , . . . , xT }, for an utterance with T

frames, the log-likelihood of a speaker model s is Ls (X) = log p(X|λs ) =

T X t=1

log p(xt |λs )

(2.23)

assuming the vectors to be independent for computational simplicity. For SI, the value of Ls (X) is computed for all speaker models λ s enrolled in the system and the owner of the model that generates the highest value is returned as the identified speaker. During training, feature vectors collected from a speaker’s utterances are trained using the Expectation and Maximization (E&M) [43] algorithm. This technique involves an iterative update of each of the parameters in λ s , with a consequent increase in the log-

34

Speaker Identification Based on High-Frequency Cues

0.07 0.06

Joint Probability

0.05 0.04 0.03 0.02 0.01 0 4 3 4

2 3

1

2

0

1 −1

Feature 2

0 −1

−2

−2

−3 −4

−3

Feature 1

−4

Figure 2.10: Mixture density by Gaussian mixture models with 4 modes in two dimensional space .

likelihood at each step. Usually, within a few iterations (10 to 25) the model parameters converge to stable values. The detailed estimation of the parameters has been shown in Appendix A.

Throughout the thesis, all the speaker models have been developed by GMM, since it gives better results than other modeling techniques [44], [45], [46], [47]. For GMM, initialization of seed mean vectors for Gaussian centers is done by the split vector based (VQ) [48] with specifications shown in table 2.2. The detailed algorithm is given in Appendix A. This was followed by the E&M algorithm with 20 iterations. Note that, for all cases, diagonal covariance matrices [40], [49] were chosen because they are clearly more advantageous from the calculation perspective since the inverse of covariance matrices have to be calculated repeatedly during the E&M iteration.

2.5 Comparative Study of Performances of MFCC and IMFCC Feature Sets 35

Table 2.2: Specifications of Split Vector Quantization based clustering algorithm for initialization of seed mean vectors for GMM.

2.5

1.

Percentage for splitting :-

0.01

2.

Rate of reduction of split size after each splitting :-

0.75

3.

The threshold in improvement in Distortion before terminating and splitting again :-

0.001

4.

Rate of reduction of improvement threshold after each splitting :-

0.75

5.

The minimum population of each cluster :-

10%

Comparative Study of Performances of MFCC and IMFCC Feature Sets

In this section, we compare the identification performances of the two feature sets, namely, MFCC and IMFCC, on two public databases. The identification accuracy [8] is defined as the percentage of the total number of utterances that are correctly identified among the total number of utterances that are put under test. Percentage of Identification Accuracy (PIA) =

(2.24)

No. of utterances correctly identified Total no. of utterances under test

Ă— 100

Tables 2.3 and 2.4 show the comparative performance of MFCC and IMFCC feature sets. For the YOHO database [26], [33], a system using IMFCC performs better (ref. tab. 2.3) than MFCC in lower order models (M = 2, 4, 8). In higher order models (M = 16, 32, 64), the system using the MFCC feature set outperforms the IMFCC based system. The performances shown by MFCC based system are better than the performances shown by the system developed from IMFCC over all the model orders in the POLYCOST database [35], [50]. The reason could be the inherent distortions [51] that normally exist in the high frequency region of a telephone speech. Note that no channel compensation [51], [52] techniques have been adopted here, as our aim is to show the comparison of absolute performances of the two feature sets. The result is that IMFCC can be regarded as an nearly equal contender vis-` a-vis MFCC for SI applications. Also both the tables show increasing SI accuracy with increasing model

36

Speaker Identification Based on High-Frequency Cues

Table 2.3: Comparative performance using MFCC & IMFCC feature sets in YOHO database. Ma

a

PIA (in %)

PIA (in %)

using MFCC

using IMFCC

2

74.31

78.04

4

84.86

86.50

8

90.69

91.99

16

94.20

94.15

32

95.67

95.22

64

96.79

95.76

No. of Mixtures used in GMM or Model orders

Table 2.4: Comparative performance using MFCC & IMFCC feature sets in POLYCOST database. M

PIA (in %)

PIA (in %)

using MFCC

using IMFCC

2

63.93

55.97

4

72.94

68.04

8

77.85

76.26

16

77.85

77.06

order. Normally, the SI accuracy saturates [41] at some higher model order and begins to decrease with further increment of number of mixtures. Model orders are taken as powers of two because the Linde, Buzo & Gray (LBG) [48] or VQ algorithm, which is based on splitting criteria, produces the same number of seed mean vectors for GMM. For the YOHO and POLYCOST databases, 64 and 16 mixtures (i.e. M ) were the highest model orders because some speakers do not generate enough feature vectors to create the next higher order models i.e. with 128 and 32 mixtures respectively. Note that MFCC and IMFCC among themselves capture complementary information in feature space (as shown earlier in section 2.3) to enable a successful fusion strategy involving both that gives higher accuracy compared to their individual performances.

2.6 Fusion of MFCC and IMFCC based Model Level Scores using Decimation type of Pre-quantization

2.6

37

Fusion of MFCC and IMFCC based Model Level Scores using Decimation type of Pre-quantization

Pre-quantization (PQ) is a vector sampling method by which some test vectors are pruned out from the complete set prior to sending them to the speaker models. The idea is to eliminate redundant vectors, which have been generated from highly correlated adjacent speech frames. More specifically, consecutive vectors taken from an utterance perform worse than the same number of vectors widely spaced because there is less acoustic variety in the consecutive vectors. There are four types of PQ [20] methods, namely, 1) Random sub-sampling, 2) Averaging, 3) Decimation and 4) Clustering based PQ. In random sub-sampling, each segment is represented by a random vector from the segment. In averaging, the representative vector is the centroid (mean vector) of the segment. In decimation, we take every P th vector (see Fig. 2.11, (P = 3)) of the test Ë† clusters using sequence. In clustering based PQ, we partition the sequence X into M the LBG [48] clustering algorithm. Decimation of Time-domain frames before feature extraction

Decimation of Test feature vectors after feature extraction Test Speech Data

Test Speech Data

Correlated Time-domain frames

Pre-Processing

Pre-Processing

Pre-quantization Sampling Rate

Correlated Test vectors Pre-quantization Sampling Rate

Feature Extraction

OR

Time-domain frames (1 x 160 dim.)

Test feature vectors (1 x 19 dim.) Feature Extraction

Test feature vectors (1 x 19 dim.)

Figure 2.11: Fixed decimation based PQ.

38

Speaker Identification Based on High-Frequency Cues In this fusion scheme, we have chosen the decimation based PQ technique. One such

example, fixed rate decimation based PQ, has been shown in fig. 2.11. Pruning at the pre-processing stage also corresponds to performing feature extraction with a smaller frame rate. Taking this advantage we apply the same on incoming time-domain speech frames and select some vectors according to the pre-defined PQ decimation rate . Again the MFCC and IMFCC methods share almost the same blocks (Pre-emphasis, Frame blocking & windowing [29], FFT, log 10 and DCT operators) except for the fact that each method imposes its individual filter bank for yielding filter bank outputs. Thus, the usage of common blocks by both the feature sets and the PQ based pruning method jointly motivate us to develop the idea of fusion, which is described next. After pre-emphasis, the incoming speech signal is divided into overlapping frames and then the frames are multiplied with the smoothing window (e.g. Hamming window). Using a pre-defined PQ sampling rate, which is denoted by P , the usable frames are chosen according to the fixed decimation rate and sent to the FFT module for calculation of the energy spectrum (i.e. |Y (k)| 2 (ref. Eqn. 2.2)). The energy spectra of

decimated frames are now alternately routed (see Fig. 2.12) to either the MFCC or the IMFCC filter bank (see Fig. 2.6). The spectra of odd numbered speech frames i.e. 1, 3, 5, . . . , b PT c + 1 (for odd number of T) have been sent to MFCC filter bank while

for the even numbered i.e. 2, 4, 6, . . . , b PT c spectra, IMFCC filter bank is used for ob-

taining filter bank outputs. Here T denotes the total number of frames in an utterance.

Note that, when P = 2 no pruning of time domain frames has been done according to proposed scheme compared to individual MFCC or IMFCC used. However, one can increase the value of P in order to prune more numbers of frames. The blocks after computing filter bank outputs are logarithmic compression and DCT (see Fig. 2.12), which are the same to be used irrespective of the filter bank types or their generated outputs (ref. Eqns. 2.6 & 2.18). The scores that are generated from either the MFCC or the IMFCC stream [53], [54] are accumulated in a score accumulator till all of the vectors are exhausted. In this fusion strategy, we always allocate the first vector for MFCC for giving it an advantage of using one extra frame (applicable only for an utterance containing odd number of frames) as it has already shown better performance than IMFCC (ref. Tab. 2.3 & 2.4). The system can be compared to a communication system that uses typical Time Division Multiplexing (TDM) [55], where transmissions from multiple sources (MFCC & IMFCC filter bank outputs) occur on the same facility (Common computational blocks) but not at the same time. We have also checked the performance of the fused system that uses the same speech frames (i.e. same energy

2.6 Fusion of MFCC and IMFCC based Model Level Scores using Decimation type of Pre-quantization

39

Silence removal + Pre-emphasis + Framing and Windowing Time domain frames

Div . by 2

Set Pre-quantization sampling rate

Usable frame selector

Pre-quantized frames

|FFT |2

Energy spectrum of frames

External Control

FFT output selection

MFCC filter Bank

IMFCC filter Bank

SOP

SOP

log 10 ( 路 )

log 10 ( 路 )

For MFCC Frame For IMFCC Frame SOP - Sum of Product

DCT ( 路 )

DCT ( 路 )

MFCC Vector

IMFCC Vector Stored IMFCC Model

Stored MFCC Model Score Accumulator

Likelihood Score for IMFCC

Likelihood Score for IMFCC Final Score for a Speaker

Figure 2.12: Pre-Quantization based MFCC-IMFCC fusion strategy. spectra) to extract MFCC & IMFCC features. The performances shown by this system are better than those of the baseline but not as good as those of the proposed system. Both MFCC and IMFCC use

T 2

number of feature vectors for which the total com-

40

Speaker Identification Based on High-Frequency Cues

plexity in the system does not increase in comparison to a single stream based MFCC or IMFCC system. However, one has to store both the models off-line, resulting in additional memory requirements for storage. P must be taken as multiple of 2, as two separate streams are involved. In general if there are more than two streams then P must be chosen as a multiple of the number of streams in order to allocate the test frames evenly over all the streams. In the following section, we show the improved performance of the combined system over the single stream based SI system with comparative computational load. Therefore, the motivation behind the idea of introducing the IMFCC can be further justified here as IMFCC shares almost all the computational modules with MFCC and combines itself with MFCC efficiently by enjoying the advantage of PQ based pruning. Viewed in another manner, we involve another set of Gaussian (i.e. IMFCC speaker models), whose evidences from the high frequency content of the speech help MFCC to perform better without increasing the computational burden of the system. The scheme is suitable for merging two or many feature sets that use common modules/blocks and one does not depend on the other. The algorithm is described next (see Algorithm 2.1).

2.6.1

Performance Evaluation

All the described algorithms up to this were implemented using MATLAB version 7.0.1.24704 (R14). All experiments were carried out on a single computer having 3 GHz dual-core processor and 2048 MB of RAM. The operating system is WINDOWS XP. We use the MATLAB command â€˜cputimeâ€™ in MATLAB to measure the running time.

2.7

Results and Discussion on PQ based Fusion Strategy with P = 2

The results and discussions on the aforementioned fusion scheme are given next. The tables 2.5 and 2.6 show the identification accuracies of MFCC and IMFCC based systems in absolute, pre-quantized and fusion modes. The result in fusion outperforms baseline (MFCC based SI system) significantly over various model orders on two standard databases. In both the tables, average times (T m & Tf ) have been shown for MFCC

2.7 Results and Discussion on PQ based Fusion Strategy with P = 2

41

Algorithm 2.1: PQ based fusion strategy 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

17 18 19

20 21 22 23 24

/* Declaration of parameters */; S = Total numbers of speakers ; P = Pre-quantization rate ; Ym = Energy spectrum for mth frame ; MFCC fea ex = MFCC feature extraction module ; IMFCC fea ex = IMFCC feature extraction module ; XmMFCC = Unknown cepstral vector for mth frame using MFCC filter bank ; XmIMFCC = Unknown cepstral vector for mth frame using IMFCC filter bank ; λiMFCC = Speaker model realized using MFCC filter bank for speaker i ; λiIMFCC = Speaker model realized using IMFCC filter bank for speaker i ; Li = Log likelihood of speaker model i ; strue = True speaker identity ; /* Fusion */; Li ← 0; for i=1 to S do for m=1: P : T do if using MFCC stream then XmMFCC = MFCC fea ext(Y m ) ; // Use frame nos. 1, 1 + P, . . .. See table 2.7 Li =Li + log p(XmMFCC |λiMFCC ) ; else XmIMFCC = IMFCC fea ext(Ym ) ; // Use frame nos. P P 2 + 1, 2 + 1 + 2P, . . .. See table 2.7 Li =Li + log p(XmIMFCC |λiIMFCC ) ; end end end

Decision: strue = arg maxs Ls s ∈ {1, 2, . . . , S} ;

42

Speaker Identification Based on High-Frequency Cues

Table 2.5: Performances using MFCC & IMFCC feature sets in normal, Pre-quantized and fused mode for YOHO database for (M = 64). M

PIA a (%)

Tm b

MFCC (T

vec.g )

Tf c

SFd

(T vec.)

(s)

Tm Tf

PIA (%)

PIA (%)

PIA (%)

PIA (%)

IMFCC

MFCCPQ e

IMFCCPQ f

FUSED

(sh )

(T vec.)

( T2

( T2

vec.)

vec.)

2

74.31

6.22

78.04

73.50

77.45

82.81

6.22

1:1

4

84.86

7.12

86.50

84.46

85.87

90.82

7.11

1:1

8

90.69

8.71

91.99

90.62

91.56

94.80

8.70

1:1

16

94.20

11.78

94.15

93.95

93.77

96.27

11.79

1:1

32

95.67

18.19

95.22

95.60

95.13

97.19

18.20

1:1

64

96.79

25.53

95.76

96.74

95.51

97.64

25.55

1:1

a

Percentage of Identification Accuracy Time required for evaluating MFCC stream c Time required for evaluating fused stream d Speed-up Factor = TTm f e PQ applied in MFCC stream f PQ applied in IMFCC stream g Vectors h Seconds b

Table 2.6: Performances using MFCC & IMFCC feature sets in normal, Pre-quantized and fused mode for POLYCOST database for (M = 16). M

PIA (%)

Tm

MFCC

PIA (%)

PIA (%)

PIA (%)

PIA (%)

IMFCC

MFCCPQ

IMFCCPQ

FUSED

( T2

vec.)

( T2

vec.)

Tf

SF

(T vec.)

(s)

Tm Tf

(T vec.)

(s)

(T vec.)

2

63.93

36.20

55.97

63.53

55.84

68.04

36.22

1:1

4

72.94

40.56

68.04

72.28

67.90

76.53

40.58

1:1

8

77.85

49.61

76.26

77.59

76.26

80.77

49.59

1:1

16

77.85

66.72

77.06

77.59

76.92

81.43

66.78

1:1

and fused systems in the third and eighth columns respectively. Over the various model orders, times required in fused system are shown comparable with the computation times taken for MFCC based SI system. The pre-quantized results for both the feature sets (on an individual basis) show minor degradation of identification accuracies as com-

2.8 Further Reduction of Frame Rate with P > 2

43

pared to their respective baselines as expected. Also this shows MFCC and IMFCC are complementary to each other so that fusion of them with no additional computational load gives better performance.

2.8

Further Reduction of Frame Rate with P > 2

In section 2.6, we have demonstrated a PQ based merging scheme for multi-stream [53] speaker models and shown the results in section 2.7. This section presents further reduction of computation by increasing the decimation rate i.e. P . In the earlier case, P = 2 is chosen for routing the energy spectrum alternately to either the MFCC or the IMFCC filter bank and the results show that the fused scheme performs better than the baseline system considerably. Here, we assign the value of P from 4 onwards i.e. P = 4, 6, 8 and so on until the performance of the fused scheme drops below the accuracy shown by the MFCC based system. The objective of this study is to show the amount of reduction of computational complexities without sacrificing the absolute performance of the baseline system. Note that the results are presented here with highest model order for both the databases i.e. 64 Gaussian for YOHO and 16 Gaussian for POLYCOST. For MFCC filter bank, P (nâˆ’1)+1 th labeled vectors are allocated for MFCC, while P (nâˆ’ 12 )+1 th labeled vectors are routed towards IMFCC stream, where n = 1, 2, 3, . . ., and T is the total number of available test vectors involved in an identification. The

table 2.7 shows the serial numbers of the vectors that are routed to MFCC and IMFCC streams for different values of P . For any value of P the first vector has always been sent to MFCC models of all the speakers in the database. Assuming uniformity in correlation, the starting vector of IMFCC is not fixed; rather it is chosen as the middle vector (see Tab. 2.7) of any two consecutive vectors allocated for MFCC models. From the tables 2.8 and 2.9 it is observed that, better identification accuracies over the baseline system can be obtained with much reduced complexities as compared to the results shown by tables 2.5 and 2.6. For the YOHO and POLYCOST databases, maximum speed-up factors (SF) of 4:1 and 8:1 are achieved without compromising the performances shown by the MFCC based system. The times required for the fused system to give the decision about speaker identity are also reported here. The reduced times are exactly following the theoretical hypothesis based on the fixed decimation based PQ technique. However, both the streams individually show gradually degrading performances with increasing P . Therefore, the fact that the combined system outper-

44

Speaker Identification Based on High-Frequency Cues

Table 2.7: Serial nos. of allocated vectors towards MFCC & IMFCC stream for different values of P . Value of P

Vectors allocated

Vectors allocated

Total no. of

to MFCC stream

to IMFCC stream

Vectors used

P =2

1, 3 ,5, 7, . . .

2, 4, 6, 8, . . .

P =4

1, 5, 9, 13, . . .

3, 7, 11, 15 . . .

P =6

1, 7, 13, 19, . . .

4, 10, 16, 22, . . .

P =8 .. .

1, 9, 17, 25, . . . .. .

5, 13, 21, 27, . . . .. .

T 2 T 4 T 6 T 8

+ + + +

T 2 T 4 T 6 T 8

=T T 2 T 3 T 4

= = =

.. .

Table 2.8: Reduction in computational complexity with increasing P for fusion scheme on YOHO Database (M = 64). Average time (Tm ) for only MFCC (single stream) is 25.53 sec. Total

PIA (%)

PIA (%)

PIA (%)

Tf

SF

Vec. used

MFCCPQ

IMFCCPQ

FUSED

(s)

Tm Tf

T

96.74

95.51

97.64

25.55

1:1

T 2 T 3 T 4 T 5

96.56

94.87

97.59

12.52

2:1

95.83

93.97

97.52

8.46

3:1

95.22

92.95

97.50

6.30

4:1

94.38

91.11

96.61

5.09

5:1

forms the baseline system proves alternatively the necessity of IMFCC, which models speaker specific cues in the high frequency region of the speech. Figures 2.13 and 2.14 give a graphical representation of this scheme.

2.9

Conclusions

The main points presented in this chapter include the following: â€˘ The inverted Mel-scale is proposed for developing a new filter bank, which captures

high frequency cues for speakers. The performances while using this new filter bank

2.9 Conclusions

45

Table 2.9: Reduction in computational complexity with increasing P for fusion scheme on POLYCOST Database (M = 16). Average time (Tm ) for only MFCC (single stream) is 66.72 sec. Total

PIA (%)

PIA (%)

PIA (%)

Tf

SF

Vec. used

MFCCPQ

IMFCCPQ

FUSED

(s)

Tm Tf

T

77.59

76.92

81.43

66.78

1:1

T 2 T 3 T 4 T 5 T 6 T 7 T 8

76.79

76.79

81.10

33.31

2:1

76.26

75.73

80.77

22.19

3:1

76.26

74.01

80.64

16.65

4:1

75.86

73.21

80.24

13.41

5:1

75.26

71.75

80.11

11.10

6:1

74.67

70.16

79.44

9.53

7:1

72.55

69.89

79.05

8.34

8:1

T 9

72.15

68.30

77.45

7.40

9:1

are better than the performance shown by the MFCC based system, especially in the lower order models for the microphone speech. Comparable performances with respect to the baseline system are obtained when model orders are increased for both the databases. â€˘ A merging scheme is also proposed here for combining model level scores using

the PQ based pruning method. Exploiting the common usage of same modules by

MFCC and IMFCC feature sets, scores from the models are merged by adopting the PQ technique without increasing extra computational budget as compared to the single stream based SI system. The combined system performs better than the baseline system even though the two systems use the same amount computation. â€˘ Further pruning has been done on the fused system by increasing the decimation

rate in order to show maximum speed-up factors that one can achieve. For the YOHO database, we achieved 4:1 computational benefits without compromising

identification accuracy while in the POLYCOST database, a speed-up factor of 8:1 is gained. The speed-up factor is measured by averaging the total time required for all the utterances that are put under test.

46

Speaker Identification Based on High-Frequency Cues

Time vs. PIA for YOHO Database with M = 64 98 Fused System Baseline

T− No. of Total test frames used 97.8 T T/2 T/3

Identification Accuracy (%) →

97.6

T/4

97.4

97.2

97

96.8 T T/5 96.6

5

10

15 20 Average Time (Seconds) →

25

30

Figure 2.13: Time vs. PIA for YOHO database. Time vs. PIA for POLYCOST Database with M = 16 81.5 Fused System Baseline

T

81 T/2 T/3 80.5

T/4

Identification Accuracy (%) →

T/5 80

T/6

79.5 T/7 79

T/8

78.5 T− Total no. of test frames used 78 T 77.5

77

T/9

0

10

20

30 40 Average Time (Seconds) →

50

60

Figure 2.14: Time vs. PIA for POLYCOST database.

70

2.9 References

47

Besides the main points described above, some notable things are also observed. First, the merging scheme shows the potential of merging multi-stream models where the models are developed on two or many feature sets that use the same computational complexities through several modules/blocks having no dependency/loading between them. However, developing multiple models from different feature sets requires additional memory space for storage. Therefore one must accept a tradeoff between memory and speed when using such multi-stream modeling technique. The merits of the performance gains must be weighed against these considerations. Next, if we compare the performances of the two databases, it can be said that the YOHO database performs better than the POLYCOST database as far as applications like SI are concerned. The YOHO database consists of three digit combination lock numbers like “26-81-57” both in training and testing time. On the other hand, free speech mother tongue files namely, “MOT02” and “MOT01” have been used for training and testing respectively in the POLYCOST database. The poorer performances of the POLYCOST database may be explained by the wide variety of the linguistic contents of the training and test data, besides the fact that telephone data are usually channel dependent and noisy. For these reasons, the performances shown by the POLYCOST database are not comparable with the performances shown by the YOHO database in all conducted experiments.

References [1] B. Gold and N. Morgan, Speech and audio Signal Processing : processing and perception of speech, and music, John Willy & Sons (ASIA) Pte. Ltd., 2002, pp. 189-203. (Cited in sections 2.1 and 2.3.) [2] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Audio Speech and Signal Process., vol. ASSP-29, no. 2, pp. 254-272, Apr. 1981. (Cited in section 2.1.) [3] J. G. Proakis and D. G. Manolakis, Digital Signal Processing : Priciples, Algorithms, and Applications, Pearson Education, Inc., 3rd ed., 2004, pp. 448-494. (Cited in sections 2.1 and 2.2.) [4] M. D. Skowronski and J. G. Harris, “Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition,” J. Acoustical Society of America, vol. 116, no. 3, pp. 1774-1780, Sept. 2004. (Cited in section 2.1.) [5] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Audio Speech and Signal Process., vol. ASSP-28, no. 4, pp. 357-365, Aug. 1980. (Cited in sections 2.1 and 2.3.) [6] R. Vergin, D. O’ Shaughnessy, and A. Farhat, “Generalized Mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition,” IEEE Trans. Audio Speech and Signal Process., vol. 7, no. 5, pp. 525-532, Sept. 1999. (Cited in section 2.1.)

48

Speaker Identification Based on High-Frequency Cues

[7] H. Gish and M. Schmidt, “Text-independent speaker identification,” IEEE Signal Process. Magazine, vol. 11, no. 4, pp. 18-32, Oct. 1994. (Cited in section 2.1.) [8] M. F.-Zanuy and E. M.-Moreno, “State-of-the-art in speaker recognition,” IEEE Aerospace and Electronic Systems Magazine, vol. 20, no. 5, pp. 7-12, Mar. 2005. (Cited in sections 2.1 and 2.5.) [9] K. S. R. Murty and B. Yegnanarayana, “Combining evidence from residual phase and MFCC features for speaker recognition,” IEEE Signal Process. Lett., vol 13, no. 1, pp. 52-55, Jan. 2006. (Cited in sections 2.1 and 2.3.) [10] N. Zheng, T. Lee, and P. C. Ching, “Integration of Complementary Acoustic Features for Speaker Recognition,” IEEE Signal Process. Lett., vol 14, no. 3, pp. 181-184, Mar. 2007. (Cited in sections 2.1 and 2.3.) [11] W. N. Chan, N. Zheng, and T. Lee, “Discrimination Power of Vocal Source and Vocal Tract Related Features for Speaker Segmentation,” IEEE Trans. Audio, Speech, and Language Process., vol. 15, no. 6, pp. 1884-1892, Aug. 2007. (Cited in section 2.1.) [12] U. G. Goldstein, “Speaker identifying features based on formant tracks,” J. Acoustical Society of America, vol. 59, no. 1, pp. 176-182, Jan. 1976. (Cited in section 2.1.) [13] R. D. Zilca, B. Kingsbury, J. Navratil, and G. N. Ramaswamy, “Pseudo pitch synchronous analysis of speech with applications to speaker recognition,” IEEE Trans. Speech, Audio and Language Process., vol. 14, no. 2, pp. 467-478, Mar. 2006. (Cited in section 2.1.) [14] S. R. M Prasanna, C. S. Gupta, and B. Yegnanarayana, “Extraction of speaker-specific excitation information from linear prediction residual of speech,” Speech Communication , vol. 48, no. 10, pp. 1243-1261, Oct. 2006. (Cited in section 2.1.) [15] L. Rabiner and B. H. Juang, Fundamentals of speech recognition, Pearson Education Inc., First Indian Reprint, 2003, pp. 11-65. (Cited in sections 2.1 and 2.3.) [16] X. Lu and J. Dang, “An investigation of dependencies between frequency componnets and speaker charactteristcis for text-independent speaker identification,” Speech Commun., 2007, to be published. (Cited in section 2.1.) [17] F. Zheng, F. G. Zhang, and Z. Song, “Comparison of different implementations of MFCC,” J. Computer Science & Technology, vol. 16, no. 6, pp. 582-589, Sept. 2001. (Cited in section 2.1.) [18] T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task,” in Proc. of 10th International Conference on Speech and Computer, (SPECOM 2005), 2005, pp. 191-194. (Cited in section 2.1.) [19] S. Chakroborty, A. Roy, and G. Saha, “Improved Closed Set Text-Independent Speaker Identification by combining MFCC with Evidence from Flipped Filter Banks,” International Journal of Signal Process., vol. 4, no. 2, pp. 114-121, Apr. 2007. (Cited in section 2.1.) [20] T. Kinnunen, E. Karpov, and P. Fr¨ anti, “Real-Time Speaker Identification and Verification,” IEEE Trans. Speech and Audio Process., vol 14, no. 1, pp. 277-288. Jan. 2006. (Cited in sections 2.1 and 2.6.) [21] J. McLaughlin, D. A. Reynolds, and T. Gleason, “A study of computation speed-ups of the GMMUBM speaker recognition system,” in Proc. 6th European Conf. Speech Communication and Technology (Eurospeech 1999), 1999, pp. 1215-1218. (Cited in section 2.1.)

2.9 References

49

[22] B. L. Pellom and J. H. L. Hansen, “An efficient scoring algorithm for gaussian mixture model based speaker identification,” IEEE Signal Process. Lett., vol. 5, no. 11, pp. 281-284, Nov. 1998. (Cited in section 2.1.) [23] S. S. Stevens, J. Volkmann, and E. B. Newman , “A Scale for the Measurement of the Psychological Magnitude Pitch,” J. Acoustical Society of America, vol. 8, no. 3, pp. 155-210, Jan. 1937. (Cited in section 2.2.) [24] D. J. Mashao and M. Skosan, “Combining Classifier Decisions for Robust Speaker Identification,” Pattern Recog., vol. 39, no. 1, pp. 147-155, Jan. 2006. (Cited in sections 2.2 and 2.3.) [25] R. D. Peacocke and D. H. Graph, “An Introduction to Speech and Speaker Recognition,” Computer, vol. 23, no. 8, pp. 26-33, Aug. 1995. (Cited in section 2.2.) [26] J. P. Cambell, Jr., “Speaker Recognition:A Tutorial,” Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997. (Cited in sections 2.2 and 2.5.) [27] D. O’ Shaughnessy, “Speaker recognition,” IEEE Signal Process. Magazine , vol. 3 , no. 4, pp. 4-17, Oct 1986. (Cited in section 2.2.) [28] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. M. Chagnolleau, S. Meignier, T. Merlin, J. O. Garcia, D. P. Delacretaz, and D. A. Reynolds, “A Tutorial on Text Independent Speaker Verification,” EURASIP Journal on Applied Signal Process., vol. 2004, no. 4, pp. 430-451, 2004. (Cited in section 2.2.) [29] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signal, Pearson Education Inc., First Indian Reprint, 2003, pp. 116-166. (Cited in sections 2.2, 2.4.1 and 2.6.) [30] N. Ahmed, T. Natarajan, and K. Rao, “Discrete cosine transform,” IEEE Trans. on Comput., vol. C-23, no. 1, pp. 90-93, Jan. 1974. (Cited in section 2.2.) [31] L. Besacier and J.-F. Bonastre, “Subband approach for automatic speaker recognition: Optimal division of the frequency domain,” in Proc. of 1st International Conference on Audio-and VisualBased Biometric Person Authentication, (AVBPA 1997), 1997, pp. 195-202. (Cited in section 2.3.) [32] S. Hayakawa and F. Itakura, “Text-Dependent Speaker Recognition Using the Information in the Higher Frequency Band,” in Proc. International Conf. on Acoustic, Speech, and Signal Process., (ICASSP 1994), 1994, pp. 137-140. (Cited in section 2.3.) [33] J. P. Campbell, Jr., “Testing with the YOHO CDROM voice verification corpus,” in Proc. International Conference on Acoustic, Speech, and Signal Process., (ICASSP 1995), 1995, pp. 341-344. (Cited in sections 2.3 and 2.5.) [34] J. Hennebert, H. Melin, D. Petrovska, and D. Genoud, “POLYCOST: A telephone-speech database for speaker recognition,” Speech Communication, vol. 31, no. 2-3, pp. 265-270, Jun 2000. (Cited in section 2.3.) [35] T. Nordstr¨ om, H. Melin, and J. Lindberg, “A comparative study of speaker verification systems using the polycost database,” in Proc. of 5th International Conference on Spoken Language Processing, (ICSLP98), 1998, pp. 1359-1362. (Cited in sections 2.3 and 2.5.) [36] J. Kittler, M. Hatef, R. Duin, and J. Mataz, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226-239, Mar. 1998. (Cited in section 2.3.)

50

Speaker Identification Based on High-Frequency Cues

[37] D. Huiqun and D. O’Shaughnessy, “Voiced-Unvoiced-Silence Speech Sound Classification Based on Unsupervised Learning,” in Proc. of IEEE International Conference on Multimedia and Expo, 2007 pp. 176 - 179. (Cited in section 2.4.1.) [38] A. Davis, S. Nordholm, and R. Togneri, “Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold,” IEEE Trans. on Speech, Language Process., vol. 14, no. 2, pp. 412-424, Mar. 2006. (Cited in section 2.4.1.) [39] T. F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Pearson Education Inc., First Indian Reprint, 2004, pp. 175-251. (Cited in section 2.4.1.) [40] D. A. Reynolds and R. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Process., vol. 3, no. 1, pp. 72-83, Jan. 1995. (Cited in sections 2.4.2 and 2.4.2.) [41] D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Commun., vol. 17, no. 1-2, pp. 91-108, Aug. 1995. (Cited in sections 2.4.2 and 2.5.) [42] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, John Willy & Sons (ASIA) Pte. Ltd., 2nd ed., 2006, pp. 20-82. (Cited in section 2.4.2.) [43] A. Dempster, N. Laird, and D. Rubii, “Maximum likelihood from incomplete data via the EM algorithm,” J. Royal Stat. Soc., vol. 39, no. 1, pp. 1-38, 1977. (Cited in section 2.4.2.) [44] W. M. Campbell, K. T. Assaleh, and C. C. Broun, “Speaker Recognition With Polynomial Classifiers,” IEEE Trans. Speech Audio Process., vol. 10, no. 4, pp. 205-212, May 2002. (Cited in section 2.4.2.) [45] K. R. Farrell, R. J. Mammone, and K. T. Assaleh, “Speaker Recognition using Neural Networks and Conventional Classifiers,” IEEE Trans. Speech and Audio Process., vol. 2, no. 1, pp. 194-205, Jan. 1994. (Cited in section 2.4.2.) [46] F. Soong, F. A. Rosenberg, L. Rabiner, and B. A. Juang, “Vector quantization approach to speaker recognition,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Process., (ICASSP 1995), 1985, pp. 387-390. (Cited in section 2.4.2.) [47] T. Matusi and S. Furui, “Comparison of text-independent speaker recognition methods using VQdistortion and discrete/ continuous HMMs,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Process., (ICASSP 1992), 1992, pp. II-157-II-160. (Cited in section 2.4.2.) [48] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Commun., vol. 28, no. 1, pp. 84-95, Jan. 1980. (Cited in sections 2.4.2, 2.5 and 2.6.) [49] L. Liu and J. He, “On the use of orthogonal GMM in speaker recognition,” Proc. of IEEE International Conference on Acoustics, Speech, and Signal Process., (ICASSP 1999), 1999, vol. 2, pp. 845-848. (Cited in section 2.4.2.) [50] H. Melin and J. Lindberg, “Guidelines for experiments on the polycost database,” in Proc. of a COST 250 workshop on Application of Speaker Recognition Techniques in Telephony, 1996, pp. 59-69. (Cited in section 2.5.) [51] K. K. Yzut, M. W. Mak, and S. Y. Kungt, “Channel Distortion Compensation Based On The Measurement Of Handset’s Frequency Responses,” in Proc. of 2001 International Symposium on

2.9 References

51

Intelligent Multimedia, video and Speech Process., (ISIMP 2001), 2001, pp. 197-200. (Cited in section 2.5.) [52] H. A. Murthy, F. Beaufays, L. P. Heck, and M. Weintraub, “Robust Text-Independent Speaker Identification over Telephone Channels,” IEEE Trans. Speech and Audio Process., vol. 7, no. 5, pp. 554-568, Sept. 1999. (Cited in section 2.5.) [53] W. H. Abdulla and N. Kasabov, “Reduced Feature set Based parallel CHMM Speech Recognition System,” Information Sciences, vol. 156, no. 1-2, pp. 21-38, Nov. 2003. (Cited in sections 2.6 and 2.8.) [54] S. Tibrewala and H. Hermansky, “Multi-Stream Approach in Acoustic Modeling,” in Proc. LVCSRHub5 Workshop, 1997. (Cited in section 2.6.) [55] W. Tomasi, Electronic Communication Systems: Fundamentals Through Adavanced, Pearson Education, Inc., 5th ed., 2006, pp. 469-521. (Cited in section 2.6.) 3

CHAPTER

3

Studies on Gaussian Filter Shapes for Speaker Identification Application

3

Preface This work mainly discusses the shapes of bandpass filters used in filter bank based feature extractors in the SI application. Various Gaussian shaped filters have been proposed and their performances are compared with the performances shown by conventional rectangular and triangular filter based filter banks. The chapter also discusses the effect of correlation between adjacent subbands on SI performance. 3

54 Studies on Gaussian Filter Shapes for Speaker Identification Application

3.1

Introduction

Recent attention in speech related applications has focussed on the use of subband processing [1], whereby the wide band signal is preprocessed by a bank of Q bandpass filters to give a set of Q time-varying outputs, which are individually processed [1], [2]. The major advantage of subband processing is that the scheme produces much better and more robust models [3], [4], [5], [6] for each of the Q subband signals from the (always limited) example data than the single model produced from the wide band signal. The key idea is to analyze speech or speaker dependent information independently that is distributed unevenly [7] in the entire frequency range. In the speaker recognition context, the variation of speaker specific cues over several sub-bands is discussed in [1], [8]. The non-uniformity of speaker specific information, which lies mainly around the regions of the first and third formant frequencies, has been discussed in [1], [2]. Note that the locations of first and third formant frequencies are found normally below 600 Hz and above 3000 Hz respectively. The subband based approach has also become popular in recent years in speech recognition [9], [10], [11]. In this area, the main motivation has been to achieve robust recognition in the face of noise. The basic methodology is that the recombination process allows the overall decision to be made taking into account any noise contaminating [12] one or more of the partial bands. To model a K subband based system, K numbers of models are used. At the time of testing, scores from several models are combined together to yield a consensus decision. Note that a subband can comprise one or more numbers of bandpass filters, which are usually shared by adjacent subbands. The MFCC feature extraction task uses such subband processing where each subband consists of a single bandpass filter (i.e. K = Q where, Q is the number of bandpass filters) with partial overlapping with adjacent ones. Even though each filter is individually treated during the determination of filter bank outputs, it is completely the designerâ€™s choice whether to process the cepstral coefficients individually or together. Normally, a subband is constructed with more than two bandpass filters. Accordingly, if there are K subbands, K sets of cepstral parameters are obtained on which respective models are developed. Thus the model level complexity increases by K times as compared to conventional multidimensional modeling schemes. The subband processing technique divides the frequency range into a number of divisions for which the correlations between a particular subband and its adjacent ones are lost. By correlations, we mean the smooth continuity of the energy spectrum that

3.1 Introduction

55

lies over the whole frequency range. Note that the work presented in [1] has already shown the importance of this correlation in a SI problem and has shown a detailed comparison of SI performances between the systems that use adjacent and nonadjacent subbands. It is found that the former outperform latter. The role of this correlation can also be found in a speech recognition application [13], where it has been shown that the redundancy between the subbands might be a source of human robustness to speech degradation.

3.1.1

Motivation

Acoustic feature extraction techniques like MFCC come under subband processing techniques where each subband is constructed with one Triangular Filter (TF), that shows partial overlap (see Fig. 3.1) with its neighboring ones. A triangular filter provides crisp partitions in an energy spectrum as it sets non-zero weights to the portion covered by it while giving zero weights outside it. Therefore, during spectral averaging (eqn. 2.6) smooth transition of the energy spectrum is somewhat hampered by the abrupt ending of the two sides of the TF resulting in loss of correlation among the subbands. However, no attempts have yet been made to extract features to introduce correlation in a systematic way. In a broad sense, the effect of filter shapes in the context of speaker recognition has not been explored much although there are studies [14], [15] on the auditory filter shapes for the speech recognition application. In this chapter, we investigate the effect of filter shape in the SI application with detailed experimentation on two databases. We introduce here the of use of Gaussian Filters (GF), [16], [17] (see Fig. 3.1) instead of triangular ones, as averaging bins for calculating MFCC. The motivation of using GF is threefold. â€˘ First, Gaussian shaped filters can provide much smoother transition from one subband to the adjacent one preserving most of the correlation between them.

â€˘ Second, the Standard Deviations (SD) [18] of these GF can be independently chosen in order to have control over the amount of overlap with the neighboring subband. â€˘ Third, the filter design parameters (means [18] and SD) for GF can be calculated very easily from mid as well as end-points located at the base of the original TF

used for implementing MFCC.

56 Studies on Gaussian Filter Shapes for Speaker Identification Application

5 Subbands

10 Subbands 1 Amplitude→

Amplitude→

1 0.8 0.6 0.4 0.2 0

500

1000

1500

2000

2500

3000

3500

0

4000

1

1 0.8

0.6 0.4 0.2

500

1000

1500

2000

2500

3000

3500

4000

0.6

Triang.

0.4

0

500

1000

1500

2000

2500

3000

3500

0

4000

1

1

0.8

0.8

0.6 0.4 0.2 0

0

0.2

Amplitude→

Amplitude→

0.4

0.8

0

Rect.

0.6

0.2

Amplitude→

Amplitude→

0

0.8

0

500

1000

1500

2000

2500

3000

3500

4000

0.6

Gauss.

0.4 0.2

0

500

1000

1500 2000 2500 Frequency (Hz)→

3000

3500

4000

0

0

500

1000

1500 2000 2500 Frequency (Hz)→

3000

3500

4000

Figure 3.1: Overlapped subbands realized by filters of various shapes GF are symmetric, provide positive filter weights and are triangular to some extent to MFCC, which is an essential requirement in the filter bank based feature extraction process. One such attempt [19] has been made in speech recognition where mean and SD for a GF is obtained from the first and second moment of energy spectrum confined within a subband. The work has shown that the improvements in recognition performances over TF based MFCC are insignificant. In addition, the work [20], [21] estimates parameters of Gaussian shaped filters by discriminative training in a speech recognition application. The performances using GF are poor as compared to ‘free-formed’ and ‘triangular-like’ shapes of filters.

3.1.2

Organization of the chapter

The rest of this chapter includes mainly the use of GF on mel scale and evaluation of its parameters in section 3.2. Next, a study on the comparative results for an SI application that uses different shaped mel-scaled filter is done in section 3.3. This

3.2 Derivation of Gaussian Filter based MFCC

57

is followed by an extension of this work, which has been done by imposing GF on inverted mel-scale and the performances of these inverted mel-scaled filters for the same application is discussed in section 3.4 and section 3.5, respectively. In section 3.6, effectiveness of each feature set is evaluated using Divergence Measure. We have also applied PQ based fusion strategy (ref. sec. 2.6) for merging the model level outputs where models have been developed from GF based MFCC and IMFCC feature sets and comparisons have been done here with earlier results reported in chapter 2. The results have been presented in section 3.7 and section 3.8 followed by the conclusions in section 3.9.

3.2

Derivation of Gaussian Filter based MFCC

A detailed description of MFCC feature extraction task has already been presented in sec. 2.2. In this section, we derive the GF based MFCC, which is described next. In conventional MFCC, overlapped TF are used as averaging bins for obtaining overall filter bank outputs. The averaging is done for two reasons: 1) for estimating the shorttime spectral envelope, which describes a speaker’s vocal tract characteristics, and 2) reducing dimension for which the whole energy spectrum is converted to low resolution filter bank outputs. The response of TF has been given by eqn. 2.3. Analogously, we can write the response of a GF as, ψig (k)

=e

−

(k−kb )2 i 2σ 2 i

(3.1)

where, kbi is located in the middle of the ith TF’s boundaries i.e. k bi−1 and kbi+1 given by eqn. 2.3. We consider kbi as the mean of the ith GF while σi is the standard deviation or square root of variance and we define it as, σi =

kbi+1 − kbi αi

(3.2)

where, αi is the parameter that controls the variance, which in turn can control the amount of overlap of one GF with its adjacent ones. Nevertheless, the mean and SD could also be chosen as the first and second order moments [19] of the spectral compo√ nents that lie in a frequency band. In the eqn. 3.1, the conventional denominator 2πσi for Gaussian is dropped, as its presence is only to ensure that the area under a Gaussian curve [18] is unity. Omitting the term helps GF to achieve unity as the highest value at

58 Studies on Gaussian Filter Shapes for Speaker Identification Application their means. These specifications are comparable with TF for conventional MFCC. Note that TF are originally isosceles (symmetric) in mel-frequency scale but when the scale is mapped into the original frequency scale (see Fig. 3.2) they become non-isosceles as shown by equation 2.5. Therefore, the distances from the boundaries (k bi−1 & kbi+1 ) to the point (kbi ) situated in the middle are not equal. For the non-linear monotonicity of the relation between f and fmel (eqn. 2.1), the inequality relation between these two distances can be written as, (kbi+1 − kbi ) > (kbi − kbi−1 )

(3.3)

We took the maximum spread out of these two distances i.e. k bi+1 − kbi to evaluate σi

Pitch (Mels) →

Mel to normal frequency scale mapping

fmel=2595 log10(1+f/700)

k

b

i−1

kb

i

k

b

i+1

Normal Frequency →

Figure 3.2: Mel to normal frequency scale mapping

ensuring maximum coverage of the subbands by the GF.

3.2 Derivation of Gaussian Filter based MFCC

59

Different shapes of filters used for MFCC implementation Rectangular filter Triangular filter Gaussian filter 1 Gaussian filter 2 Gaussian filter 3

1 σi=kb −kb i+1

σ =(k

Amplitude →

0.8

i

−k )/2

b

b

i+1

σ =(k i

i

b

i

−k )/3

i+1

b

i

0.6

0.4

0.2

0

k

b

kb i−1

FFT Coefficient index → i

kb

i+1

Figure 3.3: Different shapes of filters used for MFCC implementation

3.2.1

Choice of α, the overlap parameter

In Figure 3.3, the responses of a TF, a Rectangular Filter (RF) and GF with three different values of σi are shown. All three GFs are centered around k bi and offer gradually decaying weights to those parts of the energy spectrum which are further away from the center. The αi plays an important role in setting the variances for GF. Higher α i produces lower variance and vice versa. For example, if α i = 3 then the eqn. 3.2 turns into, kbi+1 − kbi = 3σi ∀ i, i = 1, 2, . . . , Q

(3.4)

which signifies Pr[kbi+1 − kbi = 3σi ] = 0.997, where Pr[•] denotes the prior probability

[18] for an event. Therefore, αi = 3 implies almost 99.7% coverage by a GF to a particular subband. Similarly, for α i = 2, the GF guarantees to provide nearly 95% of its total weights for a subband, since Pr[k bi+1 − kbi = 2σi ] = 0.95. Therefore, αi = 2 can

provide higher correlation with adjacent subbands in comparison to the case α i = 3 for which a Gaussian window can assign 0.3% of its total weights to the frequencies other

60 Studies on Gaussian Filter Shapes for Speaker Identification Application than its own subband. One could have chosen α i = 1 for which the variance will be too high i.e. nearly 68% (because Pr[k bi+1 − kbi = σi ] = 0.68) of a GF’s total weights

inside a subband of interest. Therefore, a trade-off could be maintained between weight

distribution within a subband and the amount of correlation by proper choice of α i . However, different α could also be chosen for different filters instead of same α for all the filters in a filter bank. Three different kinds of Gaussian filter banks are developed by varying αi . For each of these cases (see Fig. 3.4, second to fourth panes), a fixed value of αi is chosen for all the filters present in the filter bank.

Gaussian Filters realized in mel−scale with different variances TF−MFCC

1 0.5 0

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000 Frequency (Hz)→

2500

3000

3500

4000

α=3

1 0.5 0

α=2

1 0.5 0

α=1

1 0.5 0

varying α

1 0.5 0

Figure 3.4: Gaussian filters realized in mel-scale with different variances

3.2 Derivation of Gaussian Filter based MFCC

61

We also propose here another GF bank (see Fig. 3.4, bottom last pane) for which we take different αi for each filter in the filter bank. The objective is to control the correlation variably according to each filter’s spread i.e. k bi+1 − kbi−1 in frequency scale.

Note that due to non-linear mapping of f mel to f (eqn. 2.5, see Fig. 3.2) in MFCC, the filter’s spread is progressively increasing as we gradually move from lower to higher values along the frequency axis. Therefore, the slopes in TF become lesser as f increases. Higher slope provides lesser tapering, which cannot compensate the artifacts due to side effects [22] in the spectral domain. For this reason, we choose lower α i.e higher variances for the GF that are placed in lower frequency range (around 1 kHz) such that they taper and correlate more with adjacent frequency components. In contrast, higher values of α have been applied on the GF, which are located beyond 1 kHz as these filters can smoothen out an energy spectrum within a larger bandwidth. We have assigned α = 2 and α = 3 for the first and the last filter (20th filter) respectively while for the rest of the GF, α is varied linearly and is given by the following relation, αi = 2 +

di − d 1 , i = 1, 2, . . . , Q dQ − d 1

(3.5)

where, di = kbi+1 − kbi and Q = 20. The varying α based GF allows one to have better

control over the correlations with close filters in the vicinity.

Finally, the cepstral vector using any type of GF can be calculated from the sum of product between energy spectrum and filters’ responses (eqn. 3.1) followed by DCT, which are similar to eqns. 2.6 & 2.7 respectively. Therefore, we can write, Ms

eg (i) =

2 X

k=1

g Cm

=

r

|Y (k)|2 · ψig (k)

(Q−1) 2 X 2l − 1 π log[eg (i + 1)] · cos m · · Q 2 Q

(3.6)

(3.7)

l=0

where, ‘g’ in (3.7) denotes GF based cepstral vector. This new cepstral vector will be referred to throughout the rest of this thesis as Gaussian Mel-frequency Cepstral Coefficients (GMFCC). Now, it is not incorrect to think that the DCT can decorrelate the log filter bank outputs, among which the correlations have already been introduced through overlapped GF. But, the GF introduces this correlation by taking the evidences from other subbands simultaneously while calculating the weighted average for a particular subband. On the

62 Studies on Gaussian Filter Shapes for Speaker Identification Application other hand, the sharp transition based filters with triangular and rectangular shapes do not take these evidences (i.e. they assign zero weights outside the bandwidth of interest), when they have been imposed on a certain section in a frequency scale in order to perform weighted average. Thus, the DCT can decorrelate the log filter bank outputs obtained from different sections (some spectral lines are shared by consecutive filters) placed along the frequency axis, but can not interpret the way (i.e. irrespective of shape of the filter) these filter bank outputs have been generated (or the shapes of the filter) by a set of band-pass filters. Although the fact is that DCT decorrelates the log filter bank outputs, its major contribution is to approximate the functionality of Principal Component Analysis by representing compactly the variation within a frame of speech through the first few eigenvectors, whose directional cosines are similar to a cosine series expansion [23], [24].

3.3

Comparative Performances of Different Feature Sets under Mel Scale

In this section, a comparative performance has been shown between RF conventional TF and four kinds of Gaussian filters in a SI application on two databases. The tables 3.1 and 3.2 show the performances of different filter shapes in SI application. For both the databases, RF based MFCC perform worst while varying α based GMFCC shows the best performances over different orders of models. For α i = 2 and αi = 3 the SI accuracies are significantly better than conventional as well as RF based MFCC, specifically in lower order models. Between the cases of α i = 2 and αi = 3, the former performs better than the latter as expected, because α i = 2 yields higher variance, which gives more correlation with nearby frequency components. However, when α i = 1, the performances drop below the identification rates shown by normal MFCC. This might be due to the excessive correlation/coupling between the subbands, which destroys zone specific local information like prominent formant peaks of the energy spectrum. As GF with variances takes the evidence from the distant subbands, it is expected that SI performances would not be improved. This fact also supports the findings by Laurent et al. in [1], which have shown that the correlations between distant subbands is less important for SI than the correlations between close subbands. The tables also exhibit the superior performance of varying α based GMFCC, which outperform the other kinds of GMFCC mentioned above. The results thus justify our hypothesis of choosing αi adaptively depending upon the relative spreads of bases of TF.

3.4 Application of Gaussian Filter to Inverted Mel Scale

63

Table 3.1: SI performances of various shapes of filters in mel-scale on YOHO database.

a b

Gaussian Filters (in %)

M

ua (in %)

∧b (in %)

αi = 1

αi = 2

αi = 3

Varying α

2

68.26

74.31

72.46

77.25

76.85

79.82

4

80.25

84.86

83.47

89.18

86.90

90.31

8

87.92

90.69

90.00

93.42

92.10

94.66

16

92.19

94.20

93.50

95.53

94.80

96.50

32

94.84

95.67

95.42

96.70

96.38

97.19

64

96.01

96.79

96.30

97.19

97.16

97.54

Rectangular Filter Triangular Filter i.e. conventional MFCC

Table 3.2: SI performances of various shapes of filters in mel-scale on POLYCOST database.

3.4

Gaussian Filters (in %)

M

u (in %)

∧ (in %)

αi = 1

αi = 2

αi = 3

Varying α

2

60.61

63.93

60.61

64.73

64.10

66.05

4

71.22

72.94

72.47

75.20

73.56

76.53

8

75.20

77.85

77.59

78.65

77.93

80.11

16

76.26

77.85

77.59

78.80

78.65

80.24

Application of Gaussian Filter to Inverted Mel Scale

We can extend the idea of the GF to the IMFCC based system also, as the shapes of the filters are independent of their placement in the frequency scale. Gaussian filters can easily be applied on the inverted mel-scale by changing the eqns. (3.2) and (3.5). In the inverted mel-scale the relation 3.3 is reversed i.e. (kˆbi+1 − kˆbi ) < (kˆbi − kˆbi−1 )

(3.8)

64 Studies on Gaussian Filter Shapes for Speaker Identification Application

TF−IMFCC

Gaussian Filters realized in inverted mel−scale with different variances 1 0.5 0

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000

2500

3000

3500

4000

0

500

1000

1500

2000 Frequency (Hz)→

2500

3000

3500

4000

α=3

1 0.5 0

α=2

1 0.5 0

α=1

1 0.5 0

Varying α

1 0.5 0

Figure 3.5: Gaussian filters realized in inverted mel-scale with different variances

Therefore, S.D. for inverted mel-scale is defined as, σ ˆi =

kˆbi − kˆbi−1 αi

(3.9)

Accordingly, we change the eqn. (3.5), which sets α1 = 3 and α20 = 2 for the first and last (i.e. 20th) filter respectively. For the rest of the filters the α has been varied linearly. This gives us completely reversed equation, which justifies the inequality relation shown

3.4 Application of Gaussian Filter to Inverted Mel Scale

65

by the eqn. 3.8. This is given by, αi = 2 +

dˆQ − dˆi , i = 1, 2, . . . , Q dˆQ − dˆ1

(3.10)

where, dˆi = kˆbi − kˆbi−1 . Note that this flip-over in the equation (3.10) is required for the inverted structure of any IMFCC filter bank (see Fig. 3.5 & 3.6) which needs more correlation at higher frequency region as the filters are more densely placed there showing lesser tapering as compared to their low frequency counterpart. Similar to the abbreviation used for GF based MFCC, GIMFCC will be used as the short form to refer to the terminology of Gaussian filters based inverted mel-frequency cepstral coefficients

Pitch (Mels) →

throughout the rest of the thesis.

Mel−scale Inverted mel Scale

kb − Boundary ponts for mel−scale x

kib − Boundary ponts for inverted mel−scale x

kb

i−1

kb

kb

i

i+1

kib

i+1

kib

i

kib

i+1

Normal Frequency →

Figure 3.6: Mel and inverted mel to normal frequency scale mapping Now, following the same relations given by eqns. 3.1, 3.6, and 3.7 one can derive the feature set GIMFCC by the following equations; ψˆig (k) = e

−

ˆ )2 (k−k bi 2ˆ σ2 i

(3.11)

66 Studies on Gaussian Filter Shapes for Speaker Identification Application Ms

g

eˆ (i) =

2 X

k=1

g Cˆm =

r

|Y (k)|2 · ψˆig (k)

(3.12)

(Q−1) 2l − 1 π 2 X g log[ˆ e (i + 1)] · cos m · · Q 2 Q

(3.13)

l=0

where, all the symbols have their usual meaning.

3.5

Comparative Performances of Different IMFCC Feature Sets

The tables 3.3 and 3.4 describe the performance of different IMFCC feature sets using various shapes of filters. For both the databases, the trends of the results are similar to the performances shown by different implementations for MFCC, which have already been described in tables 3.1 & 3.4. However, the relative performances of the GIMFCC that uses αi = 2, αi = 3, and varying α are not significantly better than those of TF based IMFCC if one compares the improvements of GMFCC over MFCC. Table 3.3: SI performances of various shapes of filters in inverted mel-scale on YOHO database. Gaussian Filters (in %)

M

u (in %)

∧ (in %)

αi = 1

αi = 2

αi = 3

Varying α

2

72.72

78.04

77.25

78.13

78.10

78.30

4

83.48

86.50

85.71

87.10

87.00

87.23

8

90.18

91.99

90.79

92.23

92.12

92.55

16

92.72

94.15

93.60

94.26

94.20

94.64

32

94.02

95.22

94.99

95.24

95.22

95.45

64

95.13

95.76

95.13

95.80

95.79

96.05

From the analysis on both the scales i.e. mel and inverted mel-scale, the following remarks can be made. • RF based feature sets do not perform well in the SI task due to their extreme crisp or non-tapering nature that results in poor approximation of the spectral envelope.

3.6 Analysis of Class Separability

67

Table 3.4: SI performances of various shapes of filters in inverted mel-scale on POLYCOST database. Gaussian Filters (in %)

M

u (in %)

∧ (in %)

αi = 1

αi = 2

αi = 3

Varying α

2

53.71

55.97

54.59

56.55

56.00

56.90

4

66.71

68.04

67.83

68.71

68.55

69.10

8

73.74

76.26

74.13

77.39

76.99

77.59

16

75.33

77.06

76.11

77.49

77.11

77.65

• Relatively lower variance based (α = 2, 3 and varying α kind, where 2 ≤ α ≤ 3) GMFCC and GIMFCC outperforms other MFCC & IMFCC implementations over different model orders. • Varying α based GMFCC and GIMFCC outperform TF as well as other GF based feature sets because overlapping for this case has been controlled adaptively for

various filters in order to introduce the correlation. Therefore, from the next section onwards, further analysis will be done only on varying α based GMFCC and GIMFCC as they perform better than any GF based feature sets. For simplicity of abbreviations, we call these feature sets as GMFCC and GIMFCC while omitting the term ‘varying α’.

3.6

Analysis of Class Separability

The effectiveness of a feature extraction scheme for speaker recognition depends mainly on how well the generated features, separate the different speaker classes, and suppress speech dependent information. In this section, we describe a measure based on the Linear Discriminant Analysis (LDA) technique [25] to analyze speaker separability of different feature extraction schemes. The measure is often called divergence [26], [27], [28], [29] in the literature. Divergence gives linear class separability in multidimensional space in the Euclidean sense. While dealing with multiple feature sets, one could check the potential [25] of each feature set individually before determining the actual performance of the system through extensive training and testing phases. A higher value of divergence indicates a better feature set by which only linear separability between the

68 Studies on Gaussian Filter Shapes for Speaker Identification Application speakers could be ensured. In this section, we evaluate the discrimination ability of individual feature sets by calculating divergence. The calculation of divergence for a feature set is described next. Let there be S number of speakers who have N 1 , N2 , . . . , NS number of D dimensional feature vectors as training data. Divergence is defined as the trace(W −1 B) where B, is the between-class covariance matrix and W the pooled within-class covariance matrix. These matrices are symmetric and can be computed from the training data in the following manner.

B=

S 1X n (µ − µ)(µn − µ)t S

(3.14)

S 1X Wn S n=1

(3.15)

n=1

and W=

where, µn and Wn are the mean vector and covariance matrix of the nth speaker, respectively and µ is the overall mean. These are given by µn =

Nn 1 X xkn Nn

(3.16)

k=1

Nn 1 X Wn = (xnk − µn )(xnk − µn )t , Nn

(3.17)

k=1

µ=

S 1X n µ S n=1

(3.18)

where xnk is the kth pattern from the speaker n. Divergence is based on Fisher’s linear discriminative analysis, where a projection matrix say A, is chosen such that the projected feature vectors belonging to one class are close together in feature space and are separated from the features of other classes. This is achieved by computing a projection that maximizes the objective function of the ratio between within-class variance and the between-class variance in the projected space

LBLt

A = arg max

L LWLt

(3.19)

The columns of A are the eigenvectors corresponding to the eigenvalues of W −1 B.

3.7 Fusion of GMFCC & GIMFCC using PQ when P = 2

69

Usually for any dimensionality reduction problem [26], a D dimensional pattern can be transformed or projected into a relatively lower dimensional space using the eigenvectors ˆ ≤ D. In a special ˆ largest eigenvalues of W−1 B, where D in A, which correspond to D

ˆ = D i.e. no dimensional reduction has been done, we could measure the case, when D effectiveness of a feature set. Therefore, we calculate divergence directly from the trace

of the matrix W−1 B, which is equal to the sum of the eigenvalues of that matrix. So, we have now, Divergence = trace(W−1 B)

(3.20)

The table 3.5 shows performances of the feature sets namely RF, TF and GF based MFCC and IMFCC on two databases. Table 3.5: Divergence analysis for different feature sets on YOHO and POLYCOST databases. Filter Shapes

YOHO

POLYCOST

MFCC

IMFCC

MFCC

IMFCC

u

1.90

1.30

2.66

1.69

2.10

1.30

2.89

1.71

Varying α

2.12

1.31

2.95

1.77

∧

From the table 3.5 it is observed that the discriminative abilities of a GF based feature set using divergence are better than those of RF and TF based feature sets. Therefore, it is expected that GF based feature set also will perform better than the other feature sets at the time of actual testing. However, the divergence is a rudimentary measure that gives only a rough estimation of the performance of a feature set (assuming linear separability) without knowing which classifier is to be used for identifying the speakers. Besides, divergence does not guarantee to give any idea about the non-linear class separability and where the data from different pattern classes are mixed in a very complex manner.

3.7

Fusion of GMFCC & GIMFCC using PQ when P = 2

In this section, PQ based fusion strategy is applied for merging GMFCC and GIMFCC based systems. GMFCC and GIMFCC can be considered as natural extensions of MFCC and IMFCC by changing only their window shapes from triangular to Gaussian.

70 Studies on Gaussian Filter Shapes for Speaker Identification Application For that, GMFCC and GIMFCC do not depend on each other and share common blocks for yielding their respective cepstral parameters as MFCC and IMFCC do. This gives us the motivation to use the PQ based fusion scheme on these two modified feature sets in the same way as we had done for merging MFCC and IMFCC streams. The scheme has already been illustrated in section 2.7. It is worth mentioning again that when the PQ decimation rate is 2 i.e. P = 2, no pruning has been actually done; instead, incoming feature vectors are alternately sent via a router to MFCC and IMFCC models for which one uses only half of the total feature vectors and involves only a set of M numbers of Gaussian at a time by enabling only one stream out of two. The performances of the fused system after merging GMFCC and GIMFCC streams are shown next (tables 3.6 & 3.7) for both the databases. Table 3.6: Performances using GMFCC & GIMFCC feature sets in normal, Prequantized and fused mode for YOHO database for (M = 64). PIA a (%)

M

Tgm b

GMFCC

PIA (%)

PIA (%)

PIA (%)

PIA (%)

GIMFCC

GMPQ e

GIMPQ f

FUSED

Tgf c

SFd Tgm Tgf

(T vec.)

(s)

(T vec.)

( T2 vec.)

( T2 vec.)

(T vec.)

(s)

2

79.82

6.21

78.29

79.64

77.99

84.86

6.22

4

90.31

7.12

87.23

90.11

86.87

92.43

7.12

1:1

8

94.42

8.72

92.55

94.66

92.32

95.76

8.71

1:1

16

96.50

11.79

94.64

96.32

94.57

96.88

11.78

1:1

32

97.19

18.19

95.49

97.12

95.38

97.43

18.21

1:1

64

97.54

25.54

96.05

97.50

96.05

97.90

25.53

1:1

1:1

a

Percentage of Identification Accuracy Time required for evaluating GMFCC stream (Single stream) c Time required for evaluating fused stream developed from GMFCC-GIMFCC T d Speed-up Factor = Tgm gf e PQ applied in GMFCC stream f PQ applied in GIMFCC stream b

The performances of the normal, pre-quantized and fused schemes for the GMFCC and GIMFCC based systems have been shown in the tables 3.6 & 3.7. The SI accuracy achieved by the fused system shows considerable improvements over both GMFCC and GIMFCC based systems. The fused system based on GMFCC-GIMFCC also shows lower error rate than the combined system that uses MFCC-IMFCC paradigm (ref. tables 3.8

3.7 Fusion of GMFCC & GIMFCC using PQ when P = 2

71

Table 3.7: Performances using GMFCC & GIMFCC feature sets in normal, Prequantized and fused mode for POLYCOST database for (M = 16). M

PIA (%)

Tgm

GMFCC

PIA (%)

PIA (%)

PIA (%)

PIA (%)

GIMFCC

GMPQ

GIMPQ

FUSED

( T2

( T2

vec.)

SF

(T vec.)

(s)

Tgm Tgf

(T vec.)

(s)

(T vec.)

2

66.05

36.21

56.90

65.12

56.90

68.57

36.21

4

76.53

40.55

69.10

75.60

68.17

77.45

40.57

1:1

8

80.11

49.62

77.59

79.84

76.53

81.43

49.60

1:1

16

80.24

66.71

77.65

79.84

77.06

81.70

66.74

1:1

Table 3.8: Database.

vec.)

Tgf

1:1

Comparative SI performances between two fused systems for YOHO

M

PIA (%)

PIA (%)

MFCC + IMFCC

GMFCC + GIMFCC

2

82.81

84.86

4

90.82

92.43

8

94.80

95.76

16

96.27

96.88

32

97.19

97.43

64

97.64

97.90

Table 3.9: Comparative SI performances between two fused systems for POLYCOST Database. M

PIA (%)

PIA (%)

MFCC + IMFCC

GMFCC + GIMFCC

2

68.04

68.57

4

76.53

77.45

8

80.77

81.43

16

81.43

81.70

72 Studies on Gaussian Filter Shapes for Speaker Identification Application & 3.9). It can also be observed that not only are the absolute SI performances shown by GMFCC and GIMFCC better than their corresponding baselines, but also pre-quantized GMFCC and GIMFCC based SI systems exhibit improvements over the similar systems developed from MFCC and IMFCC feature sets. We also compute the time required by the GMFCC and the fused system and observe that the fused system takes almost the same time as GMFCC takes on its own for giving the decision. In addition, the time required by only GMFCC is comparable to the time taken by MFCC; this indicates nearly equal complexities in both the paradigms. Here also, we choose the model orders as 64 and 16 for YOHO and POLYCOST respectively for the same reason as mentioned in section 2.5.

3.8

Decimation with P > 2 on GMFCC-GIMFCC based Fused System

This section describes a similar analysis using the feature sets GMFCC and GIMFCC. Tables 3.10 and 3.11 show the SI rate for the fused system using GMFCC & GIMFCC feature sets when more number of frames have been skipped. Without sacrificing baseline accuracy (GMFCC based system), the fused system shows 3:1 and 6:1 speed-up factors for the YOHO and POLYCOST databases, respectively. This is somewhat in contrast with the earlier combined system that showed 4:1 and 8:1 speed-up factors for the same databases. However, our aim is to show the feasibility of this PQ based fusion strategy on GMFCC and GIMFCC based system. It is interesting to note that, GMFCC-GIMFCC based fused system can show up to five and nine fold computational benefits if comparison has been done with the MFCC based system. Figures 3.7 & 3.8 show time versus identification rate for the two fused systems. The fused system developed using GF based paradigm performs better than the normal TF based combined system.

3.8 Decimation with P > 2 on GMFCC-GIMFCC based Fused System

73

Table 3.10: Reduction in computational complexity with increasing P for fusion scheme on YOHO Database (M = 64). Average time (Tgm ) for only GMFCC (single stream) is 25.54 sec. Total

PIA (%)

PIA (%)

PIA (%)

Tgf

SF Tgm Tgf

Vec. used

GMPQ

GIMPQ

FUSED

(s)

T

97.50

96.05

97.90

25.53

1:1

T 2 T 3 T 4 T 5 T 6

97.23

95.53

97.84

12.55

2:1

96.94

94.75

97.68

8.44

3:1

96.47

94.06

97.52

6.33

4:1

95.42

92.75

97.14

5.10

5:1

94.93

91.12

96.75

4.25

6:1

Table 3.11: Reduction in computational complexity with increasing P for fusion scheme on POLYCOST Database (M = 16). Average time (Tgm ) for only GMFCC (single stream) is 66.71 sec. Total

PIA (%)

PIA (%)

PIA (%)

Tgf

SF Tgm Tgf

Vec. used

GMPQ

GIMPQ

FUSED

(s)

T

79.84

77.85

81.70

66.74

1:1

T 2 T 3 T 4 T 5 T 6

79.44

76.79

81.17

33.32

2:1

78.12

76.26

81.10

22.21

3:1

78.12

74.54

80.77

16.65

4:1

76.26

73.47

80.64

13.42

5:1

75.73

73.21

80.24

11.11

6:1

T 7 T 8 T 9 T 10

74.93

70.29

79.58

9.50

7:1

72.28

69.63

79.44

8.32

8:1

72.15

69.63

78.65

7.40

9:1

72.15

66.45

77.45

6.67

10:1

74 Studies on Gaussian Filter Shapes for Speaker Identification Application

Comparative studies on different fused systems (Time vs. PIA for YOHO Database with M = 64) 98 T/2

T

T/3

T

Identification Accuracy (%) →

T/4 T/2

97.5

T/3

T

T/4

T/5

97

T− Total no. of test frames used T/6

T/5 96.5

0

T

Fused System (MFCC−IMFCC) Fused System (GMFCC−GIMFCC) Baseline (MFCC) GMFCC

5

10

15 Average Time (Seconds) →

20

25

30

Figure 3.7: Time vs. PIA for YOHO database for two different fused systems Comparative studies on fused systems (Time vs. PIA for POLYCOST Database with M = 16) 82 T 81.5

T/3

T

T/4

81

T/2

T/5 T/3

80.5 Identification Accuracy (%) →

T/2

T/6

T/4 T/5

80

79.5

T

T/6

T/7 T/8

T/7 79

T/8 T/9

78.5 T− Total no. of test Frames Used 78

77.5 T/10 77

0

Fused System (MFCC−IMFCC) Fused System (GMFCC−GIMFCC) Baseline (MFCC) GMFCC

T/9

10

20

30 40 Average Time (Seconds) →

T

50

60

70

Figure 3.8: Time vs. PIA for POLYCOST database for two different fused systems

3.9 Conclusions

3.9

75

Conclusions

This chapter investigates the effect of filter shapes in the SI application. A summary of the chapter is presented below: • The chapter introduces the Gaussian Filter as an averaging window in the frequency domain. Compared to triangular filters in MFCC correlations among ad-

jacent subbands are higher for each filter. • Four different kinds of Gaussian filters are realized using mel scale and experiments

were conducted on these filters with models of various orders. The performances

using Gaussian filters are better than (except the case, α i = 1, ∀ i ∈ {1, 2, . . . , Q}) the performances of the conventional triangular filter based system.

• We proposed here variable variance (controlled by α i , where 2 ≤ αi ≤ 3 ∀ i ∈ {1, 2, . . . , Q}) based Gaussian filter that gives better SI accuracies than using fixed

variance based Gaussian filter. Adaptive variances control the amount of correla-

tion according to the amount of tapering of the original triangular filter used for the MFCC based system. • When a Gaussian filter with high variance (i.e. more tapered window, case α i = 1) is used, the performances drop below those of the baseline. The high variance based Gaussian window loses the triangular like shape and averages the energy spectrum over a wider bandwidth. Therefore, a larger number of side frequency components is included; this causes the subband specific local information useful for identifying speakers to be averaged out. • We have demonstrated the speaker identification performance using a rectangular

filter as averaging window and found that its performance is the worst among all the systems. The rectangular filter provides absolutely no tapering for the energy

spectrum causing abrupt transitions at both ends. Therefore, reliable estimation of a short-time spectral envelope could not be done through the weighted sum approach when weights were taken from a rectangular filter. • We also applied the Gaussian filters on our previously placed inverted mel scale.

The speaker identification accuracies are better than those of the triangular filter

based IMFCC system. The results show similar trends with Gaussian filter based systems. However, there is not much improvement over the baseline as there might

76 Studies on Gaussian Filter Shapes for Speaker Identification Application be a chance of flattening the sharp structure of higher order formants, which is very sensitive in the speaker recognition context. • Validation of the feature sets at the early stages is performed using divergence

that can determine the linear separability of speaker class without going into an

exhaustive training and testing procedure. • Further extension of these works has been done by merging GMFCC and GIMFCC

based systems using PQ based fusion strategy. We show the feasibility of the PQ

based fusion scheme by combining GMFCC and GIMFCC based speaker models. Without compromising the baseline accuracy, maximum speed-up factors of 3:1 and 6:1 are achieved in the YOHO and POLYCOST databases respectively. If MFCC is considered as the baseline, maximum 5:1 and 9:1 speed-up factor could be achieved through the GF based realization of the filter banks. Nevertheless, the fused system shows significant improvement over GMFCC and MFCC when a speed-up factor of 1:1 is considered, i.e. when no frame pruning has been done.

References [1] L. Besacier and J.-F. Bonastre, “Subband architechute for automatic speaker recognition,” Signal Processing, vol. 80, no. 7, pp. 1245-1259, Jul. 2000. (Cited in sections 3.1 and 3.3.) [2] L. Besacier and J.-F. Bonastre, “Subband approach for automatic speaker recognition: Optimal division of the frequency domain,” in Proc. International Conf. on Audio-and Visual-Based Biometric Person Authentication (AVBPA 1997), 1997, pp. 195-202. (Cited in section 3.1.) [3] J. E. Higgins, R. I. Damper, and T. J. Dodd, “Information fusion for subband-HMM speaker recognition,” in Proc. INNS-IEEE Internatational Jnt. Conf. on Neural Networks, (IJCNN 2001), 2001, pp. 1504-1509. (Cited in section 3.1.) [4] J. E. Higgins, T. J. Dodd, and R. I. Damper, “Application of multiple classifier techniques to subband speaker identification with an HMM/ANN system”, in Multiple Classifier Systems, International Workshop, (MCS 2001), 2001, pp. 369-377. (Cited in section 3.1.) [5] J. E. Higgins, R. I. Damper, and T.J. Dodd, “Improving speaker identification by trainable data fusion and subband processing techniques”, in Proc. IEEE Workshop on Automat. Identification Advanced Technologies, (AutoID 2002), 2002, pp. 109-114. (Cited in section 3.1.) [6] R. A. Finan, R. I. Damper, and A. T. Sapeluk, “Improved data modelling for text-dependent speaker recognition using sub-band processing,” Inter. Jnl. Speech Technolgy, vol. 4, no. 1, pp. 45-62, Mar. 2001. (Cited in section 3.1.) [7] J. B. Allen, “How do humans process and recognize speech?,” IEEE Trans. on Speech and Audio Processing, vol. 2, no. 4, pp. 567-577, Oct. 1994. (Cited in section 3.1.)

3.9 References

77

[8] X. Lu and J. Dang, “An investigation of dependencies between frequency componnets and speaker charactteristcis for text-independent speaker identification,” Speech Commun., 2007, to be published. (Cited in section 3.1.) [9] H. Bourlard and S. Dupont, “A new ASR approach based on independent processing and recombination of partial frequency bands,” in Proc. Internatational Conf. on Spoken Language Process., (ICSLP 1996), 1996, pp. 426-429. (Cited in section 3.1.) [10] S. Tibrewala and H. Hermansky, “Subband based recognition of noisy speech,” in Proc. IEEE Internatational Conf. on Acoust., Speech, Signal Process., (ICASSP 1997), 1997, pp. 1255-1258. (Cited in section 3.1.) [11] A. Morris, A. Hagen, and H. Bourlard, “The full-combination subbands approach to noise robust HMM/ANN-based ASR,” in Proc. Europ. Conf. on Speech Comm. Technol., (Eurospeech 1999), 1999, pp. 599-602. (Cited in section 3.1.) [12] Y. C. Tam, Y. Cheung, and B. Mak, “Optimization of sub-band weights using simulated noisy speech in multi-band speech recognition”, in Proc. International Conf. on Spoken Language Processing, (ICSLP 2000), 2000, pp. I-313-I-316. (Cited in section 3.1.) [13] R. P. Lippmann, “Speech recognition by machines and humans,” Speech Commun., vol. 22, no. 1, pp. 1-15, Jul. 1997. (Cited in section 3.1.) [14] F. Zheng and G. Zhang, “Integrating the energy information into MFCC,” in Proc. International Conf. on Spoken Language Processing (ICSLP 2000), 2000, pp. I-389-I-392. (Cited in section 3.1.1.) [15] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J. Acoustical Society of America, vol. 87, no. 4, pp. 1738-1752, Apr. 1990. (Cited in section 3.1.1.) [16] S. Chakroborty and G. Saha, “Improved Text-Independent Speaker Identification using Fused MFCC & IMFCC Feature Sets based on Gaussian Filter”, International Journal of Signal Processing, vol.5 , no. 1, pp. 11-19, Nov. 2007. (Cited in section 3.1.1.) [17] S. Chakroborty and G. Saha, “Improved Closed Set Text-Independent Speaker Identification by Gaussian Filter based Mel-Frequency Cepstral Coefficients,” in Proc. IEEE Annual Conference(Indicon 2007), 2007. (Cited in section 3.1.1.) [18] A. Papoulis and S. U. Pillai, “Probability, Random variables and Stochastic Processes,” Tata McGraw-Hill Edition, 4th ed., 2002, pp. 72-122. (Cited in sections 3.1.1, 3.2 and 3.2.1.) [19] E. Gjelsvik and K. K. Paliwal, “Use Of Spectral Subband Moments In MFCC Computation,” in Proc. of Fifth International Symposium on Signal Processing and its Applications, (ISSPA 1999), 1999, pp. 637-640. (Cited in sections 3.1.1 and 3.2.) [20] B. Mak, Y. C. Tam, and Q. Li, “Discriminative Auditory Features for Robust Speech Recognition,” in Proc. IEEE International Conference on Acoustic, Speech and Signal Processing, (ICASSP 2002), 2002, pp. 381-384. (Cited in section 3.1.1.) [21] B. Mak, Y. C. Tam, and R. Hsiao, “Discriminative Training of Auditory Filters of Different Shapes for Robust Speech Recognition,” in Proc. of the IEEE International Conference on Acoustic, Speech and Signal Processing, (ICASSP 2003), 2003, pp. II-45-II-48. (Cited in section 3.1.1.) [22] T. F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Pearson Education Inc., First Indian Reprint, 2004, pp. 175-251. (Cited in section 3.2.1.)

78 Studies on Gaussian Filter Shapes for Speaker Identification Application [23] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Audio Speech and Signal Process., vol. ASSP 28, no. 4, pp. 357-365, Aug. 1980. (Cited in section 3.2.1.) [24] L. C. W. Pols, “Spectral analysis and identification of Dutch vowels in monosyllabic words,” Ph.D. dissertation, Free University, Amsterdam, The Netherlands, 1966. (Cited in section 3.2.1.) [25] S. Dharanipragada, U. H. Yapanel, and B. D. Rao, “Robust Feature Extraction for continuous Speech Recognition using the MVDR Spectrum Estimation Method,” IEEE Trans. Audio, Speech and Langauage Process., vol. 15, no. 1, Jan 2007. (Cited in section 3.6.) [26] K. K. Paliwal, “Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer,” Digital Signal Process., vol.2, no. 3, pp. 157-173, Jul. 1992. (Cited in sections 3.6 and 3.6.) [27] W. H. Abdulla and N. Kasabov, “Reduced Feature set Based parallel CHMM Speech Recognition System,” Information Sciences, vol. 156, no. 1-2, pp. 21-38, Nov. 2003. (Cited in section 3.6.) [28] S. Nicholson, B. Milner, and S. Cox, “Evaluatin feature set performance using the f-ratio and jmeasures,” in Proc. of Euro Speech Conf. Speech Communication and Technology, (EUROSPEECH 1997), 1997, pp. 413-416. (Cited in section 3.6.) [29] B. S. Atal, “Automatic recognition of speakers from their voices,” Proc. of IEEE, Vol. 64, No. 4, pp. 460-475, Apr. 1976. (Cited in section 3.6.) 3

CHAPTER

4

SVD-QRcp based Acoustic Feature Selection for Speaker Identification

3

Preface In chapter 4 a brief review has been done on major feature selection techniques used in the speaker recognition context. We propose here a new feature selection technique that avoids complex search based methods but shows reasonably high accuracies. The proposed feature selection technique has been applied on four kinds of feature sets that include baseline MFCC and other improved variants of the same introduced in earlier chapters. The performances of the feature sets selected using the proposed technique are compared with F-Ratio based selection criterion and found superior. As with previous studies, the same two public databases i.e. YOHO and POLYCOST have been used here for presenting the results. 3

80

SVD-QRcp based Acoustic Feature Selection for Speaker Identification

4.1

Introduction

Acoustic feature extraction is an inevitable front-end module for speech and speaker recognition application. State-of-the-art speaker recognition application uses both time and frequency domain features to describe a speaker’s vocal tract characteristics. Examples of such feature extraction techniques are LPC, LPCC, MFCC. Usually, from a raw speech frame, multiple features are extracted, which form a feature vector. When there are enough feature vectors available for a speaker, a statistical model like GMM is used to represent them in multidimensional hyperspace. A multidimensional model like GMM sees a feature vector as a point in high dimensional space by considering the group or combined effect of different features that represent various dimensions. The number of features i.e. the number of dimensions plays an important role in diverse pattern recognition applications. An important terminology in the context of number of dimensions is the ‘Curse of dimensionality’ [1] problem. The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons [1]. As more features are used, the feature dimensions increase, which imposes severe requirements on computation and storage in both training and testing. The demand for a large amount of training data to represent a speaker’s voice characteristics grows exponentially [2] with the dimension of the feature space. This severely restricts the usefulness of nonparametric procedures like GMM and higher order transforms. An application like SR needs dimensionality reduction [3] to avoid this curse. To reduce the dimension, often the efficient features are selected while discarding the weak ones. Inefficient features confuse and over-fit the speaker models during training and these over-fitted models cannot give the best performance, although they need extra offline memory space and greater computation at the time of testing. Features are generally selected through a criterion before modeling starts as the models are developed using selected features. However, an alternative approach (see. Fig. 4.1) is to develop the model using a full feature set and then project the model into lower dimensional space if the features are independent. The basic idea of Feature Selection (FS) [4] is to reduce the computation at the time of testing while achieving the best performance through the chosen optimal set of features. Nevertheless, it is very difficult to find the best subset of features without having an exhaustive search that needs 2 D evaluations, where D is the dimension of the extracted feature vectors. There are many FS techniques and each has its own way of selecting the features.

4.1 Introduction

81

Full feature set

Model Training

[ C 1 C 2 C 3... C D ] T Feature Extraction

Evidence from selected features

Model Pruning Feature Selection

[ C 1 C 2 C 3... C D ] T

OR Model Training

Selected features Subset of selected features

Figure 4.1: Typical feature selection method Reviews of some major feature selection methods in the SI context are presented next.

4.1.1

Review of Feature Selection Methods

F-Ratio Based Feature Selection F-Ratio (FR) is a measure that can be used to evaluate the effectiveness of a particular feature. It has been widely used as a figure of merit for FS in speech [5] and speaker recognition applications [6], [7], [8], [9]. It is defined as the ratio of the between-class variance to the within-class variance. In the context of FS for pattern classification, the FR can be considered as a strong catalyst to select the features that maximize the separation between different classes and minimize the scatter within these classes. There are some assumptions that have to be satisfied when using the FR as a figure of merit for dimensionality reduction. The assumptions are: â€˘ The feature vectors within each class must have Gaussian distribution. This con-

dition can be satisfied if we use a sufficiently large training data set, according to

the central limit theorem [10]. â€˘ The features should be statistically uncorrelated. In practice this condition is

hardly ever satisfied, and the correlated features can be transformed into uncor-

related features via suitable transformation techniques. However, if we use the

82

SVD-QRcp based Acoustic Feature Selection for Speaker Identification MFCC to construct the feature vectors, then we can consider the feature vectors uncorrelated, since the DCT is used to prepare these vectors, which performs adequate decorrelation. • The variances within each class must be equal. Since the variances within each

class are generally not equal, the pooled within-class variance is used to define the

FR. Due to these assumptions, the FR fails to select the features correctly in many situations. For example, the usefulness of the FR as a discrimination measure is reduced if the classes are multi-modal or if they have the same means. This is a fatal flaw with any criterion that is dominated by differences between class means. Then, two features with high individual F-Ratios might be highly correlated and, as a feature vector, less effective than two features that individually have lower F-Ratios but uncorrelated. In spite of these limitations, the FR has remained very popular and has been considered as a standard FS technique for SI application because of its low computation (only D units of operations are required for determining the ranks of the features) and easy implementation. Although the technique F-Ratio has been widely used in SR problem, we review some of the other FS procedures for the completeness of this study. Those FS techniques are discussed next. “Knocked-out” Strategy Knocked-out strategy based FS technique was proposed by Sambur et al. [11] in 1975. The scheme proposed is a suboptimal search without replacement strategy that investigated experimentally a total of 92 features for a database of eleven speakers. Assuming that the total number of features that are originally available is equal to D, the method begins by evaluating the effectiveness (error performance) of each of the D feature subsets with D − 1 members. The most effective feature subset is then determined, and the feature not included in this subset is defined as the least important feature. This

feature is then eliminated or “knocked-out” from further consideration. The procedure continues until all the features are “knocked-out” from consideration. The ordered effectiveness of the features is then given by the inverse sequence of “knocked-out” features. The above scheme, though computationally efficient, has the inherent disadvantage that the resulting subset, which contains the best individually selected properties is not necessarily the optimal subset of features. This procedure requires D (D+1) units of operations 2

4.1 Introduction

83

for determining the order of the features. Feature Selection through Dynamic Programming Feature selection through dynamic programming for SI [12] and other pattern recognition problems [13], [14] is a relatively old technique. Cheung et al. [12] have first attempted to select the features for SI application with divergence [15] as selection criterion. The work has been presented on a very small database i.e. ten speakers with no guarantee for optimal selection of a set of potential features. However, the method shows better FS than using the “knocked-out” strategy mentioned before. In this work, the divergence has been chosen as the selection criterion by considering a Gaussian with a single mode to estimate a speaker’s Probability Density Function (PDF); this contradicts the usual use of GMM [16], which involves multi-modal Gaussian to estimate the same PDF. Note that GMM has the capability to model arbitrary PDF for the data, which vary due to multi-session recordings [17] and phonetic contents of the speech. This units of operations for determining a set of useful features. strategy requires D 2 (D−1) 2 Other Search based Feature Selection procedures A search based techniques have been applied on an HMM-based text-dependent speaker verification system [18], in which a distinction is made between the alignment task and the scoring task. The optimization is based on the search, among a set of potential features, for the feature subset that gives the minimal experimental Equal Error Rate (EER) [19]. The adopted search criteria, however, include dynamic programming and knocked-out strategies and others. The work has been also extended to the search for an optimal weighting of the different axes of the acoustic space. The optimal weighting was found by using a genetic algorithm. The work found that cepstral coefficients of higher order and first derivatives of all cepstral coefficients are the most useful for speaker verification. This contradicts to a certain degree a more recent work presented in [20], which has attempted to remove pitch for reducing the pitch mismatch at the time of verification as higher order cepstral coefficients describe fine harmonic structure of the spectrum, which mainly comes from the pitch. In another attempt [21], three different search algorithms have been used for a Dynamic Time Warping (DTW) based text-dependent speaker verification system that uses 33 French speakers. However, the selected features are speaker dependent and thus the adopted technique cannot be used in the SI paradigm, which requires model discrimination in common feature

84

SVD-QRcp based Acoustic Feature Selection for Speaker Identification

space. Searching techniques like genetic algorithms have been applied [22] in searches for optimal acoustic features with a 15 speaker database for a SI problem. Feature Selection using Information-Theoretic view point A recent work [23] describes the FS procedure based on the information theoretic perspective. The authors have shown a detailed theoretic study of the connection between classification error and mutual information between speaker and features, and they apply the theory to arrive at qualitative conclusions about FS and performance in a speaker recognition system. The work has shown, both theoretically and experimentally, that the error probability in speaker recognition is closely connected to the mutual information between speaker and features. Thirty-two speakers have been chosen from YOHO database. Unlike other methods described above, the PDF of the features is estimated through GMM, and has been chosen as a FS criterion. The work, however, lacks exhaustive results that could have shown how the speaker recognition (identification) error rate varies with the number of selected features. Besides, it is very difficult to extrapolate the SI accuracy for a total of 138 speakers, which is the actual size [24] of the full corpus. More approaches on information theoretic view point on the National Institute of Standards and Technology (NIST) 2004 database [25] can be found in [26] and [27]. Unfortunately, none of the evaluated feature ranking methods provided a feature subset capable of outperforming the full feature set; however, useful knowledge was gained regarding features that do not contribute significantly to the task. Other FS methodologies based on mutual information can be found in [28], [29], [30].

4.1.2

Motivation

From the above studies, the following observations can be made: • The purpose of the FS is to reduce the dimension of the extracted vectors and thereby reduce the complexity of the system.

• Also, it is important to realize that the main task of the FS process is to pack as much speaker-discriminating information as possible into as few features as possible while eliminating the weak features. • Selected best subset of features (optimal feature set) produces best speaker models by which SI performance will be enhanced.

4.1 Introduction

85

• Selection of features needs a good criterion for selection, which should see the

feature vectors from a speaker model’s view point as the scores from these models

are responsible for the ultimate decision. • Finally, while selecting the optimal or suboptimal feature subset, a FS method must be computationally scalable even for large databases, allowing quick retrain-

ing/reconfiguring of the off-line models while one or many new feature(s) are added to the existing set of features. It is apparent that even for moderate values of D a direct exhaustive search will not be possible. Evidently, in practical situations, alternate computationally feasible procedures will have to be employed. Moreover, search based FS method cannot even guarantee to yield an optimal feature set. Except the FR based FS method which needs only D evaluations, no other FS methods are computationally very efficient. Clearly, we seek a criterion that more accurately portrays the selection of features. In this chapter, we adopt a new and straight forward FS method, which is computationally less expensive (than exhaustive search based techniques), takes the combined effect of the features in multidimensional space, uses multi-modality of the data set using cluster heads generated from VQ outputs, and is easy to realize. Singular Value Decomposition (SVD) [31] and QR Decomposition with Column Pivoting (QRcp) [32] have been chosen together as FS criteria. The idea is to capture the most salient part of the information from the speakers’ data by choosing those features that can explain different dimensions showing minimal similarities (or maximum acoustic variability) among them in orthogonal sense. The proposed method first selects coarsely a set of fixed features based on the FS criterion. This is followed by a search (without replacement) over a small window among rest of the features in order to achieve the best performance from a SI system. A similar method that uses SVD-QRcp has been found useful for selecting the subset of data in a heart sound classification problem [33] using feed-forward ANN and nonlinear modeling of a complex process [34] but not in the SI context. Using this FS criterion, efficient features are selected from MFCC, IMFCC, GMFCC, and GIMFCC feature sets. Corresponding non-over fitted models are obtained using the efficient subset of features at the time of testing. Models developed from efficient MFCC and IMFCC features are then fused using the same PQ based fusion technique, which had already shown good performance while merging both models developed from full feature sets (ref. Sec. 2.7 & 2.8). Extending the same idea of fusion, we merge GMFCC & GIMFCC models developed from their corresponding best feature subset. Note that

86

SVD-QRcp based Acoustic Feature Selection for Speaker Identification

the performances have been compared against FR based FS method which, similar to this approach, assigns the rank of each feature in terms of its importance.

4.1.3

Organization of the chapter

The rest of the chapter is organized as follows. In 4.2, the theoretical background of the SVD and QRcp techniques has been given. The selection of the feature and complete training and testing procedures are thoroughly described next in 4.3. Then a discussion on singular values and their percentage of energy explanation has been done in section 4.4. This is followed by the complete experimental results shown in section 4.5. The results based on PQ based fusion on the best models are described 4.6. Finally, section 4.7 presents the conclusions for the chapter.

4.2

Theoretical Background on SVD and QRcp

4.2.1

Singular Value Decomposition (SVD)

SVD [31] is an optimal orthogonal decomposition, which finds wide applications in rank determination and inversion of matrices, as well as in the modeling, prediction, filtering and information compression of data sequences. SVD is closely related with KLT [35] , singular values being uniquely related to eigenvalues, although the computational requirements of SVD are less than those of KLT. From the numerical point of view, SVD is extremely robust, and the singular values in SVD can be computed with greater computational accuracy than eigenvalues. SVD is popularly used for the solution of least squares problems; it offers an unambiguous way of handling rank deficient or nearly rank deficient least squares problems. SVD is also the most definite method for the detection of the rank of a matrix or the nearness of a matrix to loss of rank. Given any m × n matrix F, there exist an m × m real orthogonal matrix U, an n × n

real orthogonal matrix V and an m × n diagonal matrix S v , such that F = USv VT ,

Sv = UT FV,

where the elements of Sv can be arranged in non-increasing order, that is 1. for a nonsingular F, Sv = diag{s1 , s2 , . . . , sp },

p = min(m, n),

(4.1)

4.2 Theoretical Background on SVD and QRcp s1 ≥ s2 ≥ s3 . . . ≥ sp > 0,

87

or

2. for F of rank r, s1 ≥ s2 ≥ s3 . . . ≥ sg > 0 and sg+1 = sg+2 = . . . = sp = 0.

In other words, UT U = UUT = I, VT V = VVT = I, and

Sv =

s1

0

...

0

0 .. .

s2 . . . .. . . . .

0 .. .

0

0

. . . sp

.

for m > n = p.

The decomposition is called the singular value decomposition. The numbers s 1 , s2 , s3 , . . . sp are the singular values (or principal values) of F. U and V are called the left and right singular vector matrices of F respectively. U and V can be expressed as

U = [u1 u2 . . . ui . . . um ],

and

V = [v1 v2 . . . vi . . . vn ],

(4.2) (4.3)

where i = 1 to p, the m-column vector u i and the n-column vector vi , which correspond to the i-th singular value si , are called i-th left singular vector and the i-th right singular vector respectively.

4.2.2

QRcp Factorization

The QR decomposition of an m × n matrix F with rank p is given by F = QR

(4.4)

where Q is an m × p matrix with orthonormal columns and R is a p × n upper triangular

matrix. When m = n, Q and R are square matrices, and Q is an orthogonal matrix. QRcp [33], [34] (i.e., QR with column pivoting) factorization is used to pivot the columns of a matrix in order of maximum Euclidean norm in successive orthogonal directions, while QRcp factorization is performed on the matrix. The mechanism of the rotation of the columns is discussed next. The column vector of F with max(f Ti fi ) is first selected,

88

SVD-QRcp based Acoustic Feature Selection for Speaker Identification

and is swapped with f1 . q1 ; the unit vector in the direction of f 1 is determined by, q1 =

f1 kf1 k

(4.5)

T T The second (or rotated) vector is the one maximizing (f j − qT 1 fj q1 ) (fj − q1 fj q1 ) which

is swapped with f2 , and corresponding q2 is computed as, q2 =

q02 kq02 k

(4.6)

where, q02 is defined as q02 = f2 − (qT1 f2 )q1

(4.7)

At the i-th stage of selection, the rotated vectors (f ∗j ) are T f∗j = fj − (qT 1 fj q1 + . . . + qi-1 fj qi−1 ), i = 2 . . . n, j = i . . . n

(4.8)

∗ and the i-th selected vector is the one maximizing f ∗T j f j . The subsequent rotation

within QR decomposition will be with respect to this vector and so on. The selection is continued for up to g stages, where g may be the rank of F or may be specified based on other considerations. The sequence of successive selections is registered in the permutation matrix Pt . The result is QT FPt = R,

(4.9)

where R is upper triangular. The matrix FP t will have r columns of F appearing first in order of importance.

4.3

Feature Subset Selection using SVD Followed by QRcp

Given any m × n information set F with m ≥ n, the objective is to select an m × g

subset F1 (g < n) of F, which contains the salient part of the information contained in F. The objective is to select the g significant variables out of the n or p variables; m indicates the length of the data sets.

4.3 Feature Subset Selection using SVD Followed by QRcp

4.3.1

89

Formation of Matrix F

The feature matrix F has been constructed by stacking M number of representative vectors from each speaker. Using the LBG algorithm [36], representative vectors (code vectors) have been obtained and these representative vectors are also used as the initial guess of the mean vectors for GMM. The matrix F is formed by stacking these M code vectors from speaker 1, followed by M code vectors of speaker 2 and so on up to speaker S. Therefore, the value of m is equal to S × M , where S is the total number of the speakers. The columns in the matrix F specify the dimension of the feature vectors (D), which is denoted in this study by n or p i.e., D = n = p = 19. The matrix F coarsely represents the whole corpus with equal evidence from different speakers. The size of the matrix is not large in comparison to the concatenated version of all feature vectors from total speakers in the database. Another useful feature of matrix F is that it holds the most representative data from each speaker allowing the SVD-QRcp operator to find the effective feature subset.

4.3.2

Selection of Number of features using SVD

Let SVD of F be given by F = USv VT , where U = [u1 , . . . , um ], V = [v1 , . . . , vn ], and Sv = [diag{s1 , . . . , sp } : 0], p = min(m, n). U and V are the left and the right singular vector matrices respectively. The left and the right singular vectors form a basis for the column-space and the row-space of F respectively. Again F=

p X

ui si vTi

(4.10)

i=1

If g of the p singular values of F are dominant, that is s g+1 , sg+2 , . . . , sp are insignificantly small, the prime information of F will be contained in F1 =

g X

ui si vT i

(4.11)

i=1

Again, rank (F)= the number of nonzero singular values. So, a selection of F1 , the prime m × g subset of F, should correspond to the set of

singular values (s1 , . . . , sg ), g ≤ p, implying

rank(F1 ) = pseudorank(F) = g

(4.12)

90

SVD-QRcp based Acoustic Feature Selection for Speaker Identification Subset selection is straightforward if the (p−g) singular values of F are zero. Precise

selection of subsets has been made in [32] where a large gap or jump in the distribution of the singular values (i.e. si >> si+1 , where 1 ≤ i ≤ p) occurred. Vector quantized

outputs from several speakers packed into matrix F do not show periodic nature due to

intra and inter speaker variability (see Fig. 4.2). For feature set MFCC on the YOHO database, the matrix F has been plotted in figure 4.2, which does not not show a large gap or jump between two successive singular values.

Figure 4.2: Stacked version of vector quantized cepstral vectors using MFCC feature set from YOHO database

Here to select the number of features, we adopt the criterion called percentage of

4.3 Feature Subset Selection using SVD Followed by QRcp

91

energy explanation by singular values. It is defined as, Pex

Pg s2i = Pi=1 p 2 × 100 i=1 si

(4.13)

Viewed in another way, we select g number of features for which the energy explained by corresponding g number of singular values exhibit P ex percentage of the total energy shown by p singular values. Here, Pex is chosen as 99% irrespective of the type of feature sets and databases. Thus, p − g number of features have been considered insignificant as their singular values explain only 1% of the total energy explained by the complete

set of singular values.

4.3.3

Selection of Effective ‘g’ Number of Features using QRcp

From the previous discussion on selection of features using SVD, one could get only the idea of selecting or retaining the number of features. This subsection deals with the selection of the actual g number of features out of a total of p features using the QRcp (refer sec. 4.2). The selection of columns through QRcp factorization is based on Euclidean norm. First the column with maximum Euclidean length is selected (see Fig. 4.3). Next the column having maximum orthogonal component is to be selected (see Fig. 4.3), and so on. So the i-th selected column is the one having maximal orthogonal component to the subspace spanned by the earlier selected i − 1 columns . The sequence

of the selection is stored in the permutation matrix P t (see Fig. 4.3). Note that the

criterion “maximum Euclidean norm” in QRcp also helps to find the lower order cepstral parameters, which are generally higher in magnitudes and can describe vocal tract [37] behavior. Generally, 10-12 cepstral coefficients is usually enough for SI due the fast decay of the higher coefficients [38]. There is no reason to use a high number of cepstral coefficients unless they are properly normalized; the coefficients with a small magnitude do not contribute to the distance values much.

4.3.4

Description of the Complete System

Training Phase The complete system has been shown in figure 4.4. First, for each speaker, the feature matrix with 19 dimensional features has been sent to a vector quantizer block and quantized vectors for each speaker have been stored in a F. When the VQ has been completed for all the speakers in the database, GMM uses these representative

92

SVD-QRcp based Acoustic Feature Selection for Speaker Identification

12 34

21 34

D

D Permutation Matrix

Max Norm Calculation (Col. 1 to D)

Stage 1

Swapping 1 2

F-Matrix

21 34

P t = [2 1 3 4 ... D]

2 D34

D

1 Permutation Matrix

Max Norm Calculation *(Col. 2 to D)

Stage 2

Swapping 1 D

F-Matrix

2 D3 4

P t = [2 D 3 4 ... 1]

2 D43

1

1 Permutation Matrix

Max Norm Calculation *(Col. 3 to D)

Stage 3

P t = [2 D 4 3 ... 1]

Swapping 3 4

F-Matrix

To Stage 4 This continues till Stage D-1 * Maximal orthogonal component to the subspace spanned by the earlier columns

Figure 4.3: Column swapping through maximum norm criterion in QRcp seed vectors and starts the training via E&M algorithm. Note that GMM also needs full feature vectors to find the probabilistic model for each speaker along with these seed vectors. SVD followed by QRcp are applied on the matrix F and the desired feature subset is selected. The actual selection procedure is as follows. A user can set a threshold value (say 99%) that indicates his choice of percentage of energy explantation through an external input denoted by P 0 ex . However, the actual percentage of energy explantation (P ex ) by the singular values can be obtained after applying SVD on F. Then a comparison is made between the actual and desired

4.3 Feature Subset Selection using SVD Followed by QRcp

93

percentage of energy explantation, and the number of cepstral parameters have been 0 > P selected as long as the condition Pex ex holds. On the other hand, QRcp provides

the ordering of the features from the same data. So, combining both the paradigms one could easily get the final set of fixed feature subset from the feature selector block (see Fig. 4.4). Note that SVD determines only the number of features to be selected while QRcp ranks them based on the orthogonality using Gram-Schmidt orthogonalization [32]. Because the modeling and feature selection are performed at the same time, a huge amount of off-line time is saved; on the other hand, in conventional methods, modeling generally starts only after FS is completed. FS using SVD followed by QRcp does not take much time; thus, this FS process is completed much before the models are prepared for the complete list of speakers. The complete training algorithm that includes formation of matrix F, development of speaker models, and feature selection procedure is described next by Algorithm 4.1.

Testing Phase During the testing phase, the same features from the test vectors are also selected. Models, which have been stored in the model databases, are pruned according to the selected features at the time of training. The pruning of models is as follows: A GMM based speaker model for the speaker s is composed of a set of M , D dimensional mean vectors (µsi ), D × D dimensional covariance matrix (Σ si ), and priors (psi ). Analytically,

λs = {psi , µsi , Σsi }M i=1

(4.14)

where, D = 19 indicates the dimension of full feature set. If g number of features (say 18) have already been selected then the reduced model can be obtained by discarding those components generated by the features which are not included in the selected subset. For example, if we discard the feature C2 , the modified 18 dimensional components will be, pˆsi = psi

∀ i, i = 1, 2, . . . , M

µ ˆ si = [Ci,1 Ci,3 . . . Ci,19 ]M i=1 2 σ 0 ... i,11 2 0 σi,33 ... ˆ si = Σ . . .. .. .. . 0

0

(4.15) (4.16)

0 0 .. .

2 . . . σi,1919

i = 1, 2 . . . , M

(4.17)

94

SVD-QRcp based Acoustic Feature Selection for Speaker Identification

Algorithm 4.1: Training + Feature Selection 1 2 3 4 5 6 7 8

9 10 11

12 13

/* Declaration of parameters S = No. of speakers ; F = Matrix keeping seed mean (cluster heads) vectors after VQ ; M = No. of seed mean from each speaker ; Trdata (i) = Training data for speaker i, i = 1, 2, . . . , S ; 0 = User defined percentage of energy explanation ; Pex Pex = Actual percentage of energy explanation after SVD ; λi = Speaker model for the speaker i ; Pt = Permutation matrix ; /* Generation of seed vectors for i=1 to S do F (i − 1) · M + 1 to (i − 1) · M + M, : = VQ Trdata (i) ; end /* Preparation of speaker models for i=1 to S do λi = GMM Trdata (i), F (i − 1) · M + 1 to (i − 1) · M + M, : ;

16

end /* Feature selection using SVD + QRcp s = SVD(F) ; Pt = QRcp(F) ;

17

Pex =

14

15

18 19 20 21 22

Pg s2 Ppi=1 i2 i=1 si

× 100 ;

*/;

*/;

*/;

*/;

count=0 ; 0 do while Pex ≤ Pex count=count+1 ; end Selected features = Pt (1 to count) ; // Select the first count number of features from the ordered list given by matrix P t , where g = count in this case

Raw Speech Data

4.3 Feature Subset Selection using SVD Followed by QRcp

95

Testing Phase

Pre-Processing + Feature Extraction

VQ Stack (Matrix F)

Pre-Processing + Feature Extraction

All feature vectors

SPK-1 SPK-2 SPK-3 Matching Algorithm Mean vectors initialization using VQ

Reduced feature vectors after VQ

SPK-S

QRcp

SVD

P ex GMM based Speaker Modeling

Final Output

P' ex

SPK-1

SPK-2

SPK-S

[ C 1 C 2 C 3 ... C D ] T

Comparison

Feature selector

Selected features

Projection into lower dimension

SPK-1

Speaker Model Databases in lower dimension

SPK-2

SPK-S Speaker Model Databases in 19 dimensional features

Figure 4.4: SVD-QRcp based feature selection in SI system 2 where, σi,kk is the variance in kth dimension for the i th component of GMM. Therefore,

the reduced triplet is defined as, ˆ s = {ˆ ˆ si }M λ psi , µ ˆ si , Σ i=1

(4.18)

For multiple feature deletion, the reduced model should be prepared in a similar way.

96

SVD-QRcp based Acoustic Feature Selection for Speaker Identification

The next algorithm (see algorithm 4.2 describes the testing part of the proposed system using lower dimensional models. Algorithm 4.2: Testing 1 2 3

4 5

6 7 8 9 10 11 12 13 14

/* Declaration of parameters */; Xm = Unknown cepstral vector for mth frame ; Xmproj = Unknown cepstral vector for mth frame projected in lower dimension ; ˆ i = Reduced model (Model projected in lower dimension) for speaker i λ i = 1, 2, . . . , S ; Li = Log likelihood of speaker model i ; strue = True speaker identity ; /* Testing */; Li ← 0; for i=1 to S do ˆ i = Projection λ // Using equation 4.17 ˆ (λi ) ; D→D for m=1 to T do Xmproj = ProjectionD→Dˆ (Xm ) ; ˆi) ; Li = Li + log p(Xmproj |λ end end

Decision: strue = arg maxs Ls s ∈ {1, 2, . . . , S} ;

4.4

A Discussion on Singular Values and Percentage of Energy Explanation for Different Feature Sets

In the previous section (sec. 4.3), we have shown the non-periodic nature of the rows of the matrix F (see Fig. 4.2) for which no large gap or jump has been noticed (see Sec. 4.3) among the singular values obtained through SVD. In this section, we discuss singular values and their corresponding percentage of energy explanation shown by the feature sets that have been mentioned in the previous chapters. Figures 4.5 & 4.6 shows them for the YOHO and POLYCOST databases respectively. Four sub-figures in the left pane of each of these figures show singular values while the corresponding right pane figures depict the percentage of energy explanation by them. Note that the X-axis of all these sub-figures indicates the number of features while the Y-axis for the left and right pane sub-figures shows magnitude and percentage of energy explanation of singular values respectively. It is clearly observed from the figures that 90% of the total energy can be explained by only 4 to 5 features out of the 19 parameters. As we go along the

4.4 A Discussion on Singular Values and Percentage of Energy Explanation for Different Feature Sets 97 X-axis, singular values decrease less rapidly and percentage of energy explanation by them increase gradually. Using 7 to 11 features, 99% of the energy can be explained by all the different types of feature sets used for both the databases.

Singular Values

Percentage of energy explanation

MFCC

400

100 90 80 70 60 50

200

0

1

5

10

15

20

IMFCC

400

10

15

20

1

5

10

15

20

1

5

10

15

20

1

5

10 No. of features â†’

15

20

100 90 80

1

5

10

15

20

300 GMFCC

5

200

0

70 100 90 80 70 60 50

200 100 0

1

5

10

15

20

400 GIMFCC

1

100 90

200

80 70

0

1

5

10 No. of features â†’

15

20

60

Figure 4.5: Singular values and their corresponding percentage of energy explanation in YOHO Database We set a criterion, i.e. to select the total number of features, which explain 99% and above of the total energy shown by all the singular values. However, this heuristic does not guarantee the yield of the best subset of features by which maximum SI accuracy could be obtained and therefore one has to search by using without replacement policy (i.e. eliminating features one by one from the list of remaining features) over a small

98

SVD-QRcp based Acoustic Feature Selection for Speaker Identification

Singular Values

Percentage of energy explanation

MFCC

200

100

0

IMFCC

100 90 80 70 60 50 1

5

10

15

20

300

100

200

90

100

80

0

1

5

10

15

20

GMFCC

200

5

10

15

20

1

5

10

15

20

1

5

10

15

20

1

5

10 No. of Features →

15

20

100 80

100 60 0

GIMFCC

70

1

1

5

10

15

20

40

300

100

200

90

100

80

0

1

5

10 No. of Features →

15

20

70

Figure 4.6: Singular values and their corresponding percentage of energy explanation in POLYCOST Database window of size (here it is ±4) among rest of the features. This improves the quality

of selection and in turn enhanced identification rate. Using the permutation matrix Pt , a subset of features could be selected where the total number of the elements are chosen according to the previous criterion. As one sets a higher P ex , one includes a larger number of features in the subset of features thereby reducing the search space that includes the less effective features to be included in that subset. According to our criterion i.e. Pex = 99%, a minimum of 7 and a maximum of 11 features are chosen as the fixed parameters (Fp ) while from the rest, less effective features are selected without

4.5 Selected Subsets of Features and Their Performances

99

replacement policy. Note that the ranking of the features are given by the permutation matrix Pt obtained through the QRcp on matrix F. Thus the best performance can be obtained by at most Tr trials, where Tr is implicitly dependent on the value of P ex . Analytically,

Tr = D − Fp

(4.19) Pex =P0 ex

The table 4.1 shows the minimum number of features that are fixed using the percentage of energy explanation criterion, which is chosen as 99% for the study presented in this chapter. However, the ordering of the features given by the matrix P t for different feature sets will be shown in the next section. Table 4.1: Minimum number of fixed features obtained when P ex = 99% for different feature sets on two databases.

4.5

Feature set

YOHO

POLYCOST

MFCC

11

11

IMFCC

9

8

GMFCC

11

9

GIMFCC

9

7

Selected Subsets of Features and Their Performances

This section reports the ranking of features in MFCC, IMFCC, GMFCC and GIMFCC feature sets where ranks are determined using QRcp and FR based feature selection techniques for both the databases. The section also shows SI performances using various subsets of features. The top rows of the two tables (tables 4.2 & 4.3), which are indicated by serial number 1 represent the least effective features while the bottom rows show most reliable and effective ones. All the cepstral features present in bottom rows but inside the columns marked by ‘SQ’ have been denoted by the symbol ‘⇒’. This symbol indicates the starting point of a fixed feature set (Fp ), which has been formed using 99% of energy explanation criterion. The end point of the set Fp is also indicated by the same symbol located at a slightly higher level in the same column. The tables also contain some ‘?’ & ‘’ symbols, which indicate the best SI performances achieved by the smallest subsets of

100 SVD-QRcp based Acoustic Feature Selection for Speaker Identification features selected by SVD-QRcp and FR, respectively. Note that the symbol ‘?’ is found outside the sets Fp in most columns of tables 4.2 and 4.3. This indicates that one has to search outside this fixed set of features. However, in a most optimum scenario, the subset selected through the percentage of energy explanation criterion should give the best performance (in which case the ’?’ symbol will occur at the higher boundary of the fixed feature set Fp ). This completely removes the need to search for parameters outside the set Fp . This can be observed from the table 4.3, where the selected subset of MFCC features is proven to be the best subset. It can be also observed from the tables that the best performance shown by the subset of features selected by SVD-QRcp takes a smaller number of parameters than the number of selected features suggested by the FR. In most of the cases (except for GMFCC feature set in POLYCOST database) FR based FS technique can not even select a subset of features that contains a smaller number of elements than the full set containing 19 coefficients. The SI performances using all these feature sets over two databases are described next. The tables 4.5 and 4.6 show the SI performances using different feature sets on the YOHO and POLYCOST databases, respectively. For the two tables, the same symbols (i.e. ‘⇒’, ‘?’, and ‘’, used in tables 4.2 and 4.3) have been chosen to denote the fixed set of features, and the best performance obtained by using SVD-QRcp and FR criteria. In addition, the best performance under each feature set has been shown with boldface character along with ‘?’ symbol. From these tables some notable points can be observed. They are as follows. YOHO Database: • Lowest number of parameters (13 features) have been used in MFCC for showing the maximum performance within that feature set.

• However, the highest SI performance i.e. 97.55% is obtained using GMFCC feature set but at the expense of 17 parameters.

• MFCC achieved its highest performance with a smaller number of parameters than

the number of parameters used by IMFCC based system to achieve its highest performance. However, for GMFCC and GIMFCC, the same number of parameters i.e., 17 have been used in order to achieve respective highest performances shown by them.

• The best performance achieved by GMFCC is better than the best performance

shown by MFCC and a similar comparison can be noticed between GIMFCC and

4.5 Selected Subsets of Features and Their Performances

101

Table 4.2: Rank of features evaluated by SVD-QRcp and F-Ratio based feature selection methods for YOHO Database. Sr. No.

MFCC

IMFCC

GMFCC

GIMFCC

SQa

FRb

SQ

FR

SQ

FR

SQ

FR

1

C19

C19

C4

C18

C18

C18

C2

C19

C18

C17

C19

2

c C2

C19

C18

C17

3

C17

C3

C16

C19

?d C17

C4

?C16

C19

4

C16

C19

?C17

C15

C16

C18

C14

C16

5

C15

C6

C14

C16

C15

C3

C17

C15

6

C14

C18

C12

C14

C14

C6

C12

C14

7

?C13

C7

C15

C13

C13

C15

C15

C8

8

C12

C8

C13

C8

C12

C16

C10

C13

9

⇒e C11

C5

C10

C9

C17

C13

C9

C10

C11

C11

C12

⇒C11 C10

C7

C11

C12

11

C9

C1

C7

C9

C13

C8

C15

C9

C10

C8

C11

⇒C8

C7

12

⇒C8

13

C7

C16

C6

C11

C7

14

C6

C17

C7

C6

15

C5

C13

C4

16

C3

C14

17

C1

18 19

10

C18

C9

C10

C14

C6

C6

C6

C8

C7

C11

C5

C5

C5

C4

C5

C5

C3

C3

C1

C5

C3

C9

C3

C2

C1

C12

C3

C2

C2

C12

C2

C1

C2

C10

C2

C1

⇒C4

C10

⇒C1

C4

⇒C4

C9

⇒C1

C4

a

SVD followed by QRcp F-Ratio c Highest performance shown by the system where features are selected by F-Ratio d Highest performance shown by the system where features are selected using SVD-QRcp e Start and end of fixed feature set (Fp ) b

IMFCC, though these duos use different subset of features with unequal numbers, too. • Using FR based FS method, performances drop while pruning even a single feature

from the complete set of parameters and this has been observed for all the four

102 SVD-QRcp based Acoustic Feature Selection for Speaker Identification

Table 4.3: Rank of features evaluated by SVD-QRcp and FR based feature selection methods for POLYCOST Database. Sr. No.

MFCC

IMFCC

GMFCC

GIMFCC

SQ

FR

SQ

FR

SQ

FR

SQ

FR

1

C19

C19

C18

C19

C18

C19

C18

C16

C19

2

C17

C18

C18

C19

C18

C19

3

C16

C18

C16

C19

C17

C16

C18

4

C17

C16

?C17

C17

C16

C16 C14

?C17

C17

5

C15

C14

C14

C14

?C15

C11

C14

C14

6

C14

C9

C15

C15

C14

C17

C15

C15

7

C13

C15

C12

C13

C13

C15

C12

C13

8

C12

C11

C13

C12

C12

C9

C13

C9

9

⇒?C11

C7

C10

C9

C11

C7

C10

C12

C10

C4

C11

C11

C10

C12

C11

C10

11

C9

C6

C8

C10

C13

C8

C11

12

C7

C12

C8

C8

C6

C9

C7

13

C8

C13

⇒C9

⇒C9

C7

C6

C7

C4

C8

14

C6

C5

C6

C5

C6

C5

⇒C7 C6

C6

15

C5

C1

C5

C1

C5

C1

C5

C4

16

C3

C3

C4

C7

C3

C3

C4

C5

17

C4

C8

C3

C2

C4

C8

C3

C2

18

C1

C2

C2

C4

C2

C10

C2

C1

19

⇒C2

C10

⇒C1

C3

⇒C1

C2

⇒C1

C3

10

C16

feature sets shown in table 4.5. • Performances shown by the system that uses subset of features selected by SVD-

QRcp are found to be higher than or equal to the performances achieved by the same system involving an equal number of features sorted by FR. However, for a

few cases, the selected set of features using FR performs marginally better than the former. Selected features using SVD-QRcp perform considerably better when a larger number of parameters have been pruned. This justifies our selection of SVD-QRcp as a feature selection criterion, which selects the features carefully so

4.5 Selected Subsets of Features and Their Performances

103

Table 4.4: Final list of selected subset of features for YOHO and POLYCOST Databases. Database

Feature Set

POLYCOST

a b

P ex (in %) 99.60

IMFCC

{ a C4 , C 2 , C 1 , C 3 , C 5 , C 6 , C 7 , C 8 , C 9 , C10 , C11 }b , C12 , C13

{C1 , C2 , C3 , C5 , C4 , C7 , C6 , C9 , C8 }, C11 , C10 , C13 , C15 , C12 , C14 , C17

99.98

GMFCC

99.98

GIMFCC

{C4 , C2 , C1 , C3 , C5 , C6 , C7 , C8 , C9 , C10 , C11 }, C12 , C13 , C14 , C15 , C16 , C17

{C1 , C2 , C3 , C5 , C4 , C7 , C6 , C9 , C8 }, C11 , C13 , C10 , C15 , C12 , C17 , C14 , C16

99.99

MFCC

99.22

IMFCC

{C2 , C1 , C4 , C3 , C5 , C6 , C8 , C7 , C9 , C10 , C11 }

{C1 , C2 , C3 , C4 , C5 , C6 , C7 , C9 }, C8 , C11 , C10 , C13 , C12 , C15 , C14 , C17

99.98

GMFCC

{C1 , C2 , C4 , C3 , C5 , C6 , C7 , C8 , C9 }, C10 , C11 , C12 , C13 , C14 , C15

99.96

GIMFCC

{C1 , C2 , C3 , C4 , C5 , C6 , C7 }, C9 , C8 , C11 , C10 , C13 , C12 , C15 , C14 , C17

99.99

MFCC

YOHO

Subset of features

Starting of fixed feature set (Fp ) End of fixed feature set (Fp )

that high performance can be obtained using a smaller number of features. • Discarded features using SVD-QRcp criterion for all the feature sets are generally

higher order cepstral coefficients, which capture the fine harmonic structure of the

energy spectrum that indirectly portrays the pitch characteristics for a speaker. Higher order cepstral coefficients vary widely from training data to test data due to pitch mismatch [20], which occurs at the time of testing. The pitch mismatch phenomenon thus justifies our selection of the low order cepstral parameters; this gives a very good estimate of a speaker’s vocal tract behavior [37]. POLYCOST Database: • The results in the POLYCOST database show trends similar to those displayed by the results presented for the YOHO Database, with a few exceptions.

• Here, FR based feature selection scheme performs well as it suggests discarding

104 SVD-QRcp based Acoustic Feature Selection for Speaker Identification

Table 4.5: SI performance using selected features from different feature sets on YOHO Database. Sr. No.

b

IMFCC

GMFCC

GIMFCC

SQa

FRb

SQ

FR

SQ

FR

SQ

FR

1

96.79

95.76

96.32

95.76

95.60

97.55

97.54

96.05

96.56

95.76

97.54

2

96.79

97.50

96.05

96.05

3

96.76

95.89

95.76

95.42

?97.55

97.23

?96.05

95.83

4

96.94

94.53

?95.76

95.34

97.50

97.05

95.87

95.63

5

96.97

94.51

94.93

95.18

97.43

97.05

95.63

95.34

6

97.26

93.55

94.60

94.20

97.28

96.03

94.95

94.89

7

?97.32

93.75

94.31

94.20

97.26

95.33

94.58

94.44

8

97.17

91.93

93.61

93.19

97.16

95.04

94.09

94.00

9

⇒97.14

90.05

93.25

93.19

94.60

93.56

93.44

96.70

85.54

92.41

92.23

⇒96.90 96.58

94.13

93.04

92.72

11

96.20

82.83

91.12

96.18

92.03

95.42

71.63

90.18

89.75

95.31

90.89

⇒91.81

91.63

12

⇒91.47

90.34

90.24

13

93.70

68.64

88.10

87.55

93.77

88.99

88.17

88.06

14

90.85

66.76

84.46

84.20

90.82

86.99

85.05

84.04

15

86.18

65.16

79.78

79.71

86.32

80.20

79.66

78.59

16

77.92

48.41

69.96

69.34

78.86

64.15

70.31

69.93

17

62.97

37.34

54.22

53.89

63.08

39.91

54.73

54.15

18

34.22

18.79

28.99

29.35

33.95

23.15

29.80

29.95

19

⇒8.04

6.85

⇒7.43

9.73

⇒8.32

8.12

⇒7.36

10.11

10

a

MFCC

96.01

SVD followed by QRcp F-Ratio

the higher order cepstral coefficients, which do not contribute much in the speaker recognition aspect. The success of the FR could be predicted due to presence of less variable data (Same ‘MOT02’ utterances used for all the training sessions) available at the time of training. The best performance when using FR based feature selection technique is achieved in GMFCC feature set in which 3 features can be pruned. However, SVD-QRcp based FS technique selects the features more

4.5 Selected Subsets of Features and Their Performances

105

Table 4.6: SI performance using selected features from different feature sets on POLYCOST Database. Sr. No.

MFCC

IMFCC

GMFCC

GIMFCC

SQ

FR

SQ

FR

SQ

FR

SQ

FR

1

77.85

77.06

80.24

77.65

77.65

77.85

78.65

77.06

77.06

80.24

2

77.85

77.19

80.11

80.24

75.99

3

77.98

78.51

77.18

76.92

80.24

77.06

4

78.38

78.51

?77.65

76.66

80.50

80.24

77.65

5

78.38

78.38

76.39

76.39

6

78.65

78.38

75.20

7

78.65

78.12

8

78.78

9

77.59

80.24

?77.65

77.06

?81.57

79.97

77.06

76.26

75.20

79.97

79.58

76.26

75.99

75.07

75.07

79.97

79.58

75.73

74.27

76.66

74.40

74.27

79.71

79.31

74.54

74.01

⇒?79.58

75.86

73.61

72.68

78.25

77.18

73.47

72.41

77.72

73.87

72.28

72.15

78.12

76.92

72.94

72.41

11

75.33

71.35

70.29

70.16

75.33

70.29

69.23

12

73.87

68.70

68.44

73.61

73.08

69.50

68.04

13

71.22

67.90

⇒69.50

⇒76.53

66.58

65.38

69.10

68.44

64.46

14

66.05

64.59

61.67

61.67

65.52

64.39

⇒65.52 62.86

60.48

15

61.01

58.49

57.03

55.31

61.27

59.10

57.03

54.51

16

53.71

45.09

53.85

51.99

52.25

44.83

53.98

52.92

17

40.72

28.91

40.19

37.80

40.19

31.30

40.32

39.12

18

25.60

22.41

25.20

22.68

24.01

28.22

25.33

23.87

19

⇒8.09

8.89

⇒8.89

8.62

⇒6.37

9.23

⇒8.62

8.22

10

finely than FR based selection criterion. • The highest performance shown by GIMFCC is the same as the highest performance shown by IMFCC based system while using an equal number of features.

• In general, it is observed from both the databases that GMFCC, GIMFCC take a

larger number of features than their original baselines i.e., MFCC and IMFCC in

order to show the highest performances.

106 SVD-QRcp based Acoustic Feature Selection for Speaker Identification

4.6

Combination of Best Speaker Modelsâ€™ Outputs via PQ Based Fusion Strategy with P =2

In this section, we combine the outputs of best models (models that show maximum accuracies) developed from MFCC & IMFCC or GMFCC & GIMFCC feature subsets. The combination of best models outperforms the combination of the over-fitted 19 dimensional (full set of features) models in terms of SI accuracies. The results after the combination of 19 dimensional MFCC and IMFCC based models have already been shown in tables 2.5 and 2.6 whereas for combined results of GMFCC and GIMFCC based speaker models, the related results can be found in tables 3.6 and 3.7. The same prequantization based fusion strategy has been adopted here but involves projected lower dimensional models, which show highest performances using a set of potential features under a particular feature extraction method. The following algorithm (Algorithm 4.3) describes the process when PQ based fusion has been applied to merge the best models developed from the reduced feature sets. From the tables (Tables 4.7, 4.8, 4.9, 4.10) described above, it can be concluded that the performances after fusion are considerably better than those of their respective baselines. We have also calculated the average identification time and found that the time taken by a low dimensional model for yielding a decision is slightly less than the time used by a full model. As was previously observed (see sec. 2.7), PQ based fusion strategy uses half of the total feature vectors for one of the streams thereby equalizing the time complexity of the system. We also observe from the tables that fused schemes take slightly lower time than the MFCC based single stream that uses all the speech frames. As far as time complexity is concerned, one could not gain a large advantage [39] while using a smaller number of features to develop the model. However, enhanced performance could be obtained utilizing the non-over-fitted models. Note that the time complexity of a speaker identification system [39] depends mainly on the number of frames involved, the number speakers present in the database, and the complexity of the speaker models. These three parameters are much higher than the number of dimensions used in the SI system. As a result, we could not obtain much benefit as far as time complexity is concerned. Note that in this work, our aim is to find here the best set of features that help to construct non-over-fitted models in order to show the best performances for both the databases. In addition useful knowledge has been gained about different features and their relative performances in this task. As we can see from

4.6 Combination of Best Speaker Models’ Outputs via PQ Based Fusion Strategy with P =2 107

Algorithm 4.3: SVD-QRcp + PQ based fusion 1 2 3 4 5 6 7 8 9 10

11

12 13 14 15 16 17 18 19 20 21

22 23 24 25

26 27 28 29 30 31

/* Declaration of parameters */; S = Total numbers of speakers ; P = Pre-quantization rate ; Ym = Energy spectrum for mth frame ; MFCC fea ex = MFCC feature extraction module ; IMFCC fea ex = IMFCC feature extraction module ; XmMFCC = Unknown cepstral vector for mth frame using MFCC filter bank ; XmIMFCC = Unknown cepstral vector for mth frame using IMFCC filter bank ; λiMFCC = Speaker model realized using MFCC filter bank for speaker i ; λiIMFCC = Speaker model realized using IMFCC filter bank for speaker i ; ˆi λ = Reduced model (Model projected in lower dimension) for speaker i MFCC using MFCC filter bank i = 1, 2, . . . , S ; ˆi λ = Reduced model (Model projected in lower dimension) for speaker i IMFCC using IMFCC filter bank i = 1, 2, . . . , S ; Li = Log likelihood of speaker model i ; strue = True speaker identity ; /* Fusion */; Li ← 0 ; Set P = 2 ; for i=1 to S do ˆi λ = ProjectionD→Dˆ (λiMFCC ) ; // Using equation 4.17 MFCC ˆ λiIMFCC = ProjectionD→Dˆ (λiIMFCC ) ; // Using equation 4.17 for m=1: P : T do if using MFCC stream then XmMFCC = MFCC fea ext(Y m ) ; // Use frame nos. 1, 1 + P, . . .. See table 2.7 XmMFCCproj = Projection D→Dˆ (XmMFCC ) ; ˆi Li =Li + log p(Xm |λ ); MFCC proj

MFCC

else XmIMFCC = IMFCC fea ext(Ym ) ; // Use frame nos. P P 2 + 1, 2 + 1 + 2P, . . .. See table 2.7 XmIMFCCproj = Projection D→Dˆ (XmIMFCC ) ; ˆi Li =Li + log p(Xm |λ ); IMFCC proj

IMFCC

end end end

Decision: strue = arg maxs Ls s ∈ {1, 2, . . . , S} ;

108 SVD-QRcp based Acoustic Feature Selection for Speaker Identification

Table 4.7: SI accuracy when best MFCC & IMFCC models are fused by PQ based fusion strategy for YOHO Database. Avg.c time (Tgm ) used by MFCC (Single stream, 19 D d ) is 25.53 sec. PIA (%)

Tm

MFCC

PIA (%)

PIA (%)

PIA (%)

PIA (%)

IMFCC

MFCCPQ

IMFCCPQ

FUSED

T 2

(T vec.)

(s)

97.90

19.00

(13D, T vec.)

(s)

(16D, T vec.)

97.32

19.00

95.76

(13D,

T 2

vec.)

97.26

(16D,

vec.)

95.49

Tf

a

Average Dimension c Average d Dimension b

Table 4.8: SI accuracy when best GMFCC & GIMFCC models are fused by PQ based fusion strategy for YOHO Database. Avg. time (Tm ) used by MFCC (Single stream, 19 D) is 25.54 sec. PIA (%)

Tgm

GMFCC

PIA (%)

PIA (%)

PIA (%)

PIA (%)

GIMFCC

GMPQ

GIMPQ

FUSED

T 2

T 2

(T vec.)

(s)

97.97

20.64

(17D, T vec.)

(s)

(17D, T vec.)

97.55

20.63

96.05

(17D,

vec.)

97.48

(17D,

vec.)

96.05

Tgf

Table 4.9: SI accuracy when best MFCC & IMFCC models are fused by PQ based fusion strategy for POLYCOST Database. Avg. time (Tm ) used by MFCC (Single stream, 19 D) is 66.72 sec. PIA (%)

Tm

MFCC

PIA (%)

PIA (%)

PIA (%)

PIA (%)

IMFCC

MFCCPQ

IMFCCPQ

FUSED

T 2

(T vec.)

(s)

81.70

60.01

(11D, T vec.)

(s)

(16D, T vec.)

79.58

59.99

77.72

(11D,

T 2

vec.)

78.32

(16D,

vec.)

76.92

Tf

the tables, the deletion of features does not really reduce the time complexity much; on the other hand, pruning of frames and speakers does help in significantly reducing the time complexity. Hence here it is not meaningful to consider the case where P > 2.

4.7 Conclusions

109

Table 4.10: SI accuracy when best GMFCC & GIMFCC models are fused by PQ based fusion strategy for POLYCOST Database. Avg. time (Tgm ) used by MFCC (Single stream, 19 D) is 66.71 sec. PIA (%)

Tgm

GMFCC

PIA (%)

PIA (%)

PIA (%)

PIA (%)

GIMFCC

GMPQ

GIPQ

FUSED

T 2

(T vec.)

(s)

82.23

61.94

(16D, T vec.)

(s)

(17D, T vec.)

81.57

61.94

77.65

4.7

(16D,

T 2

vec.)

80.50

(17D,

vec.)

77.59

Tgf

Conclusions

The salient points covered in this chapter are; • An SVD-QRcp based feature selection scheme is proposed for SI application in this chapter.

• SVD determines the number of fixed features that must be retained based on the percentage of energy explanation of singular values, while QRcp provides the

actual ranking of the features using Gram-Schmidt orthogonalization. • Thus the complete set of features is determined easily by considering the set of fixed parameters and searching among the rest of the parameters without replacement.

• The above feature selection procedure has been applied on the four different feature sets (MFCC, IMFCC GMFCC , GIMFCC) on two different databases.

• The results show that the best performances are obtained while using a smaller

number of features. This suggests that we were using over fitted models that had been developed by 19 dimensional features.

• We also demonstrated the performance of the system after the selection of features

done by FR based feature selection technique, which also selects the features using

the ‘without replacement’ technique. FR based feature selection criterion selects the features poorly in YOHO database; no subset of features can show better performance than the full feature set. However, for the POLYCOST database, the performances of the feature set ranked by FR are not significantly worse than the performances shown by SVD-QRcp selected feature set.

110 SVD-QRcp based Acoustic Feature Selection for Speaker Identification • In this work, we adopted the policy of using low dimensional models obtained from full 19 dimensional models at the time of testing, assuming the statistical

independence between the features. This allows us to build speaker models without waiting for the selected subset of features. The policy also helps to save a huge off-line time, as it allows us to train the speaker models and select the subset of features at the same time. • We also fuse the outputs of the models developed by the best subset of features by PQ based fusion technique. The combination of the models that already have

shown better performances on their own gives the best result. • The SVD-QRcp method for best feature selection might not provide the optimal feature set, but it gives a very good indication of discarding higher order cepstral

features, which can cause pitch mismatch.

References [1] R. Bellman, Dynamic programming, Princeton University Press, 1957. (Cited in section 4.1.) [2] B. W. Silverman, Density Estimation for Statistics and Data Analysis (Monographs on Statistics and Applied Probability), London: Chapman and Hall, 1986, pp. 75-93. (Cited in section 4.1.) [3] A. Errity and J. McKenna, “A Comparative Study of Linear and Nonlinear Dimensionality Reduction for Speaker Identification,” in Proc. Digital Signal Processing (DSP 2007), 15th International Conference, no. 1-4, 2007, pp. 587-590. (Cited in section 4.1.) [4] J. Tou and R. Gonzalez, Pattern recognition principles, London: Addison-Wesley, 1974, pp. 243-314. (Cited in section 4.1.) [5] K. K. Paliwal, “Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer,” Digital Signal Process., vol. 2, no. 3, pp. 157-173, Jul. 1992. (Cited in section 4.1.1.) [6] S. Pruzansky, “Talker recognition procedure based on analysis of variance,” J. Acoustical Society of America, vol. 36, no. 11, pp. 2041-2047, Nov. 1964. (Cited in section 4.1.1.) [7] J. J. Wolf, “Efficent Acoustic Parameters for Speaker Recognition,” J. Acoustical Society of America, vol. 51, no. 6 (Part 2), pp. 2044-2056, Mar. 1971. (Cited in section 4.1.1.) [8] G. Saha, S. Chakroborty and S. Senapati “An F-Ratio Based Optimization Technique for Automatic Speaker Recognition System,” in Proc. IEEE Annual Conference-(Indicon 2004), 2005, pp. 70-73. (Cited in section 4.1.1.) [9] K. S. Prasad, K. A. Sheelal, and M. Sridevi, “Optimization of TESPAR Features using Robust F-Ratio for Speaker Recognition,” in proc. International Conference on Signal Processing, Communications and Networking, (IEEE- ICSCN 2007), 2007, pp. 20-25. (Cited in section 4.1.1.)

4.7 References

111

[10] A. L-Garcia, Probability and Random Processes for Electrical Engineering, Pearson Education, Inc., 2nd ed., 2007, pp. 285-343. (Cited in section 4.1.1.) [11] M. R. Sambur, “Selection of Acoustic features for Speaker Identification,” IEEE Transaction on Aoustic Speech and Signal Process., vol. ASSP-23, no. 2, pp. 176-182, Apr. 1975. (Cited in section 4.1.1.) [12] R. Cheung and B. Eisenstein, “Feature Selection via Dynamic Programming for Text-Independent Speaker Identification”, IEEE Trans. Audio Speech and Signal Process., vol. 26, no. 5, pp. 397 403, Oct. 1978. (Cited in section 4.1.1.) [13] G. D. Nelson and D. M. Levy, “A dynamic programming approach to the selection of pattern features,” IEEE Trans. Syst. Sci. Cybern., vol. SSC-4, no.2, pp. 145-151, Jul. 1968. (Cited in section 4.1.1.) [14] C. Y. Chang, “Dynamic programming as applied to feature subset selection in a pattern recognition system,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3, pp. 166-171, Mar. 1973. (Cited in section 4.1.1.) [15] S. Kullback, “Information Theory and Statistics”, New York: Wiley, 1959. (Cited in section 4.1.1.) [16] D. A. Reynolds and R. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Process., vol. 3, no.1, pp. 72-83, Jan. 1995. (Cited in section 4.1.1.) [17] B. R. Wildermoth and K. K. Paliwal, “Reducing Inter-Session Variability with Transitional Spectral Information,” in Proc. of Microelectronic Engineering Research Conference 2001, 2001. (Cited in section 4.1.1.) [18] D. Charlet and D. Jouvet, “Optimizing feature set for speaker verification,” Pattern Recognition Lett., vol. 18, no. 9, pp. 873-879, Sept. 1997. (Cited in section 4.1.1.) [19] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. M. Chagnolleau, S. Meignier, T. Merlin, J. O. Garcia, D. P. Delacretaz, and D. A. Reynolds, “A Tutorial on Text Independent Speaker Verification,” EURASIP Journal on Applied Signal Process., vol. 2004, no. 4, pp. 430-451, 2004. (Cited in section 4.1.1.) [20] R. D. Zilca, B. Kingsbury, J. Navratil, and G. N. Ramaswamy, “Pseudo pitch synchronous analysis of speech with applications to speaker recognition,” IEEE Trans. Speech, Audio and Language Process., vol. 14, no. 2, pp. 467-478, Mar. 2006. (Cited in sections 4.1.1 and 4.5.) [21] M. Pandit and J. Kittler, “Feature Selection for a DTW-based Speaker Verification System,” in Proc. of the IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 1998, vol. 2, pp. 769-773. (Cited in section 4.1.1.) [22] A. Haydar, M. Demirekler, and M. K. Yurtseven, “Feature selection using genetic algorithm and its application to speaker verification,” Electronics Lett., vol. 34, no. 15, pp. 1457-1459, Jul. 1998. (Cited in section 4.1.1.) [23] T. Eriksson, S. Kim, H.-G. Kang, and C. Lee, “An Information-Theoretic Perspective on Feature Selection in Speaker Recognition,” IEEE Signal Process. Lett., vol. 12, no. 7, pp. 500-503, Jul. 2005. (Cited in section 4.1.1.)

112 SVD-QRcp based Acoustic Feature Selection for Speaker Identification [24] J. P. Campbell, Jr., “Testing with the YOHO CDROM voice verification corpus,” in Proc. International Conference on Acoustic, Speech, and Signal Process., (ICASSP 1995), 1995, pp. 341-344. (Cited in section 4.1.1.) [25] NIST-Speaker

Recognition

Evalutaion,

(2004).

[Online].

Available:

http://www.nist.gov/speech/tests/spk/2004/index.htm (Cited in section 4.1.1.) [26] T. Ganchev, P. Zervas, N. Fakotakis, and G. Kokkinakis “Benchmarking Feature Selection Techniques on the Speaker Verification Task,” in Proc. Fifth International Symposium On Communication Systems, Networks And Digital Signal Processing, (CSNDSP 2006), 2006, pp. 314-318. (Cited in section 4.1.1.) [27] D. A. V. Leeuwen, A. F. Martin, M. A. Przybocki, and J. S. Bouten, “NIST and NFI-TNO evaluations of automatic speaker recognition,” Computer Speech and Language, vol. 20, no. 2-3, pp. 128-158, Apr.-Jul. 2006. (Cited in section 4.1.1.) [28] R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE Trans. Neural Netw., vol. 5, no. 4, pp. 537-550, Jul. 1994. (Cited in section 4.1.1.) [29] D. P. W. Ellis and J. A. Bilmes, ”Using mutual information to design feature combinations,” in Proc. International Conf. on Spoken Language Processing (ICSLP 2000) 2000, pp. 79-82. (Cited in section 4.1.1.) [30] N. Kwak and C.-H. Choi, “Input feature selection by mutual information based on Parzen windows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 12, pp. 1667-1671, Dec. 2002. (Cited in section 4.1.1.) [31] G. H. Golub and C. F. V. Loan, Matrix Computations, Johns Hopkins University Press, 3rd ed., 1996. pp. 48-80. (Cited in sections 4.1.2 and 4.2.1.) [32] P. P. Kanjilal, Adaptive prediction and predictive control, Peter Peregrinus Ltd., 1995, pp. 56-107. (Cited in sections 4.1.2, 4.3.2 and 4.3.4.) [33] S. Ari and G. Saha, “In Search of an SVD and QRcp Based Optimization Technique of ANN for Automatic Classification of Abnormal Heart Sounds,” International Journal of Biomedical Sciences, vol. 2, no. 1, Feb. 2007. (Cited in sections 4.1.2 and 4.2.2.) [34] P. P. Kanjilal, G. Saha, and T. J. Koickal “On Robust Nonlinear Modeling of a Complex Process with Large Number of Inputs Using m-QRcp Factorization and Cp Statistic,” IEEE Trans. Systems, Man, And CyberneticsPart B, vol. 29, no. 1, pp. 1-12, Feb. 1999. (Cited in sections 4.1.2 and 4.2.2.) [35] Y. Hua, and W. Liu, “Generalized KarhunenLo`eve Transform,” IEEE Signal Process. Lett., vol. 5, no. 6, pp. 141-142, Jun. 1998. (Cited in section 4.2.1.) [36] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Commun., vol. 28, no. 1, pp. 84-95, Jan. 1980. (Cited in section 4.3.1.) [37] K. S. R. Murty and B. Yegnanarayana, “Combining evidence from residual phase and MFCC features for speaker recognition,” IEEE Signal Process. Lett., vol 13, no. 1, pp. 52-55, Jan. 2006. (Cited in sections 4.3.3 and 4.5.) [38] J. R. Deller Jr., J. H. L. Hansen, and J. G. Proakis, Discrete-Time Processing of Speech Signals, 2nd ed. New York: IEEE Press, 2000. (Cited in section 4.3.3.)

4.7 References

113

[39] T. Kinnunen, E. Karpov, and P. Fr¨ anti, “Real-Time Speaker Identification and Verification,” IEEE Trans. Speech and Audio Process., vol 14, no. 1, pp. 277-288. Jan. 2006. (Cited in section 4.6.) 3

CHAPTER

5

Studies on Input Fusion and Output Normalization for Speaker Identification Application

3

Preface This chapter describes integration of information at various levels in SI application. Here, both feature level and score level fusions are investigated. A weighted feature level fusion and two different score level fusion strategies have been proposed. 3

116

5.1

Studies on Input Fusion and Output Normalization for Speaker Identification Application

Introduction

The combination of different sources of information has been explored within fields known as data fusion, consensus building, team decision theory, combination of multiple experts, along with numerous other titles. Often, we refer to the combination of data from various sources as data fusion [1], [2]. As in general data fusion, we recognize from the literature that the earliest level of fusion is signal fusion, followed by feature fusion. Both signal and feature fusion can be grouped into early-integration or input fusion [3]. Feature fusion simply consists of concatenating the feature vectors into a larger dimensional feature vector, which has several disadvantages: 1) the â€œcurse of dimensionalityâ€?; 2) the difficulty of taking the reliability of either feature set into account (a corrupted feature set can compromise and dominate the entire i.e., concatenated feature vectors [4], [5]). The next level available for fusion is at the mapping/modeling stage and is referred to as middle integration. Coupled HMMs in speech recognition are a common example of this approach [6], [7], [8]. Multiple experts/classifiers can be combined in the score-space. This can be classified under late-integration or output fusion. Integration can also take place at the post-classifier level, for example a secondary classifier employs the output scores from the primary classifiers as new features and performs a further classification [9]. Examples of decision combination rules include, the AND rule (all classifier decisions must agree), the OR rule (a decision is made if any classifier makes a decision), and the majority vote rule (a majority of the classifiers must agree). For decision fusion, the number of classifiers should be higher than the number of classes. This is reasonable for person verification. For person identification, the number of classes is large, rendering decision fusion unsuitable. For fuzzy decisions, Dempster-Shafer theory can be used for score combination [10]. Within the context of SR, data fusion comprises either the concatenation of multiple feature sets from the same speech signal or the combination of scores from different models trained for a speaker. Here, the method of concatenation is adopted to combine static cepstral vector (static MFCC) with its delta (Delta MFCC) and delta-delta (DeltaDelta) version. It is generally expected that the system developed from the this type of combined feature would perform better than that system, which is based only on the static feature vectors. However, a recent study [11] shows that the system that models static plus time-derivative parameters (delta features) performs poorer than that using only static ones. The delta-features do not contain more information than is already in the static features, and from the theory, no gain can be achieved [12] by using them

5.1 Introduction

117 Early Integration

1

Late Integration

1

Feature Extractor 1

Expert 1

2

Feature Extractor 2

2

Expert 2

Modeling (Single Expert)

G

Score Fusion

G

Feature Extractor G

Expert G

Feature Vectors

Separate Expert for each feature set

Figure 5.1: Different levels of integration

together as a single feature set. In this chapter, we first show the performance of a SI system that uses concatenated feature vectors composed of MFCC and IMFCC feature sets. Then a weighted concatenation has been proposed, where the weights are determined according to the discrimination power of each feature set. Next, we have applied the same idea to concatenate GMFCC and GIMFCC feature sets and the performances have been evaluated. On the other hand, in case of late/output fusion, speaker models may be trained with different speech data (voiced and unvoiced portions from the same utterances), different feature data, or different modeling [13] SVM and GMM) techniques. Ultimately, it is desired that the errors of one model are corrected by the others and vice versa. If all models are in agreement upon an error, i.e., they all make the same mistake, then no combination will rectify that error. However, as long as there is some degree of uncorrelation among the errors, performance can be improved with a proper combination. Some of the commonly used classifier output score combining rules have been summarized by Kittler et al. [14]. They found out that the simple sum rule, i.e. combining the individual classifier outputs by summing them, gave the best recognition performance. They found out theoretically that the sum rule was most resilient to estimation errors. The major drawbacks of the classifier output fusion systems are increased time-and memory requirements. For each speaker, a separate model must be stored and in the recognition

118

Studies on Input Fusion and Output Normalization for Speaker Identification Application

stage, each classifier must compute its own match score. The overall time increases with the number of classifiers as well as the complexity (Model orders) within them. Expert scores can take many forms such as posteriors, likelihoods, and distance measures. Non-normalized scores [15] cannot be integrated sensibly in their raw form, as it is impossible to fuse incomparable numerical scales. The min-max technique [16] is the most basic form of score normalization, which shifts and scales the scores into the range [0,1]. The min-max norm is most suitable when the pre-normalized scores have known bounds, however, it can still be used otherwise but will be extremely sensitive to outlier scores. While being straightforward to implement, the min-max norm has been found to have comparable performance to more complicated normalization methods [16]; hence, it is used for the experiments reported here. This omits the worst (outlier) expert scores. In this work, two different normalization methods have been proposed in order to find the weights for the weighted sum of the scores obtained from different classifiers. The idea is to normalize/stabilize each classifierâ€™s output individually through a weight, where the weight can provide a classifierâ€™s power of discrimination among the speakers. Note that in this study we fuse the scores of the classifiers, which are developed from MFCC and IMFCC feature sets respectively. Using the same normalization procedures, we also extend this work to fuse the scores generated from Gaussian filter based (GMFCC, GIMFCC) classifiers.

5.1.1

Organization of the chapter

The rest of the chapter is organized as follows. In 5.2, information fusion has been done by concatenating the feature vectors. Next, in 5.3, scores from various classifiers have been fused suitably by using two different normalization techniques. Then, in the section 5.4, the scores from the best models have been merged using the same score level fusion strategies where the models are derived from the best set of features selected by SVD-QRcp. Finally, the conclusions are presented in 5.5.

5.2

Weighted Feature Level fusion

Over the years, speaker recognition technology has been using concatenated feature vectors [17], [18], [19], [20]. In this context, a concatenated vector is generally composed of static and dynamic cepstral coefficients or some source based features with the same static cepstral parameters. The aim of the concatenation is to separate the speakers in multidimensional space by adding more dimensions to the existing set of features.

5.2 Weighted Feature Level fusion

119

However, it is not certain that the SI accuracy would necessarily be improved with addition of new features. If the added features do not contribute at all in demarcation of speakers while confusing the multidimensional models, then the overall error rate would be decreased [11]. However, in a very recent study [21] on multi-modal biometrics have shown that the integration of information at an early stage of processing are believed to be more effective than those systems that perform integration at a later stage. Since the feature set contains richer information about the input speech signal than the match score, integration at this level is expected to provide better recognition results than the match score level. Normally, in feature level fusion, two or more feature sets are concatenated without using weights. This suggests that all the feature sets are given equal importance; however, in many practical scenarios, each feature set has its own capability in the classification decision. Therefore, the normal feature concatenation could be considered as the weighted feature level fusion, which uses equal weights for different feature sets, where weights are defined as, wi =

1 ∀i, i ∈ {1 . . . G} G

(5.1)

and G is the number of feature sets used. We propose here a weighting scheme for concatenation of MFCC with IMFCC and GMFCC with GIMFCC feature sets, where weights are determined from the divergence. The measure divergence (see Sec. 3.6), gives a rough estimation of the performance of a feature set as far as linear discrimination among speakers is concerned. The weights can been found easily using the following equations, w1 =

DivMFCC DivMFCC + DivIMFCC

(5.2)

and, w2 = 1 − w 1

(5.3)

Similarly, the weights have been calculated for the GMFCC and GIMFCC feature sets. The table 5.1 shows weights for different feature sets. The calculated weights have been multiplied with all the training vectors, before training begins. At the time of testing the same weights are used to scale the incoming test feature vectors. Figure 5.2 shows the feature level merging scheme.

Studies on Input Fusion and Output Normalization for Speaker Identification Application

120

Table 5.1: Assigned weights for different feature sets for the YOHO and POLYCOST databases.

5.2.1

Feature Sets

YOHO

POLYCOST

MFCC

0.6276

0.6283

IMFCC

0.3824

0.3717

GMFCC

0.6181

0.6250

GIMFCC

0.3819

0.3750

Performances after Feature Level Fusion using MFCC & IMFCC feature sets

The results for equal and divergence based weighted feature level fusion are described in Tables 5.2 and 5.3. Table 5.2: SI accuracies after feature level fusion using MFCC-IMFCC paradigm for YOHO database (w1 = 0.6176, w2 = 0.3824). M

MFCC

IMFCC

Concatenated

(19D)

(19D)

(38D) (Equal Weights)

a

Concatenated (38D) (Div.a

Based Weights)

2

74.31

78.04

82.21

82.81

4

84.86

86.50

90.01

91.29

8

90.69

91.99

94.38

95.16

16

94.20

94.15

96.23

96.38

32

95.67

94.22

96.94

96.96

64

96.79

94.76

97.10

97.48

Divergence

Some notable points can be observed from the these tables (Tables 5.2 & 5.3). â€˘ Models that have been trained by the concatenated feature vectors outperform the MFCC based system significantly for both the databases.

â€˘ The system that uses divergence based weighted concatenations of feature vectors performs better than that which uses equal weight based feature vectors.

5.2 Weighted Feature Level fusion

121

Training Speech Data

Pre-processing Stage

IMFCC /GIMFCC Feature Extraction

MFCC /GMFCC Feature Extraction

Feature Vectors

Feature Vectors

Divergence

Divergence

w2

w1 Fusion Speaker Modeling using concatenated feature vectors 1 2

3 S

Matching Algorithm

Sum over all frames

Feature Vectors

IMFCC /GIMFCC Feature Extraction

Feature Vectors

Fusion

Final Output

MFCC /GMFCC Feature Extraction

Pre-processing Stage

Testing Speech Data

Figure 5.2: Feature level fusion strategy â€˘ It is observed that MFCC feature set has been given higher weightage than IMFCC

feature set as MFCC performs better than IMFCC in the SI context irrespective

of model orders and databases. The system developed from the concatenated feature sets (MFCC and IMFCC) performs worse than the system that uses PQ based fusion (refer section 2.6) in highest model order (64 for YOHO and 16 for POLYCOST) for both the databases. However, for the YOHO database, the former system performs better than the latter for lower

Studies on Input Fusion and Output Normalization for Speaker Identification Application

122

Table 5.3: SI accuracies after feature level fusion using MFCC-IMFCC paradigm for POLYCOST database (w1 = 0.6283, w2 = 0.3717). M

MFCC

IMFCC

Concatenated

Concatenated

(19D)

(19D)

(38D)

(38D)

(Equal Weights)

(Div. Based Weights)

2

63.93

55.97

67.37

68.99

4

72.94

68.04

75.73

76.13

8

77.85

76.26

80.77

80.80

16

77.85

77.06

80.77

80.90

order models (compare between seventh column of Table 2.5 and fourth column of Table 5.2). Except the model orders 4 and 16, the concatenated system shows better SI accuracies than the PQ based fused system for the POLYCOST database (compare between seventh column of Table 2.6 and fourth column of Table 5.2).

5.2.2

Performances after Feature level fusion using GMFCC & GIMFCC feature sets

Table 5.4: SI accuracies after feature level fusion using GMFCC-GIMFCC paradigm for YOHO database (w1 = 0.6181, w2 = 0.3819). M

GMFCC

GIMFCC

Concatenated

Concatenated

(19D)

(19D)

(38D)

(38D)

(Equal Weights)

(Div. Based Weights)

2

79.82

78.29

83.77

84.06

4

90.31

87.23

91.45

91.63

8

94.66

92.55

95.54

95.67

16

96.50

94.64

96.68

96.79

32

97.19

95.49

97.19

97.30

64

97.54

96.05

97.70

97.72

The tables 5.4 and 5.5 represent the feature level fusion using GMFCC and GIMFCC feature sets. The results show trends similar to those observed in the MFCC-IMFCC

5.3 Weighted Score Level Fusion using Output Normalization

123

Table 5.5: SI accuracies after feature level fusion using GMFCC-GIMFCC paradigm for POLYCOST database (w1 = 0.6250, w2 = 0.3750). M

GMFCC

GIMFCC

Concatenated

Concatenated

(19D)

(19D)

(38D)

(38D)

(Equal Weights)

(Div. Based Weights)

2

66.05

56.90

68.99

70.32

4

76.52

69.10

76.92

77.85

8

80.11

77.59

80.90

82.00

16

80.24

77.45

81.57

83.02

based results shown in the previous sub-section. Concatenations of both kinds realized by GMFCC and GIMFCC feature sets outperform the similar concatenations developed from MFCC & IMFCC feature sets in terms of SI accuracies over all the model orders. For YOHO database, the system that uses concatenated feature sets (GMFCC and GIMFCC) performs poorer than that system which is developed from the PQ based fusion strategy over all the model orders (compare between seventh column of Table 3.6 and fourth column of Table 5.4). However, for the POLYCOST database, the former system outperforms the latter significantly for all the four model orders (i.e. 2, 4, 8, and 16).

5.3

Weighted Score Level Fusion using Output Normalization

For development of SI systems, numerous feature extractions and classification algorithms have been proposed over the years. However, it is still difficult to implement a single classifier that can exhibit sufficiently high performance in practical application. As a result, many researchers have cited the fusion of multiple information sources or classifiers as a promising option in SI research. The idea is not to rely on a single decision making scheme. Instead, Multiple Classifiers (MC) are used to derive a consensus decision. In this context, Altin存cay and Demirekler [22], have improved SI performance by fusing two classifiers, where one used a form of channel compensation and the other did not. They showed that SI performance is very sensitive to the signal processing done when extracting a particular acous-

Studies on Input Fusion and Output Normalization for Speaker Identification Application

124

tic feature. Similarly, Chen and Chi [23] applied a novel method of combining multiple classifiers using different feature sets extracted from the same raw speech signal to a SI task. They showed that the combination of classifiers based on different spectrum representations can be used to improve the robustness of SI systems. However, they performed SI experiments on clean speech for a population of only 20 male speakers. On the same line of thought, Ramachandran et al. [1] provided a discussion of how various forms of diversity, redundancy and fusion can be used to improve the performances of SI systems. Their experiments showed improvements in speaker verification performance when forming a simple linear combination of three different classifiers on the same front end features. In a recent work, Mashao et al. [11] developed a MC based SI system where the fusion strategy is a simple weighted â€˜SUMâ€™ rule. The authors, however, have chosen common weights for all the test data over varying sets of speakers. Unfortunately, there are fewer studies that provide a sound theoretical underpinning for the improvements gained in MC systems, resulting in inadequate understanding of why some combination rules are better than others and in what circumstances. Among the three different levels Model bank using IMFCC /GIMFCC feature sets

Model bank using MFCC /GMFCC feature sets

Training Speech Data

1

1 2

2

3 S

Matching Algorithm

IMFCC /GIMFCC Feature Extraction

Pre-processing Stage

MFCC /GMFCC Feature Extraction

IMFCC /GIMFCC Feature Extraction

Pre-processing Stage

MFCC /GMFCC Feature Extraction

Testing Speech Data

Scores

3 S

Matching Algorithm

Scores

Weight Calculation

w2 Sum over all frames

Fusion

w1

SUM

Sum over all frames Score L ( sMFCC /GMFCC )

Score L

( sIMFCC /GIMFCC ) Final Output L scom

Figure 5.3: Model level fusion strategy

5.3 Weighted Score Level Fusion using Output Normalization

125

(Abstract level, Rank level, and Measurement level) of classifier outputs, the measurement level is the one conveying the greatest amount of information about the relative degree to which each particular class may be the correct one and this information can be quite useful during combination. For example, when the measurement values of all classes are very close to each other, the classifier may be considered as not being sure about its most likely pattern class. Consider the case of three pattern classes and two different output vectors as o 1 = [1.5, 0.0, 0.0] and o2 = [0.11, 0.10, 0.10]. Let the correct class be the first class. For abstract or rank based combination approaches, these outputs cannot be differentiated from each other. However, the first output vector conveys strong evidence that the correct class is the first one which is not the case for the second. Suppose that the outputs o1 = [0.0, 2.0, 0.0] and o2 = [0.10, 0.11, 0.10] are obtained when the first class is tested. For the abstract or rank level combination approaches, both outputs have equivalent effects in the combination operation where, for the case of measurement level combination, the weaker information coming from the second output vector can be much more easily compensated with another correct output.

These discussions mainly emphasize the advantages of using measurement level classifier outputs in combination. However, a major problem in this approach is the incomparability/incompatibilty of classifier outputs (Ho et al.) [24]. Classifiers based on parametric modeling provide likelihood values whereas non-parametric classifiers provide some cost or distance values. Also, different classifiers may depend on different feature vectors and the dynamic ranges of these vectors are not generally the same. As a matter of fact, the scales of the outputs from different classifiers are incomparable and need preprocessing before combination [24].

In this work, we fuse the log-likelihood scores from the classifier developed from MFCC and IMFCC feature sets. Similarly, we combine the scores of the speaker models that use GMFCC and GIMFCC features. To fuse the scores, a weighted â€˜SUMâ€™ rule has been chosen, where weights are determined from the raw likelihood scores from the classifiers by applying the min-max based normalization method. The idea is to scale up or down the classifiersâ€™ outputs such that these scores would be compatible with each other before fusion takes place. Note that the sum rule outperforms other combination strategies due to its lower sensitivity to estimation errors [11], [14], [25].

Studies on Input Fusion and Output Normalization for Speaker Identification Application

126

5.3.1

Weight Calculation using Best Speaker and Most Competing Speaker

In a SI task, a series of test vectors are sent to all the models in the database. Log likelihood scores are obtained for all the test vectors and the sum of these scores gives the final score for a speaker (ref. eqn. 2.23). This scoring procedure is the same irrespective of the feature sets used for a SI experiment. The fusion of the scores obtained from different classifiers (in this case MFCC & IMFCC and GMFCC & GIMFCC) via the weighted SUM rule is given by, Lscom Lgscom

= w1 · = w1g ·

T X

t=1 T X t=1

log p(XtMFCC |λsMFCC ) + w2 ·

T X t=1

log p(XtGMFCC |λsGMFCC ) + w2g ·

log p(XtIMFCC |λiIMFCC )

T X t=1

(5.4)

log p(XtGIMFCC |λiGIMFCC )(5.5)

where, all the symbols have their usual meanings except the suffix ‘com’, which indicates the combined score and ‘g’ stands for Gaussian filter bank based system. We have used scores only from the speaker who is the winner and the speaker who is the best among the remaining runners to find the weights. The weights w 1 and w2 can be found by, S S max LsMFCC s=1 − max2 LsMFCC s=1

w1 =

S

max Ls

MFCC s=1

S S max LsIMFCC s=1 − max2 LsIMFCC s=1

w2 =

S

max Ls

IMFCC s=1

(5.6)

(5.7)

and for fusing GMFCC-GIMFCC based system, the same weight evaluation method is adopted, which is given by, w1g

=

w2g =

S S max LsGMFCC s=1 − max2 LsGMFCC s=1

S

max Ls

GMFCC s=1

S S max LsGIMFCC s=1 − max2 LsGIMFCC s=1

S

max Ls

GIMFCC s=1

(5.8)

(5.9)

5.3 Weighted Score Level Fusion using Output Normalization

127

The operator max2 finds the second best speaker when an utterance is put under test. Note that these weights depend upon the raw log-likelihood scores generated by classifiers where there might be a chance of mismatch between dynamic ranges of the outputs. The weights are not constant over all the utterances (i.e. utterance specific) and moreover they scale down the raw output scores after multiplication into a suitable range for combination as both the weights lie between 0 and 1. The above equations directly indicate the power of each classifier through the weights as they take the difference between the winner and the most competing speakers from the rest in the numerator. Therefore, it is expected that SI accuracy would be increased as compared to the system that uses equal weights based SUM rule. As this weighting method is stream dependent, the other weights can be similarly obtained when more streams are added to the system.

5.3.2

Weight Calculation using min-max Operator

In the previous proposition, each speaker has been given same the weights for a particular utterance. This is a somewhat sub-optimal case where all the speakers are treated in the same manner. However, there might be some speakers in the database, whose models’ scores are very different from the rest. Although they can be easily separated from the rest of the speakers, from the view point of a classifier the scores generated by them are the outliers. The effect of outliers remains present even if the scores are normalized by the max operator. The ‘min-max’ operator, though is not very powerful for removing this artifact, has often been used for its easy implementation. We calculate the weights from such a min-max operator, which is given by, w s1

=

w s2

=

wsg1

=

wsg2

=

S LsMFCC − min LsMFCC s=1 S S max LsMFCC s=1 − min LsMFCC s=1 S LsIMFCC − min LsIMFCC s=1 S S max LsIMFCC s=1 − min LsIMFCC s=1 S LsGMFCC − min LsGMFCC s=1 S S max LsGMFCC s=1 − min LsGMFCC s=1 S LsGIMFCC − min LsGIMFCC s=1 S S max LsGIMFCC s=1 − min LsGIMFCC s=1

(5.10)

(5.11)

(5.12)

(5.13)

The suffix ‘s’ denotes the s-th speaker in the database. We also modify the way the weights have been incorporated in the ‘SUM’ rule (see Eqn. 5.5). The modified ‘SUM’

128

Studies on Input Fusion and Output Normalization for Speaker Identification Application

rule for speaker s is given by, Lscom

= w s1 ·

T X

log p(XtMFCC |λsMFCC ) + (1 − ws1 ) ·

+ w s2 ·

T X

log p(XtIMFCC |λiIMFCC ) + (1 − ws2 ) ·

t=1

t=1

T X

log p(XtIMFCC |λiIMFCC )

t=1 T X t=1

log p(XtMFCC |λsMFCC (5.14) )

Rearranging the above equation, we can get the total effective weights for a particular stream. The rearranged equation is given by, Lscom

= (1 + ws1 − ws2 ) ·

T X

log p(XtMFCC |λsMFCC )

+ (1 + ws2 − ws1 ) ·

T X

log p(XtMFCC |λsMFCC )

t=1

t=1

(5.15)

Hence, the total weights for the MFCC and IMFCC streams are (1 + w s1 − ws2 ) and

(1 + ws2 − ws1 ) respectively. The equation uses both weights with both the classifiers’ outputs. As a result, the score of each classifier is called twice in the equation. However,

when a weight is assigned to a particular classifier, the difference between the same weight and 1 is assigned to another classifier. This pronounces the effect of a classifier whose weight is larger. The larger weights used for a stream, the higher is its importance, consequently penalizing the other. Note that, the maximum weight that can be assigned for a stream is 2; at the same time, the other weight will be zero for the other stream. Therefore, in this case, the decision about a speaker’s identity relies more on the heavily weighted stream, which in turn is guided by a speaker’s score. Viewed in another manner, this fusion scheme discriminates a speaker based on a stream which would be suitable for that speaker. For example, a large corpus may contain some speakers who can be discriminated using the MFCC stream alone; similarly, the same database may contain other speakers for whom the IMFCC stream might be more appropriate. This is an area where further research can definitely be conducted. The same fusion rule can be applied on the GMFCC-GIMFCC paradigm. Natural extension of this fusion rule for the multi-stream case can be done in a similar way. For G number of streams the

5.3 Weighted Score Level Fusion using Output Normalization

129

equation can be written as, Lscom

= (G + ws1 − + (G + ws2 − + (G + wsG −

G X

w si ) ·

T X

log p(Xt1 |λs1 )

G X

w si ) ·

T X

log p(Xt2 |λs2 ) + . . .

i=1,i6=1

i=1,i6=2 G X

i=1,i6=G

t=1

t=1

w si ) ·

T X t=1

log p(XtG |λsG )

(5.16)

Similarly, the maximum weight that can be assigned for a stream is G; at the same time, the other weight will be zero for the other streams. The following tables (see Tables 5.6, 5.7, 5.8, 5.9) describe the results of various fusion strategies that include equal weighted, winner-runner-up, and min-max operator based methods. Table 5.6: SI accuracies using various fusion strategies applied on speaker models’ scores (MFCC-IMFCC, YOHO database). M

MFCC

IMFCC

Different Fusion Schemes Equal Weight

Winner

Min-Max

Runner-up 2

74.31

78.04

82.95

83.37

83.97

4

84.86

86.50

90.87

91.12

91.50

8

90.69

91.99

94.91

94.98

95.16

16

94.20

94.15

96.30

96.34

96.45

32

95.67

94.22

97.26

97.36

97.39

64

96.79

94.76

97.68

97.81

97.97

From these tables, the following points can be observed: • All the score level fusion strategies outperform the baseline MFCC irrespective of databases and feature sets used.

• The equal weighted fusion strategy performs worse than other combining schemes. • Out of the two proposed techniques, the min-max based weight finding technique

Studies on Input Fusion and Output Normalization for Speaker Identification Application

130

Table 5.7: SI accuracies using various fusion strategies applied on speaker models’ scores (MFCC-IMFCC, POLYCOST database). M

MFCC

IMFCC

Different Fusion Schemes Equal Weight

Winner

Min-Max

Runner-up 2

63.93

55.96

69.09

70.56

70.56

4

72.94

68.04

76.79

77.06

77.85

8

77.85

76.26

81.63

81.30

81.70

16

77.85

77.06

81.70

81.83

82.93

Table 5.8: SI accuracies using various fusion strategies applied on speaker models’ scores (GMFCC-GIMFCC, YOHO database). M

GMFCC

GIMFCC

Different Fusion Schemes Equal Weight

Winner

Min-Max

Runner-up 2

79.81

78.30

84.95

84.98

85.13

4

90.30

87.23

92.55

92.75

92.79

8

94.66

92.55

95.83

95.76

96.00

16

96.50

94.64

96.94

96.97

96.99

32

97.19

95.49

97.52

97.55

97.65

64

97.53

96.05

97.92

97.99

98.23

performs better in terms of SI accuracies than the winner-runner-up based fusion strategy. • The min-max based weight finding algorithm takes of about the dynamic range

of the speakers’ scores, while the other proposition is based on two top ranked

speakers. • For both the proposed method, the weights for two different streams are var-

ied when a new utterance comes for testing unlike the fixed weight based fusion

scheme. More specifically, the min-max based technique finds weights for each speaker individually in addition to finding weights for the two different streams;

5.4 Combining Scores of Best Models’ obtained through SVD-QRcp

131

Table 5.9: SI accuracies using various fusion strategies applied on speaker models’ scores (GMFCC-GIMFCC, POLYCOST database). M

GMFCC

GIMFCC

Different Fusion Schemes Equal Weight

Winner

Min-Max

Runner-up 2

66.05

56.90

69.23

70.56

70.56

4

76.53

69.10

77.85

78.25

78.59

8

81.10

77.59

81.57

81.71

81.96

16

80.24

77.45

82.10

81.96

82.98

for this reason, the weights calculated by this technique are considered to be more optimal than those calculated by the winner-runner-up variant, which yields a weight for each stream irrespective of speakers.

• All the fusion schemes perform marginally better than the PQ based merging strategy (mentioned in 2.6).

5.4

Combining Scores of Best Models’ obtained through SVD-QRcp

In this section, we fuse the scores obtained from the best speaker models, which are obtained through SVD followed by QRcp (refer section 4.3). For both the databases, the scores from the highest model orders have been used for various fusions. It is observed from the tables 5.10 and 5.11 that for all the fusion techniques, SI performances using combination of best speaker models outperform the combined systems that use all features to develop the their speaker models. It is also to be noted that, fused systems developed from best speaker models performs better than the systems that use the same best speaker models but adopt PQ based fusion (see Tables 4.7, 4.8, 4.9, 4.10).

Studies on Input Fusion and Output Normalization for Speaker Identification Application

132

Table 5.10: SI accuracies using various fusion strategies applied on the best speaker models’ scores (MFCC-IMFCC, YOHO and POLYCOST databases) with highest model orders (i.e. 64 and 16). Databases

MFCC

IMFCC

Different Fusion Schemes Equal Weight

Winner

Min-Max

Runner-up YOHO

97.32

95.76

97.97

98.10

98.30

POLYCOST

79.58

77.72

81.80

82.00

83.21

Table 5.11: SI accuracies using various fusion strategies applied on the best speaker models’ scores (GMFCC-GIMFCC, YOHO and POLYCOST databases) with highest model orders (i.e. 64 and 16). Databases

GMFCC

GIMFCC

Different Fusion Schemes Equal Weight

Winner

Min-Max

Runner-up

5.5

YOHO

97.55

96.05

98.12

98.23

98.45

POLYCOST

81.57

77.65

82.87

83.00

83.40

Conclusions

The following points are the principal conclusions for this chapter. They are described next. • In this chapter, we mainly focus on the fusion strategies at different levels for SI application.

• A weighted feature level fusion strategy has been proposed and applied on both

MFCC-IMFCC & GMFCC-GIMFCC paradigms, where weights have been found

using divergence. Weight based concatenated feature vectors modeled using GMM show improved SI accuracy over both the baseline system and the system which uses equal weight based concatenated feature vectors. • Two different weight finding techniques are proposed through the min-max based normalization method; one of them is completely dependent on the scores of the

winner and the runner-up, while the other relies on all the speakers in the database.

5.5 References

133

The ‘winner-runner up’ method does not consider the effect of each speaker and allocates equal weight to every speaker in the database for a particular stream. The other variant, the min-max based weight finding method, gives weights for each speaker individually; this guarantees a more optimal allocation of weights for the various classifiers concerned. • This chapter also shows the combination of the best speaker models through the

two proposed combination schemes under weighted score level fusion strategy. The

performances shown by the combination of the best models are better than those of other systems proposed through the previous works described in the preceding chapters. This suggests that non-over fitted models can bring out the complementary nature of the two feature sets (MFCC-IMFCC or GMFCC-GIMFCC) under combination in a better manner. Maximum values of 98.45% and 83.40% are obtained for YOHO and POLYCOST databases, respectively, when non-over fitted models (so-called best models) use GF based feature sets, provided they are combined using the Min-Max rule.

References [1] R. Ramachandran, K. Farrell, R. Ramachandran, and R. Mammone, “Speaker recognition - general classifier approaches and data fusion methods,” Pattern Recognition, vol. 35, no. 12, pp. 2801-2821, Dec. 2002. (Cited in sections 5.1 and 5.3.) [2] T. Kinnunen, V. Hautam¨ aki, and P. Fr¨ anti, “On the fusion of dissimilarity based classifiers for speaker identification,” in Proc. 8th European Conference on Speech Communication and Technology (Eurospeech 2003), 2003, pp. 2641-2644. (Cited in section 5.1.) [3] S. Lucey, T. Chen, S. Sridharan, and V. Chandran, “Integration strategies for audio-visual speech processing: Applied to text dependent speaker recognition,” IEEE Trans. Multimedia, vol. 7, no. 3, pp. 495-506, Jun. 2005. (Cited in section 5.1.) [4] N. A. Fox and R. B. Reilly, “Audio-visual speaker identification based on the use of dynamic audio and visual features,” in Proc. 4th International Conf. Audio-and Video-Based Biometric Person Authentication, (AVBPA 2003), 2003, pp. 743-751. (Cited in section 5.1.) [5] N. A. Fox, R. Gross, J. F. Cohn, and R. B. Reilly, “Robust Biometric Person Identification Using Automatic Classifier Fusion of Speech, Mouth, and Face Experts,” IEEE Trans. Multimedia, vol. 9, no. 4, pp. 701-714, Jun 2007. (Cited in section 5.1.) [6] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, “Recent advances in the automatic recognition of audiovisual speech,” Proc. IEEE, vol. 91, no. 9, pp. 1306-1324, Sept. 2003. (Cited in section 5.1.)

134

Studies on Input Fusion and Output Normalization for Speaker Identification Application

[7] S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141-151, Sept. 2000. (Cited in section 5.1.) [8] S. Tamura, K. Iwano, and S. Furui, “A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMs,” in Proc. IEEE International Conf. Acoustics, Speech, and Signal Processing, (ICASSP 2004), 2004, vol. 1, pp. 857-860. (Cited in section 5.1.) [9] C. Sanderson and K. K. Paliwal, “Identity verification using speech and face information,” Digital Signal Processing, vol. 14, no. 5, pp. 449-480, Sept. 2004. (Cited in section 5.1.) [10] L. Xu, A. Krzyzak, and C.Y. Suen, “Methods of combining multiple classifiers and their applications to hand-written character recognition,” IEEE Trans. Systems Man Cybernet., vol. 22, no. 3, pp. 418-435, May-Jun. 1992. (Cited in section 5.1.) [11] D. J. Mashao and M. Skosan, “Combining Classifier Decisions for Robust Speaker Identification,” Pattern Recog., vol. 39, no. 1, pp. 147-155, Jan. 2006. (Cited in sections 5.1, 5.2, 5.3 and 5.3.) [12] T. Eriksson, S. Kim, H.-G. Kang, and C. Lee, “An Information-Theoretic Perspective on Feature Selection in Speaker Recognition,” IEEE Signal Process. Lett., vol. 12, no. 7, pp. 500-503, Jul. 2005. (Cited in section 5.1.) [13] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo, “Support vector machines for speaker and language recognition,” Computer, Speech and Language, vol. 20, no. 2-3, pp. 210-229, Apr.-Jul. 2006. (Cited in section 5.1.) [14] J. Kittler, M. Hatef, R. Duin, and J. Mataz, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226-239, Mar. 1998. (Cited in sections 5.1 and 5.3.) [15] H. Altin¸cay and M. Demirekler, “Undesirable effects of output normalization in multiple classifier systems,” Pattern Recog. Lett., vol. 24, no. 9-10 , pp. 1123-1650, Jun. 2003. (Cited in section 5.1.) [16] A. Jain, K. Nandakumar, and A. Ross, “Score normalization in multimodal biometric systems,” Pattern Recg., vol. 38, no. 12, pp. 2270-2285, Dec. 2005. (Cited in section 5.1.) [17] R. Zilca and Y. Bistritz, “Feature Concatenation for Speaker Identification,” in Proc. Conference records of the 10-th European Signal Processing Conference, (EUSIPCO-2000), 2000. (Cited in section 5.2.) [18] C. Sanderson and K.K. Paliwal, “Information fusion for robust speaker verification,” in Proc. European Conf. Speech Communication and Technology, (EUROSPEECH-2001), 2001, pp. 755758. (Cited in section 5.2.) [19] L. Ferrer, E. Shriberg, S. Kajarekar, and K. S¨ onmez, “Parameterization of Prosodic Feature Distributions For Svmmodeling In Speaker Recognition,” in Proc. IEEE International Conf. Acoustics, Speech, and Signal Processing, (ICASSP 2007), 2007, pp. IV-233-IV-236. (Cited in section 5.2.) [20] T. Kinnunen, V. Hautamaki, and P. Franti, “Fusion of spectral feature sets for accurate speaker identification,” in Proc. of the International Conference on Speech and Computer (SPECOM 2004), 2004, pp. 361-365. (Cited in section 5.2.) [21] X. Zhou and B. Bhanu, “Feature fusion of side face and gait for video-based human identification,” Pattern Recognition, vol. 41, no. 3, pp. 778-795, Mar. 2008. (Cited in section 5.2.)

5.5 References

135

[22] H. Altin¸cay and M. Demirekler, “Speaker identification by combining multiple classifiers using dempster-shafer theory of evidence,” Speech Commun., vo. 41, no. 4 , pp. 531-547, Nov. 2003. (Cited in section 5.3.) [23] K. Chen and H. Chi, “A method of combining multiple probabilistic classifiers through soft competition on different feature sets,” Neurocomputing, vol. 20, no. 1-3 , pp. 227-252, Aug. 1998. (Cited in section 5.3.) [24] T. Ho, J. Hull, and S. N. Srihari, “Decision combination in multiple classifier systems,” IEEE Trans. Pattern Anal. Machine Intell., vol. 16, no. 1, pp. 66-75. Jan. 1994. (Cited in section 5.3.) [25] J. Kitter and F. M. Alkoot, “Sum versus Vote Fusion in Multiple Classifier Systems,” IEEE Trans. Pattern Anal. Machine Intell. vol. 25, no. 1, pp. 110-115, Jan 2003. (Cited in section 5.3.) 3

CHAPTER

6

Conclusions

3

Preface In this chapter, we summarize the contributions detailed in Chapter 2 to Chapter 5. Some important conclusions are drawn along with the summary and important issues discussed. Finally, scope for future developments and possible extensions of the present work is discussed. 3

138

6.1

Conclusions

Summary of the Work

This dissertation

embodies the results of our investigations on some feature ex-

traction techniques, computational speed-up methods, selection of acoustic feature, and fusion strategies for a text-independent SI application. The growing demands for SI have been the major motivation towards employing it in efficient ways as an add-on module with various speech related applications. In particular, our investigations have been addressed towards improving the accuracy for an SI system while the parallel efforts are made to reduce the computation involved. In Chapter 2 a complementary feature set namely IMFCC has been proposed. This complementary feature set can capture high frequency speaker specific cues using reversed filter bank structure, which is diametrically opposite to the structure of conventional MFCC based filter bank used for an SI task. Complementary nature between IMFCC and MFCC has been demonstrated with raw log-likelihood scores from speaker models. A comparison shows that proposed feature set exhibit near equal performance with baseline feature set when they were used for an closed set text-independent SI task dealing with more than 130 speakers. Two different public databases namely YOHO (Microphone, monolingual speech, 138 speakers) and POLYCOST (Telephone multilingual speech, 131 speakers) have been used throughout the thesis for conducting the experiments. Next, we have proposed a scheme for fusion that utilizes PQ. Using this fusion, models developed from those two complementary features (IMFCC & baseline MFCC) are merged at score level. The proposed fusion strategy helps the SI systems to achieve higher performance rate while maintaining the same computation than that of baseline system. Another study has been carried out with proposed fusion scheme to investigate the maximum speed-up gain by the system with increasing decimation rate without compromising the identification accuracy. A maximum of 3:1 and 8:1 speed-up factors have been achieved for YOHO and POLYCOST database, respectively. Note that the actual â€˜cpu-timeâ€™ has been calculated to report the speed-up factor by running a simulation five times and averaging the total time spent. Subsequently, a study has been carried out in Chapter 3 to observe the effect of using various window/filter shapes on identification accuracy. Three different shapes of filters that include Rectangular, Triangular and Gaussian have been used for this study. Two different kinds of Gaussian filter based filter banks have been proposed here. For the first case, a filter bank is constructed with equal importance (i.e. scaling

6.1 Summary of the Work

139

factor for deriving the variance) to all the filters. By importance, we mean how much an isolated Gaussian filter couple with other adjacent frequency components that lie in nearby subbands. For other realization, each filter has been given its own importance (i.e. variable scaling factor for calculation of variance) depending on the position of the filter on the frequency axis. Note that each of these filters are placed in mel-scale. From the experiments, we observe that cepstral coefficients obtained from Rectangular filter bank show worst while the same derived from Gaussian filter bank (variable importance based filters) exhibit best performance in terms of identification accuracies. In this study a detailed discussion on the correlation between adjacent and non-adjacent subbands has been done. This correlation signifies the evidences from the neighboring subbands and it helps to improve the speaker identification task to a great extent. The idea of using Gaussian filter has also been tested on inverted mel-scale. We observe the same trend in result in terms of improvement of accuracy. To check the performance of individual feature set under this study, a rudimentary basis like divergence, which is based on LDA has been employed. The analysis validates our findings from the actual result. Finally, we combine the best performing Gaussian filter based feature sets obtained from mel-scale and inverted mel-scale by PQ based fusion rule. The assumption is that these two new feature sets would also be complementary like the MFCC-IMFCC couple. The fused scheme that uses Gaussian filter based feature sets have shown better performances than that of using MFCC-IMFCC. In sequel, a simple and straightforward approach has been proposed in Chapter 4 to select the potential features from an acoustic feature set. SVD followed by QRcp have been adopted as a feature selection criterion. The idea is to select those features that can explain different dimensions showing minimal similarities (or maximum acoustic variability) among them in orthogonal sense. The procedure involves less computational complexity than any exhaustive search based techniques. In this, speaker models have not been trained from the selected feature; rather effective portion of the models are extracted from the full model assuming involved features are independent. This also opens the door of parallel processing of feature selection and model building in a time sharing basis. Features are selected not only from MFCC, but also from the other proposed feature sets on the same two databases. All the experiments have been compared with popular F-Ratio based feature selection technique for its discriminative nature and easy implementation. It has been found from the results, that the proposed feature selection technique uses less number of features than that of F-Ratio based scheme in order to yield the best performance by a particular feature set. It has also been no-

140

Conclusions

ticed that lower order cepstral coefficients, which describe vocal tract behavior are more useful than higher order cepstral coefficients that portrays pitch and vocal chord based information. In this study, actual ‘cpu-time’ has been reported and the effect of dimensionality reduction on identification time has been discussed and compared with reduction of frames in an incoming unknown test utterance. It is found from the calculated time that pruning of features does not significantly contribute to the reducing in time in testing phase. It can be also concluded from this study that the identification time does depend mainly on the parameters which are higher in dimension i.e. number of frames (say T seconds of speech, we have T × 100 frames assuming 20ms frame size and 50% overlap, which gives us 100 frames per second), number of speakers (greater than 100),

and model complexity (e.g. highest model order what we have chosen is 64) to some extent. However, useful information was gained about the features which contribute to the SI application The best speaker models (from feature sets developed from mel and inverted melscale) obtained after feature selection are then combined using PQ based fusion strategy. As the higher order cepstral coefficients are removed from the both the feature sets, it can be said that the speaker models representing the retained features from the original sets, describe the envelope of the spectrum rather than portraying pitch harmonics in lower and higher parts of a spectrum. Finally in a separate study in Chapter 5, various fusion strategies have been applied at different levels in the SI system. Note that, for these studies we have used all the frames for an unknown test utterance under test. First, a weighted feature level fusion has been proposed where a feature set is weighted with its relative importance. Concatenated feature set performs significantly well not only over the baseline system but also outperforms the system that uses composite feature set with equal weighted scheme. The weight suggests the relative strength of a feature set to demarcate one speaker from another as these weights are directly determined from the divergence measure. The results have been presented with various model orders for both the databases. Towards the end in chapter 5, speaker models are fused in score level via weighted ‘SUM’ rule as the ‘SUM’ rule outperforms other combination strategies due to its lower sensitivity to estimation errors. Weights have been calculated using normalization of models’ scores across all the speakers in a corpus. The idea is to find the weights for the different streams for which the different scores would be transformed in a compatible range for proper combination. Two different normalization techniques have been

6.2 Future Research Directions

141

proposed in this work to find the weights. First a stream specific normalization technique has been been proposed, where weights are solely determined according to the classification ability by a particular stream and does not depend on every speaker in the database. In the second normalization procedure, every speaker’s score is involved to find his/her own weight for a particular stream. For the latter case, the obtained weights are expected to be the more optimal than former as the evidence from each speaker have been taken into account. From the results, we can see that the latter case outperforms the former as well as the equal weighting scheme for score level fusion. Using the proposed normalization techniques for finding weights, we combine the best speaker models obtained through SVD-QRcp based selected features. For YOHO database highest 98.45% and for POLYCOST highest 83.40% identification accuracies are obtained using the best normalization technique for finding the weights with Gaussian filter based cepstral feature sets as front-end modules. If we compare highest accuracies with single stream, MFCC based system we obtain significant (i.e. 47.04% and 16.60%) relative improvement in reduction of SI error rate on YOHO and POLYCOST database, respectively. Finally, all the results embodied throughout the thesis prove the superiority of our different propositions irrespective of data type, amount of data, and model orders. We believe that the consistency of the results would also hold if some other speaker models are used.

6.2

Future Research Directions

In this section, future scope and possible extensions of the investigations carried out in this thesis are discussed. This is described next. • Although the PQ method helps the SI systems to achieve significantly higher speedups with minor degradation of accuracy, but it might miss some of the important

or “intelligent” vectors that are helpful for demarcating one speaker from another. If a suitable criterion is applied for checking the incoming utterance beforehand, the intelligent vectors could be identified and sent afterwards for the test. • In SR context finding an optimal model complexity is still an open issue for future research as there is no straightforward and direct approach for evaluating the

optimum model order. In order to minimize the resource for a given SI system, one could think of using non-uniform model orders for available client speakers in

142

Conclusions a database. The idea is to use lower order models for those speakers, who can be easily demarcated from the rest. The speakers, whose voice are difficult to distinguish (e.g. two high pitched females), higher order models are to be used for better approximation of the data. Note that the scores obtained from the models are not dependent on the order of the models and therefore scores from several speaker models are comparable without any post processing.

• In general, for a GMM based speaker model, only a few modes provide higher scores/probability than others. At the time of blind testing, a test vectors gets

score from each mode in a model and finally all the contributed scores are summed to have the final one. If one would have known which modes are really contributing and which are not through a proper criterion using the evidence from the training set, then the incoming test vectors might not be sent for the evaluation to those modes, which are not contributory at all or in other words far away from the unknown vectors. It has been shown already in the literature for a GMM & Universal Background Model (UBM) based SV system, that only the top five modes from UBM and its corresponding components of speaker models are useful out of a very large (≥256) model orders. • F-Ratio, a feature selection technique performs well when data for a speaker is

not too sparse or multi-modal in nature. Future investigation can be made by modifying the F-Ratio such that it can take care of the various modes present in the data. A sensible solution is to find the cluster centers using K-means algorithm and find the members, which are nearest to the respective cluster heads. Now the cluster heads will act as multiple means (i.e., Multi-modes) unlike the single mean

representation for a speaker. The method of finding intra and inter class variance for F-Ratio could possibly be extended on these labeled data, where labels signify the disjointed voronoi regions. • For the database like POLYCOST, which is multilingual in nature, one feasible

way to increase the identification rate is to cluster the speakers language wise

at the time of training. At testing phase, first a language identifier will identify the language and route the test utterance to the specific set of speakers who have already been been categorized under the same language as identified by the language recognizer. Next, the normal speaker identification needs to be done. In this way, the performance of an SI system could be enhanced as there are

6.2 Future Research Directions

143

smaller number of speakers that are grouped under the same language. Note that, language recognizer should be robust and any mistake by it will affect the overall systemâ€™s performance as language and speaker identifer blocks are designed in a cascaded manner. 3

APPENDIX

A

E&M and Split VQ Algorithm

3

Preface Appendix A presents Expectation & Maximization (E&M), Split VQ algorithms, which have been used for constructing speaker models from multidimensional features. First Split VQ is applied on the data to have a approximation of mean vectors. Then using these mean vectors, Expectation & Maximization finds the covariances and prior probabilities (mixing proportions) for the multidimensional Gaussians. 3

146

Appendix A

E&M Algorithm:

The basic idea of the E&M algorithm is, beginning with an initial model λ s, to estimate ¯ s , such that p(X|λ ¯ s ) > p(X|λs ). The new model then becomes the initial a new model λ model for the next iteration and the process is repeated until some convergence threshold is reached. This is the same basic technique used for estimating HMM parameters via the Baum-Welch reestimation algorithm [1]. On each E&M iteration, the following reestimation formulas are used which guarantee a monotonic increase in the model’s likelihood value. For a speaker, say s, for whom N s amount training data are available, the reestimation algorithm is as follows: Mixture Weights: p¯si

Ns 1 X = ps (i|xt , λs ) Ns

(A.1)

t=1

Means: µ ¯ si

Variances: 2

σ ¯is =

=

P Ns

s t=1 p (i|xt , λs ) · xt P Ns s t=1 p (i|xt , λs )

(A.2)

P Ns

s 2 2 t=1 p (i|xt , λs ) · xt −µ ¯ si P Ns s t=1 p (i|xt , λs )

(A.3)

¯ s = {¯ ¯ s }M and, where λ psi , µ ¯si , Σ i i=1

¯s = Σ i

2

s σ ¯i,11

0 .. . 0

0

...

0

s σ ¯i,22 ... .. .. . .

0 .. .

2

0

2

s ... σ ¯i,DD

.

is updated model after one iteration. Note that other symbols have their usual meaning. Split VQ Algorithm: The LBG algorithm [2] is a finite sequence of steps in which, at every step, a new

147 quantizer, with a total distortion less or equal to the previous one, is produced. Initialization by splitting requires that the number of codewords is a power of 2. The procedure starts from only one codeword that, recursively, splits it in two distinct codewords (Linde et al., 1980)[2]. More precisely, the generic m-th step consists in the splitting of all vectors obtained at the end of the previous step. The splitting criterion is shown in Fig. A.1. It starts from one codeword x. It splits this vector into two close vectors x + e and x âˆ’ e where e is a fixed perturbation vector. x+e

x

x-e

Figure A.1: Splitting of a codeword After the splitting, an optimization step is executed according to the method (See Algorithm A.1) described next. Algorithm A.1: Split-VQ Algorithm 1 2 3 4 5 6

7

8

9 10

/* Initialization of parameters */; M : Number of codewords; : Precision of the optimization process; Xtr0 : Initial codebook; Xtr : Input Patterns; Further, the following assignments are made; Iter = 0 Distâˆ’1 = + inf; /* Partition Condition: */; Given the codebook Xtriter , the partition P (Xtriter ) is calculated according to the Nearest Neighbor Condition[2]; /* Termination Condition check: */; The quantizer distortion (Distiter = Distortion {Xtriter , P (Xtriter )} is calculated according to Mean Quantization Error [2]. If |Distiterâˆ’1 âˆ’ Distiter |/Distiter â‰¤ then the optimization ends and Xtriter is the final returned codebook.; /* New Codebook calculation: */; Given the partition P (Xtriter ), the new codebook is calculated according to Centroid Condition[2]. In symbols: Xtriter+1 = Mean P (Xtriter ) ; After, the counter iter is increased by one and the procedure follows from Partition Calculation.

148

Appendix A

References [1] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,” The Annals of Mathematical Statistics, vol. 41, no. 1, pp. 164-171, 1970. (Cited in section A.) [2] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Commun., vol. 28, no. 1, pp. 84-95, Jan. 1980. (Cited in sections A and 5.) 3

Index AND rule, 116 ANN, 4, 85 Bandpass filters, 54 Between-class, 81 Bhattacharyya Distances, 9 Central limit theorem, 81 Cepstral features, 24 Closed-set, 2 Codecs, 10 Complementary information, 29 Concatenated feature vectors, 118 Correlation, 55 CPUTIME, 40 Curse of dimensionality, 80 DCT, 23 Decimation rate, 38 DFT, 22 Diagonal covariance, 34 Divergence, 67, 119 E&M, 33 Energy spectrum, 22, 61 Feature extraction, 4 Feature fusion, 116 FFT, 20 Filter bank, 23 Formants, 20 FR, 81 Frame blocking, 32 FS, 80 Fusion, 12, 38, 116

Gaussian distribution, 81 GF, 12, 55 GIMFCC, 65 GMFCC, 61 GMM, 4, 32 GVQ, 7 Hamming window, 32 HMM, 7, 116 IMFCC, 21, 26, 65 Inverted mel scale, 25 KLT, 9, 86 LBG, 36, 89 LDA, 67 LFCC, 4 Log-likelihood, 33, 125 LPC, 4 LPCC, 4, 20 Majority vote rule, 116 MATLAB, 40 MC, 123 Mel scale, 22 MFCC, 4, 22 Min-max operator, 127 Open-set, 2 OR rule, 116 PC, 4 Percentage of energy explanation, 96 Permutation matrix, 91 Pitch, 28

150 PLP, 4 POLYCOST, 14 PQ, 12, 37, 69, 106 QRcp, 12, 87, 131 RCC, 6 RF, 59 SA, 10 Sampling frequency, 23 SI, 2 Side effects, 32 Silence removal, 32 Singular values, 86 Speaker enrolment, 4 Speech coding, 10 SR, 2 Subband, 54 SUM rule, 117, 125 SV, 2 SVD, 12, 86, 131 SVM, 4 Text-dependent, 2 Text-independent, 2 TF, 55 UBM, 142 Utterance verification, 3 Vector sampling, 37 Vocal tract, 20, 91 Voronoi regions, 142 VQ, 4, 34, 91 Winner-runner-up, 129 Within-class, 81 YOHO, 13

INDEX

Publications Refereed Journals: • S. Chakroborty, A. Roy, and G. Saha, “Improved Closed Set Text-Independent Speaker Identification by combining MFCC with Evidence from Flipped Filter Banks,” International Journal of Signal Processing, vol. 4, no. 2, pp. 114-121, Apr. 2007. • S. Chakroborty and G. Saha, “Improved Text-Independent Speaker Identification using Fused MFCC & IMFCC Feature Sets based on Gaussian Filter,” International Journal of Signal Processing, vol.5 , no. 1, pp. 11-19, Jan. 2008. • S. Chakroborty and G. Saha, “Feature Selection using Singular Value Decomposition and QR Factorization with Column Pivoting for Text-independent Speaker Identification”, Communicated to IEEE Trans. on Audio, Speech, and Language Process. • S. Chakroborty and Goutam Saha, “Pre-quantization based fusion of MFCC and Speaker specific High Frequency Cues for Improved Text-Independent Speaker Identification,” Communicated to International Journal of Intelligent Systems Technologies and Applications • S. Chakroborty, S. Reddy, S. Senapati, and G. Saha, “Multi-Stream Speaker Modeling Technique for Vector Quantization Based Text-Independent Speaker Identification,” Communicated after first revision Journal of Indian Academy of Science, SADHANA. • G. Saha, S. Senapati, S. Chakroborty, U. S. Yadhunandan, “Text Dependent Speaker Identification using Modified Mel-Frequency Cepstral Coefficients and Reduced Artificial Neural Network Classifier,” Communicated to IETE Journal of research. • S. Senapati, S. Chakroborty and G. Saha, “Speech Enhancement by Joint Statistical Characterization in the Log Gabor Wavelet Domain,” Communicated to Speech Communication.

152

Publications

Conference Proceedings: • G. Saha, P. Kumar, and S. Charoborty, “A comparative study of feature extraction algorithms on ANN based speaker model for speaker recognition application,” in Proc. 11th International Conf. on Neural Information Processing, (ICONIP 2004), vol. LNCS 3316, 2004, pp. 1192-1197. • G. Saha, S. Chakroborty and S. Senapati, “An F-Ratio Based Optimization Technique for Automatic Speaker Recognition Application,” in Proc. of INDICON 2004 (First IEEE Annual Indian Conference), 2004, pp. 70-73. • G. Saha, S. Chakroborty and S. Senapati, “A New Silence Removal and EndPoint Detection Algorithm for Speech And Speaker Recognition Application,” in Proc. National Conference on Communication (NCC 2005), 2005. • G. Saha, S. Senapati and S. Chakroborty, “An F-Ratio based Optimization on noisy data for Speaker Recognition Application,” in Proc. of IEEE India Annual International Conf. INDICON 2005, 2005, pp. 352-355. • G. Saha, S. Chakroborty, and S. Senapati, “On Combining Classifier for Password Secured Speaker Recognition,” in Proc. of Thirteenth International Conference on Advanced Computing & Communications (ADCOM 2005), 2005, pp. 48-55. • S. Senapati, S. Chakroborty, and G. Saha, “Robust Automatic Speaker Identification based on Singular Value Decomposition technique in adverse conditions,” in Proc. of Asian Conference on Intelligent Systems and Networks, (AISN-2006), 2006, pp. 395-398. • S. Senapati, S. Chakroborty, and G. Saha, “Log Gabor Wavelet and Maximum a Posteriori Estimator in Speaker Identification,” in Proc. of IEEE India Annual International Conf. INDICON 2006, 2006, pp. 1-6. • S. Chakroborty, A. Roy, and G. Saha, “Fusion of a complementary Feature set with MFCC for Improved Closed Set Text-Independent Speaker Identification,” in Proc. of IEEE International conference on Industrial Technology, (ICIT-2006), 2006, pp. 387-390. • S. Chakroborty, A. Roy, S. Majumdar, and G. Saha, “Capturing Complementary Information via Reversed Filter Bank and parallel implementation with MFCC for improved Text-Independent Speaker Identification,” in Proc. of International Conference on Computing: Theory and Application, (ICCTA-2007), 2007, pp. 463-467. • S. Chakroborty and G. Saha, “Improved Closed Set Text-Independent Speaker Identification by Gaussian Filter based Mel-Frequency Cepstral Coefficients,” in Proc. IEEE Annual International Conf. (INDICON 2007), 2007.

153

Workshop: • G. Saha, S. Senapati, and S. Chakroborty, “Speaker identification using Modified Mel-Frequency Cepstral Coefficients and Reduced Artificial Neural Network Classifier,” EU-India workshop, Nov. 2005, IIT-Kharagpur. 3

Authorâ€™s Biography Sandipan

Chakroborty was born in Kolkata on 10 th

September, 1977. After finishing his schooling in 1997, he obtained Bachelor of Engineering (B.E) in Electronics from Nagpur University, India in 2001 and subsequently Masters of Engineering (M.E) with specialization in Digital System and Instrumentation from Bengal Engineering and Science University, Shibpur, Howrah, India in 2003. During his Masters programme, he obtained highest honors among the candidates belonging to the same Department. He served as a lecturer in an Engineering college for a duration of six months just after his masters degree. Since July, 2003 till date, he is a Institute Research Fellow in the Department of Electronics and Electrical Communication Engineering 1 , Indian Institute of Technology, Kharagpur 2 , India. His current area of research includes Speaker Recognition, Speech processing, Audio signature analysis, Signal processing, Neural networks, Data fusion strategies, Soft computing, and other complex Pattern recognition problems. He is associated with live projects funded by Government agencies like Department of Science and Technology (DST), India and Indian Space Research Organization (ISRO). During his research, he presented papers in many National and International Conferences and participated in important workshops. He is also a student Member of IEEE. He can be contacted at: sandipan@ece.iitkgp.ernet.in & mail2sandi@gmail.com.

1 2

http://www.iitkgp.ac.in/departments/home.php?deptcode=EC http://www.iitkgp.ac.in/

A Thesis on Speaker Identification

Published on Dec 6, 2011

Some Studies on Acoustic Feature Extraction, Feature Selection and Multi-level Fusion Strategies for Robust Text-Independent Speaker Identif...

Advertisement