Page 1

On the combination of auditory and l modulation frequency channels for ASR applications

Fabio Valente and Hynek Hermansky IDIAP Research Institute


Motivation 

The information about the speech signal is contained in both auditory frequency channels and modulation frequency channels. Separate processing of auditory frequency is effective in noisy conditions (a.k.a. multi-band). Separate processing of modulation frequency is also effective in noisy conditions (Valente and Hermansky ICASSP 2008). What is the most effective way for ASR of combining classifiers trained on different frequency (both auditory and modulation) ranges.


Auditory and Modulation Spectrum Features

Speech

|STFT|

Bark

Trajectory

Filter bank

|FFT|

Auditory Spectrum Time

− − −

Auditory spectrum represents the instantaneous frequency of the signal. Modulation spectrum represents the dynamics of the speech signal. There are several ways to extract speech dynamics.

Modulation Spectrum


MRASTA filtering 

Set of multiple resolution filters applied on a critical band log-energy trajectory.

Filters are equally spaced on a logarithmic scale.

Covers the all range of modulation frequencies.

Model consistent with perceptual model of [T. Dau et al 1997]. Filter-bank is split in two set of filters that emphasize respectively slow and fast modulation frequencies.


MRASTA filtering II Slow Modulations

Fast Modulations

+1

-1 -50

TIME (ms)

+50

-50

TIME (ms)

0

dB

-20 1

10 Modulation Frequency (Hz)

1

10 Modulation Frequency (Hz)

+50


How to extract different frequency ranges.

This generates four different frequency channels. What is the best processing for combining information from the auditory and the modulation frequency channels ?


Parallel Processing Filter-bank 1

MLP

Time-frequency representation

MLP Filter-bank 2

Posterior

MLP

Time-frequency representation

A separate Multi Layer Perceptron (MLP) is trained on each channel output. • Output is combined using another MLP. Parallel processing does not imply an order in processing different channels. •


Hierarchical Processing Filter-bank 1

MLP MLP

Time-frequency representation

Filter-bank 2 Time-frequency representation

Different frequency ranges are introduced in hierarchical fashion. ď Ź Sensitive to the order in which features are introduced. ď Ź

Posterior


Experimental Setup 

Experiments performed with AMI RT05s first pass system. Conversational meetings data acquired using headset microphones.

• 100hrs of training data from 4 different meeting rooms. • Evaluated on RT05 evaluation data. 

TANDEM approach is used to incorporate posteriors into conventional LVCSR system.


TANDEM

MLP Time-frequency representation

Phoneme Posteriors

Log/KLT

TANDEM features

Gaussianization+ decorrelation

TANDEM transforms MLP output into features for conventional HMM/GMM systems.


Modulation frequency only PLP MRASTA Parallel Fast/Slow Hierarchical Fast to Slow Hierarchical Slow to Fast

WER 42.4 45.8 41.4 40 45.8

No splitting of auditory frequency channels.  Separate processing reduces significantly WER w.r.t. single classifier.  Hierarchical Processing outperforms parallel processing only when moving from fast to slow modulation frequencies. 


Auditory frequency only PLP MRASTA Parallel High/Low Hierarchical High to Low Hierarchical Low to High

No splitting of modulation frequency channels.  Separate processing reduces WER w.r.t. single classifier.  Parallel processing outperforms hierarchical processing.  Consistent with the conventional multi-band approach. 

WER 42.4 45.8 43.9 45 44.3


Some Conclusions 

Separate processing of different frequency ranges outperforms single classifier approach. In combination of classifiers trained on different modulation frequency ranges, hierarchical processing outperforms parallel processing only when moving from fast to slow modulation frequencies. Consistent with physiological evidence on auditory processing [L. Miller et al. 2002]. In combination of classifiers trained on different auditory frequency ranges, parallel processing outperforms hierarchical processing.

Consistent with the multi-band framework.

But how to combine both ?


Parallel combination Posterior

Merger MLP

MLP

MLP

MLP

MLP

(G-High,F-Low) (G-High,F-High) (G-Low,F-Low) (G-Low,F-High)

MRASTA Parallel Only modulations

WER 45.8 40.7 40


New architecture Posterior

splitting of modulation channels

Merger MLP

Merger MLP

MLP

MLP

MLP

MLP

(G-High,F-Low) (G-High,F-High) (G-Low,F-Low) (G-Low,F-High) splitting of auditory channels

MRASTA Parallel New

WER 45.8 40.7 39.6


Conclusions 

The proposed architecture takes advantage of previous findings and process: −

Modulation frequencies in hierarchical fashion.

Auditory frequencies in parallel fashion.

Outperforms by 1% absolute the parallel combination of all channels.


Thank you !


Combination of Auditory Frequencies Only two auditory bands are considered

Band 1 Band 2 Parallel Band 1 / 2 Hierarchical Band 1 to Band 2 Hierarchical Band 2 to Band 1

WER 65.5 60.6 43.9 45 44.3

In case of combination of classifiers trained on different auditory frequencies Parallel combination outperforms hierarchical combination.


MRASTA filtering II


joint  

http://fvalente.zxq.net/presentations/joint.pdf

Read more
Read more
Similar to
Popular now
Just for you