On the combination of auditory and l modulation frequency channels for ASR applications
Fabio Valente and Hynek Hermansky IDIAP Research Institute
The information about the speech signal is contained in both auditory frequency channels and modulation frequency channels. Separate processing of auditory frequency is effective in noisy conditions (a.k.a. multi-band). Separate processing of modulation frequency is also effective in noisy conditions (Valente and Hermansky ICASSP 2008). What is the most effective way for ASR of combining classifiers trained on different frequency (both auditory and modulation) ranges.
Auditory and Modulation Spectrum Features
Auditory Spectrum Time
− − −
Auditory spectrum represents the instantaneous frequency of the signal. Modulation spectrum represents the dynamics of the speech signal. There are several ways to extract speech dynamics.
Set of multiple resolution filters applied on a critical band log-energy trajectory.
Filters are equally spaced on a logarithmic scale.
Covers the all range of modulation frequencies.
Model consistent with perceptual model of [T. Dau et al 1997]. Filter-bank is split in two set of filters that emphasize respectively slow and fast modulation frequencies.
MRASTA filtering II Slow Modulations
10 Modulation Frequency (Hz)
10 Modulation Frequency (Hz)
How to extract different frequency ranges.
This generates four different frequency channels. What is the best processing for combining information from the auditory and the modulation frequency channels ?
Parallel Processing Filter-bank 1
MLP Filter-bank 2
A separate Multi Layer Perceptron (MLP) is trained on each channel output. â€˘ Output is combined using another MLP. Parallel processing does not imply an order in processing different channels. â€˘
Hierarchical Processing Filter-bank 1
Filter-bank 2 Time-frequency representation
Different frequency ranges are introduced in hierarchical fashion. ď Ź Sensitive to the order in which features are introduced. ď Ź
Experiments performed with AMI RT05s first pass system. Conversational meetings data acquired using headset microphones.
• 100hrs of training data from 4 different meeting rooms. • Evaluated on RT05 evaluation data.
TANDEM approach is used to incorporate posteriors into conventional LVCSR system.
MLP Time-frequency representation
TANDEM transforms MLP output into features for conventional HMM/GMM systems.
Modulation frequency only PLP MRASTA Parallel Fast/Slow Hierarchical Fast to Slow Hierarchical Slow to Fast
WER 42.4 45.8 41.4 40 45.8
No splitting of auditory frequency channels. Separate processing reduces significantly WER w.r.t. single classifier. Hierarchical Processing outperforms parallel processing only when moving from fast to slow modulation frequencies.
Auditory frequency only PLP MRASTA Parallel High/Low Hierarchical High to Low Hierarchical Low to High
No splitting of modulation frequency channels. Separate processing reduces WER w.r.t. single classifier. Parallel processing outperforms hierarchical processing. Consistent with the conventional multi-band approach.
WER 42.4 45.8 43.9 45 44.3
Separate processing of different frequency ranges outperforms single classifier approach. In combination of classifiers trained on different modulation frequency ranges, hierarchical processing outperforms parallel processing only when moving from fast to slow modulation frequencies. Consistent with physiological evidence on auditory processing [L. Miller et al. 2002]. In combination of classifiers trained on different auditory frequency ranges, parallel processing outperforms hierarchical processing.
Consistent with the multi-band framework.
But how to combine both ?
Parallel combination Posterior
(G-High,F-Low) (G-High,F-High) (G-Low,F-Low) (G-Low,F-High)
MRASTA Parallel Only modulations
WER 45.8 40.7 40
New architecture Posterior
splitting of modulation channels
(G-High,F-Low) (G-High,F-High) (G-Low,F-Low) (G-Low,F-High) splitting of auditory channels
MRASTA Parallel New
WER 45.8 40.7 39.6
The proposed architecture takes advantage of previous findings and process: −
Modulation frequencies in hierarchical fashion.
Auditory frequencies in parallel fashion.
Outperforms by 1% absolute the parallel combination of all channels.
Thank you !
Combination of Auditory Frequencies Only two auditory bands are considered
Band 1 Band 2 Parallel Band 1 / 2 Hierarchical Band 1 to Band 2 Hierarchical Band 2 to Band 1
WER 65.5 60.6 43.9 45 44.3
In case of combination of classifiers trained on different auditory frequencies Parallel combination outperforms hierarchical combination.
MRASTA filtering II