Automatic Determination of Vocal Percussive Classes Using Unsupervised Learning

24 Identifier

Pre-onset Post-onset duration Post-onset duration (samples) duration (ms) (samples) @44.1kHz Control/Baseline 2048 2048 23 Real-time – Low Latency 512 256 5.8 Real-time – Ultra Low Latency 256 128 2.9 Table II. Audio Duration Inputs for Hyperparameter Tuning

4. Results The following results demonstrate the feasibility of the unsupervised classification of vocal percussion for real-time audio durations. The best performing model achieved cluster purity of 100% for the control audio duration, > 94% for real-time low latency and > 93% for real-time ultra-low latency. Complete results with best performing parameters are given in table III. The conversion between accuracy and samples correctly classified for best performing results are presented in table IV. three_perc Baseline

five_perc

six_perc

accuracy

parameters

accuracy

parameters

accuracy

parameters

Kmeans 100%

feature1: 𝑚𝑓𝑐𝑐 feature2: 𝑚𝑓𝑐𝑐∆ feature3:𝑚𝑓𝑐𝑐∆∆ n_mfcc: 12 scaling: standard dimensions: 5 feature1: 𝑚𝑓𝑐𝑐 n_mfcc: 12 scaling: standard dimensions: 5

Kmeans 100%

feature1: 𝑚𝑓𝑐𝑐 feature2: 𝑚𝑓𝑐𝑐∆ feature3:𝑚𝑓𝑐𝑐∆∆ n_mfcc: 12 scaling: standard dimensions: 5 feature1: 𝑚𝑓𝑐𝑐 feature2: sc n_mfcc: 12 scaling: standard dimensions: 5

Kmeans 100%

feature1: 𝑚𝑓𝑐𝑐 feature2: zcr feature3: sc n_mfcc: 15 scaling: standard dimensions: 5

Kmeans 98%

feature1: mfcc feature2: 𝑧𝑐𝑟 feature3: sc n_mfcc: 15 scaling: standard dimensions: 5

Kmeans 92%

feature1: 𝑚𝑓𝑐𝑐 feature2: 𝑚𝑓𝑐𝑐∆ feature3:𝑚𝑓𝑐𝑐∆∆ n_mfcc: 12 scaling: standard dimensions: 5 feature1: 𝑚𝑓𝑐𝑐 feature2: sc feature2: zcr n_mfcc: 12 scaling: standard dimensions: 5 feature1: mfcc feature2: sc n_mfcc: 15 scaling: standard n_dims: 5

GMM 100% RealTime – Low Latency

Kmeans 100%

RealTime – Ultra Low Latency

Kmeans 98%

GMM 100%

GMM 100% Kmeans 97% GMM 97%

GMM 97%

GMM 100% Kmeans 94% GMM 94%

GMM 93%

Table III. Overall best performing results across three datasets and audio durations

Control/Baseline Low Latency Ultra-Low Latency

three_perc

five_perc

six_perc

108 of 108 samples correctly classified 108 of 108 samples correctly classified 108 of 108 samples correctly classified

180 of 180 samples correctly classified 175 of 180 samples correctly classified 177 of 180 samples correctly classified

216 of 216 samples correctly classified 203 of 216 samples correctly classified 201 of 216 Samples correctly classified

Table IV. Conversion from Accuracy scores to Number of Samples Correctly Classified A combination of Mel Frequency Cepstral Coefficients (MFCC), Spectral Centroid (SC) and ZeroCrossing Rate (ZCR) have shown overwhelming success, particularly for the real-time audio durations. For longer audio durations, delta accelerated features that model the trajectories of the MFCC over frames have performed highly. This can be attributed to the higher number of samples and therefore

Turn static files into dynamic content formats.

Create a flipbook