
2 minute read
5. Conclusion

Figure 20. Example of Cluster Formations that are not Clearly Defined (top) Upon Which the kelbow Metric Performs Poorly (bottom)
Advertisement
This paper focused on the analysis and automatic class determination of vocal percussive sounds for offline and online audio durations. First, three vocal percussive datasets were curated containing three, five and six vocal percussive classes. Ground truth class tags were provided in the CSV files associated with the datasets to enable the use of the Rand-Index performance metric, crucial in the configuration of the clustering pipeline. Frame-based features were then extracted, engineered, and reduced in
dimensionality to obtain a lower-dimensional principle subspace upon which three unsupervised learning techniques were explored for clustering the data. Three audio durations were also compared, with a focus on real-time low latency. Finally, the implementation of hyperparameter tuning via the Optuna framework allowed for the optimisation of hyperparameters to maximise the Rand-Index score (i.e., identify hyperparameters that achieve the highest cluster purity). As far as is known, this is the first comparative study to evaluate model and centroid-based, partitional clustering for classification of HBB data.
The planned next steps for this work focus on dataset testing and real-time implementation. The main priority of the testing phase will be the introduction of datasets with multiple participants, and the extension of current datasets with non-linguistic, inhaled and exhaled vocal events common in HBB. The aim of this testing phase will be to incrementally generalise the algorithm through exposure to multiple datasets and vocal percussion both semi-professional and amateur to account for a userbase with varying ability. The real-time implementation will require a development phase of fine tuning the pipeline and translation to a real-time language (i.e., C++). Considering the unpredictable results of the k-elbow method for ultra-low latency durations there is an opportunity to explore alternative methods for predicting number of components. The high performance of the model-based GMM clustering in this work makes a case for the implementation of GMM over k-means in a real-time application. In this case there exists an alternative, high accuracy method for the prediction of number of components for the GMM algorithm named Bayesian Information Criterion (BIC), a method used in linear regression (McLachlan and Rathnayake, 2014).
If the currently used models prove to be too computationally heavy for real-time, Olukanmi, Nelwamondo and Marwala (2018) suggest a scalable adaptation of classic k-means entitled ‘k-means lite’. This model is based on the extension of Central Limit Theory (CLT) and prioritises large datasets in real-time applications. In the case that the next dataset testing phase proves the current pipeline to be considerably less accurate for generalised cases there is an opportunity to explore unsupervised, deeplearning based alternatives i.e., a pipeline involving the segmentation of vocal percussive classes based on sematic meaning, extracted employing autoencoder architecture (Chung, et al., 2016).