3.10 Hyperparameter Optimization

from Automatic Determination of Vocal Percussive Classes Using Unsupervised Learning

3.6 Feature Engineering

cluster_label = kmeans.cluster_centers_

Gaussian Mixture Model Implementation

GMM is implemented in a very similar way to k-means, from the sci-kit learn sklearn.mixture module. Once again, the model is instantiated, this time with a value for number of Gaussian distributions (which will dictate number of clusters) is passed into the constructor and the model is fitted with the principle subspace.

gmm = GaussianMixture(n_components = 3).fit(reduced_features) y = gmm.predict(reduced_features)

In GMM each cluster is based upon a Gaussian distribution, therefore the equivalent of the k-means centroid is the mean of each distribution in the model. The gmm object has a member variable called means_ which stores the resulting mean of each Gaussian distribution, which can again be used to plot cluster centres on visualisation.

cluster_label = gmm.means_

The probabilistic properties of GMM allow gmm.predict_proba(pcas) which returns the probability of each datapoint belonging to a certain cluster. GMM is more computationally expensive than k-means, therefore, if the results obtained with GMM aren’t considerably more accurate than those obtained with the k-means model, there is no real advantage of choosing GMM over k-means (Everitt et al., 2011). This is especially the case when applied to real-time applications where computational time is costly and low latency is a requirement.

3.10 Hyperparameter Optimisation

Hyperparameters are human specified, application specific variables that influence the outcome of Machine Learning models (Du, Xu and Zhu, 2021). The most basic method of finding an optimal outcome for any specific application is through trial and error i.e., changing the variables by hand and recording which combinations of variables produce the best results. This approach may produce some interesting results but requires the expense of time, recording of the results and, if there are many variables, it is possible that there are optimal combinations that will be missed due to human error.

Optuna is a powerful, open source hyperparameter optimisation framework that removes the need for manual auditioning of myriad variable combinations by automating this process. The outcome of each trial is recorded in history as Optuna tests all combinations of hyperparameters in a specified range, attempting to either ‘minimise’ or ‘maximise’ a user defined metric. Examples of hyperparameters are, number_of_mfcc, scaling_type, feature_type and number_of_dimensions.

The Optuna objective function takes an Optuna trial as an argument to give objective() access to the trial.suggest methods and encases the machine learning pipeline and its suggested hyperparameters within. In this case, the body of the objective function contains code to create objects of (and pass data between) the classes that capture the full clustering pipeline. These classes are

class VocalPercDataset(Dataset) class FeatureExtractor() class FeatureHandling() class ClusterAnalysis()

#Optuna script to tune the hyperparameter that maximize rand-index score

def objective(trial):

#Hyperparameters-------------------------------------------------------------------

feature_select_1 = trial.suggest_categorical("feature_select_1",["mfcc", "delta_mfcc", "delta_delta_mfcc", "pstc", "sc", "zcr"]) feature_select_2 = trial.suggest_categorical("feature_select_2",["mfcc", "delta_mfcc", "delta_delta_mfcc", "pstc", "sc", "zcr"]) feature_select_3 = trial.suggest_categorical("feature_select_3",["mfcc", "delta_mfcc", "delta_delta_mfcc", "pstc", "sc", "zcr"]) parameter_dict = {} parameter_dict[feature_select_1] = "stack" parameter_dict[feature_select_2] = "stack" parameter_dict[feature_select_3] = "stack" n_mfcc = trial.suggest_int("n_mfcc", 12, 40) feature_scale = trial.suggest_categorical("feature_scale", ["standard", "normal"]) n_dims = trial.suggest_int("n_dims", 2, 8)

#Objects---------------------------------------------------------------------------

#Dataset Loading

ANNOTATIONS_FILE_ = "/Desktop/rachels_dataset/metadata/three_perc.csv" AUDIO_DIR_ = "/Users/philfasan/Desktop/rachels_dataset/audio/" SAMPLE_RATE_ = 44100

vocal_percussion_dataset = VocalPercDataset(ANNOTATIONS_FILE_, AUDIO_DIR_, SAMPLE_RATE_) #create a vocal percussion dataset object

#Empty containers for audio, labels, and features audios_ = [] labels_ = [] features_ = []

#Loop to extract audio samples and labels from the VPD class for i in range(len(vocal_percussion_dataset)): audio_,label_ = vocal_percussion_dataset[i] audio_ = audio_.squeeze().numpy() audios_.append(audio_) labels_.append(label_) audios_ = np.array(audios_) labels_ = np.array(labels_)

#Feature Extraction

#Loop to extract audio features form FE class for x in audios_: feature_extractor_ = FeatureExtractor(x, parameter_dict, n_mfcc) temp_extract_ = feature_extractor_._perform_extraction()

features_.append(temp_extract_) features_ = np.array(features_)

#Feature Handling

feature_handler_ = FeatureHandling(features_, n_dims, "PCA", feature_scale) #create a feature handling object feature_handler_._scale_feature() #call scale features on object feature_handler_._get_reduction() #call get reduction on object pcas_ = feature_handler_.scaled_feature.squeeze() #squeeze scaled/reduced features

#Clustering

clustering_ = ClusterAnalysis("kmeans") #create a cluster analysis object cluster_predict_, centroids_ = clustering_._perform_clustering(pcas_, 3) #call perform clustering on object

return rand_score(labels_, cluster_predict_)

study = optuna.create_study(direction="maximize") #maximise the rand_score (best score is 1, worst is 0) study.optimize(objective, n_trials=300)

For the assignment of the above hyperparameter suggestions, two methods of the Optuna trail module are utilised. These are trial.suggest_categorical and trial.suggest_int. The former allows the suggestion of a range of choices for the categorial parameter e.g., feature_select. The latter allows the specification of a low and high threshold for a range of ints, for an integer parameter e.g., n_mfcc. The metric used here (and throughout this work) to score the accuracy of the clustering is the rand_score. A score of 1.0 translates to 100% cluster accuracy when comparing ground truth classes and predicted classes. Accordingly, the objective function’s return seeks to ‘maximise’ the rand_score between the ground truth classes, labels_ (gathered from the dataset’s associated csv file) and the predicted classes, cluster_predict_ (generated within the objective function via a call to the _perform_clustering method of the ClusterAnalysis class).

Real-time Considerations

Whilst the fixed duration length of 4096 samples has performed well for proof-of-concept, this research has been undertaken with a real-time application in mind. A worthwhile consideration is the durations that are appropriate when classifying in real-time. Accordingly, experiments were performed for various sample durations each side of the audio onset. In a real-time application it is the trailing audio (after the onset) that is costly in terms of latency. The leading audio (before the onset) is free in this regard. It follows that the goal is to find a suitable balance between latency and clustering accuracy. This trade off will be application specific. For example, a performance tool needs to feel instantaneous, the latency must not reach an audible level (3ms for vocals, 6ms for percussion)(Walker, 2005).With instant results less of a requirement, an application like query-by-vocalisation might favour classification accuracy over low latency. With shorter durations it can be beneficial to re-evaluate the feature sets to better represent the fewer samples the machine learning model will now have at its disposal. Below, the hyperparameter optimisation results are discussed for the three durations of audio, presented in table II.

3.10 Hyperparameter Optimization

Next Article

3.6 Feature Engineering

Gaussian Mixture Model Implementation

3.10 Hyperparameter Optimisation

Real-time Considerations

More articles from this publication:

3.6 Feature Engineering

3.5 Feature Extraction

4. Results

References

3.9 Clustering

5. Conclusion

2.1.3 Gaussian Mixture Models (GMM

This article is from:

Automatic Determination of Vocal Percussive Classes Using Unsupervised Learning