2 minute read

3.9 Clustering

Next Article
5. Conclusion

5. Conclusion

When instantiating the visualizer, the model, range of k to fit to the model, and the scoring metric are passed into the constructor. Next, the data is fitted to the visualizer. In this case, the principle subspace data is passed in. Finally, the visualizer show method is called to finalise and render the figure.

visualizer.fit(pcas) visualizer.show()

Advertisement

There are a few scoring metrics upon which the k-elbow can be evaluated. These are: –

Distortion – The default metric computes the sum of squared distances from each point to its assigned centre. This metric has performed sufficiently for a larger range of �� i.e., �� ∈ℝ|3≤�� ≤ 12 Silhouette – The Silhouette Score calculates the ‘goodness’ of a clustering technique. To determine the optimal number of clusters based on this metric, the mean silhouette coefficient of all samples is calculated. The silhouette coefficient is a value between −1 and 1 where a value of 1 indicates clearly defined, distinct clusters and a value of −1 indicates poor separation (Syakur et al., 2018). Based on this system of measurement, employing the silhouette coefficients for determining optimal number of clusters will be most effective when the data points are very clearly segmented into groups. For this specific application this metric has proven to be inconsistent especially within reduced ranges i.e., �� ∈ℝ|2≤�� ≤8. Calinski Harabasz – This metric for optimal k validation considers the levels of inter-cluster and intra-cluster dispersion and has shown promising, consistent results for 10 candidates of k. In the range �� ∈ℝ|3≤�� ≤ 12 the Calinski Harabasz index metric performs accurately on datasets of 3, 5, and 6 vocal percussive elements without depending on distinct cluster formations.

There are many factors that influence the accuracy of the ��-elbow method. The situations in which this method performs consistently well are documented in further detail in chapter 4.

The following cluster analysis functionality is encapsulated within the ClusterAnalysis class. The constructor for the ClusterAnalysis class takes the argument cluster_alg which specifies the type of clustering algorithm which will be applied when the _perform_clustering method is called. The _perform_clustering method takes the arguments reduced_features and n_clusters. This is the principle subspace that is passed as input to the clustering algorithm (the data to be clustered) and the number of output clusters.

K-Means Implementation

The k-means model used in this work is implemented with the sklearn.cluster module from the scikit-learn Python library. First, a k-means model is instantiated and a value for �� (i.e., number of clusters) is passed into the constructor, where the model’s default centroid initialisation is ‘kmeans++’. The principle subspace is then fitted to the model.

kmeans = KMeans(n_clusters = 3) kmeans.fit(reduced_features)

Once the model is fitted the output classes can be predicted with the fit_predict method. The kmeans model has a member variable called cluster_centers_ which is an array that holds the final centroid placements, there are useful to plot the centroids when visualising the final clustering.

y = kmeans.fit_predict(reduced_features)

This article is from: