
2 minute read
3.5 Feature Extraction
and Oxenham, 2010), (8) Pitch gathered by subharmonic summation and (9) Skew. The remaining three are novel feature sets proposed by Gonzalez as a way of approximating the Karhunen-Loeve theorem (KLT) (commonly referred to as Principal Component Analysis (PCA)) for obtaining Principle Components (PC). They propose approximating the KLT as a Discrete Cosine Transform (DCT) to obtain the Principle Spectral Components (PSC) of a signal. Gonzalez reasoning was the guarantee of PC to optimally characterise the underlying structure of data. The remaining two feature sets are extensions of PSC, Principle Cepstral Components (PCC) and Principle Spectral Temporal Components (PSTC). The PSTC features were found to produce the best overall results (for adequate data) whilst PSC and PCC obtained better results than MFCC and MFCC+ for smaller datasets. First applied to supervised audio classification these feature sets have been explored in this work for unsupervised learning on vocal percussion datasets. The justification and formulas for the proposed features will be introduced in greater detail below.
Principle Spectral Coefficients (PSC) – The first few DCT coefficients of the spectrum obtained via a short-time Fourier transform, given by (3)
Advertisement
������(��) =������{|������{��(��)}|} 3
Principle Cepstral Coefficients (PCC) – A variation on PSC that provides more even scaling of the feature set by whitening the spectrum, given by (4)
������(��) =������{log|������{��(��)}|} 4
Principle Spectral-Temporal Coefficients (PSTC) – Captures the principle information contained in the time-frequency distribution of energy in the audio signal, given by (5)
��������(��,��) =2��������{��������{��(��)}} 5 At 1025 features per frame, the PSTC result in a much larger feature set to describe the underlying audio samples when compared to the alternate features explored during the project. Stacking the 9 frames yield a feature vector of shape (1,9225) per data member. Gonzalez (2013) proposed the previously introduced features as “better performing” and “simpler to calculate” than MFCCs and presented results that showed this to be true for the specific classification problem to which the feature sets were applied.
In the initial stages of this work, features were gathered only from the first frame of each audio sample. Combining features across frames using a technique commonly referred to as ‘stacking’ is described by Meng, 2006 in their PhD thesis as ‘temporal feature integration’ and involves the aggregation of framebased features into a single feature vector. This technique can result in more reliable clusters by making use of the information gathered from features in other frames. Another frame-based aggregation method would be that of averaging across frames to find a mean, median, min or max, though some information is discarded during this process (Knees and Schedl, 2016). Naturally, stacking features creates a larger feature vector. When stacking 12 MFCC, 12∆&12∆∆ (36 features) across 9 frames the result is a 1����324 feature vector to represent each instance of vocal percussion. The shape of a feature vector (in the above example [1,324]) is known as its dimensionality (Stowell, 2010). Within the six_perc dataset, there are 216 vocal percussive recordings, resulting in a feature space of [216,324] for the delta accelerated MFCC features. There exists a difficulty that can arise in feature spaces containing many dimensions, a phenomenon referred to in the literature as ‘the curse of dimensionality’. To address this, the dimensions of the feature space were reduced from 324 to 5 prior to clustering, in a process called dimensionality reduction (Winursito, Hidayat and Bejo, 2018).