III. CONCLUSION

from Audio Onset Detection: A Brief History and Current Techniques

I. INTRODUCTION

2010 saw a paradigm shift for onset detection methods. Inspired by the work of Lacoste and Eck, Böck et al presented an algorithm for onset detection based on a Bidirectional Recurrent Neural Network (BRNN) with Long Short-Term Memory (LSTM). This new development defined the then state-of-the-art for onset detection algorithms. Eyben et al publicised that their novel, data-driven methodology showed superior performance and precision for both pitched and percussive onsets (2010). RNNs have a looping mechanism in the hidden layer, called the hidden state, which makes them great for capturing sequential data like signals or time-series and making predictions (Phi, 2018). A weakness of RNNs is that they suffer from short term memory due to the vanishing gradient effect (Phi, 2018). This problem was solved for onset detection in this case by introducing Long Short-Term Memory to the process, LSTM functions similarly to RNN but they are capable of learning long term dependencies (Phi, 2018).

Neural Networks had originally been explored for onset detection in 2002 when Marlot et al used a Multi-Layer Perceptron (MLP) on top of a spectrogram but the study was limited to the onsets of piano music (Marolt & Kavčič, 2003). Another historically significant breakthrough took place in the same year when Long Short-Term Memory (LSTM) was first applied to Music Information Retrieval for the detection of temporal structure in Blues improvisation (Eck, D. & Schmidhuber, 2002). Modern, datadriven onset detection algorithms still use spectral representations of the signal, but instead of calculating spectral flux manually, a spectrogram is fed to a Neural Network (Salamon & Bello, 2017).

Inspired by the use of CNN for edge detection in image processing, Böck et al were the first to realise the use of CNN for onset detection (Schlüter & Böck, 2013). When comparing their previous use of RNN for onset detection (Eyben et al., 2010) with the CNN method, a trade-off between pre-processing and computational load was identified, the methods performed similarly overall, however, the CNN required less pre-processing with the trade-off of higher computational power (Schlüter & Böck, 2013).

In 2014 a study by Böck and Schlüter saw that CNN yielded improved results when compared to the then state-of-the-art onset detection techniques and superseded their previous use of RNN and CNN for the same task. The then state-of-the-art was ‘OnsetDetector’ also the work of Böck et al, delivered in the 2011 and 2012 MIREX Onset Detection algorithm awards, performing best both years, Böck would go on to author or co-author the winning submissions to the present day.

Böck and Schlüter began their study by stating that up until that point, onset detection had been only partially solved for polyphonic signals (Schluter & Bock, 2014). Their implementation of CNN for onset detection was able to overcome this long-standing weakness as CNN are able to distinguish between two musical events or onsets when they are masking each other in time and/or frequency (Salamon & Bello, 2017). Similarly, to their previous data-driven methodology this CNN was trained on a spectrogram. The grid-like representation of frequency as a function of time is ideal for a CNN which efficiently handles spectral, temporal data (Schluter & Bock, 2014).

III. CONCLUSION

Looking at the literature of the past decade it emerges that Böck has been the most prominent researcher in the field of onset detection and along with Schlüter and Eyben has really championed deep learning for onset detection, consistently defining or contributing to the state-of-the-art algorithms (Bello et al., 2005; Eyben et al., 2010; Schlüter & Böck, 2013).

The deep-learning-based approaches discussed in this paper have seemingly managed to overcome the main complications inherent to onset detection that researchers have faced time and again. These are:

§ Soft onsets with slow transients where the onset is harder to detect § Vague or quiet onsets § Onsets in polyphonic music

The CNN based onset detection algorithm ‘CNNOnsetDetector’ that Böck and Schlüter presented at MIREX 2018 continues to hold the title as the best performing algorithm for onset detection (Roebel, Jacques, & Aknin, 2018). However, the results have been shown to be impossible to replicate thus far.

Reproduction of the results was attempted most recently by Björn Lindqvist (Lindqvist, 2019). They concluded that they tried and failed to replicate the results achieved by Böck and Schlüter. Gong and Serra (2018) were also unable to reproduce these results. This raises the question of how useful an algorithm is in the wider context if it is not able to be reproduced to the same quality. Lindqvist (2019) proposes raw waveforms as an alternative to spectrograms for training neural networks and suggests that this might improve the performance.

Over the past two years, MIR papers suggesting transformers as an improvement upon LSTM based methods have been emerging (Agrawal, Ravi Shanker, & Alluri, 2021; Park, Choi, Jeon, Kim, & Park, 2019). Transformers could potentially be the next step for onset detection algorithms.

III. CONCLUSION

Next Article

I. INTRODUCTION

III. CONCLUSION

More articles from this publication:

I. INTRODUCTION

II. LITERATURE REVIEW

IV. REFERENCES

This article is from:

Audio Onset Detection: A Brief History and Current Techniques