dataset

FoleySet: A Multi-Level Human-Annotated Foley Sound Dataset

by Sunshiyu Wang

FoleySet is a new publicly available dataset of 10,000 human-annotated Foley audio clips designed to address a key gap in Foley research: the lack of structured, fine-grained, human-annotated Foley sound datasets. As evoloving machine learning approaches create new opportunities for Foley sound classification, retrieval, synchronization, and generation, high-quality datasets are essential for developing and evaluating reliable models. FoleySet provides this foundation through a two-level taxonomy containing 9 major categories and 73 sub-categories, covering sounds such as footsteps, clothing movement, human-produced sounds, liquid sounds, metal interactions, clicks, break/drop events, material interactions, and open/close mechanisms.

Because Foley does not have a universally unambiguous definition, we define it as sounds arising from human-related actions, including human-produced sounds and human interactions with materials, objects, and surfaces. Based on this definition, we developed a Foley-specific taxonomy by reviewing prior research, Foley practice, and commercial Foley sound libraries. Common Foley-related keywords were extracted, normalized, and manually refined into the final two-level taxonomy. Using this taxonomy, FoleySet was built through a multi-step pipeline: candidate clips were collected from Creative Commons Zero-licensed Freesound recordings, manually screened, standardized through audio processing, and annotated with major-category, sub-category, and one-shot/multi-shot labels. The final dataset includes training, validation, and test splits, along with metadata such as Freesound tags, descriptions, source IDs, uploader information, and URLs.

FoleySet: Content categories and statistics

We also provide baseline classification results using a PaSST-based audio embedding model. The model performs well on the 9-way major-category classification task, while the more fine-grained 73-way sub-category task remains challenging, especially for acoustically similar or underrepresented Foley sounds. Overall, FoleySet provides a practical dataset and Foley-specific taxonomy for developing models that retrieve, classify, and generate detailed sounds of human action and material interaction, with the goal of supporting future Foley research and contributing to broader audiovisual production workflows.

Dataset link: FoleySet

Paper link: Paper

dMelodies: A Music Dataset for Disentanglement Learning

by Ashis Pati

Motivation

In the field of machine learning, it is often required to learn low-dimensional representations which capture important aspects of given high-dimensional data. Learning compact and disentangled representations (see Figure 1) from given data, where important factors of variation are clearly separated, is considered especially useful for generative modeling.

Figure 1: Disentangled representation learning.

However, most of the current/previous studies on disentanglement have relied on datasets from the image/computer vision domain (such as the dSprites dataset).

We propose dMelodies, a standardized dataset for conducting disentanglement studies on symbolic music data which will allow:

researchers working on disentanglement algorithms evaluate their method on diverse domains.
systematic and comparable evaluation of methods meant specifically for music disentanglement.

dMelodies Dataset

To enable objective evaluation of disentanglement algorithms, one needs to either know the ground-truth values of the underlying factors of variation for each data point, or be able to synthesize the data points based on the values of these factors.

Design Principles

The following design principles were used to create the dataset:

It should have a simple construction with homogeneous data points and intuitive factors of variation.
The factors of variation should be independent, i.e., changing any one factor should not cause changes to other factors.
There should be a clear one-to-one mapping between the latent factors and the individual data points.
The factors of variation should be diverse and span different types such as discrete, ordinal, categorical and binary.
The generated dataset should be large enough to train deep neural networks.

Dataset Construction

Based on the design principles mentioned above, dMelodies is artificially generated dataset of simple 2-bar monophonic melodies generated using 9 independent latent factors of variation where each data point represents a unique melody based on the following constraints:

Each melody will correspond to a unique scale (major, harmonic minor, blues, etc.).
Each melody plays the arpeggios using the standard I-IV-V-I cadence chord pattern.
Bar 1 plays the first 2 chords (6 notes), Bar 2 plays the second 2 chords (6 notes).
Each played note is an 8th note.

A typical example is shown below in Figure 2.

Factors of Variation

The following factors of variation are considered:

1. Tonic (Root): 12 options from C to B
2. Octave: 3 options from C4 through C6
3. Mode/Scale: 3 options (Major, Minor, Blues)
4. Rhythm Bar 1: 28 options based on where the 6 note onsets are located in the first bar.
5. Rhythm Bar 2: 28 options based on where the 6 note onsets are located in the second bar.
6. Arpeggiation Direction Chord 1: 2 options (up/down) based on how the arpeggio is played
7. Arpeggiation Direction Chord 2: 2 options (up/down)
8. Arpeggiation Direction Chord 3: 2 options (up/down)
9. Arpeggiation Direction Chord 4: 2 options (up/down)

Consequently, the total number of data-points are 1,354,752.

Benchmarking Experiments

We conducted benchmarking experiments using 3 popular unsupervised algorithms (beta-VAE, Factor-VAE, and Annealed-VAE) on the dMelodies dataset and compared the result with those obtained using the dSprites dataset. Overall, we found that while disentanglement performance across different domains is comparable (see Figure 3), maintaining good reconstruction accuracy (see Figure 4) was particularly hard for dMelodies.

Figure 3: Disentanglement performance (higher is better) in terms of Mutual Information Gap (MIG) of different methods on the dMelodies and dSprites datasets.

Figure 4: Reconstruction accuracies (higher is better) of the different methods on the dMelodies and dSprites datasets. For reconstruction, the median accuracy for 2 methods fails to cross 50% which makes them unusable in a generative modeling setting.

Thus, methods which work well for image-based datasets do not extend directly to the music domain. This showcases the need for further research on domain-invariant algorithms for disentanglement learning. We hope this dataset is a step forward in that direction.

Resources

The dataset is available on our Github repository. The code for reproducing our benchmarking experiments is available here. Please see the full paper (to appear in ISMIR ’20) for a more in-depth discussion on the dataset design process and additional results from the benchmarking experiments.

Automatic Sample Detection in Polyphonic Music

by Siddharth Gururani

The term ‘sampling’ refers to the reuse of audio snippets from pre-existing digital recordings with appropriate modifications in new compositions in a way that it fits the musical context. Influential artists that have been sampled frequently by younger artists include, for example, James Brown, Stevie Wonder, and Michael Jackson. Since sampling is an important approach in at least some music genres, there are websites dedicated to linking samples to songs such as whosampled.com. The annotation, however, is done manually by fans and
music aficionados. A system that can automatically detect sampling can help automate this process and could also be used in large scale musicological studies of artist influence across time and geographical space.

The task of automatic sample detection has not been explored in much detail. Some papers proposed methods involving a modified audio fingerprinting method and Non-negative Matrix Factorization (NMF). The block diagram below gives a broad overview of the method used in this work.

flowchart of the sample detection process

The algorithm we present also utilizes NMF and adds a post-processing step with subsequence Dynamic Time Warping (DTW) to extract features that indicate a sample/song pair. The figure below shows a distance matrix for a song in which the sample is looped 4 times in 20 seconds as indicated by the diagonal lines. We extract features from the detected paths and use them to train a random forest classifier.

distance matrix showing 4 repetitions of the looped sample

A new dataset had to be created for the evaluation of the system as previous publications lack systematic evaluation. This dataset originates from whosampled.com and is now publicly available. Our evaluation results, presented in the paper, indicate that our algorithm is has reasonably high precision while suffering from low recall which may be attributed to absence of clear alignment paths in the distance matrix.

For details on the method, results and discussion, please refer to the published paper available here.

Mixing Secrets: A Multi-Track Dataset for Instrument Recognition

by Siddharth Gururani

Instrument recognition as a task in Music Information Retrieval has had a long history and several datasets have been introduced for public use. The RWC dataset and the UIOWA dataset, for instance, are standard datasets for evaluation of instrument recognition in monophonic audio. The IRMAS dataset is a large dataset for predominant instrument detection. There are however, not many datasets available for instrument detection in polyphonic mixtures.

Muti-track data comes in handy for such a task. Multi-track datasets contain the recording sessions of songs, which will normally include the raw tracks, the stems, and the final mix. This enables the usage of multi-track datasets for a variety of tasks such as source separation and multi-f0 tracking, but also instrument recognition.

MedleyDB is a widely known dataset that contains 250 multi-tracks with a well defined annotation format and instrument taxonomy. While this might be considered an overwhelming amount of data, new data-hungry algorithms such as deep neural networks are often in need of more data for training and testing. We release a new set of annotated multi-track data in a format that is compatible to MedleyDB. It contains 258 multi-tracks originating from the website for a book titled “Mixing Secrets For the Small Studio.”

The paper contains more details about how the data was cleaned and processed in order to make it consistent with MedleyDB’s annotations. The github repository contains the code and links to the data.

Guitar Solo Detection

by Ashis Pati

Over the course of the evolution of rock, electric guitar solos have developed into an important feature of any rock song. Their popularity among rock music fans is reflected by lists found online such as here and here. The ability to automatically detect guitar solos could, for example, be used by music browsing and streaming services (like Apple Music and Spotify) to create targeted previews of rock songs. Such an algorithm would also be useful as a pre-processing step for other tasks such as guitar playing style analysis.

What is a Solo?

Even though most listeners can easily identify the location of a guitar solo within a song, it is not a trivial problem for a machine. Looking at it from an audio signal perspective, solos can be very similar to some of the other techniques such as riffs or licks.

Therefore, we define a guitar solo as having the following characteristics:

The guitar is in the foreground compared to other instruments
The guitar plays improvised melodic phrases which don’t repeat over measures (differentiate from a riff)
The section is larger than a few measures (differentiate from a lick)

What about Data?

In the absence of any annotated dataset of guitar solos, we decided to create a pilot dataset containing 60 full-length rock songs and annotated the location of the guitar solos within the song. Some of the songs contained in the dataset include classics like “Stairway to Heaven,” “Alive,” and “Hotel California.” The sub-genre distribution of the dataset is shown in Fig. 1.

What Descriptors can be used to discriminate solos?

The widespread use of effect pedal boards and amps results in a plethora of different electric guitar “sounds,” possibly almost as large as the number of solos themselves. Hence, finding audio descriptors capable of discriminating a solo from a non-solo part is NOT a trivial task. To gauge how difficult this actually is, we implemented a Support Vector Machine (SVM) based supervised classification system (see the overall block diagram in Fig. 2).

In addition to the more ubiquitous spectral and temporal audio descriptors (such as Spectral Centroid, Spectral Flux, Mel-Frequency Cepstral Coefficients etc.), we examine two specific class of descriptors which intuitively should have better capacity to differentiate solo segments from non-solo segments.

Descriptors from Fundamental Pitch estimation:
A guitar solo is primarily a melodic improvisation and hence, can be expected to have a distinctive fundamental frequency component which would be different from that of another instrument (say a bass guitar). In addition, during a solo the guitar will have a stronger presence in the audio mix which can be measured using the strength of the fundamental frequency component.
Descriptors from Structural Segmentation:
A guitar solo generally doesn’t repeat in a song and hence, would not occur in repeated segments of a song (e.g., chorus, song). This allows to leverage existing structural segmentation algorithms in a novel way. A measure of the number of times a segment has been repeated in a song and the normalized length of the segment can serve as useful inputs to the classifier.

By using these features and post-processing to group the identified solo segments together, we obtain a detection accuracy of nearly 78%.

The main purpose of this study was to provide a framework against which more sophisticated solo detection algorithms can be examined. We use relatively simple features to perform a rather complicated task. The performance of features based on structural segmentation is encouraging and warrants further research into developing better features. For interested readers, the full paper as presented at the 2017 AES Conference on Semantic Audio can be found here.

MDB Drums dataset for automatic drum transcription

by Chih-Wei Wu

Data availability is the key to the success of many machine learning based Music Information Retrieval (MIR) systems. While there are different potential solutions to deal with insufficient data (for instance., semi-supervised learning, data augmentation, unsupervised learning, self-taught learning), the most direct way of tackling this problem is to create more annotated datasets.

Automatic Drum Transcription (ADT) is, similar to most of the MIR tasks, in need of more realistic and diverse datasets. However, the creation process of such datasets is usually difficult for the following reasons: 1) the synchronization between the drum strokes the onset times has to be exact. In previous work, this was done by installing triggers on the drum sets. However, the installation and the recording process also limits the size of the dataset. Another work around solution is to synthesize drum tracks with user defined onset time and drum samples; this will result in drum tracks with perfect ground truth, but the resulting music might be unrealistic and unrepresentative of the real-world music. 2) the variety of the drum sounds has to be high enough to cover a wide range from electronic to acoustic drum sounds. Most of the previous work only uses a small subset of drum sounds (e.g., certain drum machines or a few drum kits), which is not ideal in this regard. 3) the playing techniques can be hard to differentiate, especially on instruments such as snare drum. A majority of existing datasets only contain the annotations of the basic strikes for simplicity.

We try to address the above mentioned difficulties by introducing a new ADT dataset with a semi-automatic creation process.

Why MDB?

The goal of this project is to create a new dataset with minimum effort from the human annotators. In order to achieve this goal, a robust onset detection algorithm to locate the drum events is important. To ensure the robustness of the onset detector, we want the input signal to be as clean as possible. However, we also want to have a signal to be as realistic as possible (i.e., polyphonic mixtures of both melodic and percussive instruments). With these considerations in mind, we decided to avoid collecting a new dataset from scratch but rather work on the existing dataset with desirable properties.

As a result, the MusicDelta subset in the MedleyDB dataset is chosen for its:

Multitrack format. This can potentially increase the robustness of onset detector and facilitate the semi-automatic annotation process. In addition, the multitrack files can be mixed in any arbitrary combination, providing more possibilities for experimentation.
Real-world recordings. This means the recordings are more realistic and closer to the real use cases. Also, the diversity in terms of music genres offers a more representative sample pool.

Dataset creation

The processing flow of the creation of MDB Drum is shown in Fig. 1. First, the songs in the selected dataset are processed with an onset detector. This provides a consistent estimation of the onset locations. Next, the onsets are labeled with their corresponding instrument names (i.e., Hi-hat, Snare Drum, Kick Drum). This step inevitably requires manual annotation from the human experts. Following the manual annotation, a set of automatic checks were implemented to examine the annotations for common errors (e.g., typos, duplicates). Finally, the human experts went through an iterative process of cross-checking their annotations prior to the release of the dataset.

An dataset example is shown in Fig. 2. The original polyphonic mixture (top) appears noisy, and it is difficult to locate the drum events through both listening and visual inspection. The drum-only recording (middle), however, has sharp attacks and short decays in the waveform, providing a cleaner representation for the onset detector. Finally, the detected onsets (bottom), as marked in red, are relatively accurate, which greatly simplifies the process of manual annotation.

One example of the semi-automatic annotation process

The resulting dataset contains 23 tracks of real-world music with a diverse distribution of music genres (e.g., rock, disco, grunge, punk, reggae, jazz, funk, latin, country, britpop, to name just a few). The average duration of the tracks is around 54s, and the number of annotated instrument classes is 6 (only major classes) or 21 (with playing techniques). The users may choose to mix the multitracks in any combination (e.g., guitar + drum, bass + drum) due to the multitrack format of the original MedleyDB dataset.

All details can be found in our short paper here.