Learning to Traverse Latent Spaces for Musical Score Inpainting

Inpainting as a Creative Tool

Inpainting is the task of automatically filling in missing information in a piece of media (say, an image or a audio clip). Traditionally inpainting algorithms have been used mostly for restorative purposes. However, in many cases, there could be multiple ways to perform an inpainting task. Hence, inpainting algorithms can be used as a creative tool.

Figure 1:

Specifically for music, an inpainting model could be used for several applications in the field of interactive music creation such as:

  • to generate musical ideas in different styles,
  • to connect different musical sections, and
  • to modify and extend solos.


Previous work, e.g., Roberts et al.’s MusicVAE, has shown that Variational Auto-Encoder (VAE)-based music generation models show interesting properties such as interpolation and attribute arithmetic (this article gives an excellent overview of these properties). Effectively, for music, we can train a low-dimensional latent space where each point maps to a measure of music (see Fig. 3 for an illustration).

Figure 2:

Now while linear interpolation in latent spaces has shown interesting results, by definition, it cannot model repetition in music (joining two points in euclidean space with a line will never contain the same point twice). The key motivation behind this research work was to explore if we can learn to traverse complex trajectories in the latent space to perform music inpainting, i.e., given a sequence of points corresponding to the measures in the past and future musical contexts, can we find a path through this latent space which can form a coherent musical sequence (see Fig. 4).

Figure 3:


To formalize the problem, consider a situation where we are given the beginning and end of a song (which we refer to as the past musical context Cp and future musical context Cf, respectively) The task here is to create a generative model which connects the two contexts in a musically coherent manner by generating an inpainting Ci. We further constrain the task by assuming that the contexts and the inpainting have a certain number of measures (see Fig. 2 for an illustration). So effectively, we want to train a generative model which can maximize the likelihood of p(Ci|Cp, Cf).

Figure 4:

In order to achieve the above objective, we train recurrent neural networks to learn to traverse the latent space. This is accomplished using the following steps:

    • We first train a MeasureVAE model to reconstruct single measures of music. This creates our latent space for individual measures.
Figure 5:
    • The encoder of this model is used to obtain the sequence of latent vectors (Zp and Zf) corresponding to the past and future musical contexts.
Figure 6:
    • Next, we train a LatentRNN model which takes these latent vector sequences as input and learns to output another sequence of latent vectors Zi corresponding
      Figure 7:

      to the inpainted measures.

    • Finally we use the decoder of the MeasureVAE to maps these latent vectors back to the music space to obtain our inpainting (Ci).
Figure 8:

More details regarding the model architectures and the training procedures are provided in our paper and our GitHub repository.


We trained and tested our proposed model on monophonic melodies in the Scottish and Irish style. For comparison, we used the AnticipationRNN model proposed by Hadjeres et al. and a variant of it based our stochastic training procedure. Our model was able to beat both the baselines in objective test aimed at testing how well the model is able to reconstruct missing measures in monophonic melodies.

Figure 9:

We also conducted a subjective listening by asking human listeners to rate pairs of melodic sequences. In this our proposed model performed comparably to the baselines.

Figure 10:

While the method works fairly well, there were certain instances where the model produces pitches which are out of key. We think these “bad” pitch predictions were instrumental in reducing the perceptual rating of the inpaintings in the subjective listening test. This needs additional investigation. Interested readers can listen to some of these example inpaintings performed by our model in the audio examples list.

Overall, the method demonstrates the merit of learning complex trajectories in the latent spaces of deep generative models. Future avenues of research would include adapting this framework for polyphonic and multi-instrumental music.


An Attention Mechanism for Musical Instrument Recognition

Musical instrument recognition continues to be a challenging Music Information Retrieval (MIR) problem, especially for songs with multiple instruments. This has been an active area of research for the past decade or so and while the problem is generally considered solved for individual note or solo instrument classification, the occlusion or super-position of instruments and large timbral variation within an instrument class makes the task more difficult in music tracks with multiple instruments.

In addition to these acoustic challenges, data challenges also add to the difficulty of the task. Most methods for instrument classification are data-driven, i.e., they infer or ‘learn’ patterns from labeled training data. This adds a dependence on obtaining a reliable and reasonably large training dataset for instrument classification.

Data Challenges in Instrument Classification

In previous work, we discussed briefly how there was a data problem in the task of musical instrument recognition or classification in the multi-instrument/multi-timbral setting. We utilized strongly labeled multi-track datasets to overcome some of the challenges. This enabled us to train deep CNN-based architectures with short strongly labeled audio snippets to predict fine-grained instrument activity with a time resolution of 1 second.

In retrospect we can claim that, although instrument activity detection remains the ultimate goal, current datasets are both too small in scale and terribly imbalanced in terms of both genre and instruments as shown in this figure.

Distribution of instruments in one dataset

The OpenMIC dataset released in 2018 addresses some of these challenges by curating a larger-scale, somewhat genre-balanced set of music with 20 instrument labels and crowdsourced annotations for both positive (presence) and negative (absence) classes. The catch is that the audio clips here are 10 seconds long and the labels are assigned to the entire clip, i.e., fine-grained instrument activity annotations are missing. Such a dataset is what we call ‘weakly labeled.’

Previous approaches for instrument recognition are not designed to handle weakly labeled long clips of audio. Most of them are trained to detect instrument activity at short time-scales, i.e., the frame level up to 3 seconds. The same task with weakly labeled audio is tricky since instruments that are labeled as present may be present only instantaneously and could be left undetected by models that average over time.

In our proposed method, we utilize an attention mechanism and multiple-instance learning (MIL) framework to address the challenge of weakly labeled instrument recognition. The MIL framework has been explored in sound event detection literature as a means to weakly labeled audio event classification, so we decided to apply the technique to instrument recognition as well.

Method Overview

In MIL, each data point is represented as a bag of multiple instances. In our case, each bag is a 10 second clip from the OpenMIC dataset. We divide the clip into 10 instances, each of 1 second. Each instance is represented by a 128-dimensional feature vector extracted using a pre-trained VGGish model. Thus, a clip input is 10×128 dimensional. As per the MIL framework, each bag is associated with a label. This implies that at least one instance in the bag is associated with the same label, we just don’t know which one. To learn in the MIL setting, algorithms perform a weighted sum of instance-level predictions to obtain the bag-level predictions. These predictions can be compared with the bag-level labels to train the algorithm. In our paper, we utilize learned attention weights for the aggregation of instance-level predictions.

attention model

Looking at the model architecture in the figure above, the model estimates instance-level predictions and instance-level attention weights. We estimate one weight per instance per instrument label. The weights are normalized across instances to sum to one, adding an interpretation of relative contribution each instance prediction has on the final clip-level prediction for a particular instrument.


We compared this model architecture with mean pooling, recurrent neural networks, fully connected networks and binary random forests and found our attention-based model to outperform all the other methods, especially in terms of the recall. We also tried to visualize the attention weights and see if the model was focusing on relevant parts of the audio as it was supposed to.

Visualization of attention weights

As you can see in the above image, while this model is not very adept at localizing the instruments (in the first example, violin is actually in all instances, but model applies high weights only to a couple of instances), it does a good job at seeking out easy to classify instances and focuses weights on those.


In conclusion, we discuss the merits of weakly labeled audio datasets in terms of ease of annotation and scalability. In our opinion, it is important to develop methods capable of handling weakly labeled data due to the ease of annotation and therefore scalability of such datasets. To that end, we introduce the MIL framework and discuss the attention mechanism in brief. Finally, we show a visualization of how the attention weights provide some degree of interpretability to the model. Even though the model is not perfectly localizing the instruments, it learns to focus on relevant instances for final prediction.

The detailed paper can be found here.


Explicitly Conditioned Melody Generation

by Benjamin Genchel

The advent of modern deep learning has rapidly propelled progress and attracted interest in symbolic music generation, a research topic that encompasses methods of generating music in the form of sequences of discrete symbols. While a variety of techniques have been applied to tasks in this domain, the majority of approaches treat music as a temporal sequence of events and rely on Recurrent Neural Networks (RNNs), specifically the gated variants Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), to model this sequence. These approaches generally attempt only to learn to predict the next note event(s) in a series given previous note event(s). While results from these types of models have been promising, they train on hard mode relative to human musicians, who rely on explicit, discretely classed information such as harmony, feel (e.g. swing, straight), phrasing, meter, and beat to organize and contextualize while learning.

Our study aims to investigate how deep models, specifically gated RNNs, learn when provided with explicit musical information, and further, how particular types of explicit information affect the learning process. We describe the results of a case study in which we compared the effects on musical output of conditioning an RNN-based model like those mentioned above with various combinations of a select group of musical features.

Specifically, we separate musical sequences into sequences of discrete pitch and duration values, and model each sequence independently with its own LSTM-RNN. We refer to these as Pitch-RNN and Duration-RNN respectively. Then we train a pair of these models using combinations of the following:

  • Inter Conditioning: the pitch generation network is provided with the corresponding duration sequence as input and vice versa.
  • Chord Conditioning: both the pitch and duration networks are provided with a representation of the current chord for each input timestep.
  • Next Chord Conditioning: both the pitch and duration networks are provided with a representation of the chord following the current chord for each input timestep.
  • Bar Position Conditioning: both the pitch and duration networks are provided with a discrete value representing the position within a surrounding bar of the current input token.

For a more clear picture of what we’re doing, check out this architecture diagram!


We trained our models using two datasets (Bebop Jazz and Folk) of lead sheets (data extracted from musicXML files), as lead sheets provide a simplified description of a song that is already segmented into explicit components.


Evaluating generative music is a challenging and precarious task. In lieu of an absolute metric for quality, we characterize the outputs of each trained model using a set of statistical methods:

  • Validation NLL — The final loss achieved by a model on the validation set.
  • BLEU Score — a metric originally designed for machine translation, but deployed fairly commonly for judging sequence generation in general, BLEU score computes a geometric mean of the counts of matching N-grams (specifically 1, 2, 3, and 4-grams) between a set of ground truth data and a data set generated by a model. A BLEU score can be between 0 and 1, with 0 being nothing in common and 1 being everything in common (exactly matching).
  • MGEval — the MGEval toolbox, created by our own Li-Chia “Richard” Yang and Alexander Lerch, and described in an earlier blog post on this page, calculates KL-divergence and overlap scores between distributions of a number of music specific statistical features calculated for a set of ground truth data and generated data. These scores give an idea of how much generated data differs from the ground truth, and further provides insight into how they differ in musical terms.


Validation NLL

The general trend observed was, perhaps unsurprisingly, that the more conditioning information used, the lower the final loss. Additionally, models trained with more conditioning learned faster than trained with less.

BLEU Score

Pitch models generally achieved a higher BLEU score with added conditioning factors, while Duration networks demonstrated the opposite trend — lower scores with added conditioning factors. The highest scoring pitch models had both chord and next chord conditioning, though models trained on the Bebop dataset seemed to generally perform better with current chord conditioning than next-chord conditioning, while models trained on the Folk dataset performed better with next-chord conditioning. The highest scoring duration models for both datasets had no-conditioning at all, and the next highest scoring models had only one conditioning factor.
While a BLEU score close to 1 might be ideal for machine translation, for artistic tasks like music generation where there is no right answer, a score that high hints at overfitting, and thus a lack of ability to generalize what it has learned and create novel melodies in unfamiliar contexts. While pitch models scored between .1 and .3, duration models scored from .5 to .9; that duration models with less conditioning scored higher than those with more may actually indicate that models with more conditioning generalized better, as their scores were closer to .5


MGEval produces a lot of numbers — In our case, 66 for a single model (6 scores each for 11 MGEval metrics we used). We aggregated these results in a way that gives each individual conditioning factor a probability indicating its likelihood of contributing to higher performance. We split our 11 metrics into two categories, pitch-based, and duration-based, and performed this aggregation for each set:

In general, we observed that scores on MGEval metrics improve with the addition of conditioning features, but with these aggregation scores, we were able to ascertain (at a high level) which types of conditioning were affecting what. For both datasets, inter and chord conditioning had strong effects on pitch based features, with bar position conditioning a close third. However, for duration based features, it’s harder to determine. We can see here that for models trained on the Folk set, it’s a bit of a grab bag, but for models trained on the Bebop set, chord and bar position clearly contribute the most to success. This is perhaps due to the much simpler, and more repetitive rhythmic patterns found in Folk music as compared to Bebop Jazz music.


Solid takeaways from this study are that conditioning with chords is important not just for pitch prediction, but for duration prediction as well, though the usefulness of conditioning with the next chord vs. the current chord was harder to determine. More generally, we were also able to gain insight into the relative usefulness of our chosen conditioning set for pitch and rhythm based features. Additionally, we saw that the usefulness of particular features can also be genre dependent. Further investigation, along with a subjective evaluation via listening test is certainly worthwhile.

Check out our paper, the associated website, or the codebase for more details, audio samples, data and other information!

Music Informatics Group @ICML Machine Learning for Music Discovery Workshop

There will be 2 presentations by our group at the ICML Machine Learning for Music Discovery Workshop (ML4MD) this year:

  • Pati, Ashis; Lerch, Alexander: Latent Space Regularization for Explicit Control of Musical Attributes:
    Deep generative models for music are often restrictive since they do not allow users any meaningful control over the generated music. To address this issue, we propose a novel latent space regularization technique which is capable of structuring the latent space of a deep generative model by encoding musically meaningful attributes along specific dimensions of the latent space. This, in turn, can provide users with explicit control over these attributes during inference and thereby, help design intuitive musical interfaces to enhance creative workflows.
  • Gururani, Siddharth; Lerch, Alexander; Bretan, Mason: A Comparison of Music Input Domains for Self-Supervised Feature Learning:
    In music using neural networks to learn effective feature spaces, or embeddings, that capture useful characteristics has been demonstrated in the symbolic and audio domains. In this work, we compare the symbolic and audio domains, attempting to identify the benefits of each, and whether incorporating both of the representations during learning has utility. We use a self-supervising siamese network to learn a low-dimensional representation of three second music clips and evaluate the learned features on their ability to perform a variety of music tasks. We use a polyphonic piano performance dataset and directly compare the performance on these tasks with embeddings derived from synthesized audio and the corresponding symbolic representations.

From labeled to unlabeled data – on the data challenge in automatic drum transcription

by Chih-Wei Wu

Automatic Drum Transcription (ADT) is an on-going research topic that concerns the extraction of drum events from music signals. After roughly three decades of research on this topic, many methods and several datasets have been proposed to address this problem. However, similar to many other Music Information Retrieval (MIR) research topics, the availability of realistic and carefully curated datasets is one of the bottlenecks for advancing the performance of ADT systems.

In our previous blog post, we briefly discussed this challenge in the context of ADT. With a standard annotated dataset (i.e., ENST drums) and a small collection of unlabeled data, we demonstrated the possibility of harnessing unlabeled music data for improvements in the context of ADT.
In this paper, we explore this idea further in the following directions:

  1. Identify the major types of ADT systems and investigate generic methods for integrating unlabeled data into these systems accordingly
  2. Train the systems using a large scale unlabeled music dataset
  3. Evaluate all systems using multiple labeled datasets currently available

The intention is to validate the idea of using unlabeled data for ADT in a large scale. To achieve this goal, we present two approaches.


To show that the benefit of using unlabeled data can be generalized to most ADT systems, the first thing is to identify the most popular ADT approaches. To this end, we reviewed existing ADT systems (for more information, please refer to this recent publication), and we found that the two most popular approaches can be categorized as Segment-and-classify ( classification-based) and Separate-and-detect (activation-based).

Based on these two types of approaches, two learning paradigms for incorporating unlabeled data are evaluated. These are

    1. Feature Learning and
    2. Student Teacher Learning.

As shown in the figure below, both paradigms may extract information from unlabeled data and transfer them to ADT systems through different mechanisms; the feature learning paradigm learns a feature extractor that computes distinctive features from audio signals, whereas the student teacher learning paradigm focuses on generating “pseudo ground truth” (i.e., soft targets) using teacher models and passing them onto student models. Different variants of both paradigms are evaluated.

Results 1: we need more labeled data

In the first part of the experiment results, the averaged performance across all evaluated systems for each labeled dataset (e.g, ENST drums, MIREX 2005, MDB drums, RBMA) is presented. As shown in the following figure, for each individual drum instrument such as Kick Drum (KD), Snare Drum (SD), and HiHat (HH), the averaged performances differ from dataset to dataset. This result not only shows the relative difficulties of these datasets, but also implies the danger of relying on solely one dataset (which is exactly the case in many prior ADT studies). This result highlights the need for more diverse labeled datasets!

Results 2: unlabeled data is useful

In the second part of the experiment results, different systems for each learning paradigm are compared under the controlled conditions (e.g., training methods, the number of unlabeled examples used, etc.). For feature learning paradigm, as shown in the following table, both evaluated systems outperformed baseline systems on SD for the averaged F-Measure. This improvement was confirmed through a statistical check. This result suggests Segment-and-classify ADT systems can successfully benefit from unlabeled data through feature learning.

Role System HH BD SD
Baseline MFCC 0.61 0.62 0.40
Baseline CONV-RANDOM 0.61 0.54 0.39
Evaluated CONV-AE 0.61 0.62 0.42
Evaluated CONV-DAE 0.61 0.61 0.42

For student-teacher learning paradigm, encouraging results can also be found (this time on HH). In the following table, it is shown that all student models performed better on HH compared to both teacher models. This result indicates the possibility of obtaining students that are better than teachers with the help of unlabeled data. Additionally, this result confirms the finding in our previous paper in a larger scale.

Role System HH BD SD
Teacher PFNMF (SMT) 0.47 0.61 0.45
Teacher PFNMF (200D) 0.47 0.67 0.40
Student FC-200 0.56 0.57 0.44
Student FC-ALL 0.53 0.59 0.42
Student FC-ALL (ALT 0.55 0.58 0.44


According to the results, both learning paradigms can potential improve the ADT performances with the addition of unlabeled data. However, each paradigm seems to benefit different instruments. In other words, it is not easy to conclude which paradigm is “the way to go” when it comes to harnessing unlabeled resources. To simply put it, unlabeled data certainly has potential in improving ADT systems, and further investigation is worthwhile.

If you are interested in learning more details about this work, please refer to our paper. The code is available on github.

On the evaluation of generative models in music

by Li-Chia Richard Yang

Generative modeling among creative systems has research interest in a wide variety of tasks. Just as deep learning has reshaped the whole field of artificial intelligence, it has reinvented generative modeling in recent years, e.g., in music or painting. Regardless, however, of the research interest in generative systems, the assessment and evaluation of such systems has proven challenging.

In recent research on music generation, various data-driven models have shown promising results. As a quick example, here are two generated samples from two distinct systems:

Magenta (Attention RNN)
Magenta (Lookback RNN)


Now, how can we analyze and compare the behavior of these models?
As the ultimate judge of creative output is the human (listener or viewer), subjective evaluation is generally preferable in generative modeling. However, the general drawbacks of subjective evaluation can be summarized as various issues related to the required amount of resources and to the experiment design. Furthermore, objective evaluation has the advantage of providing a systematic, repeatable measurement across a significant amount of generated samples.

The proposed evaluation strategy

The proposed method does not aim at assessing musical pieces in the context of human-level creativity nor does it attempt to model the aesthetic perception of music.

It rather applies the concept of multi-criteria evaluation in order to provide metrics that assess basic technical properties of the generated music and help researchers identify issues and specific characteristics of both model and dataset. In a first step, we define two collections of samples as our datasets (in case of objective evaluation, one dataset contains the generated samples, the other contains samples from the training dataset). Then, we extract a set of features based on musical domain-knowledge for two main targets of the proposed evaluation strategy:

Absolute MeasurEments

Absolute measurements give insights into properties and characteristics of a generated or collected set of data.
During the model design phase of a generative system, it can be of interest to investigate absolute metrics from the output of different system iterations or of datasets as opposed to a relative evaluation. A typical example is the comparison of the generated results from two generative systems: although the model properties cannot be determined precisely for a data-driven approach, the observation of the generated samples can justify or invalidate a system design.
For instance, an absolute measurement can be a statistic analysis of note length transition histogram of each sample of a given dataset. In the following figure, we can easily observe the difference of such this feature among datasets from two different genres.

Relative Measurements

In order to enable the comparison of different sets of data, the relative measure generalizes the result among features with various dimensions.
We first perform pairwise exhaustive cross-validation for each feature and smooth the histogram results into probability density functions (PDFs) for a more generalizable representation. If the cross-validation is computed within one set of data, we will refer to it as intra-set distances. If each sample of one set is compared with all samples of the other set, we call it the inter-set distances.
Finally, we measure the similarity between these distributions for the application of evaluating music generative systems, and compute two metrics between the target dataset’s intra-set PDF and the inter-set PDF: the Kullback-Leibler Divergence (KLD) and overlapped area (OA) of two PDFs.
Take the following visualized figure as an example: assume set1 is the training data, while set2 and set3 correspond to generated results from two systems. The analysis can provide a quick observation It can be easily observed that the system that generates generated set2 produces results more in line has a closer behavior in such feature with the training data (in the context of this feature).

Find out more

Check out our paper for detailed use-case demonstration and the released toolbox for further application.

Instrument Activity Detection in Polyphonic Music

by Siddharth Gururani

Most forms of music are rendered as a mixture of acoustic and electronic instruments. The human ear, for the most part, is able to discern the instruments being played in a song fairly easily. However, the same is not true for computers or machines. The task of recognizing instrumentation in music is still an unsolved and active area of research in Music Information Retrieval (MIR).

The applications of such a technology are manifold:

  • Metadata which includes instrumentation enables instrument-specific music discovery and recommendations.
  • Identifying regions of activity of specific instruments in a song allows easy browsing for users. For example, a user interested in a guitar solo or vocals in a song can easily browse to the relevant part.
  • Instrument activity detection may serve as a helpful pre-processing step for other MIR tasks such as automatic transcription and source separation.

In our work, we propose a neural network-based system to detect activity for 18 different instruments in polyphonic music.

Challenges in Instrument Activity Detection

A big challenge in building algorithms for instrument activity detection is the lack of appropriate datasets. Until very recently, the IRMAS dataset was used as the benchmark dataset for instrument recognition in polyphonic music. However, this dataset is not suitable for an instrument activity detection because of the following reasons:

  • The test set contains 3 to 10 second snippets of audio that are only labeled with instruments present instead of a fine-grained instrument activity annotation.
  • The training clips are labeled with a single ‘predominant’ instrument even if more than one instrument is active in the clip.

We overcome this challenge by leveraging multi-track datasets such as the MedleyDB and Mixing Secrets dataset. These multi-track datasets contain the mixes as well as the stems accompanying them. Therefore, annotations for fine-grained stem activity may be automatically obtained by applying envelope tracking on the instrument stems.

In addition, we identify metrics that allow easier comparison of models for instrument activity detection. Traditional metrics such as precision, recall and f1-score are both threshold dependent and not ideal for multi-label classification scenarios. We use label-ranking average precision (LRAP) and area under the ROC curve (AUC) for comparison between different model architectures. Both these metrics are threshold agnostic and are suitable for multi-label classification.

Method and Models

We propose a rather simple pipeline for our instrument activity detection system. The block diagram below shows the high-level processing steps in our approach. First, we split our all the multi-tracks into artist conditional splits. We obtain 361 training tracks and 100 testing tracks. During training, the various models are fed with log-scaled mel-spectrograms for 1 second clips for the training tracks. We train these models to predict all the instruments present in a 1 second clip. We compare Fully Connected, Convolutional (CNN) and Convolutional-Recurrent (CRNN) Neural Networks in this work.

During testing, a track is split into 1 second clips and fed into the model. Once all 1 second level predictions are obtained from the model, we evaluate the predictions at different time-scales: 1 s, 5 s, 10 s and track-level. We aggregate over time by max-pooling the predictions and annotations for longer time-scale evaluation.


As expected, the CNN and CRNN models outperform the Fully Connected architectures. The CNN or the CRNN perform very similarly and we attribute that to the choice of input time context. For only a 1 second input, there are only a few time-steps for the recurrent network to learn temporal features from, hence the insignificant change in performance over the CNN. An encouraging finding was that the models perform well for rare instruments also.

We also propose a method for visualizing confusions in a multi-label context, shown in the figure above. We visualize the distribution of false negatives for all instruments conditioned on a false positive of a particular instrument. For example, the first row in the matrix represents the distribution of false negatives of all instruments conditioned on the acoustic guitar false positives. We observe several cases of confusions that make sense musically, for example: different guitars, tabla and drums, synth and distorted guitars being confused.

For more details on the various processing steps, detailed results and discussion, please check out the paper here! Additionally, a 3 and a half minute lightning talk given at the ISMIR conference is accessible here.

Assessment of Student Music Performance using Deep Neural Networks

by Ashis Pati

Moving Towards Automatic Music Performance Assessment Systems

Improving one’s proficiency in performing a musical instrument often requires constructive feedback from a trained teacher regarding various aspects of a performance, e.g., its musicality, note accuracy, rhythmic accuracy, which are often hard to define and evaluate. While the positive effects of a good teacher on the learning process is unquestionable, it is not always practical to have a teacher during practice. This begs the questions if we can design an autonomous system which can analyze a music performance and provide the necessary feedback to the student. Such a system will allow students without access to human teachers to learn music, effectively enabling them to get most out of their practice sessions.

What are the limitations of the current systems?

Most of the previous attempts (including our own) at automatic music performance systems have relied on:

  1. Extracting standard audio features (e.g. Spectral Flux, Spectral Centroid, etc.) which may not contain relevant information pertaining to a musical performance
  2. Designing hand-crafted features from music which are based on our (limited?) understanding of music performances and their perception.

Considering these limitations, relying on standard and hand-crafted features for the music performance assessment tasks leads to sub-optimal results. Instead, feature learning techniques which have no “prejudice” and can learn relevant features from the data have shown promise at this task.
Deep Neural Networks (DNNs) form a special class of feature learning tools which are capable of learning complex relationships and functions from data. Over the last decade or so, they have emerged as the architecture-of-choice for a large number of discriminative tasks across multiple domains such as images, speech and music. Thus, in this study, we explore the possibility of using DNNs for assessing student music performances. Specifically, we evaluate their performance with different input representations and network architectures.

Input Representations and Network Architectures

We chose input representations at two different levels of abstraction: a) Pitch Contour which extracts high level melodic information, and b) Mel-Spectrogram which extracts low-level information across several dimensions such as pitch, amplitude and timbre. The flow diagram for the computation of the input representations is shown in the Figure below:

Flow diagram for computation of input representations. F0: Fundamental frequency, MIDI: Musical instrument digital interface

Three different model architectures were used: a) A fully convolutional model with Pitch Contour as input (PC-FCN), b) A convolutional recurrent model with Mel-Spectrogram at input, and (M-CRNN) c) A hybrid model combining information both the input representations (PCM-CRNN). The three model architectures are shown below.

Experiments and Results

For data, we use the student performance recordings obtained from the Florida All-State auditions. Each performance is rated by experts along 4 different criteria: a) Musicality, b) Note Accuracy, c) Rhythmic Accuracy, and d) Tone Quality. Moreover, we consider two categories of students at different proficiency levels: a) Symphonic Band and, b) Middle School and design separate experiments for each category. Three instruments are considered: Alto Saxophone, Bb Clarinet and Flute.
The models are trained to predict the ratings (which are normalized between 0 and 1) given by the experts. As baseline, we use a Support Vector Regression based model (SVR-BD) which relies on standard and hand-crafted features extracted from the audio signal. More details about the baseline model can be found in our previous blog post. The performance of the models at this regression task is summarized as the plot in Figure 6. The coefficient of determination (R2) is used as the evaluation metric (higher is better).

Evaluation results showing R2 metric for all assessment criteria. SVR-BD: Baseline Model, PC-FCN: Fully Convolutional Pitch Contour Model, M-CRNN: Convolutional Recurrent Model with Mel Spectrogram, PCM-CRNN: Hybrid Model Combining Mel-Spectrogram and P. Left: Symphonic Band, Right: Middle School

The results clearly show that the DNN based models outperform the baseline model across all 4 assessment criteria. In fact, the DNN models perform the best for the Musicality criterion which is arguably the most abstract and is hard to define. In the absence of a clear definition, it is indeed difficult to design features to describe musicality. The success of the DNN models at modeling this criterion is, thus, extremely encouraging.

Another interesting observation is that the pitch contour based model (PC-FCN) outperforms every other model for the Symphonic Band students. This could indicate that the high-level melodic information encoded by the pitch contour is important to assess students at a higher proficiency level since one would expect that the differences between individual students would be finer. The same is not true for Middle School students where the best models use the Mel-Spectrogram as the input.

Way Forward

While the success of DNNs at this task is encouraging, it should be noted, however, that the performance of the models is still not robust enough for practical applications. Some of the possible areas for future research include experimenting with other input representations (potentially raw audio), adding musical score information as input to the models and training instrument specific models. It is also important to develop better model analysis techniques which can allow us to understand and interpret the features learned by the model.
For interested readers, the full paper published in the Applied Sciences Journal can be found here.

Automatic Sample Detection in Polyphonic Music

by Siddharth Gururani

The term ‘sampling’ refers to the reuse of audio snippets from pre-existing digital recordings with appropriate modifications in new compositions in a way that it fits the musical context. Influential artists that have been sampled frequently by younger artists include, for example, James Brown, Stevie Wonder, and Michael Jackson. Since sampling is an important approach in at least some music genres, there are websites dedicated to linking samples to songs such as whosampled.com. The annotation, however, is done manually by fans and
music aficionados. A system that can automatically detect sampling can help automate this process and could also be used in large scale musicological studies of artist influence across time and geographical space.

The task of automatic sample detection has not been explored in much detail. Some papers proposed methods involving a modified audio fingerprinting method and Non-negative Matrix Factorization (NMF). The block diagram below gives a broad overview of the method used in this work.

flowchart of the sample detection process

The algorithm we present also utilizes NMF and adds a post-processing step with subsequence Dynamic Time Warping (DTW) to extract features that indicate a sample/song pair. The figure below shows a distance matrix for a song in which the sample is looped 4 times in 20 seconds as indicated by the diagonal lines. We extract features from the detected paths and use them to train a random forest classifier.

distance matrix showing 4 repetitions of the looped sample

A new dataset had to be created for the evaluation of the system as previous publications lack systematic evaluation. This dataset originates from whosampled.com and is now publicly available. Our evaluation results, presented in the paper, indicate that our algorithm is has reasonably high precision while suffering from low recall which may be attributed to absence of clear alignment paths in the distance matrix.

For details on the method, results and discussion, please refer to the published paper available here.

Mixing Secrets: A Multi-Track Dataset for Instrument Recognition

by Siddharth Gururani

Instrument recognition as a task in Music Information Retrieval has had a long history and several datasets have been introduced for public use. The RWC dataset and the UIOWA dataset, for instance, are standard datasets for evaluation of instrument recognition in monophonic audio. The IRMAS dataset is a large dataset for predominant instrument detection. There are however, not many datasets available for instrument detection in polyphonic mixtures.

Muti-track data comes in handy for such a task. Multi-track datasets contain the recording sessions of songs, which will normally include the raw tracks, the stems, and the final mix. This enables the usage of multi-track datasets for a variety of tasks such as source separation and multi-f0 tracking, but also instrument recognition.

MedleyDB is a widely known dataset that contains 250 multi-tracks with a well defined annotation format and instrument taxonomy. While this might be considered an overwhelming amount of data, new data-hungry algorithms such as deep neural networks are often in need of more data for training and testing. We release a new set of annotated multi-track data in a format that is compatible to MedleyDB. It contains 258 multi-tracks originating from the website for a book titled “Mixing Secrets For the Small Studio.”

The paper contains more details about how the data was cleaned and processed in order to make it consistent with MedleyDB’s annotations. The github repository contains the code and links to the data.