Multi-Task Learning for Instrument Activation Aware Music Source Separation

by Yun-Ning Hung


Music source separation performance has been improved dramatically in recent years due to the rise of deep learning, especially in the fully-supervised learning setting. However, one of the main problems of the current music source separation systems is the lack of large-scale datasets for training. Most of the open-source datasets either have limited size, or the number of included instruments and music genres is limited.

Moreover, many existing systems focus exclusively on the problem of source separation itself and ignore the utilization of other possibly related MIR tasks which could lead to additional quality gain.

In this research work, we leverage two open-source large-scale multi-track datasets, MedleyDB and Mixing Secrets, in addition to the standard MUSDB to evaluate on a large variety of separable instruments. We also propose a multitask learning structure to explore the combination of instrument activity detection and music source separation. The goal is that by training these two tasks in an end-to-end manner, the estimated instrument labels can be used during inference as a weight for each time frame. We refer to our method as instrument aware source separation (IASS).

Model structure

Figure 1: Multitask model structure for our proposed source separation system.

Our proposed model is a U-net based structure with residual blocks instead of CNNs at each layer. The reason for choosing the U-net structure is that it has been found useful in image decomposition, a task with general similarities to source separation. The residual block allows the information from the current layer to be fed into a layer 2 hops away and deepens the structure. A classifier is attached to the latent vector to predict the instrument activity. Mean Square Error (MSE) loss is used for source separation while Binary Cross-Entropy (BCE) loss is used for instrument activity prediction.

Instrument weight

Figure 2: Using instrument activation as a weight to filter the estimated mask, which will be used to multiply with the mixture of the magnitude spectrogram.

We use instrument activation as a weight to multiply with the estimated mask along the time dimension. By doing so, the instrument labels are able to suppress the frames not containing any target instrument. The instrument activation is first binarized by using a threshold. A median filter is then applied to smooth the estimated activation.

Experiments & Results

We compare our model with baseline Open-Unmix model. Both models use mixture phase without any post-processing. We train and evaluate both models on two datasets, MUSDB-HQ dataset and the combination of Mixing Secrets and MedleyDB.

Figure 3: BSS metrics for Open-Unmix and IASS on the MUSDB-HQ dataset.

The result shows that when training and evaluating on the MUSDB-HQ dataset, our model outperforms the Open-Unmix model on ‘Vocals’ and ‘Drums’, performs equally on ‘Other’, and slightly worse on ‘Bass’. This might be because ‘Bass’ is likely to appear throughout the songs; as a result, the improvement of using the instrument activation weight is limited.

Figure 4: SDR score for Open-Unmix, IASS, an ideal binary mask and input-SDR.

Figure 4 summarizes the results for the combination of the MedleyDB and Mixing Secrets datasets. The Ideal Binary Mask (IBM) represents the best-case scenario. The worst-case scenario is represented by the results for input-SDR, which is the SDR score when using the unprocessed mixture as the input. We can observe that our model also outperforms Open-Unmix on all the instruments. Moreover, both models have higher scores on ‘Drums,’ ‘Bass,’ and ‘Vocals’ than on ‘Electrical Guitar’, ‘Piano’, and ‘Acoustic Guitar’. This might be attributed to the fact that ‘Guitar’, ‘Piano’, and ‘Acoustic Guitar’ have fewer training samples. Another possible reason is that the more complicated spectral structure of polyphonic instruments such as ‘Guitar’ and ‘Piano’ make the separation task more challenging.


This work presents a novel U-net based model that incorporates instrument activity detection with source separation. We also utilize a larger dataset to evaluate various instruments. The result shows our model achieves equal or better separation quality than the baseline model. Future extension of this work includes:

  • Increasing the amount of data by using a synthesized dataset,
  • Incorporating other tasks, such as multi-pitch estimation, into our current model, and
  • Exploring post-processing methods such as Wiener filter, to improve our system’s quality.


  • The code for reproducing experiments is available on our Github repository.
  • Please see the full paper (to appear in ISMIR ’20) for more details on the dataset and our experiment.

Music Informatics Group @ICML Machine Learning for Music Discovery Workshop

There will be 2 presentations by our group at the ICML Machine Learning for Music Discovery Workshop (ML4MD) this year:

  • Pati, Ashis; Lerch, Alexander: Latent Space Regularization for Explicit Control of Musical Attributes:
    Deep generative models for music are often restrictive since they do not allow users any meaningful control over the generated music. To address this issue, we propose a novel latent space regularization technique which is capable of structuring the latent space of a deep generative model by encoding musically meaningful attributes along specific dimensions of the latent space. This, in turn, can provide users with explicit control over these attributes during inference and thereby, help design intuitive musical interfaces to enhance creative workflows.
  • Gururani, Siddharth; Lerch, Alexander; Bretan, Mason: A Comparison of Music Input Domains for Self-Supervised Feature Learning:
    In music using neural networks to learn effective feature spaces, or embeddings, that capture useful characteristics has been demonstrated in the symbolic and audio domains. In this work, we compare the symbolic and audio domains, attempting to identify the benefits of each, and whether incorporating both of the representations during learning has utility. We use a self-supervising siamese network to learn a low-dimensional representation of three second music clips and evaluate the learned features on their ability to perform a variety of music tasks. We use a polyphonic piano performance dataset and directly compare the performance on these tasks with embeddings derived from synthesized audio and the corresponding symbolic representations.

Instrument Activity Detection in Polyphonic Music

by Siddharth Gururani

Most forms of music are rendered as a mixture of acoustic and electronic instruments. The human ear, for the most part, is able to discern the instruments being played in a song fairly easily. However, the same is not true for computers or machines. The task of recognizing instrumentation in music is still an unsolved and active area of research in Music Information Retrieval (MIR).

The applications of such a technology are manifold:

  • Metadata which includes instrumentation enables instrument-specific music discovery and recommendations.
  • Identifying regions of activity of specific instruments in a song allows easy browsing for users. For example, a user interested in a guitar solo or vocals in a song can easily browse to the relevant part.
  • Instrument activity detection may serve as a helpful pre-processing step for other MIR tasks such as automatic transcription and source separation.

In our work, we propose a neural network-based system to detect activity for 18 different instruments in polyphonic music.

Challenges in Instrument Activity Detection

A big challenge in building algorithms for instrument activity detection is the lack of appropriate datasets. Until very recently, the IRMAS dataset was used as the benchmark dataset for instrument recognition in polyphonic music. However, this dataset is not suitable for an instrument activity detection because of the following reasons:

  • The test set contains 3 to 10 second snippets of audio that are only labeled with instruments present instead of a fine-grained instrument activity annotation.
  • The training clips are labeled with a single ‘predominant’ instrument even if more than one instrument is active in the clip.

We overcome this challenge by leveraging multi-track datasets such as the MedleyDB and Mixing Secrets dataset. These multi-track datasets contain the mixes as well as the stems accompanying them. Therefore, annotations for fine-grained stem activity may be automatically obtained by applying envelope tracking on the instrument stems.

In addition, we identify metrics that allow easier comparison of models for instrument activity detection. Traditional metrics such as precision, recall and f1-score are both threshold dependent and not ideal for multi-label classification scenarios. We use label-ranking average precision (LRAP) and area under the ROC curve (AUC) for comparison between different model architectures. Both these metrics are threshold agnostic and are suitable for multi-label classification.

Method and Models

We propose a rather simple pipeline for our instrument activity detection system. The block diagram below shows the high-level processing steps in our approach. First, we split our all the multi-tracks into artist conditional splits. We obtain 361 training tracks and 100 testing tracks. During training, the various models are fed with log-scaled mel-spectrograms for 1 second clips for the training tracks. We train these models to predict all the instruments present in a 1 second clip. We compare Fully Connected, Convolutional (CNN) and Convolutional-Recurrent (CRNN) Neural Networks in this work.

During testing, a track is split into 1 second clips and fed into the model. Once all 1 second level predictions are obtained from the model, we evaluate the predictions at different time-scales: 1 s, 5 s, 10 s and track-level. We aggregate over time by max-pooling the predictions and annotations for longer time-scale evaluation.


As expected, the CNN and CRNN models outperform the Fully Connected architectures. The CNN or the CRNN perform very similarly and we attribute that to the choice of input time context. For only a 1 second input, there are only a few time-steps for the recurrent network to learn temporal features from, hence the insignificant change in performance over the CNN. An encouraging finding was that the models perform well for rare instruments also.

We also propose a method for visualizing confusions in a multi-label context, shown in the figure above. We visualize the distribution of false negatives for all instruments conditioned on a false positive of a particular instrument. For example, the first row in the matrix represents the distribution of false negatives of all instruments conditioned on the acoustic guitar false positives. We observe several cases of confusions that make sense musically, for example: different guitars, tabla and drums, synth and distorted guitars being confused.

For more details on the various processing steps, detailed results and discussion, please check out the paper here! Additionally, a 3 and a half minute lightning talk given at the ISMIR conference is accessible here.

Automatic Sample Detection in Polyphonic Music

by Siddharth Gururani

The term ‘sampling’ refers to the reuse of audio snippets from pre-existing digital recordings with appropriate modifications in new compositions in a way that it fits the musical context. Influential artists that have been sampled frequently by younger artists include, for example, James Brown, Stevie Wonder, and Michael Jackson. Since sampling is an important approach in at least some music genres, there are websites dedicated to linking samples to songs such as The annotation, however, is done manually by fans and
music aficionados. A system that can automatically detect sampling can help automate this process and could also be used in large scale musicological studies of artist influence across time and geographical space.

The task of automatic sample detection has not been explored in much detail. Some papers proposed methods involving a modified audio fingerprinting method and Non-negative Matrix Factorization (NMF). The block diagram below gives a broad overview of the method used in this work.

flowchart of the sample detection process

The algorithm we present also utilizes NMF and adds a post-processing step with subsequence Dynamic Time Warping (DTW) to extract features that indicate a sample/song pair. The figure below shows a distance matrix for a song in which the sample is looped 4 times in 20 seconds as indicated by the diagonal lines. We extract features from the detected paths and use them to train a random forest classifier.

distance matrix showing 4 repetitions of the looped sample

A new dataset had to be created for the evaluation of the system as previous publications lack systematic evaluation. This dataset originates from and is now publicly available. Our evaluation results, presented in the paper, indicate that our algorithm is has reasonably high precision while suffering from low recall which may be attributed to absence of clear alignment paths in the distance matrix.

For details on the method, results and discussion, please refer to the published paper available here.

Mixing Secrets: A Multi-Track Dataset for Instrument Recognition

by Siddharth Gururani

Instrument recognition as a task in Music Information Retrieval has had a long history and several datasets have been introduced for public use. The RWC dataset and the UIOWA dataset, for instance, are standard datasets for evaluation of instrument recognition in monophonic audio. The IRMAS dataset is a large dataset for predominant instrument detection. There are however, not many datasets available for instrument detection in polyphonic mixtures.

Muti-track data comes in handy for such a task. Multi-track datasets contain the recording sessions of songs, which will normally include the raw tracks, the stems, and the final mix. This enables the usage of multi-track datasets for a variety of tasks such as source separation and multi-f0 tracking, but also instrument recognition.

MedleyDB is a widely known dataset that contains 250 multi-tracks with a well defined annotation format and instrument taxonomy. While this might be considered an overwhelming amount of data, new data-hungry algorithms such as deep neural networks are often in need of more data for training and testing. We release a new set of annotated multi-track data in a format that is compatible to MedleyDB. It contains 258 multi-tracks originating from the website for a book titled “Mixing Secrets For the Small Studio.”

The paper contains more details about how the data was cleaned and processed in order to make it consistent with MedleyDB’s annotations. The github repository contains the code and links to the data.

Objective descriptors for the assessment of student music performances

by Amruta Vidwans

Learning a musical instrument is difficult. It needs regular practice, expert advice, and supervision. Even today, musical training is largely driven by interaction between student and a human teacher plus individual practice session at home.

Can technology improve this process and the learning experience? Can an algorithm perform an assessment of a student music performance? If yes, we are one step closer to a truly musically intelligent music tutoring system  that will support students learn their instrument of choice by providing feedback on aspects like rhythmic correctness, note accuracy, etc. An automatic assessment is not only useful to students for their practice sessions but could also help band directors in the auditioning and (pre-)selection process. While there are a few commercial products for practicing instruments, the assessment in these products is usually either trivial or opaque to the user.

The realization of a musically intelligent system for music performance assessment requires knowledge from multiple disciplines such as digital signal processing, machine learning, audio content analysis, musicology, and music psychology. With recent advances in Music Information Retrieval (MIR), noticeable progress has been made in related research topics.

Despite these efforts, identifying a reliable and effective method for assessing music performances remains an unsolved problem. In our study, we explore the effectiveness of various objective descriptors by comparing three sets of features extracted from the audio recording of a music performance, (i) a baseline set with common low-level features (often used but hardly meaningful for this task), (ii) a score-independent set with designed performance features (custom-designed descriptors such as pitch deviation etc., but without knowledge of the musical score), and (iii) a score-based set with designed performance features (taking advantage of the known musical score). The goal is to identify a set of meaningful objective descriptors for the general assessment of student music performances. The data we used covers Alto Saxophone recordings of three years of student auditions (Florida state auditions) rated by experts in the assessment categories of musicality, note accuracy, rhythmic accuracy, and tone quality.

Label: Musicality E1 E2 E3 E4
Correlation (r) 0.19 0.49 0.56 0.58

Our observations (as seen in Table 1) are that, as expected, the baseline features (E1) are not able to capture any qualitative aspects of the music performance so that the regression model mostly fails to predict the expert assessments . Another expected result is that score-based features (E3) are able represent the data generally better than score-independent features (E2) in all categories. The combination of score-independent and score-based features (E4) show some trend to improve results, but the gain remains small, hinting at redundancies between the feature sets. With values between 0.5 and 0.65 for the correlation between the prediction and the human assessments, there is still a long way to go before computers will be able to reliably assess student music performance, but the results show that an automatic assessment is possible to a certain degree.

To learn more, please see the published paper for details.

Header image used with kind permission of Rachel Maness from


It was great to see alumni and current students meet at the International Society for Music Information Retrieval Conference (ISMIR) in Suzhou, China.

Contributions from the group at the conference:




The Georgia Tech Center for Music Technology (GTCMT) has shown strong presence at the International Conference for Music Information Retrieval (ISMIR) with students, post-docs, and alumni.

Contributions from the group at the conference: