Alexander | Music Informatics Group

Quantifying Spatial Audio Quality Impairment

by Karn Watcharasupat

Spatial audio quality is a highly multifaceted concept (see this for a very long list of things to consider). “Geometrical” components of spatial audio quality are perhaps the least subjective aspect of spatial audio quality to quantify, yet there have been very little attempt at dealing withit since BSS Eval came out almost 20(!) years ago.

Even the geometrical component of spatial audio quality is not trivial to quantify. We resorted to only considering the interchannel time differences (ITD) and interchannel level differences (ILD) of the test signal relative to a reference signal. With this, it is actually possible to construct a signal model to isolate _some_ of the spatial distortion. By using a combination of Weiner-style least-square optimization and good ol’ correlation maximization, we propose a signal decomposition method to isolate the spatial error, in terms of interchannel gain leakages and changes in relative delays, from a processed signal. These intermediates parameters can then be used as a diagnostic tool to identify the nature of the spatial distortion and to quantify the spatial quality impairment.

Methods

Evaluation of Generative Models in Music

by Alexander Lerch

Generative Artificial intelligence is increasingly capable of composing music, from short melodies to full songs. Despite the increasing number of new, “superior” models, there has been no consensus on how to measure this progress. How do we know if one model is indeed better than the other?

Evaluating AI-generated music is challenging because music perception is inherently subjective. There is no single “correct” or “best” version of a song, and people’s tastes vary widely and objectively evaluating elusive properties such as aesthetics, musicality, creativity or emotional impact is ultimately pointless. The language of music is complex and abstract, and its perception subjective.

Evaluation Targets

The paper breaks evaluation into two main categories:
• System Output: focusing on the generated output of a system and its properties
• User Experience: focusing on how people interact with a generative system.
Researchers use both subjective and objective methods. Subjective methods include listening tests, surveys, and Turing-style tests where listeners try to guess whether a piece was composed by a human or a machine. Objective methods use mathematical metrics to compare the AI’s output to human-composed music, measuring things like pitch distribution, rhythm patterns, and audio fidelity.

Challenges and Conclusion

There are several major challenges in evaluating generative music systems. First, the validity of existing methodologies is limited. Second, existing metrics have limited and/or unknown musical and perceptual meaning. Third, there is no standard set of metrics, which makes it hard to compare different systems. In addition, there are concerns around the topic of responsible AI.

There is a need for more consistent, interdisciplinary approaches to evaluating generative music. It highlights the need for better metrics, more transparent research practices, and deeper collaboration between computer scientists, musicians, and psychologists.

Resources

Please find the open access survey paper for more details.

EAsT: Embeddings As Teachers for Music Classification

by Yiwei Ding

Due to the increasing use of neural networks, the last decade has seen dramatic improvements in a wide range of music classification tasks. However, the increased algorithmic complexity of the models requires an increased amount of data during the training process. Therefore, transfer learning is applied where the model is first pre-trained on a large-scale dataset for the source tasks and then fine-tuned with a comparably small dataset of the target task. T However, the increasing model complexity also makes the inference computationally expensive, so knowledge distillation is proposed where a low-complexity (student) model is trained while leveraging the knowledge in the high-complexity (teacher) model.

In this study, we integrate ideas and approaches from both transfer learning and knowledge distillation and apply them to the training of low-complexity networks to show the effectiveness of knowledge transfer for music classification tasks. More specifically, we utilize pre-trained audio embeddings as teachers to regularize the feature space of low-complexity student networks during the training process.

Methods

flow chart — Overall pipeline of training a model by using pre-trained embeddings as teachers. The training loss is a weighted sum (weighting factor omitted in the figure) of prediction loss and regularization loss. The regularization loss measures the distance between pre-trained embedding and the output feature map after the feature alignment. During inference, only the bottom part with the blue background is used.

Similar to knowledge distillation, we rewrite our loss function as a combination of cross entropy loss for the classification task and a regularization loss that measures the distance between the student network’s feature map and the pre-trained embeddings with two different distance measures.

Experiments

We test the effectiveness of using pre-trained embeddings as teachers on two tasks: musical instrument classification with OpenMIC and music auto-tagging with MagnaTagATune, and we use four different embeddings: VGGish, OpenL3, PaSST and PANNs.

The following systems are evaluated for comparison:
• Baseline: CP ResNet (on OpenMIC) and Mobile FCN (on MagnaTagATune) trained without any extra regularization loss.
• Teacher_LR: logistic regression on the pre-trained embeddings (averaged along the time axis), which can be seen as one way to do transfer learning by freezing the whole model except for the classification head.
• KD: classical knowledge distillation where the soft targets are generated by the logistic regression.
• EAsT_Cos-Diff: feature space regularization that uses cosine distance difference and regularizes only the final feature map.
• EAsT_Final and EAsT_All: proposed systems based on distance correlation as the distance measure, either regularizing only at the final stage or at all stages, respectively.
• EAsT_KD: a combination of classical knowledge distillation and our method of using embeddings to regularize the feature space. The feature space regularization is done only at the final stage.

Results

Table 1 shows the results on the OpenMIC and the MagnaTagATune dataset. We can make the following observations:
• Models trained with the extra regularization loss consistently outperform the non-regularized ones on both datasets, with all features and all regularization methods.
• Whether the teachers themselves have an excellent performance (PaSST and PANNs) or not (VGGish and OpenL3), students benefit from learning the additional knowledge from these embeddings, and the students’ upper limit is not bounded by the performance of teachers.
• As for the traditional knowledge distillation, the models perform best only with a “strong” teacher like PaSST and PANNs, which means the method is dependent on high-quality soft targets generated by the “strong” teachers.
• The combination system EAsT_KD gives us better results with PaSST and PANNs embeddings while for VGGish and OpenL3 embeddings, the performance is not as good as EAsT_Final or EAsT_All in most cases.

Table 2 lists the number of parameters as well as rough inference speed measurements of the models. We can see that Mobile FCN and CP ResNet are much faster in inference than pre-trained models.

Conclusion

In this study, we explored the use of audio embeddings as teachers to regularize the feature space of low-complexity student networks during training. We investigated several different ways of implementing the regularization and tested its effectiveness on the OpenMIC and MagnaTagATune datasets. Results show that using embeddings as teachers enhances the performance of the low-complexity student models, and the results can be further improved by com bining our method with a traditional knowledge distillation approach.