How Robust are Audio Embeddings for Polyphonic Sound Event Tagging?

Abeßer J, Grollmisch S, Müller M (2023)


Publication Type: Journal article

Publication year: 2023

Journal

Book Volume: 31

Pages Range: 2658-2667

DOI: 10.1109/TASLP.2023.3293032

Abstract

Sound classification algorithms are challenged by the natural variability of everyday sounds, particularly for large sound class taxonomies. In order to be applicable in real-life environments, such algorithms must also be able to handle polyphonic scenarios, where simultaneously occurring and overlapping sound events need to be classified. With the rapid progress of deep learning, several deep audio embeddings (DAEs) have been proposed as pre-trained feature representations for sound classification. In this article, we analyze the embedding spaces of two non-trainable audio representations (NTARs) and five DAEs for sound classification in polyphonic scenarios (sound event tagging) and make several contributions. First, we compare general properties like the inter-correlation between feature dimensions and the scattering of sound classes in the embedding spaces. Second, we test the robustness of the embeddings against several audio degradations and propose two sensitivity measures based on a class-agnostic and a class-centric view on the resulting drift in the embedding space. Finally, as a central contribution, we study how a blending between pairs of sounds maps to embedding space trajectories and how the path of these trajectories can cause classification errors due to their proximity to other sound classes. Throughout our analyses, the PANN embeddings have shown the best overall performance for low-polyphony sound event tagging.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Abeßer, J., Grollmisch, S., & Müller, M. (2023). How Robust are Audio Embeddings for Polyphonic Sound Event Tagging? IEEE/ACM Transactions on Audio, Speech and Language Processing, 31, 2658-2667. https://doi.org/10.1109/TASLP.2023.3293032

MLA:

Abeßer, Jakob, Sascha Grollmisch, and Meinard Müller. "How Robust are Audio Embeddings for Polyphonic Sound Event Tagging?" IEEE/ACM Transactions on Audio, Speech and Language Processing 31 (2023): 2658-2667.

BibTeX: Download