Learning Audio Representations for Cross-Version Retrieval of Western Classical Music

Zalkow F (2021)

Publication Language: English

Publication Type: Thesis

Publication year: 2021

URI: https://nbn-resolving.org/urn:nbn:de:bvb:29-opus4-167774


Ongoing digitization efforts lead to vast amounts of music data, e.g., audio and video recordings, symbolically encoded scores, or graphical sheet music. Accessing this data in a convenient way requires flexible retrieval strategies. One access paradigm is known as “query by example,” where a short music excerpt in a specific representation is given as a query. The task is to automatically retrieve documents from a music database that are similar to the query in certain parts or aspects. This thesis addresses two different cross-version retrieval scenarios of Western classical music, where the aim is to find the database’s audio recordings that are based on the same musical work as the query. Depending on the respective scenario, one requires task-specific audio representations to compare the query and the database documents. Various approaches for learning such audio representations with deep neural networks are proposed, leading to improvements in the efficiency of the search and the quality of the retrieval results.

In the first scenario, the query is a short audio snippet. The retrieval is based on audio shingles, which are short sequences of chroma features capturing properties of the harmonic and melodic content of the audio recordings. The comparison between the query and the recordings from the database is realized by a nearest-neighbor search of the audio shingles. The thesis contains various contributions to increase the efficiency of the retrieval procedure in this scenario. In order to reduce the dimensionality of the shingles, deep-learning-based embedding techniques are used. Furthermore, a graph-based index structure for efficient nearest-neighbor search is applied. These adaptations lead to substantial improvements in terms of runtime and memory requirements.

In the second scenario, a symbolically encoded monophonic musical theme is used as a query. The retrieval is based on a sequence-alignment algorithm relying on chroma-based audio features. This scenario is more challenging than the first one because the query (monophonic symbolic theme) and the database documents (audio recordings of polyphonic music) are fundamentally different from each other. The thesis contains various contributions to improve the retrieval results in this scenario. On the one hand, a novel dataset for musical themes is introduced that is helpful for evaluation purposes and supervised training procedures. On the other hand, various enhanced chroma representations are proposed for the retrieval task. In particular, a novel chroma-feature variant is introduced, where theme-like structures in the musical content of the audio recordings are enhanced by a deep neural network trained with a loss function (CTC) that allows for aligning the themes to the audio recordings during the training procedure. The experiments described in this thesis show that the results of the theme-based retrieval task are substantially improved by using this representation.

Authors with CRIS profile

How to cite


Zalkow, F. (2021). Learning Audio Representations for Cross-Version Retrieval of Western Classical Music (Dissertation).


Zalkow, Frank. Learning Audio Representations for Cross-Version Retrieval of Western Classical Music. Dissertation, 2021.

BibTeX: Download