Classification at the accuracy limit: facing the problem of data ambiguity

Metzner C, Schilling A, Traxdorf M, Tziridis K, Maier A, Schulze H, Krauß P (2022)


Publication Type: Journal article

Publication year: 2022

Journal

Book Volume: 12

Article Number: 22121

Journal Issue: 1

DOI: 10.1038/s41598-022-26498-z

Abstract

Data classification, the process of analyzing data and organizing it into categories or clusters, is a fundamental computing task of natural and artificial information processing systems. Both supervised classification and unsupervised clustering work best when the input vectors are distributed over the data space in a highly non-uniform way. These tasks become however challenging in weakly structured data sets, where a significant fraction of data points is located in between the regions of high point density. We derive the theoretical limit for classification accuracy that arises from this overlap of data categories. By using a surrogate data generation model with adjustable statistical properties, we show that sufficiently powerful classifiers based on completely different principles, such as perceptrons and Bayesian models, all perform at this universal accuracy limit under ideal training conditions. Remarkably, the accuracy limit is not affected by certain non-linear transformations of the data, even if these transformations are non-reversible and drastically reduce the information content of the input data. We further compare the data embeddings that emerge by supervised and unsupervised training, using the MNIST data set and human EEG recordings during sleep. We find for MNIST that categories are significantly separated not only after supervised training with back-propagation, but also after unsupervised dimensionality reduction. A qualitatively similar cluster enhancement by unsupervised compression is observed for the EEG sleep data, but with a very small overall degree of cluster separation. We conclude that the handwritten letters in MNIST can be considered as ’natural kinds’, whereas EEG sleep recordings are a relatively weakly structured data set, so that unsupervised clustering will not necessarily re-cover the human-defined sleep stages.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Metzner, C., Schilling, A., Traxdorf, M., Tziridis, K., Maier, A., Schulze, H., & Krauß, P. (2022). Classification at the accuracy limit: facing the problem of data ambiguity. Scientific Reports, 12(1). https://doi.org/10.1038/s41598-022-26498-z

MLA:

Metzner, Claus, et al. "Classification at the accuracy limit: facing the problem of data ambiguity." Scientific Reports 12.1 (2022).

BibTeX: Download