Advanced search
Start date
Betweenand
(Reference retrieved automatically from Web of Science through information on FAPESP grant and its corresponding number as mentioned in the publication by the authors.)

Coarse-refinement dilemma: On generalization bounds for data clustering

Full text
Author(s):
Vaz, Yule [1] ; de Mello, Rodrigo Fernandes [1] ; Grossi Ferreira, Carlos Henrique [1]
Total Authors: 3
Affiliation:
[1] Univ Sao Paulo, Inst Math & Comp Sci, Trabalhador Saocarlense Ave 400, BR-13560970 Sao Carlos, SP - Brazil
Total Affiliations: 1
Document type: Journal article
Source: EXPERT SYSTEMS WITH APPLICATIONS; v. 184, DEC 1 2021.
Web of Science Citations: 0
Abstract

The data clustering problem is of central importance for the area of machine learning, given its usefulness to represent data structural similarities from input spaces. Although, data clustering counts on scarse literature of a theoretical framework with generalization guarantees. In this context, this manuscript introduces a new concept, based on multidimensional persistent homology, to analyze the conditions on which a clustering model is capable of generalizing data. As a first step, we propose a more general definition of DC problem by relying on topological spaces, instead of metric ones as typically approached in the literature. From that, we show that the data clustering problem presents an analogous dilemma to the bias-variance one, which is here referred to as the coarse-refinement dilemma, from which we conclude that: (i) highly-refined partitions and the clustering instability (overfitting); and (ii) over-coarse partitions and the lack of representativeness (underfitting). The coarse-refinement dilemma suggests the need of a relaxation of Kleinberg's richness axiom, as such axiom allows the production of unstable or unrepresentative partitions. Experimental exploration considering different clustering refinements can, then, depict such partitions. (AU)

FAPESP's process: 17/16548-6 - Providing theoretical guarantees to the detection of concept drift in data streams
Grantee:Rodrigo Fernandes de Mello
Support Opportunities: Scholarships abroad - Research