Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

Campello, Ricardo J. G. B.; Moulavi, Davoud; Zimek, Arthur; Sander, Joerg

Texto completo
Autor(es):	Campello, Ricardo J. G. B. ^[1] ; Moulavi, Davoud ^[2] ; Zimek, Arthur ^[3] ; Sander, Joerg ^[2] Número total de Autores: 4
Afiliação do(s) autor(es):	^[1] Univ Sao Paulo, Dept Comp Sci, BR-05508 Sao Paulo - Brazil ^[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2M7 - Canada ^[3] Univ Munich, D-81377 Munich - Germany Número total de Afiliações: 3
Tipo de documento:	Artigo Científico
Fonte:	ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA; v. 10, n. 1 JUL 2015.
Citações Web of Science:	70
Resumo
An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan's classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of ``outlierness{''} can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a ``flat{''} (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods. (AU)

Processo FAPESP:	10/20032-6 - Estudo e desenvolvimento de métodos de validação para técnicas de agrupamento de dados baseadas em densidade e em grafos
Beneficiário:	Ricardo José Gabrielli Barreto Campello
Modalidade de apoio:	Bolsas no Exterior - Pesquisa


Processo FAPESP:	13/18698-4 - Métodos e algoritmos em aprendizado de máquina não supervisionado e semi-supervisionado
Beneficiário:	Ricardo José Gabrielli Barreto Campello
Modalidade de apoio:	Auxílio à Pesquisa - Regular

URL curto