A study of K-Means-based algorithms for constrained clustering

Covoes, Thiago F.; Hruschka, Eduardo R.; Ghosh, Joydeep

Texto completo
Autor(es):	Covoes, Thiago F. ^{[1, 2]} ; Hruschka, Eduardo R. ^{[1, 2]} ; Ghosh, Joydeep ^[1] Número total de Autores: 3
Afiliação do(s) autor(es):	^[1] Univ Texas Austin, Austin, TX 78712 - USA ^[2] Univ Sao Paulo, BR-13560970 Sao Carlos, SP - Brazil Número total de Afiliações: 2
Tipo de documento:	Artigo Científico
Fonte:	Intelligent Data Analysis; v. 17, n. 3, p. 485-505, 2013.
Citações Web of Science:	14
Resumo
The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have (partially) compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints - Constrained Vector Quantization Error (CVQE), its variant named LCVQE, and the Metric Pairwise Constrained K-Means (MPCK-Means) - are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 20 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of (more specific) new experimental findings are discussed in the paper - e.g., deduced constraints usually do not help finding better data partitions. (AU)

Processo FAPESP:	09/17856-0 - Algoritmos híbridos para aprendizado de máquina não supervisionado
Beneficiário:	Eduardo Raul Hruschka
Modalidade de apoio:	Bolsas no Exterior - Pesquisa


Processo FAPESP:	09/17795-0 - Algoritmos Evolutivos para Problemas de Agrupamento de Dados com Restrições
Beneficiário:	Thiago Ferreira Covões
Modalidade de apoio:	Bolsas no Brasil - Doutorado

URL curto