Advanced search
Start date
Betweenand


Ensemble techniques for centralized and distributed clustering

Full text
Author(s):
Murilo Coelho Naldi
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:
Examining board members:
Ricardo José Gabrielli Barreto Campello; Francisco de Assis Tenório de Carvalho; Maria do Carmo Nicoletti; Solange Oliveira Rezende; Fernando José von Zuben
Advisor: Ricardo José Gabrielli Barreto Campello
Abstract

The large amount of data resulting from different areas of knowledge creates the need for development of data mining techniques increasingly efficient and effective. Clustering techniques have been successfully applied to several areas, especially when there is no prior knowledge about the data organization. Nevertheless, the use of different clustering algorithms, or variations of the same algorithm, can generate a wide variety of results, what raises the need to create methods to assess and select good results. One way to evaluate these results consists on using cluster validation indexes. However, a wide variety of validation indexes was proposed in the literature, which can make choosing a single index challenging if the performance of the compared indexes is unknown for the application scenario. In order to obtain a consensus among different options, a set of clustering results or validation indexes can be combined into a single final solution. Clustering ensembles successfully obtained results robust to variations in the application scenario, which makes them an attractive alternative to find solutions of reasonable quality, according to different validation indexes. Moreover, using a combination of validation indexes can promote a more powerful evaluation, as the majority of the combined indexes can compensate the poor performance of individual indexes. In some cases, it is not possible to work with a single centralized data set, for physical reasons or privacy concerns, which creates the need to distribute the mining process. Clustering ensembles can be extended to distributed data mining problems, since information about the data from distributed sources can be combined into a single global solution. The main objective of this research resides in investigating combination techniques for validation indexes and clustering results applied to clustering ensemble selection and distributed clustering. Additionally, evolutionary clustering algorithms are studied to select quality solutions among the obtained results. The techniques developed have scalability and reduced computational complexity, allowing their usage in large data sets or scenarios with distributed data (AU)