Clustering Data Streams with Automatic Estimation of Number of Clusters
Scalable descriptive models over extensive volumes of distributed data
Automatic clustering based on nature inspired metaheuristics
![]() | |
Author(s): |
Jonathan de Andrade Silva
Total Authors: 1
|
Document type: | Doctoral Thesis |
Press: | São Carlos. |
Institution: | Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB) |
Defense date: | 2015-03-04 |
Examining board members: |
Eduardo Raul Hruschka;
Heloisa de Arruda Camargo;
André Carlos Ponce de Leon Ferreira de Carvalho;
Ricardo Ribeiro Gudwin;
Renato Tinós
|
Advisor: | Eduardo Raul Hruschka |
Abstract | |
Several algorithms for clustering data streams based on k-Means have been proposed in the literature. However, most of them assume that the number of clusters, k, is known a priori by the user and can be kept fixed throughout the data analysis process. Besides the dificulty in choosing k, data stream clustering imposes several challenges to be dealt with, such as addressing non-stationary, unbounded data that arrives in an online fashion. In data stream applications, the dataset must be accessed in order and that can be read only once or a small number of times. In this context, the main contributions of this thesis are: (i) adapt algorithms that have been used successfully in data stream applications where k is known to be able to estimate the number of clusters from data; (ii) propose new algorithms for clustering to estimate k automatically from the data stream; (iii) evaluate the proposed algorithms according to diferent scenarios. Fourteen clustering data stream algorithms were developed which are able to estimate the number of clusters from data. They were evaluated in six artificial datasets and two real-world datasets widely used in the literature. The developed algorithms are useful for several data mining tasks. The developed evolutionary algorithms have shown the best trade-off between computational efficiency and data partition quality. (AU) | |
FAPESP's process: | 10/15049-7 - Clustering Data Streams with Automatic Estimation of Number of Clusters |
Grantee: | Jonathan de Andrade Silva |
Support Opportunities: | Scholarships in Brazil - Doctorate |