In 2009, the Brazilian Computer Society (SBC) gathered to stipulate the great challenges of computing in Brazil with perspectives to 2020. One of the appointed challenges was "how to increase our capacity to extract relevant information from data streams". One of the most attractive subareas of data streams mining is clustering, as it does not require a specialist to supervise every data base example. Traditionally, scientific experiments in the most diverse fields produce data bases with many attributes, making the analysis harder. However, most of the time, the desired clusters reside in a low dimensional subspace, or manifold, embedded in the original high dimensional space. This problem, referred to as the curse of dimensionality, has limited the success of many machine learning techniques. Few papers in the data streams area have addressed the task of clustering in high dimensional spaces. All of them, up to now, have used the concept of variance to determine the relevance of dimensions, given a fixed threshold supplied by the user a priori. This approach imposes a severe limitation, given the volatile nature of data streams. This project aims to study and propose measures of information quantification to determine feature relevance in the context of high dimensional data streams clustering. Those measures do not suffer the problems of variance, since they are based on the probabilities of data and not their scale. Furthermore, this project aims to propose mechanisms for parameter adaptation with regards to determining feature relevance, which is essential given data streams volatile nature. It is hoped that with the results of this project it will be possible to find clusters in scenarios not supported by current techniques.
News published in Agência FAPESP Newsletter about the scholarship: