Advanced search
Start date
Betweenand


Correlation identification using the fractal theory

Full text
Author(s):
Elaine Parros Machado de Sousa
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:
Examining board members:
Caetano Traina Junior; Carlos Alberto Heuser; Claudia Maria Bauzer Medeiros; Maria Carolina Monard; Altigran Soares da Silva
Advisor: Caetano Traina Junior
Abstract

The volume of information processed by computer-based systems has grown not only in the amount of data but also in number and complexity of attributes. In real world datasets, uniform value distribution and independence between attributes are rather uncommon properties. In fact, real data is usually characterized by vast existence of correlated attributes. Moreover, a dataset can present different types of correlations, such as linear, non-linear and non-polynomial. This entire scenario may degrade performance of data management and, particularly, data analysis algorithms, as they need to deal with large amount of data and high number of attributes. Furthermore, correlations are usually unknown, which may jeopardize the efficacy of these algorithms. In this context, dimensionality reduction techniques can reduce the number of attributes in datasets, thus minimizing the problems caused by high dimensionality. Some of these techniques are based on correlation analysis and try to eliminate only attributes that are correlated to those remaining, aiming at diminishing the loss of relevant information imposed by attribute removal. However, techniques proposed so far usually analyze how each attribute is correlated to all the others, considering the attribute set as a whole and applying statistical analysis tools. This thesis presents a different approach, based on the Theory of Fractals, to detect the existence of correlations and to identify subsets of correlated attributes. In addition, the proposed technique makes it possible to identify which attributes can better describe each correlation. Consequently, a subset of attributes relevant to represent the fundamental characteristics of the dataset is determined, not only based on global correlations but also considering particularities of correlations concerning smaller attribute subsets. The proposed technique works as a tool to be used in preprocessing steps of knowledge discovery activities, mainly in feature selection operations for dimensionality reduction. The technique of correlation detection and its main concepts are validated through experimental studies with synthetic and real data. Finally, as an additional relevant contribution of this thesis, the basic concepts of the Theory of Fractals are also applied to analyze data streams behavior. (AU)