The pre-processing of data is one of the most important steps in the data mining process, and one of the most neglected. Data collection may suffer from manual errors and equipment problems creating inconsistent, noisy or missing data. There are some other aspects, as imbalance and overlapping classes, which may difficult the analysis. Ignoring these aspects in the learning process can impair the induction of a suitable model, as traditional machine learning algorithms have difficulties to induce a good model in these contexts. Furthermore, most of these problems are commonly processed independently and are interrelated. The aim of this PhD project is to analyze and address the noise problem, unbalanced data, overlapping classes and high dimensionality in an integrated manner, observing the relations between them. Data with these characteristics are often found in Molecular Biology. Thus, it is considered to use molecular biology data during analysis.
News published in Agência FAPESP Newsletter about the scholarship: