Uma abordagem para a indução de árvores de decisão voltada para dados de expressão gênica

Pedro Santoro Perez

Full text
Author(s):	Pedro Santoro Perez Total Authors: 1
Document type:	Master's Dissertation
Press:	São Paulo.
Institution:	Universidade de São Paulo (USP). Instituto de Matemática e Estatística (IME/SBI)
Defense date:	2012-04-18
Examining board members:	José Augusto Baranauskas; Fabricio Martins Lopes; Renato Tinós
Advisor:	José Augusto Baranauskas
Abstract
Gene expression studies have been of great importance, allowing the development of new therapies, diagnostic exams, drugs and the understanding of a variety of biological processes. Nevertheless, those studies involve some obstacles: a huge number of genes, while only a very few of them are really relevant to the problem at hand; data with the presence of noise; among others. This research project consists of: the study of decision tree induction algorithms; the definition of a methodology capable of handling gene expression data using decision trees; and the implementation of that methodology as algorithms that can extract knowledge from that kind of data. The decision tree induction searches for relevant characteristics in the data which would allow it to precisely model a certain concept, but it also worries about the comprehensibility of the generated model, helping specialists to discover new knowledge, something very important in the medical and biological areas. On the other hand, such inducers present some instability, because small changes in the training data might produce great changes in the generated model. This is one of the problems being handled in this Master\'s project. But the main problem this project handles refers to the behavior of those inducers when it comes to high-dimensional data, more specifically to gene expression data: irrelevant attributes may harm the learning process and many models with similar performance may be generated. A variety of techniques have been explored to treat those problems, but this study focused on two of them: windowing, which was the most explored technique and to which this project has proposed some variations in order to improve its performance; and lookahead, which builds each node of a tree taking into consideration subsequent steps of the induction process. As for windowing, the study explored aspects related to the pruning of the trees generated during intermediary steps of the algorithm; the use of the estimated error instead of the training error; the use of the error weighted according to the size of the current window; and the use of the classification confidence as the window update criterion. As for lookahead, a 1-step version was implemented, i.e., in order to make the decision in the current iteration, the inducer takes into consideration the information gain ratio of the next iteration. The results show that the proposed algorithms outperform the classical ones, especially considering measures of complexity and comprehensibility of the induced models. (AU)

FAPESP's process:	09/04511-4 - An Approach for Induction of Decision Trees towards Gene Expression Data
Grantee:	Pedro Santoro Perez
Support Opportunities:	Scholarships in Brazil - Master

Short URL