Advanced search
Start date
Betweenand


Feature selection and intrinsically multivariate prediction in gene regulatory networks identification

Full text
Author(s):
David Corrêa Martins Junior
Total Authors: 1
Document type: Doctoral Thesis
Press: São Paulo.
Institution: Universidade de São Paulo (USP). Instituto de Matemática e Estatística (IME/SBI)
Defense date:
Examining board members:
Roberto Marcondes Cesar Junior; Hugo Aguirre Armelin; Junior Barrera; Sandro José de Souza; Ricardo Zorzetto Nicoliello Vencio
Advisor: Roberto Marcondes Cesar Junior; Junior Barrera
Field of knowledge: Physical Sciences and Mathematics - Computer Science
Indexed in: Banco de Dados Bibliográficos da USP-DEDALUS; Biblioteca Digital de Teses e Dissertações - USP
Location: Universidade de São Paulo. Instituto de Matemática e Estatística. Biblioteca Carlos Benjamin de Lyra; IME-T QA862.T e.1; M386s
Abstract

Feature selection is a crucial topic in pattern recognition applications, especially in bioinformatics, where problems usually involve data with a large number of variables and small number of observations. The present work addresses feature selection aspects in the problem of gene regulatory network identification from expression profiles. Particularly, we proposed a probabilistic genetic network model (PGN) that recovers a network constructed from the recurrent application of feature selection algorithms guided by a conditional entropy based criterion function. Such criterion embeds error estimation by penalization of rarely observed patterns. Results from this model applied to synthetic and real data sets obtained from Plasmodium falciparum microarrays, a malaria agent, demonstrate the validity of this technique. This method was able to not only reproduce previously produced knowledge, but also to produce other potentially relevant results. The intrinsically multivariate prediction (IMP) phenomenon has been also investigated. This phenomenon is related to the fact of a feature set being a nice predictor of the objects in study, but all of its properly contained subsets cannot predict such objects satisfactorily. In this work, the conditions for the rising of this phenomenon were analitically obtained for sets of 2 and 3 features regarding a target variable. In the gene regulatory networks context, evidences have been achieved in which target genes of IMP sets possess a great potential to execute vital functions in biological systems. The phenomenon known as canalization is particularly important in this context. In melanoma microarray data, we verified that DUSP1 gene, known by having canalization function, was the one which composed the largest number of IMP gene sets. It was also verified that all these sets have canalizing predictive logics. Moreover, computational simulations for generation of networks with 3 or more genes show that the territory size of a target gene can contribute positively to its IMP score with regard to its predictors. This could be an evidence that confirms the hypothesis stating that target genes of IMP sets are inclined to control several metabolic pathways essential to the maintenance of the vital functions of an organism. (AU)