Advanced search
Start date
Betweenand


Machine Learning Tools for Bioinformatics Problems

Full text
Author(s):
Victor Alexandre Padilha
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:
Examining board members:
André Carlos Ponce de Leon Ferreira de Carvalho; Ricardo Cerri; Alexandre Rossi Paschoal; Adenilso da Silva Simão
Advisor: André Carlos Ponce de Leon Ferreira de Carvalho; Rolf Backofen
Abstract

In recent years, machine learning techniques have been extensively used for bioinformatics, due to their capacity in solving hard problems by learning a function from a set of known examples, being this function able to make predictions for unseen data. Motivated by these successful applications, we tackle in this thesis three different bioinformatics problems using machine learning techniques. The first problem is related to the use of coherence measures for the analysis of biclustering results in gene expression data analysis. Specifically, we conducted a detailed investigation of the correlations between different bicluster coherence measures on a benchmark of 19 datasets of the Saccharomyces cerevisiae organism. We were able to identify pairs of redundant measures and also observed that such measures did not present any relation with external knowledge available in the form of gene ontologies. The second problem is related to the classification of CRISPR cassettes into their subtypes and the prediction of potentially missing proteins. We proposed a novel tool, called CRISPRcasIdentifier, which integrates classifiers and regressors for these tasks. It outperformed the competitors from the literature on the most recent benchmark dataset available and is the first tool that is able to recommend potentially missing proteins in CRISPR cassettes. The third problem is related to the automatic identification of CRISPR cassettes in bacterial and archaeal genomes. We introduced Casboundary, a new tool that detects CRISPR cassettes based on gene signatures and their relations with neighboring genes. Moreover, this tool is able to point out potentially new cas genes, as demonstrated by a case study. Finally, Casboundary is also capable of decomposing a CRISPR cassette into its modules, which are related to the different stages of the CRISPR systems. (AU)

FAPESP's process: 19/21300-9 - Machine learning tools for bioinformatics problems
Grantee:Victor Alexandre Padilha
Support Opportunities: Scholarships in Brazil - Doctorate