Sifter-T: Um framework escalável para anotação filogenômica probabilística funcional de domínios protéicos

Danillo Cunha de Almeida e Silva

Full text
Author(s):	Danillo Cunha de Almeida e Silva Total Authors: 1
Document type:	Master's Dissertation
Press:	São Paulo.
Institution:	Universidade de São Paulo (USP). Instituto de Matemática e Estatística (IME/SBI)
Defense date:	2013-10-25
Examining board members:	Ricardo Zorzetto Nicoliello Vencio; João Carlos Setubal; Thiago Motta Venancio
Advisor:	Ricardo Zorzetto Nicoliello Vencio
Abstract
It is known that many software are no longer used due to their complex usability. Even tools known for their task execution quality are abandoned in favour of faster tools, simpler to use or install. In the functional annotation field, Sifter (v2.0) is regarded as one of the best when it comes to annotation quality. Recently it has been considered one of the best tools for functional annotation according to the \"Critical Assessment of Protein Function Annotation (CAFA) experiment. Nevertheless, it is still not widely used, probably due to issues with usability and suitability of the framework to a high throughput scale. The original workflow SIFTER consists of two main steps: The annotation recovery for a list of genes and the reconciled gene tree generation for the same list. Next, based on the gene tree, Sifter builds a Bayesian network structure in which its leaves represent genes. The known functional annotations are associated to the aforementioned leaves, and then the annotations are probabilistically propagated along the Bayesian network to the leaves without a priori information. At the end of the process, a list of Gene Ontology functions and their occurrence probabilities is generated for each unknown function gene. This work main goal is to optimize the original source code for better performance, potentially allowing it to be used in a genome-wide scale. Studying the pre-processing workflow we found opportunities for improvement and envisioned strategies to address them. Among the implemented strategies we have: The use of parallel threads; CPU load balancing, revised algorithms for best utilization of disk access, memory usage and runtime; source code adaptation to currently used biological databases; improved user accessibility; input types increase; automatic gene and species tree reconciliation process; sequence filtering to reduce analysis dimension, and other minor implementations. With these implementations we achieved great performance speed-ups. For example, we obtained 87-fold performance increase in the annotation recovering module and 72.3% speed increase in the gene tree generation module using quad-core machines. Additionally, significant memory usage decrease during the realignment phase was obtained. This implementation is presented as Sifter-T (Sifter Throughput-optimized), an open source tool with better usability, performance and annotation quality when compared to the Sifter\'s original workflow implementation. Sifter-T was written in a modular fashion using Python programming language; it is designed to simplify complete genomes and proteomes annotation tasks and the outputs are presented in order to make the researcher\'s work easier. (AU)

Short URL