Advanced search
Start date

Hierarchical classification of transposable elements using machine learning


Transposable Elements (TEs) are DNA sequences which can move from one place to another inside the genome of a cell. These elements contribute to the genetic diversity of species, and their transposition mechanisms may affect the functionality of genes. The correct identification and classification of these elements is useful for the comprehension of their effects in the genomes evolutionary process. TEs are organized in a hierarchical taxonomy, having different families and superfamilies of elements. Usually, the identification and classification of these elements is performed using Bioinformatics tools which use homology, comparing a new sequence with a dataset of many sequences which have previously identified TEs. Although this method is very used, it presents disadvantages, because homology between sequences ignores their many biochemical properties, and also the relationships between the different TE families and superfamilies. Thus, this project will investigate and propose different hierarchical classification methods for TEs using Machine Learning (ML) techniques. Different datasets will be constructed nucleotide and amino acid sequences with already previously identified TEs. For the construction of these datasets, Bioinformatics tools designed to extract biochemical characteristics from sequences will be used. Different strategies to convert sequences into attribute values adequate to be used in ML techniques will also be investigated. The datasets will then be hierarchically structured according to the TEs families and superfamilies which they belong to. The different classification methods proposed will be compared with existing literature methods, and evaluated using evaluation measures specifically proposed to hierarchical classification problems. (AU)

Scientific publications
(References retrieved automatically from Web of Science and SciELO through information on FAPESP grants and their corresponding numbers as mentioned in the publications by the authors)
CERRI, RICARDO; BASGALUPP, MARCIO P.; BARROS, RODRIGO C.; DE CARVALHO, ANDRE C. P. L. F. Inducing Hierarchical Multi-label Classification rules with Genetic Algorithms. APPLIED SOFT COMPUTING, v. 77, p. 584-604, APR 2019. Web of Science Citations: 1.
SCHIETGAT, LEANDER; VENS, CELINE; CERRI, RICARDO; FISCHER, CARLOS N.; COSTA, EDUARDO; RAMON, JAN; CARARETO, CLAUDIA M. A.; BLOCKEEL, HENDRIK. A machine learning based framework to identify and classify long terminal repeat retrotransposons. PLOS COMPUTATIONAL BIOLOGY, v. 14, n. 4 APR 2018. Web of Science Citations: 3.
CERRI, RICARDO; BARROS, RODRIGO C.; DE CARVALHO, ANDRE C. P. L. F.; JIN, YAOCHU. Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinformatics, v. 17, SEP 15 2016. Web of Science Citations: 12.

Please report errors in scientific publications list by writing to: