Advanced search
Start date
Betweenand

Active learning for protein subcellular localization

Grant number: 17/24807-1
Support type:Scholarships abroad - Research Internship - Scientific Initiation
Effective date (Start): March 01, 2018
Effective date (End): June 30, 2018
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Principal Investigator:Ricardo Cerri
Grantee:Leonardo Utida Alcântara
Supervisor abroad: Isaac Triguero Velazquez
Home Institution: Centro de Ciências Exatas e de Tecnologia (CCET). Universidade Federal de São Carlos (UFSCAR). São Carlos , SP, Brazil
Local de pesquisa : University of Nottingham, University Park, England  
Associated to the scholarship:16/25220-1 - Multi-label machine learning for protein subcellular localization, BP.IC

Abstract

Protein subcellular localization is a really important classification task, because the location of proteins inside a cell is directly related to these protein's functions. As there are a lot of proteins that reside at the same time in two or more locations in a cell or move between locations, usually supervised multi-label classification (MLC) methods are designed to attack this problem. This approach is well-established in the literature; however, it presents some disadvantages such as: (i) the need for a large amount of labeled instances to train the classifier; (ii) this approach ignores the fact that unlabeled instances can provide valuable information for the classification; and (iii) there are a lot of areas in which unlabeled data is abundant but manually labeling an instance is too expensive and time-consuming. Active learning (AL) is a subfield of semi-supervised learning which aims to build classification models with fewer labeled instances complemented with the most representative unlabeled instances. In order to perform this, the AL algorithm select the most representative instances to be labeled by an oracle, which can be a specialist, for example a human or an algorithm. Then the selected unlabeled instances are used to complement the labeled ones. The main goal of this project is to investigate the use AL along with MLC on the protein subcellular localization prediction problem (PSLP). The AL algorithm will be constructed, tested and analyzed and its results will be compared against our proposed method presented on the ongoing FAPESP project in Brazil. The tests will use the same data sets proposed on the current FAPESP project. This is a substantial extension of the current work being developed under FAPESP's scholarship, and as not many works merging MLC and AL for PSLP were found, the project has potential to be of great impact in the literature.