Advanced search
Start date
Betweenand

Analyzing the diversity of public machine learning data repositories for meta-learning

Grant number: 19/20328-7
Support type:Scholarships abroad - Research
Effective date (Start): July 29, 2020
Effective date (End): January 29, 2021
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Principal Investigator:Ana Carolina Lorena
Grantee:Ana Carolina Lorena
Host: Kate Smith-Miles
Home Institution: Divisão de Ciência da Computação (IEC). Instituto Tecnológico de Aeronáutica (ITA). Ministério da Defesa (Brasil). São José dos Campos , SP, Brazil
Local de pesquisa : University of Melbourne, Australia  
Associated research grant:13/07375-0 - CeMEAI - Center for Mathematical Sciences Applied to Industry, AP.CEPID

Abstract

The areas of Meta-learning (MtL) and Automatic Machine Learning (AutoML) have emerged in the last years with successful solutions to ease the usage of Machine Learning (ML) techniques by interested end-users with low expertise in ML. Usually the MtL and AutoML solutions leverage on knowledge from problems for which the solutions are known, gathered in public repositories. One popular repository is OpenML, which also reports the predictive results achieved by several ML algorithms in benchmark experiments, a very rich information for MtL and AutoML studies. Nonetheless, most of these studies perform an ad-hoc selection of the datasets to be employed in the development of their solutions. This may prevent an appropriate selection of diverse and challenging datasets and introduce some bias in the dataset selection process. Building on the previous experience of the researcher on the study of the complexity of classification and regression problems from a data-driven perspective, we intend to perform an analysis of the existent benchmark ML repositories which is three-fold: (i) to understand and characterize the diversity of such repositories, specifically for MtL purposes; (ii) to enrich the repositories by the generation of synthetic datasets spanning properties distinct from those already existent; and (iii) to build atool able to recommend a test-bed with diverse datasets that shall meet the objectives of theMtL researcher. For such, we expect to join concepts from the recent relate literature oncomplexity measures of classification and regression problems, from the proponent side, and on instance space analysis of supervised ML problems, from the supervisor side.