Busca avançada
Ano de início
Entree


mdatagen: A python library for the artificial generation of missing data

Texto completo
Autor(es):
Mangussi, Arthur Dantas ; Santos, Miriam Seoane ; Lopes, Filipe Loyola ; Pereira, Ricardo Cardoso ; Lorena, Ana Carolina ; Abreu, Pedro Henriques
Número total de Autores: 6
Tipo de documento: Artigo Científico
Fonte: Neurocomputing; v. 625, p. 10-pg., 2025-01-29.
Resumo

Missing data is characterized by the presence of absent values in data (i.e., missing values) and it is currently categorized into three different mechanisms: Missing Completely at Random, Missing At Random, and Missing Not At Random. When performing missing data experiments and evaluating techniques to handle absent values, these mechanisms are often artificially generated (a process referred to as data amputation) to assess the robustness and behavior of the used methods. Due to the lack of a standard benchmark for data amputation, different implementations of the mechanisms are used in related research (some are often not disclaimed), preventing the reproducibility of results and leading to an unfair or inaccurate comparison between existing and new methods. Moreover, for users outside the field, experimenting with missing data or simulating the appearance of missing values in real-world domains is unfeasible, impairing stress testing in machine learning systems. This work introduces mdatagen, an open source Python library for the generation of missing data mechanisms across 20 distinct scenarios, following different univariate and multivariate implementations of the established missing mechanisms. The package therefore fosters reproducible results across missing data experiments and enables the simulation of artificial missing data under flexible configurations, making it very versatile to mimic several real-world applications involving missing data. The source code and detailed documentation for mdatagen are available at https://github.com/ArthurMangussi/pymdatagen. (AU)

Processo FAPESP: 23/13688-2 - An Autoencoder model for dealing with missing and noise data
Beneficiário:Arthur Dantas Mangussi
Modalidade de apoio: Bolsas no Exterior - Estágio de Pesquisa - Mestrado
Processo FAPESP: 21/06870-3 - Além da seleção de algoritmos: meta-aprendizado para análise e entendimento de dados e algoritmos
Beneficiário:Ana Carolina Lorena
Modalidade de apoio: Auxílio à Pesquisa - Jovens Pesquisadores - Fase 2
Processo FAPESP: 22/10553-6 - Uma abordagem unificada para lidar com dados ausentes e de ruído
Beneficiário:Arthur Dantas Mangussi
Modalidade de apoio: Bolsas no Brasil - Mestrado