DyS: A Framework for Mixture Models in Quantification

Maletzke, Andre; dos Reis, Denis; Cherman, Everton; Batista, Gustavo; AAAI

Texto completo
Autor(es):	Maletzke, Andre ; dos Reis, Denis ; Cherman, Everton ; Batista, Gustavo ; AAAI Número total de Autores: 5
Tipo de documento:	Artigo Científico
Fonte:	THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE; v. N/A, p. 9-pg., 2019-01-01.
Resumo
Quantification is an expanding research topic in Machine Learning literature. While in classification we are interested in obtaining the class of individual observations, in quantification we want to estimate the total number of instances that belong to each class. This subtle difference allows the development of several algorithms that incur smaller and more consistent errors than counting the classes issued by a classifier. Among such new quantification methods, one particular family stands out due to its accuracy, simplicity, and ability to operate with imbalanced training samples: Mixture Models (MM). Despite these desirable traits, MM, as a class of algorithms, lacks a more in-depth understanding concerning the influence of internal parameters on its performance. In this paper, we generalize MM with a base framework called DyS : Distribution y-Similarity. With this framework, we perform a thorough evaluation of the most critical design decisions of MM models. For instance, we assess 15 dissimilarity functions to compare histograms with varying numbers of bins from 2 to 110 and, for the first time, make a connection between quantification accuracy and test sample size, with experiments covering 24 public benchmark datasets. We conclude that, when tuned, Topsoe is the histogram distance function that consistently leads to smaller quantification errors and, therefore, is recommended to general use, notwithstanding Hellinger Distance's popularity. To rid MM models of the dependency on a choice for the number of histogram bins, we introduce two dissimilarity functions that can operate directly on observations. We show that SORD, one of such measures, presents performance that is slightly inferior to the tuned Topsoe, while not requiring the sensible parameterization of the number of bins. (AU)

Processo FAPESP:	16/04986-6 - Armadilhas e sensores inteligentes: uma abordagem inovadora para controle de insetos peste e vetores de doenças
Beneficiário:	Gustavo Enrique de Almeida Prado Alves Batista
Modalidade de apoio:	Auxílio à Pesquisa - Programa eScience e Data Science - Regular


Processo FAPESP:	17/22896-7 - Detecção de Contexto Não-Supervisionada em Fluxos de Dados para Classificação
Beneficiário:	Denis Moreira dos Reis
Modalidade de apoio:	Bolsas no Brasil - Doutorado

URL curto