Advanced search
Start date
Betweenand

Evaluation, model selection and unsupervised outlier detection in data spaces and subspaces

Grant number: 15/06019-0
Support type:Scholarships in Brazil - Doctorate
Effective date (Start): July 01, 2015
Effective date (End): April 01, 2019
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Cooperation agreement: Coordination of Improvement of Higher Education Personnel (CAPES)
Principal Investigator:Ricardo José Gabrielli Barreto Campello
Grantee:Henrique Oliveira Marques
Home Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil
Associated scholarship(s):17/04161-0 - Evaluation, model selection and unsupervised outlier detection in subspaces, BE.EP.DR

Abstract

Outlier detection plays an important role in the pattern discovery from data that can be considered exceptional in some sense. Detecting such patterns is relevant in general because in many data mining applications, such patterns represent extraordinary behaviors that is worth further analysis. An important distinction is that between the supervised and unsupervised techniques. In this project we focus on unsupervised outlier detection techniques. There are dozens of algorithms of this category in literature, however, each of these algorithms uses its own intuition to judge what should be considered an outlier or not, which naturally is a subjective concept. This substantially complicates the selection of a particular algorithm and also the choice of an appropriate configuration of parameters for a given algorithm in a practical application. This also makes it highly complex to evaluate the quality of the solution obtained by an algorithm or configuration adopted by the analyst, especially in light of the problem of defining a measure of quality that is not hooked on the criterion used by the algorithm itself. These issues are interrelated and refer respectively to the problems of model selection and evaluation (or validation) of results in unsupervised learning. These problems have been investigated for decades in the area of unsupervised data clustering, but only in the candidate's master a pioneer internal and relative measure for unsupervised evaluation of binary outlier detection solutions, called IREOS (Internal, Relative Evaluation of Outlier Solutions), was proposed. Although the measure represents an important step forward in the state-of-the-art in this area, measures for solutions that, instead of labels, provide scorings to the observations (that is the type of solution produced by the vast majority of well-known unsupervised outlier detection algorithms) and for solutions of outliers detected in subspaces (that, due to high dimensionality problem, is an area that has recently received considerable attention) are still notorious problems in the area. The IREOS extension for evaluation of results produced by both category of outlier detection algorithms, as well as improvements and applications that go beyond the evaluation and model selection, such as the automatic determination of the number of outliers in the dataset, represent the main objectives that this research project proposes to investigate. Also, as a second objective, we intend to investigate whether original principles used in the development of IREOS index can be adapted to the development of new outlier detection algorithms, particularly in the context of subspaces. (AU)