Busca avançada
Ano de início
Entree


Efficient outlier detection in numerical and categorical data

Texto completo
Autor(es):
Cabral, Eugenio F. ; Vinces, Braulio V. Sanchez ; Silva, Guilherme D. F. ; Sander, Jorg ; Cordeiro, Robson L. F.
Número total de Autores: 5
Tipo de documento: Artigo Científico
Fonte: DATA MINING AND KNOWLEDGE DISCOVERY; v. 39, n. 3, p. 46-pg., 2025-05-01.
Resumo

How to spot outliers in a large, unlabeled dataset with both numerical and categorical attributes? How to do it in a fast and scalable way? Outlier detection has many applications; it is covered therefore by an extensive literature. The distance-based detectors are the most popular ones. However, they still have two major drawbacks: (a) the intensive neighborhood search that takes hours or even days to complete in large data, and; (b) the inability to process categorical attributes. This paper tackles both problems by presenting HySortOD: a new, fast and scalable detector for numerical and categorical data. Our main focus is the analysis of datasets with many instances, and a low-to-moderate number of attributes. We studied dozens of real, benchmark datasets with up to one million instances; HySortOD outperformed nine competitors from the state of the art in runtime, being up to six orders of magnitude faster in large data, while maintaining high accuracy. Finally, we also performed an extensive experimental evaluation that confirms the ability of our method to obtain high-quality results from both real and synthetic datasets with categorical attributes. (AU)

Processo FAPESP: 16/17078-0 - Mineração, indexação e visualização de Big Data no contexto de sistemas de apoio à decisão clínica (MIVisBD)
Beneficiário:Agma Juci Machado Traina
Modalidade de apoio: Auxílio à Pesquisa - Temático
Processo FAPESP: 20/07200-9 - Analisando dados complexos vinculados a COVID-19 para apoio à tomada de decisão e prognóstico
Beneficiário:Agma Juci Machado Traina
Modalidade de apoio: Auxílio à Pesquisa - Regular
Processo FAPESP: 18/05714-5 - Mineração de Fluxos de Dados Frequentes e de Alta Dimensionalidade: estudo de caso em jogos digitais
Beneficiário:Robson Leonardo Ferreira Cordeiro
Modalidade de apoio: Auxílio à Pesquisa - Regular