Advanced search
Start date
Betweenand


Efficient outlier detection in numerical and categorical data

Full text
Author(s):
Cabral, Eugenio F. ; Vinces, Braulio V. Sanchez ; Silva, Guilherme D. F. ; Sander, Jorg ; Cordeiro, Robson L. F.
Total Authors: 5
Document type: Journal article
Source: DATA MINING AND KNOWLEDGE DISCOVERY; v. 39, n. 3, p. 46-pg., 2025-05-01.
Abstract

How to spot outliers in a large, unlabeled dataset with both numerical and categorical attributes? How to do it in a fast and scalable way? Outlier detection has many applications; it is covered therefore by an extensive literature. The distance-based detectors are the most popular ones. However, they still have two major drawbacks: (a) the intensive neighborhood search that takes hours or even days to complete in large data, and; (b) the inability to process categorical attributes. This paper tackles both problems by presenting HySortOD: a new, fast and scalable detector for numerical and categorical data. Our main focus is the analysis of datasets with many instances, and a low-to-moderate number of attributes. We studied dozens of real, benchmark datasets with up to one million instances; HySortOD outperformed nine competitors from the state of the art in runtime, being up to six orders of magnitude faster in large data, while maintaining high accuracy. Finally, we also performed an extensive experimental evaluation that confirms the ability of our method to obtain high-quality results from both real and synthetic datasets with categorical attributes. (AU)

FAPESP's process: 16/17078-0 - Mining, indexing and visualizing Big Data in clinical decision support systems (MIVisBD)
Grantee:Agma Juci Machado Traina
Support Opportunities: Research Projects - Thematic Grants
FAPESP's process: 20/07200-9 - Analyzing complex data from COVID-19 to support decision making and prognosis
Grantee:Agma Juci Machado Traina
Support Opportunities: Regular Research Grants
FAPESP's process: 18/05714-5 - Mining Frequent Data Streams of High Dimensionality with a Case Study in Digital Games
Grantee:Robson Leonardo Ferreira Cordeiro
Support Opportunities: Regular Research Grants