Unsupervised dimensionality reduction for very large datasets: Are we going to the right direction?

Monteiro Oliveira, Jadson Jose; Ferreira Cordeiro, Robson Leonardo

Full text
Author(s):	Monteiro Oliveira, Jadson Jose ^[1] ; Ferreira Cordeiro, Robson Leonardo ^[1] Total Authors: 2
Affiliation:	^[1] Univ Sao Paulo, Inst Math & Comp Sci ICMC, Trabalhador Sao Carlense Ave 400, BR-13566590 Sao Carlos, SP - Brazil Total Affiliations: 1
Document type:	Journal article
Source:	KNOWLEDGE-BASED SYSTEMS; v. 196, MAY 21 2020.
Web of Science Citations:	0
Abstract
Given a set of millions or even billions of complex objects for descriptive data mining, how to effectively reduce the data dimensionality? It must be performed in an unsupervised way. Unsupervised dimensionality reduction is essential for analytical tasks like clustering and outlier detection because it helps to overcome the drawbacks of the ``curse of high dimensionality{''}. The state-of-the-art approach is to preserve the data variance by means of well-known techniques, such as PCA, KPCA, SVD, and other techniques based on those that have been mentioned, such as PUFS. But, is it always the best strategy to follow? This paper presents an exploratory study performed to compare two distinct approaches: (a) the standard variance preservation, and; (b) one alternative, Fractal-based solution that is rarely used, for which we propose one fast and scalable Spark-based algorithm using a novel feature partitioning approach that allows it to tackle data of high dimensionality. Both strategies were evaluated by inserting into 11 real-world datasets, with up to 123.5 million elements and 518 attributes, at most 500 additional attributes formed by correlations of many kinds, such as linear, quadratic, logarithmic and exponential, and verifying their abilities to remove this redundancy. The results indicate that, at least for large datasets of dimensionality with up to similar to 1,000 attributes, our proposed Fractal-based algorithm is the best option. It accurately and efficiently removed the redundant attributes in nearly all cases, as opposed to the standard variance-preservation strategy that presented considerably worse results, even when applying the KPCA approach that is made for non-linear correlations. (C) 2020 Elsevier B.V. All rights reserved. (AU)

FAPESP's process:	16/17078-0 - Mining, indexing and visualizing Big Data in clinical decision support systems (MIVisBD)
Grantee:	Agma Juci Machado Traina
Support Opportunities:	Research Projects - Thematic Grants


FAPESP's process:	18/05714-5 - Mining Frequent Data Streams of High Dimensionality with a Case Study in Digital Games
Grantee:	Robson Leonardo Ferreira Cordeiro
Support Opportunities:	Regular Research Grants

Short URL