Unsupervised dimensionality reduction for very large datasets: Are we going to the right direction?

Monteiro Oliveira, Jadson Jose; Ferreira Cordeiro, Robson Leonardo

Texto completo
Autor(es):	Monteiro Oliveira, Jadson Jose ^[1] ; Ferreira Cordeiro, Robson Leonardo ^[1] Número total de Autores: 2
Afiliação do(s) autor(es):	^[1] Univ Sao Paulo, Inst Math & Comp Sci ICMC, Trabalhador Sao Carlense Ave 400, BR-13566590 Sao Carlos, SP - Brazil Número total de Afiliações: 1
Tipo de documento:	Artigo Científico
Fonte:	KNOWLEDGE-BASED SYSTEMS; v. 196, MAY 21 2020.
Citações Web of Science:	0
Resumo
Given a set of millions or even billions of complex objects for descriptive data mining, how to effectively reduce the data dimensionality? It must be performed in an unsupervised way. Unsupervised dimensionality reduction is essential for analytical tasks like clustering and outlier detection because it helps to overcome the drawbacks of the ``curse of high dimensionality{''}. The state-of-the-art approach is to preserve the data variance by means of well-known techniques, such as PCA, KPCA, SVD, and other techniques based on those that have been mentioned, such as PUFS. But, is it always the best strategy to follow? This paper presents an exploratory study performed to compare two distinct approaches: (a) the standard variance preservation, and; (b) one alternative, Fractal-based solution that is rarely used, for which we propose one fast and scalable Spark-based algorithm using a novel feature partitioning approach that allows it to tackle data of high dimensionality. Both strategies were evaluated by inserting into 11 real-world datasets, with up to 123.5 million elements and 518 attributes, at most 500 additional attributes formed by correlations of many kinds, such as linear, quadratic, logarithmic and exponential, and verifying their abilities to remove this redundancy. The results indicate that, at least for large datasets of dimensionality with up to similar to 1,000 attributes, our proposed Fractal-based algorithm is the best option. It accurately and efficiently removed the redundant attributes in nearly all cases, as opposed to the standard variance-preservation strategy that presented considerably worse results, even when applying the KPCA approach that is made for non-linear correlations. (C) 2020 Elsevier B.V. All rights reserved. (AU)

Processo FAPESP:	16/17078-0 - Mineração, indexação e visualização de Big Data no contexto de sistemas de apoio à decisão clínica (MIVisBD)
Beneficiário:	Agma Juci Machado Traina
Modalidade de apoio:	Auxílio à Pesquisa - Temático


Processo FAPESP:	18/05714-5 - Mineração de Fluxos de Dados Frequentes e de Alta Dimensionalidade: estudo de caso em jogos digitais
Beneficiário:	Robson Leonardo Ferreira Cordeiro
Modalidade de apoio:	Auxílio à Pesquisa - Regular

URL curto