Advanced search
Start date
Betweenand

Optimizing the Calculus of Normalized Compression Distance for Unsupervised Learning

Grant number: 25/12086-4
Support Opportunities:Scholarships in Brazil - Scientific Initiation
Start date: October 01, 2025
End date: September 30, 2026
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computer Systems
Principal Investigator:Paulo Sérgio Lopes de Souza
Grantee:João Pedro Hamata
Host Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil
Associated research grant:19/26702-8 - Trends on high performance computing, from resource management to new computer architectures, AP.TEM

Abstract

Unsupervised analysis of large volumes of data poses significant computational challenges due to high algorithmic complexity and demanding preprocessing requirements. Methodologies such as DAMICORE, which are based on the Normalized Compression Distance (NCD) - a metric derived from Kolmogorov Complexity - offer the advantage of eliminating the need for prior feature extraction and are applicable to heterogeneous data. However, the initial and often most computationally expensive stage of this pipeline - the computation of the NCD distance matrix for all pairs of objects - has quadratic complexity, becoming a bottleneck in Big Data scenarios. This project aims to address this limitation in the context of DAMICORE methodology through the application of high-performance computing (HPC) techniques. It proposes a survey of existing approaches and, based on that, the analysis, design, implementation, and evaluation of optimized parallel algorithms for NCD matrix computation, leveraging the capabilities of contemporary heterogeneous computing architectures. The methodology involves the use of parallel architectures such as multicore CPUs, SIMD extensions, GPU accelerators, and computer clusters. The goal is to achieve substantial optimization in the computation of the NCD metric in DAMICORE methodology, with quantitative analyses of speedup and scalability across different platforms, in addition to delivering an optimized, documented, and reusable software module for parallel NCD computation. The main scientific contribution lies in enabling the practical application of NCD-based methods, such as the pipeline employed in DAMICORE, for domains that handle large-scale data, fostering advances in fields like bioinformatics, complex networks, and natural language processing.

News published in Agência FAPESP Newsletter about the scholarship:
More itemsLess items
Articles published in other media outlets ( ):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)