Advanced search
Start date
Betweenand

Exploring Alternative Clustering Algorithms Beyond k-means: The Example of Molecular Structures used for Hydrogen Production

Grant number: 25/10719-0
Support Opportunities:Scholarships in Brazil - Scientific Initiation
Start date: September 01, 2025
End date: August 31, 2026
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Principal Investigator:Juarez Lopes Ferreira da Silva
Grantee:Marcos Vinicius Cota Rodrigues da Trindade
Host Institution: Instituto de Química de São Carlos (IQSC). Universidade de São Paulo (USP). São Carlos , SP, Brazil
Company:Universidade de São Paulo (USP). Instituto de Química de São Carlos (IQSC)
Associated research grant:17/11631-2 - CINE: computational materials design based on atomistic simulations, meso-scale, multi-physics, and artificial intelligence for energy applications, AP.PCPE

Abstract

Data clustering is a cornerstone of data analysis in computational chemistry, enabling the identification of structural and energetic patterns within vast molecular datasets. Traditional algorithms, such as k-means, while widely adopted due to their simplicity, face significant limitations when applied to high-dimensional molecular systems, including the assumption of spherical clusters, sensitivity to initialization, and poor scalability. This project addresses these challenges by conducting a systematic exploration of advanced clustering algorithms - such as Hierarchical Density-Based Spatial Clustering, Ordering Points to Identify Clustering Structure, Gaussian Mixture Models, Spectral Clustering, and Dynamical Particle-Based Clustering - to enhance the accuracy, efficiency, and scalability of molecular data analysis for molecular structures used in hydrogen production.The study focuses on optimizing the computational tool developed by Prof. Marcos G. Quiles, which currently employs k-means to cluster molecular configurations based on Coulomb matrix eigenvalues, total energy, bond lengths, and magnetic moments. By integrating alternative algorithms, we aim to overcome bottlenecks in handling large-scale datasets (e.g., millions of molecular configurations) and complex geometries, such as non-convex or overlapping clusters prevalent in molecular dynamics simulations. Key methodologies include a comprehensive review of the literature on clustering techniques, empirical evaluation of algorithmic performance using metrics such as silhouette scores, Dunn indices, and adjusted Rand indices, and computational optimizations such as parallelization, dimensionality reduction, and hyperparameter tuning.The project emphasizes computational efficiency, leveraging the QTNano group's high-performance computing infrastructure (2000+ cores) to test scalability. Performance will be compared with existing k-means implementations, with a focus on reducing runtime and memory usage while preserving cluster quality. Furthermore, the integration of density-based and hierarchical methods aims to improve robustness against noise and adaptability to irregular cluster shapes, addressing critical gaps in current unsupervised approaches.The expected results include an enhanced computational tool capable of processing high-dimensional molecular data with greater precision, enabling researchers to identify representative configurations more efficiently. This advancement will directly support the QTNano group's mission to accelerate discoveries in quantum chemistry, particularly in energy materials and sustainable technologies. By making the tool accessible to the broader scientific community, the project seeks to foster collaborative innovation, streamline exploratory analyses, and contribute to the development of data-driven methodologies in computational chemistry.The enhanced tool will provide deep insights into the structural organization of large molecular datasets and will directly contribute to optimizing hydrogen production processes by identifying more stable and reactive configurations. Its availability is expected to accelerate collaboration among researchers, driving discoveries in quantum chemistry and sustainable energy technologies, establishing robust practices in data-driven molecular science, and paving the way for future innovations in advanced materials development. (AU)

News published in Agência FAPESP Newsletter about the scholarship:
More itemsLess items
Articles published in other media outlets ( ):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)