Advanced search
Start date
Betweenand
(Reference retrieved automatically from Web of Science through information on FAPESP grant and its corresponding number as mentioned in the publication by the authors.)

Machine Learning Prediction of Nine Molecular Properties Based on the SMILES Representation of the QM9 Quantum-Chemistry Dataset

Full text
Author(s):
Pinheiro, Gabriel A. [1] ; Mucelini, Johnatan [2] ; Soares, Marinalva D. [3] ; Prati, Ronaldo C. [4] ; Da Silva, Juarez L. F. [2] ; Quiles, Marcos G. [5]
Total Authors: 6
Affiliation:
[1] Natl Inst Space Res, Associate Lab Comp & Appl Math, BR-12227010 Sao Jose Dos Campos, SP - Brazil
[2] Univ Sao Paulo, Sao Carlos Inst Chem, BR-13560970 Sao Carlos, SP - Brazil
[3] Fed Univ Sao Paulo UNIFESP, Inst Sci & Technol, BR-12247014 Sao Jose Dos Campos, SP - Brazil
[4] Fed Univ ABC, Ctr Math Computat & Cognit, BR-09210580 Santo Andre, SP - Brazil
[5] Univ Fed Sao Paulo, Inst Sci & Technol, BR-12247014 Sao Jose Dos Campos, SP - Brazil
Total Affiliations: 5
Document type: Journal article
Source: Journal of Physical Chemistry A; v. 124, n. 47, p. 9854-9866, NOV 25 2020.
Web of Science Citations: 1
Abstract

Machine learning (ML) models can potentially accelerate the discovery of tailored materials by learning a function that maps chemical compounds into their respective target properties. In this realm, a crucial step is encoding the molecular systems into the ML model, in which the molecular representation plays a crucial role. Most of the representations are based on the use of atomic coordinates (structure); however, it can increase ML training and predictions' computational cost. Herein, we investigate the impact of choosing free-coordinate descriptors based on the Simplified Molecular Input Line Entry System (SMILES representation, which can substantially reduce the ML predictions' 6 computational cost. Therefore, we evaluate a feed-forward neural network (FNN) model's prediction performance over five feature selection methods and nine ground-state properties (including energetic, electronic, and thermodynamic properties) from a public data set composed of similar to 130k organic molecules. Our best results reached a mean absolute error, close to chemical accuracy, of similar to 0.05 eV for the atomization energies (internal energy at 0 K, internal energy at 298.15 K, enthalpy at 298.15 K, and free energy at 298.15 K). Moreover, for the atomization energies, the results obtained an out-of-sample error nine times less than the same FNN model trained with the Coulomb matrix, a traditional coordinate-based descriptor. Furthermore, our results showed how limited the model's accuracy is by employing such low computational cost representation that carries less information about the molecular structure than the most state-of-the-art methods. (AU)

FAPESP's process: 17/11631-2 - CINE: computational materials design based on atomistic simulations, meso-scale, multi-physics, and artificial intelligence for energy applications
Grantee:Juarez Lopes Ferreira da Silva
Support Opportunities: Research Grants - Research Centers in Engineering Program