CCS RCD Seminar – April 16, 2024

April 16, 2024
3:00 pm – Refreshments
3:30 pm – Presentation

Speakers:
(1) Chad Risko, Department of Chemistry, University of Kentucky

(2) Hunter Moseley, Department of Molecular and Cellular Biochemistry, University of Kentucky

Where:
Davis Marksbury Building – James F. Hardymon Theatre
(Zoom link: https://uky.zoom.us/s/84474671604)

Titles:
(1) Towards Machine-driven Discovery of Organic Materials

(2) A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement

Abstracts:
(1) There is significant interest in the development of organic materials for applications that span new generations of electronic, optical, and energy generation and storage technologies. The chemical space to be explored for these materials, however, is tremendously large, and at the same time it can often be difficult to derive clear chemical building block-to-material structure–property relationships. As these hurdles have served as significant impediments to the commercial adoption of organic materials in these areas, there is growing interest in using computers and automation to aid in materials design and discovery. Here we will discuss recent advances in the development and use of high-throughput computational protocols, data infrastructures, and machine learning (ML) approaches that offer the potential to explore the wide and varied chemistries of organic materials.

(2) The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Gene and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (∼26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We also present a new benchmark dataset for training and predicting metabolic pathway involvement as well as a new dataset and model design with superior performance. In addition, we present a new tool for troubleshooting and optimizing GPU-utilizing methods within a high performance computing environment.

Slides:
(1) Chad Risko – 20240416-Risko-Seminar.pdf

(2) Hunter Moseley – 20240416-Moseley-Seminar.pdf

Click here to see the complete list of speakers.