PostEra

Submission JAR-IMP-dd656357

Topic automatically created for discussing the designs at:
https://covid.postera.ai/covid/submissions/JAR-IMP-dd656357

The design rationale, copied + pasted with the line breaks.


These structures were generated automatically using a Graph-Based Genetic-Algorithm (GA), which attempted to build mimic molecules of a reference structure.

The reference structure was the Transition State structure of the L-Q-S peptides identified by Ramos-Guzmán et al. [1]. This structure was kindly supplied in personal correspondence with Iñaki Tuñón.

The generative part of the method is based on Jan Jensen’s python GB-GA (https://github.com/jensengroup/GB-GA), but with a custom similarly metric that provides a smooth continuous scoring including chemical specificity between 3D structures. The metric additionally included a score of vdW dispersion chemical specificity (scored at the best electrostatic match), where these multiple objectives were scalarised by taking a weighted geometric mean (electrostatic^0.8 * dispersion^0.2). This generation of the code continuously updated a central file with the elite structures so far found, used to start each GA run. This enables a massively parallel run. The number of conformers checked with each scoring was reduced to 4, which appears to protect against ‘lucky’ matches of large aliphatic chains with heteroatoms.

Initial GA runs had a small populations (100) and a medium number of generations (25) to try and evolve a broad range of high scoring structures (avoiding evolutionary niching). This generated ~100k high scoring compounds. GA runs with a larger population (500) and very few generations were then used to combine and refine this broad population. For this refinement stage, the individual proposed molecules were put through a RDKIT ‘problematic group’ filter, and the number of matches here used to attenuate the score as exp(-n_matches). This appears to have the effect of suppressing overly complex heterocycles, weird heteroatom substitutions, and limits molecular weight, while retaining ergodicity of the Monte-Carlo algorithm.

No analysis of stability was made, or inspection by a trained chemist. The algorithm independent refinds the same structures with slight variations. ‘Data Warrior’ was used to cluster the top 1000 scoring molecules by molecular similarity, and a high scoring representative of each structure was chosen for submission.

This work was done in collaboration with Kuano Ltd, and used computer time on the Imperial College Research Computing Service, DOI: https://doi.org/10.14469/hpc/2232 .

[1] Ramos-Guzmán, C. A., Ruiz-Pernía, J. J., Tuñón, I. (2020). Unraveling the SARS-CoV-2 Main Protease Mechanism Using Multiscale DFT/MM Methods. https://doi.org/10.26434/chemrxiv.12501734.v1