OEDocking docking/scoring/selection of submitted compounds (model #02084)

dmoustakas · May 15, 2020, 11:41pm

This post contains results of a docking model that was optimized using both the crystallographic fragment screening data, as well as the biochemical screening data published (5/7/2020) for ~120 compounds. This structure of SARS-CoV-2 Mpro used in this model was selected to optimize early recovery of known actives. Summary of the validation data is shown at the end of the post.

Overview of selected hits

2020-05-15_ranked_submissions_model_02084_FORMATTED.sdf (802.3 KB)
submission_02084.zip (158.5 KB)

This file contains a ranked list of 200 compounds, sorted by docking score (lower is better), and formatted in Fragalysis v1.2 format. The file submission_02084.zip contains the docking receptor in Fragalysis format. The SD file contains 3D coordinates, posed into the the following docking receptor, uploaded in the following links in both PDB format and in OEDocking receptor grid format:

DOCKING_MODEL_02084_COMPLEX.pdb (730.0 KB)
DOCKING_MODEL_02084_RECEPTOR.pdb (744.5 KB)
DOCKING_MODEL_02084_OEDOCKING.oeb.gz.zip (850.7 KB)

The complete set of docked poses (3649 molecules), containing 3D docked coordinates and a docking score (“Fred Chemgauss4 score”) is contained in the following file:

ALL_DOCKED_POSES_MODEL_02084.sdf.zip (2.9 MB)

@JohnChodera
All poses included here for the benefit of retrospective analysis and comparison of performance across multiple methods. It’s worth noting that this total set of all posed compounds contains both non-covalent and covalent designs.

Receptor Preparation

An X-ray crystal structure of SARS-CoV-2 Mpro complexed with a covalent inhibitor (PDB 6LU7) was used as a starting point for docking model development. The N3 ligand occupies a large volume of the active site, in the hopes of yielding a docking model with less bias for any particular pocket in the active site, since the goal of many of the submitted designs was to link/merge fragments binding to different active site pockets. The 6LU7 PDB model contains the covalent adduct of the N3 inhibitor to the catalytic Cys145 residue, so the non-covalent complex was modeled using the MOE software to break the ligand-Cys bond, generate the pre-reaction ligand species, assign protonation states and partial charges, and minimize any clashes. This structure contains the deprotonated Cys145 thiolate species.

A molecular dynamics simulation of the non-covalent complex model was performed using the Desmond software, using the OPLS3 force field and an SPC water box. A 1 us trajectory was used to generate an ensemble of 5000 receptor conformations, from which the final docking model was selected using an exhaustive docking method. OEDocking/FRED receptor grids were prepared for each of the 5000 structures using the Spruce toolkit with default values, and a training set of ligands were docked into each structure. The structures that yielded the highest early enrichment factor were selected for docking. Details of the optimization process are described in the Retrospective Analysis section.

Ligand Preparation & Docking

Structures of all submitted proposed designs were downloaded, and were filtered to remove any structures containing greater than 7 rotatable bonds or MW > 600, leaving ~3700 compounds. The OpenEye toolkits were used to assign the single most likely protonation state at neutral pH for all of the compounds. The Omega2 program, running with the following parameters:

-ewindow 10.0
-maxconfs 50000
-strictstereo false
-flipper true

was used to enumerate unspecified stereocenters, as well as conformations, yielding a total of ~8.6K unique species and ~4.3M conformers.

The Fred program (OEDocking) was used to dock the multiconformer library against the receptor grid, using the following settings:

-dock_resolution standard
-hitlist_size 0

Docked poses were written out in SD format, containing a field with the Chemgauss4 score value for each pose. For each compound, the top-scoring of the enumerated stereoisomers was retained, yielding a reasonably normal distribution of docking scores:

Since biochemical Mpro inhibition data had been published on 5/7/2020 for a number of non-covalent compounds contained in the overall set, I extracted the 12 actives from that preliminary data set and identified their docking scores in the following table:

Reassuringly, 6 of the 12 actives, representing 3 distinct chemical series, were found in the top-scoring tail of scores < -9.5 (shown in blue in the histogram above). Further increasing confidence in this model is the presence of a number of compounds scoring significantly better than the known references, which hopefully should increase the likelihood of identifying more hits of equal or greater binding affinity.

Hitlist Selection

The score distribution for the entire library had a mean value of -8.1, with a standard deviation of 1.2. 450 compounds with scores < -9.5 (greater than 1 SD below the mean) were selected (shown in blue in the histogram). Since the entire set of submitted compound designs contained both non-covalent compounds and those with covalent warheads, the compounds with warheads were removed, leaving 420 compounds in the set.

Visual analysis of this set revealed a large number of highly similar compound clusters, so a greedy clustering algorithm was used to filter out lower ranked near neighbors of higher ranked hits. The OEGraphSim toolkit was used to compute the tanimoto similarity (based on path fingerprints) for the first compound in the list to all lower-ranked compounds, filtering out all compounds with similarity greater than 0.7. The algorithm iterated in a greedy fashion through the ranked hitlist until all near neighbors were removed, leaving ~220 compounds, from which the top-ranked 200 compounds were saved and formatted into the Fragalysis V1.2 format.

Retrospective Analysis of Biochemical Data Set

The training set of reference ligands used for this model optimization were the non-covalent ligands assayed by Haim Barr at the Weizmann institute and published recently:

https://covid.postera.ai/covid/activity_data

Non-covalent ligands with < 8 rotatable bonds and < 600 MW were included in the docking (103 compounds):

2020-05-07_tested_ligands.sdf (232.4 KB)
2020-05-07_tested_ligands_CONFORMERS.oeb.gz.zip (1.1 MB)

The performance of each of the 5000 receptor conformation dockings was evaluated, using early enrichment as a metric to identify conformations that could be promising as docking models.

The best of these models yields an ROC curve shown below:
ROC_DOCKING_MODEL_02084_ref_ligs

with a favorable early enrichment profile. By comparison, docking into the 6LU7 X-ray crystal structure yields the following ROC curve:

ROC_XRAY_MODEL_6LU7_ref_ligs

While not dramatically different, comparison of these two plots shows that the optimization yielded a docking model with double the rate of early enrichment than the X-ray structure provides.

This procedure yielded a receptor conformation with significantly improved early enrichment for this data set. Due to the similarity between this first data set and the remaining submitted designs, this receptor model was considered to be ideal for virtual screening in this context. Since significant early enrichment was found in screening the top 10% of the retrospective data set, I’m hopeful that the prospective docking selections, drawn from the top 10% of the submitted compounds (~400 of ~4000) should yield a favorable hit profile.

It’s worth noting here that the actives recovered in the docking from the first data set dock with poses that appear to be consistent with the expected binding modes based on the fragment crystal structures. For example, compound TRY-UNI-714a760b-22 binds with the following predicted mode:

Which is similar to the binding of the nitrile-containing compounds x0305 and x1249, which also position the nitrile in the hydrophobic P2 pocket.

In general, this docking model/receptor conformation appears to favor compounds spanning the P2 -> P1 pockets, as exemplified by the Diamond structures x0434, x0678, x0967, x1093, x1249. This appears to be driven by the conformations of a number of key residues, including Asn142, Gln189, Met49, Cys44.

Therefore, this docking model/receptor structure may be ideal for designing/expanding on the these sorts of fragment structures.

alphalee · May 16, 2020, 7:27am

@dmoustakas this seems different to the dock scores that you sent @mc-robinson a few days ago? The ranking seems to changed, and compounds on the previous list (e.g. the top ranked one WAR-XCH-bdd24732-34) is not the top compound in the new list…which list should we use?

dmoustakas · May 16, 2020, 3:53pm

Hi @alphalee, you’re right that this is a completely new docking against a different receptor conformation. The docking model used in this docking (# 02084) was one I selected based on its high performance recovering the actives in the first data set of ~120 compounds released last week.

So this is the set to use please!

There are a few other differences in addition to the different docking receptor. Following Mark’s suggestion in last week’s meeting, I selected the top 200 compounds (which are in a separate SD file) from the complete set of all docked compounds - previously I had just submitted the total list of docked poses. In selecting these top-docked compounds, I carefully filtered out any with covalent warheads (e.g. acrylamides, vinyl sulfones, acid chlorides) which I didn’t do in the previous posts. Also, I applied a greedy clustering script to filter out near neighbors, so that the hit list wouldn’t be dominated with large clusters of extremely similar compounds. I explain that methodology in the Hitlist Selection section above.

So for these 3 reasons - docking into a different receptor conformation, filtering covalent warhead compounds out, filtering out near neighbors of higher ranked hits - the hit list will be quite different.

In case the post above was confusing - there were a lot of files included to try to document all the features of the model, this is the file to look at:

2020-05-15_ranked_submissions_model_02084_FORMATTED.sdf (802.3 KB)

It contains the top-ranked compounds, with the covalent warheads and near-neighbors filtered out.

dgomes · June 19, 2020, 2:20pm

@dmoustakas thank you and colleagues for sharing the results in such informative and reproducible way.

It remained unclear to me if DOCKING_MODEL_02084_RECEPTOR.pdb was the actual structure with the most favorable early enrichment among all 5k, also which monomer binding site was used for docking.

Could you coment a about how the clustering procedure was done to end up with 5k structures and will you kindly share the AUC from it’s ROC curves ?

dmoustakas · June 19, 2020, 7:20pm

Hi @dgomes, thanks for your question, and sorry the post was a bit wordy. There are 2 structure files as well as the OEDocking grid made from those files:

DOCKING_MODEL_02084_RECEPTOR.pdb
DOCKING_MODEL_02084_COMPLEX.pdb

both of those files contain the identical receptor structure, which had yielded the maximum AUC in my analysis. the _COMPLEX.pdb file contains the cognate ligand in one of the active sites, this was the same site that the docking grid was built into.

So you could use the _COMPLEX.pdb file and simply delete the N3 ligand atoms, or you could just open both in a viewer like pymol and use the position of the N3 ligand to identify the docking site.

Alternatively, if you’re using OEDocking/FRED, you could simply use the .oeb.gz receptor grid that’s attached as well.

Cheers,
Demetri

dgomes · August 14, 2020, 2:20pm

I’ve done the VS of the avaible compounds with activity data, and rescored with a number of scoring functions. However, while building my ROC curves I couldn’t find the metric, such as f_ic50, r_ic50 and activity threshold you used in this experiment to classify then as Actives and Inactives.

For consistency with this results could you share them ?

Best,
Diego.

dmoustakas · August 14, 2020, 6:57pm

Hi @dgomes, the criteria for defining actives in this benchmark work was >50% inhibition at 50 uM:

In some subsequent work I’ve done more recently (discussed briefly at the last comp chem meeting) I used a more stringent definition of actives based on IC50s, b/c there were more compounds with full curves measured at that later time

Does that help?