Scoring fragment merges based on greedy set covering

JohnChodera · May 11, 2020, 6:31am

I’ve taken another stab at the fragment merge scoring problem by implementing a simple script to identify the minimum fragment set that provides maximum coverage of docked molecule heavy atoms. The idea is that, for each hybrid docked molecule, we first identify the fragment with the largest number of heavy atoms within 0.6 A, then the next fragment with the largest number of heavy atoms within 0.6 A of any yet-unmatched atoms, and so on. We require at least three heavy atoms from each fragment to consider including it in the set of unique overlapping fragments.

The resulting list is sorted by the number of unique overlapping fragments, and produces some pleasing fragment merges from the moonshot compounds that float to the top (green is always the docked Moonshot compound, other colors are fragments):

The resulting SDF file—with listed overlapping_fragments—is here:

github.com

FoldingAtHome/covid-moonshot/blob/master/moonshot-submissions/covid_submissions_all_info-docked-overlap.sdf

MAK-UNK-9e4a73aa-2
  -OEChem-05102022583D

 26 29  0     0  0  0  0  0  0999 V2000
   12.9795    0.1141   24.5878 C   0  0  0  0  0  0  0  0  0  0  0  0
   13.0731   -0.6943   23.4550 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.7275    0.4793   25.0827 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.6163    1.6035   19.1515 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.1776    0.8701   20.1909 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.9146   -1.1375   22.8169 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.5690    0.0361   24.4446 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.1324   -4.1589   22.8372 C   0  0  0  0  0  0  0  0  0  0  0  0
    9.4397   -3.2648   24.0983 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.0472    1.3357   17.8622 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.5017   -0.2871   18.5897 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.1391   -0.0951   19.9131 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.7246   -2.9459   22.4522 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.6626   -0.7723   23.3117 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.8885   -2.4700   23.0702 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.6989   -4.9202   23.8540 C   0  0  0  0  0  0  0  0  0  0  0  0

This file has been truncated. show original

A corresponding CSV file is here:

github.com

FoldingAtHome/covid-moonshot/blob/master/moonshot-submissions/covid_submissions_all_info-docked-overlap.csv

SMILES,TITLE,creator,fragments,link,real_space,SCR,BB,extended_real_space,in_molport_or_mcule,in_ultimate_mcule,in_emolecules,covalent_frag,covalent_warhead,acrylamide,acrylamide_adduct,chloroacetamide,chloroacetamide_adduct,vinylsulfonamide,vinylsulfonamide_adduct,nitrile,nitrile_adduct,MW,cLogP,HBD,HBA,TPSA,num_criterion_violations,BMS,Dundee,Glaxo,Inpharmatica,LINT,MLSMR,PAINS,SureChEMBL,PostEra,ORDERED,MADE,ASSAYED,Hybrid2,docked_fragment,Mpro-x1418_dock,site,number_of_overlapping_fragments,overlapping_fragments,overlap_score,volume
c1ccc(cc1)n2c3cc(c(cc3c(=O)c(c2[O-])c4cccnc4)F)Cl,MAK-UNK-9e4a73aa-2,Maksym Voznyy,x1418,https://covid.postera.ai/covid/submissions/MAK-UNK-9e4a73aa,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,366.779,4.5189,0,3,50.27,0,PASS,beta-keto/anhydride,PASS,PASS,PASS,"Ketone, Dye 11",PASS,PASS,PASS,FALSE,FALSE,FALSE,-11.881256,x1418,1.206534,active-covalent,3,"x0434,x0678,x0830",3.2081238078931777,271.986083984375
Cc1ccncc1n2c(=O)ccc3c2CCCN3CC(=[NH2+])N,KIM-UNI-60f168f5-7,"Kim Tai Tran, University of Copenhagen","x0107,x0991",https://covid.postera.ai/covid/submissions/KIM-UNI-60f168f5,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,297.362,1.22949,2,5,88,0,PASS,"imine, imine",PASS,PASS,acyclic C=N-H,Imine 3,PASS,PASS,PASS,FALSE,FALSE,FALSE,-11.654112,x0107,,active-noncovalent,3,"x0107,x1412,x1392",4.753475003640889,232.8155059814453
c1ccc(cc1)n2c3cc(c(cc3c(=O)n(c2=O)c4cnccn4)F)Cl,MAK-UNK-9e4a73aa-14,Maksym Voznyy,x1418,https://covid.postera.ai/covid/submissions/MAK-UNK-9e4a73aa,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,368.755,2.7241,0,6,69.78,0,PASS,PASS,PASS,PASS,PASS,PASS,PASS,PASS,PASS,FALSE,FALSE,FALSE,-10.460650,x0678,2.716276,active-noncovalent,3,"x0678,x1412,x1392",5.520980356266177,266.688720703125
Cc1ccncc1N(C=C)[C@H]([C@@H](C)[C@@H]2CN=Cc3c2ccc(c3)OC)O,AUS-WAB-916db9c0-1,"Austin D. Chivington, Wabash College","x0107,x1077,x1374",https://covid.postera.ai/covid/submissions/AUS-WAB-916db9c0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,351.45,3.51932,1,5,57.95,0,non_ring_acetal,het-C-het not in ring,PASS,Filter10_Terminal_vinyl,PASS,PASS,PASS,PASS,PASS,FALSE,FALSE,FALSE,-9.516450,x0678,,active-noncovalent,3,"x0434,x0831,x0678",3.4465724450465522,284.1953125
c1ccc2c(c1)ncc(n2)/C=C/C(=O)c3cccc(c3)O,DRV-DNY-ae159ed1-12,"Dr. Vidya Desai, Dnyanprassarak Mandals College and Research Centre",x1249,https://covid.postera.ai/covid/submissions/DRV-DNY-ae159ed1,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,276.295,3.2315,1,4,63.08,0,PASS,PASS,PASS,Filter44_michael_acceptor2,PASS,"Ketone, Dye 9, vinyl michael acceptor1",PASS,PASS,PASS,FALSE,FALSE,FALSE,-9.243208,x0678,,active-noncovalent,3,"x0434,x0678,x0830",2.865146782341253,220.27542114257812
CCc1cccc(c1NC(=O)[C@@H](c2cccnc2)N(c3cc([nH]n3)C(C)(C)C)C(=O)C=C)C,LON-WEI-b8d98729-20,"London Lab, Weizmann Institute of Science",x0072,https://covid.postera.ai/covid/submissions/LON-WEI-b8d98729,FALSE,Z4439011584,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,445.567,4.87202,2,4,90.98,0,PASS,PASS,PASS,PASS,PASS,PASS,PASS,PASS,PASS,FALSE,TRUE,TRUE,-8.018552,x0831,,active-covalent,3,"x0967,x0759,x0874",3.4143355646305182,360.1330871582031
C[C@H](Cc1cc(n(c1)S(=O)(=O)c2cccnc2)c3ccccc3F)N4CCCc5c4cc(cc5)S(=O)(=O)N,MAK-UNK-9a6be56d-8,Maksym Voznyy,x0195,https://covid.postera.ai/covid/submissions/MAK-UNK-9a6be56d,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,554.669,3.9574,1,7,115.36,1,PASS,PASS,PASS,PASS,PASS,Dye 22,PASS,PASS,PASS,FALSE,FALSE,FALSE,-7.563654,x0434,,active-noncovalent,3,"x0967,x1418,x0678",6.29764155534655,408.8782958984375
Cc1ccnc(c1)N(Cc2nc(nc(n2)F)OC)C(=O)Nc3c(c(cnc3C=C)C)CCNC(=O)C,AGN-NEW-9d245c51-1,"Agnieszka K. Bronowska, Newcastle University","x0434,x0540",https://covid.postera.ai/covid/submissions/AGN-NEW-9d245c51,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,494.531,2.98644,2,8,135.12,0,PASS,PASS,PASS,PASS,PASS,"4-halopyridine, 2-halopyridine",PASS,PASS,PASS,FALSE,FALSE,FALSE,-6.934713,x0678,,active-noncovalent,3,"x0678,x0387,x0731",6.525864824403633,379.82623291015625
CC(C)(C)c1ccc(cc1)N([C@H](c2cccnc2)C(=O)Nc3ccc4c(c3)OCO4)C(=O)C=C,LON-WEI-adc59df6-43,"London Lab, Weizmann Institute of Science",x0072,https://covid.postera.ai/covid/submissions/LON-WEI-adc59df6,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,457.53,5.0068,1,5,80.76,1,PASS,PASS,PASS,PASS,PASS,PASS,PASS,PASS,PASS,FALSE,FALSE,FALSE,-5.399409,x1418,,active-covalent,3,"x1093,x1375,x1392",4.27814829570815,361.8192443847656
CC(C)(C)OC(=O)Nc1ccc([nH]c1=O)c2ccccc2CCC(=O)N[C@@H](C[C@@H]3CCNC3=O

This file has been truncated. show original

These files also contain a score of the (fractional) number of heavy atoms that overlap with any fragment heavy atoms (within 0.6 A) and the total ligand volume, which may be useful as well.

I’ll see if these can be reformatted for Fragalysis upload, and will also investigate whether we can score very large synthetically accessible libraries by combining this with OE FastROCS searches.

Tagging @Waztom since @frankvondelft suggested he may be able to continue to refine this approach to prioritize designs that produce pleasing fragment merges. Also tagging @andrea who may have other good ideas about how to score designs for overlap with multiple fragments.

matteoferla · May 11, 2020, 1:43pm

About the scoring, here are some thoughts. For Fragmenstein which also does a fragment merging based on position (probably very similarly*), the in-protein minimised followup molecule was scored simply with the un-aligned RMSD calculated from the concatenated set of contributing atom pairs between each hit and the followup (github.com/matteoferla/Fragmenstein#mrmsd), meaning that atoms from the followup that mapped to different hits were score more times. There was a discussing of using the RMSD of the merged scaffold only (without the novel bits in the followup) vs. the placed, but @frankvondelft rightfully pointed out that an atom whose position is dictated by multiple hits should be weighted more. There is probably a mathematically more elegant solution, which would be really cool.

Also probably worth mentioning is the dead-end avenue of B-factor weighting, I was interested in the B-factors of the hits, which would be nice to be used as non-linear inverse weights, but they are probably not consistent between structures and apparently the XChem pipeline does not always give them due to the manual steps. **

I did not use anything more sophisticated as I am not too knowledgeable on the topic and I was worried about behind the scenes alignment, eg for SuCOS, I have not tested whether a molecule translated by N Å scores differently than one that isn’t.

(* but with a cutoff of 2 Å, which as no atom can be present in two sets works just as well as .6 Å except for issues with 5-ring to 6-ring, which are fixed by the minimiser (Egor))

** EDIT: the logic behind the interest in the B-factors is that sometimes the hits stick out of the protein without any interactions and it would be nice to capture that. Here is an example (ignore the JS spiel it was for a feature suggestion that would not have worked).