Duplicate rejections

In the past it was possible to submit molecules which were duplicates of suggestions from other researchers - they were simply flagged as duplicate (regardless of whether the rationale or fragments shared anything in common).

Now when you submit a set of molecules, if one or more is inadvertenty a duplicate of some-one else’s submission the whole batch is rejected without any indication of which molecules is the duplicate. This makes the batch entry redundant unless one downloads and performs a duplicate check immediately prior to submission.

Ideally, those of us performing virtual screening would like to upload a file of SMILES (or SD) structures. We currently have a big backlog of molecules to submit - and I’m really excited about some of the covalent ligands which interact with both CYS(145) and HIS(41) - analogously (but different) to those in PDB code 4imq. (Most virtual screening/docking programmes are not designed to cope with covalent constraints).

We had started to outsource this data entry to a student but entering them one at a time because of a reluctance to entertain duplicate molecules supported by different rationales is sole destroying … it also means retaining links to our unique identifiers or to those of commercial libraries becomes 100x more difficult!

Hi @EKDavies, sorry this has been annoying.

So I just tested it, and want to clarify that it only rejects duplicate designs when the duplicate is within your submission. Thus, I suspect that you may has mistakingly added a molecule twice within your own submission – though I may be missing something – so feel free to send over a counterexample.

As for bulk/file upload: This was actually intentional on our part. We have have been approached multiple times with people wanting to dump tens of thousands of molecules from docking or generative models; however, we really appreciate focused submissions, since we do not have the bandwidth to order thousands upon thousands of compounds or evaluate every docking methodology in detail. Therefore, I suggest that you pick out the few molecules with those exciting interactions and focus on submitting a smaller number of those. It is much easier for us to consider a small number than to have to look through a large list for the most exciting ones – you are in the best position to do that since you know the results and method very well.

Anyway, I am excited to see the designs!

3 Likes

In summary, it appears that the front-end performs checks to ensure that your set of molecules does not include duplicates and the backend server performs a further duplicate checks. These give difference error messages. The backend check appears to report duplicates when there are not present!

We were attempting to enter the following SMILES

O=C1NC(=O)CCC1N3C(=O)c2cccc(N)c2C3
N#CC(c1(cc(cc(c1)Cn2(ncnc2))C(C#N)©C))©C
Clc2nc1n(cnc1c(n2)N)C3OC(CO)C(O)C3
O=C1N=C(N=CN1C2(OC(CO)C(O)C2))N
FC1C(O)C(OC1n2(cnc3(c2nc(Cl)nc3N)))CO
O=C1NC(=O)CN(C1)CC(N2(CC(=O)NC(=O)C2))C
O=C5OCC1=C(C=C3(N(C1=O)Cc2(cc4(c(nc23)ccc(O)c4CN©C))))C5(O)CC
O=C(NCCc1(cccc2(ccc(OC)cc12)))C
O=C(NCCC3(c2(c1(c(OCC1)ccc2CC3))))CC
O=S(=O)(c1(cc(c(OC)cc1N)C(=O)NCC2(N(CC)CCC2)))CC
FC1=CN(C(=O)NC1=O)C(=O)NCCCCCC
Clc1ccc(cc1)C(N2(CCN(CCOCC(=O)O)CC2))c3ccccc3
O=C(OC2(C(=O)N(c1(ccccc1SC2c3(ccc(OC)cc3)))CCN©C))C
O=S(=O)(Nc1(noc(c1)C))c2ccc(N)cc2
O=S(=O)(Nc1(onc(c1C)C))c2ccc(N)cc2
Clc3cc(C(=O)NC1(CCN(CCCOC)CC1))c2OCCc2c3N
FC1=CN(C(=O)NC1=O)C2OC(CO)C(O)C2
O=C1C(=COc2(cc(O)cc(O)c12))c3ccc(O)cc3
O=C(N)C(N1(C(=O)CCC1))CC
O=C1N(C(=O)c3(cccc2(cc(N)cc1c23)))CCN©C
Oc1cc(cc(O)c1)C=Cc2ccc(O)cc2
Fc1cc(ccc1N2(CCOCC2))N3C(=O)OC(C3)CNC(=O)C
O=S(=O)(OCC23(OCC1(OC(OC1C2OC(O3)©C)©C)))N
Fc1cncnc1C©C(O)(c2(ccc(F)cc2F))Cn3ncnc3
O=S(=O)(NC)CCc2ccc1[nH]cc(c1c2)C3CCN©CC3
O=C2N(c1(ncn(c1C(=O)N2C)CC(O)CO))C
O=C1N(C(=O)c2(c1cccc2N))C3C(=O)NC(=O)CC3
n1cn(nc1)Cc3ccc2[nH]cc(c2c3)CCN©C
O=S(=O)(Nc1(ncc(OC)cn1))c2ccc(N)cc2
O=C(O)C(c2(ccc1(cc(OC)ccc1c2)))C
Clc2cc1N=CNS(=O)(=O)c1cc2S(=O)(=O)N
O=C1OCC(N1)Cc3ccc2[nH]cc(c2c3)CCN©C
O=C2N(c1(nc[nH]c1C(=O)N2C))C
O=C(O)c1cc(N)ccc1O
O=C4C=C3CCC2C1CCC(O)(C(=O)CO)C1©CC(O)C2C3©CC4
O=C3Nc1c(nccc1C)N(c2(ncccc23))C4CC4
O=C(OCC#CCN(CC)CC)C(O)(c1(ccccc1))C2CCCCC2
O=S(=O)(Nc1(ncccn1))c2ccc(N)cc2
O=C1N=C(N=CN1C2(OC(CO)C(O)C2O))N
Clc1nc(c(nc1N)N)C(=O)NC(=N)N
O=C1C(=COc2(cc(O)ccc12))c3ccc(O)cc3
Oc1ccc(cc1)C(O)CNC
O=C(O)C(N)Cc1c[nH]c2ccc(O)cc12
O=C2C=C(Oc1(cc(O)cc(O)c12))c3ccc(OC)c(O)c3
Oc4c3OCC2(O)Cc1cc(O)c(O)cc1C2c3ccc4O
Oc1ccc(cc1O)C(O)CN
Fc1cc3c(nc1N2(CCNCC2))N(C=C(C(=O)O)C3=O)CC
O=C(c1(ccc(cc1)C©©C))CC(=O)c2ccc(OC)cc2
Fc1ccc(cc1)-c2sc(cc2)Cc3cc(ccc3C)C4OC(CO)C(O)C(O)C4O
O=C2NC(=O)C(c1(ccc(N)cc1))(CC2)CC
OC2C(O)C(O)C(OCCc1(ccc(O)cc1))OC2CO
s1c(nc2(c1CC(NCCC)CC2))N
O=C(OC2(CC1(N©C(CC1)C2)))C(c3(ccccc3))CO
O=C(O)CCCCC1SCC2NC(=O)NC12
O=C2C=C(Oc1(cc(O)cc(O)c12))c3ccc(O)cc3
O=C3c1c(O)cc(cc1C(=O)c2(cc(O)cc(O)c23))C
O=C5OCC1=C(C=C3(N(C1=O)Cc4(cc2(cc(O)ccc2nc34))))C5(O)CC
O=C1C(=COc2(cc(O)ccc12))c3ccc(OC)cc3
OC1C(O)C(O)OC©C1O
O=C2C=C(Oc1(cc(O)cc(O)c12))c3ccc(O)c(O)c3
O=C(O)C(O)Cc1ccc(O)c(O)c1
Clc2ccc1N(C(=O)Nc1c2)C3CCN(CC3)CCCN4C(=O)Nc5ccccc45
O=C(NC(C(O)C)C1(OC(SC)C(O)C(O)C1O))C2N©CC(C2)CCC
O=C(O)C(NC(=O)C1(CCC(CC1)C©C))Cc2ccccc2
O=C(O)Cc3cc2c(OCc1(ccccc1C2=CCCN©C))cc3
OC(c1(ccnc2(ccc(OC)cc12)))C3N4CCC(C3)C(C=C)C4
O=S(=O)(Nc1(ccc(cc1)C(O)CNC©C))C
O=C2CC(OC1(OC3(C(OC12O)C(NC)C(O)C(NC)C3O)))C
Oc1ccc(cc1O)C(O)CNC
Oc1ccc(cc1O)CCN
Oc1ccc(cc1)C(O)C(NCCc2(ccc(O)cc2))C
O=C4C=C3CCC2C1CCC(O)(C(=O)COC(=O)C)C1©CC(=O)C2C3©CC4
Oc1ccc(cc1O)C(O)CNC©C
OC(c1(cccc(O)c1))CNC
O=C2Nc1cccc(c1C2)CCN(CCC)CCC
Oc1ccc(cc1)C(O)CN
O=C2NC(C(=O)N1(CSCC1C(=O)O))CC2
OCc1cnc(c(O)c1CO)C
O=C1NC(=O)N(C=C1)C2OC(CO)C(O)C2O
O=C(O)C(NC(C(=O)N1(CCCC1C(=O)O))C)CCc2ccccc2
O=C(c1(c(OC)cc(OC)cc1OC))CCCN2CCCC2
OCCN1CC(O)C(O)C(O)C1CO
O=C1N=C(N)C=CN1C2OC(CO)C(O)C2O
FC1=CN(C(=O)NC1=O)C2OC©C(O)C2O
O=C1N(C(=O)c2(ccccc12))C3C(=O)NC(=O)CC3
O=S(=O)(Nc1(nc(ccn1)C))c2ccc(N)cc2
O=S(=O)(Nc1(nc(cc(n1)C)C))c2ccc(N)cc2
O(c1(cc(cc(OC)c1OC)Cc2(cnc(nc2N)N)))C
O=S(=O)(N)c1sc2c(c1)C(NCC)CN(CCCOC)S2(=O)=O
O=C(c1(ccc(O)c(O)c1))CNC
Clc1ccc(cc1)C(=O)NCCN2CCOCC2
O=S(=O)(Nc1(sccn1))c2ccc(N)cc2
Oc1ccc(cc1)C3Cc2ccc(O)cc2OC3
Oc1cc(ccc1O)C(O)CNC
N#CCC(n1(ncc(c1)-c2(ncnc3([nH]ccc23))))C4CCCC4
O=C(NCC(=O)O)c1ccc(N)cc1
O=C(NCC1(N(CC)CCC1))c2cc(ccc2OC)S(=O)(=O)N
O=C(NCCc2(c[nH]c1(ccc(OC)cc12)))C
O=S(=O)(Nc1(sc(nn1)C))c2ccc(N)cc2
O=C(N(CC)Cc1(ccncc1))C(c2(ccccc2))CO

Unless our algorithm is faulty this does not contain duplicates. When adding these 100 molecules the front end did not report any duplicates. For the sake of clarity we then tried to enter a duplicate and got the expected error message " Unable to add molecule. Cannot submit duplicate molecules" and the duplicate was not added to the set. (See slide 4 in the attached open office file).

On submission we got a further different error message complaining that submitted molecules could not contain duplicates. “Unable to submit for synthesis. Cannot submit duplicate molecules”.

I also checked that there were no duplicates of earlier molecules we had submitted (a logical possibility). It is conceivable that there are different tautomers (I havent checked) but obviously there’s no scope for stereochemistry duplicates.

Open office file (wouldnt upload)

Hi @EKDavies, sorry this took some digging into.

Index 68 and index 93, Oc1ccc(cc1O)C(O)CNC and Oc1cc(ccc1O)C(O)CNC respectively, seem to be duplicates once the SMILES are canonicalized. I expect that is the issue. I will look into why our frontend did not catch that.

Since our code didnt spot the duplicate I suspect the original published algorithm had a minor fault relating to dealing with rings. We’d really like an error message from the backend which told us which molecules cause a problem … and ideally not removing them from the entry form! :slight_smile:

@EKDavies, thanks for spotting this! The frontend now canonicalizes SMILES so that it won’t let you add duplicate molecules that may have slipped through before.

Note that you may have to clear your browser’s cache.

1 Like

Thanks. Greatly appreciated!