PostEra

Indole ring in SMILES

I may be wrong with this, but while looking at https://covid.postera.ai/covid/submissions/8640f307-999e-42bc-ba0b-5d9215d78435 I noticed that the indole ring was not correct according to RDKit at least.
Namely, the kekulised C1=C2C=CC=CC2N=C1 is not aromatic unlike c1c2ccccc2[nH]c1.
So I checked (code below), luckily there are only 5 such cases. VIC-UNI-33d4332f-1, VIC-UNI-33d4332f-2, FRA-DIA-8640f307-1, FRA-DIA-8640f307-2, FRA-DIA-8640f307-3.

So this is not a problem, but I wanted to get a second opinion in case there are other way to write a reduced indole or whether there are other compounds to look out for —a quick check on the huge cheatsheet that is the figure on the Wikipedia page about heterocyclic compounds, tells me that its not a common issue either… but I may be wrong, hence my asking!

Code

# how common is the weird indole
from rdkit.Chem import PandasTools
import pandas as pd
from rdkit import Chem
from IPython.display import display

wrong = Chem.MolFromSmiles('C1=C2C=CC=CC2N=C1')
right = Chem.MolFromSmiles('c1c2ccccc2[nH]c1')

submissions = pd.read_csv('/Users/matteo/Coding/COVID_moonshot_submissions/covid_submissions_all_info.csv')
PandasTools.AddMoleculeColumnToFrame(submissions,'SMILES','molecule')

# classify
submissions['wrong_indole'] = submissions.molecule.apply(lambda mol: mol.HasSubstructMatch(wrong))
submissions['right_indole'] = submissions.molecule.apply(lambda mol: mol.HasSubstructMatch(right))
# tally
display(submissions.wrong_indole.value_counts())
display(submissions.right_indole.value_counts())
1 Like

@matteoferla, very good catch! This is a really interesting issue. For example, I think we have another mis-drawn indole here: ORN-MSD-f9d8c68a-1

Unfortunately, nothing is flagged on submission because the SMILES is valid. Furthermore, no alerts come up, because the alerts assume popular functional groups that may be problematic, but do not look for incorrect drawings. Thus, a different solution is needed. One solution I have been considering: flag substructures that are extremely unpopular in pubchem/zinc/some other source. Therefore, you could at least alert the user if they are entering a very strange group.

Right now unfortunately the wrong info is propagated through everything

Maybe a simpler flag, such as the presence of double bonds that do not resinate in a ring?

I did a quick check to see which submissions had rings with double bonds after reading them in RDKit and got about 500, which is a lot: https://www.well.ox.ac.uk/~matteo/unaromatic.html
But that is expected and most are correct.
Not many quinones in the correct category, but a lot of cyclo-pentene —must be some product of some reaction (vinyl or alkyne)—and tetrahydropyridine —some reaction product probably—, but some are wrong.

I am a biochemist, so I cannot really eyeball how many are wrong. But it looks to me that it’s less than 1 in 10 of the 500 list, so not that many to worry about —please anyone correct me!

Code

def has_double_in_ring(mol):
    for bonds in mol.GetRingInfo().BondRings():
        for bi in bonds:
            if mol.GetBondWithIdx(bi).GetBondType() == Chem.BondType.DOUBLE:
                return True
    else:
        return False

submissions = pd.read_csv('/Users/matteo/Coding/COVID_moonshot_submissions/covid_submissions_all_info.csv')
PandasTools.AddMoleculeColumnToFrame(submissions,'SMILES','molecule')
submissions['double_in_ring'] = submissions.molecule.apply(has_double_in_ring)
display(submissions.double_in_ring.value_counts())

Just saw this (wonder what setting I haven’t checked, so that I don’t get alerts to posts that refer to my posts).

As far as I can tell, the “problem” here is that there is a convention for how one must draw double bonds to indicate aromaticity, right? Because aromatic rings don’t have double bonds.

So @mc-robinson, now that @matteoferla has spotted the “mistake”, go ahead and fix the SMILES in the CSVs - because the submitters intention was clear.

So is this the right one for FRA-DIA-8640f307-1?
c1c(F)cc(C(CN(Cc2cc©o[n]2)C(NC2CC2)=O)C2=NN=C©S2)c2[n]cc(CCNC©=O)c12
Here I’ve made the isoxazole explicitly aromatic - but there’s also the version with double bonds, I think. So where does one find out what’s “right”?

The corrected SMILES are in the discussion for topic FRA-DIA-8640f307-1, namely:

The indole in aromatic form is c1c2ccccc2[nH]c1, if you want to represent it kelulised, you can have with a proton as C1-C=C-C=2-[NH]-C=[CH]-C=2C=1 or without the proton C1=CC=C2-[N]=C-[CH-]-C2=C1 (worthy of a straightjacket).