I may be wrong with this, but while looking at https://covid.postera.ai/covid/submissions/8640f307-999e-42bc-ba0b-5d9215d78435 I noticed that the indole ring was not correct according to RDKit at least.
Namely, the kekulised
C1=C2C=CC=CC2N=C1 is not aromatic unlike
So I checked (code below), luckily there are only 5 such cases.
So this is not a problem, but I wanted to get a second opinion in case there are other way to write a reduced indole or whether there are other compounds to look out for —a quick check on the huge cheatsheet that is the figure on the Wikipedia page about heterocyclic compounds, tells me that its not a common issue either… but I may be wrong, hence my asking!
# how common is the weird indole from rdkit.Chem import PandasTools import pandas as pd from rdkit import Chem from IPython.display import display wrong = Chem.MolFromSmiles('C1=C2C=CC=CC2N=C1') right = Chem.MolFromSmiles('c1c2ccccc2[nH]c1') submissions = pd.read_csv('/Users/matteo/Coding/COVID_moonshot_submissions/covid_submissions_all_info.csv') PandasTools.AddMoleculeColumnToFrame(submissions,'SMILES','molecule') # classify submissions['wrong_indole'] = submissions.molecule.apply(lambda mol: mol.HasSubstructMatch(wrong)) submissions['right_indole'] = submissions.molecule.apply(lambda mol: mol.HasSubstructMatch(right)) # tally display(submissions.wrong_indole.value_counts()) display(submissions.right_indole.value_counts())