Providing computed poses for others to look at

We’ve specced out an sdf format for people to contribute their calculated/computed binding poses for moonshot compounds (see below). This format is provided so that within the next couple of weeks, the compound sets can all be easily uploaded to fragalysis (fragalysis.diamond.ac.uk) so that they can be viewed easily alongside the x-ray hits.

Validation script and instructions: https://github.com/xchem/sdf_check

SDF files can be uploaded to this thread if you wish for them to eventually be shared on fragalysis, and for them to be considered for use in triaging designed compounds

Also attached to this post is the html for a Jupyter notebook, with some example rdkit code showing how to achieve the desired format. example_sdf_format-5.html.zip (50.2 KB)

Specification - ver_1.0

The upload format for compounds will be allowed in one of two ways:

  1. A single sdf file
  2. A single sdf file plus pdb files for the ligands to be loaded into in fragalysis.

The sdf files for these two options will have a standardised format, to allow the following options:

  • The fragments that inspired the design of each molecule can be specified
  • The protein (in pdb file format) for each molecule can be specified
  • Any number of ‘properties’ or ‘scores’ can be specified.

The (SDF) format is as follows:

The sdf file name will be: compound-set_<name>.sdf with <name> replaced with the name you wish to give it. e.g. compound-set_fragmenstein.sdf

A ‘blank’ molecule will be the first in the sdf:

  • This molecule will contain all of the same fields as the sdf, containing a description of those fields.
  • The 3D coordinates of this molecule can be anything - they will be ignored.
  • The name (title line) of this molecule should be the file format specification version e.g. ver_1.0 (as defined in this document)

Every other molecule in the sdf file will be assumed to be a molecule that is a computed molecule, and should:

  • Have the same properties as the blank molecule, but with their values instead of description.
  • Have a name that is meaningful, and will eventaully be displayed in Fragalysis - Use the PostEra submission ID, or if not available, a name that is meaningful (e.g. the PDB code, name used in publication, etc.)
  • Have the three following compulsary property fields:
    • ref_mols - a comma separated list of the fragments that inspired the design of the new molecule (codes as they appear in fragalysis - e.g. x0104_0,x0692_0)
    • ref_pdb - either (a) a filepath (relative to the sdf file) to an uploaded pdb file (e.g. Mpro-x0692_0/Mpro-x0692_0_apo.pdb) or (b) the code to the fragment pdb from fragalysis that should be used (e.g. x0692_0)
    • original SMILES - the original smiles of the compound before any computation was carried out

example sdf file: compound-set_fragmenstein.sdf (9.9 KB)

NB: only properties with numerical (or boolean) values will be displayed in fragalysis - this will be reviewed at a later date

2 Likes

Hi Rachel,

Looks useful. I’ll take a look at implementing a conversion script for GOLD data. Would anyone else find such a script useful?

Jason

1 Like

Specification - ver_1.1

The upload format for compounds will be allowed in one of two ways:

  1. A single sdf file
  2. A single sdf file plus pdb files for the ligands to be loaded into in fragalysis.

The sdf files for these two options will have a standardised format, to allow the following options:

  • The fragments that inspired the design of each molecule can be specified
  • The protein (in pdb file format) for each molecule can be specified
  • Any number of ‘properties’ or ‘scores’ can be specified.

The (SDF) format is as follows:

The sdf file name will be: compound-set_<name>.sdf with <name> replaced with the name you wish to give it. e.g. compound-set_fragmenstein.sdf

A ‘blank’ molecule will be the first in the sdf:

  • This molecule will contain all of the same fields as the sdf, containing a description of those fields.
  • The 3D coordinates of this molecule can be anything - they will be ignored.
  • The name (title line) of this molecule should be the file format specification version e.g. ver_1.1 (as defined in this document)
  • The molecule should have a field ‘ref_url’ containing a url that describes the method that was used in the work - this can be a link to the forum, or elsewhere if you have described the method somewhere else

Every other molecule in the sdf file will be assumed to be a molecule that is a computed molecule, and should:

  • Have the same properties as the blank molecule, but with their values instead of description. - for the ref_url field, you can leave this blank, as it will be ignored for molecules that are not the blank molecule
  • Have a name that is meaningful, and will eventaully be displayed in Fragalysis - Use the PostEra submission ID, or if not available, a name that is meaningful (e.g. the PDB code, name used in publication, etc.)
  • Have the three following compulsary property fields:
    • ref_mols - a comma separated list of the fragments that inspired the design of the new molecule (codes as they appear in fragalysis - e.g. x0104_0,x0692_0)
    • ref_pdb - either (a) a filepath (relative to the sdf file) to an uploaded pdb file (e.g. Mpro-x0692_0/Mpro-x0692_0_apo.pdb) or (b) the code to the fragment pdb from fragalysis that should be used (e.g. x0692_0)
    • original SMILES - the original smiles of the compound before any computation was carried out

NB: only properties with numerical (or boolean) values will be displayed in fragalysis - this will be reviewed at a later date

Hi Jason,

Yes, I imagine it would be super useful!

I can include them in the validation code repo, if you would like. If you give me a GitHub username, I can add it as a contributor.

I also saw your email yesterday with your modification of the validator - good catch! Thank you!

Rachael

Hi Rachel,

One further thought on this. The SD or PDB file thing is great, but one issue here is that some software will vary the protein conformation as well as the ligand. For GOLD, for example, in some docking modes this would mean every ligand pose would have its own associated separate PDB file. These are not the original PDBs as side chains can have been flipped, protons can have been rotated, depending on the protocol used.

I’m guessing that this could lead to a bit of a flood in the volume of data (e.g. for the runs I have here this would mean uploading ~4000 protein files which is about 2Gb of data, as opposed to the poses alone which is a lot less.

If think for now we can stick with referencing the associated PDB but we may need to ensure that when people look at the poses they understand this.

Specification - ver_1.2

The upload format for compounds will be allowed in one of two ways:

  1. A single sdf file
  2. A single sdf file plus pdb files for the ligands to be loaded into in fragalysis.

The sdf files for these two options will have a standardised format, to allow the following options:

  • The fragments that inspired the design of each molecule can be specified
  • The protein (in pdb file format) for each molecule can be specified
  • Any number of ‘properties’ or ‘scores’ can be specified.

The (SDF) format is as follows:

The sdf file name will be: compound-set_<name>.sdf with <name> replaced with the name you wish to give it. e.g. compound-set_fragmenstein.sdf

A ‘blank’ molecule will be the first in the sdf:

  • This molecule will contain all of the same fields as the sdf, containing a description of those fields.
  • The 3D coordinates of this molecule can be anything - they will be ignored.
  • The name (title line) of this molecule should be the file format specification version e.g. ver_1.2 (as defined in this document)
  • The molecule should have the following compulsory fields:
    • ref_url - the url to the forum post that describes the work
    • submitter_name - the name of the person submitting the compounds
    • submitter_email - the email address of the submitter
    • submitter_institution - the submitters institution
    • generation_date - the date that the file was generated in format yyyy-mm-dd
    • method - a name for the method used to generate the compound poses

NB: all of the compulsory fields for the blank mol can be included for the other molecules, but they will be ignored

Every other molecule in the sdf file will be assumed to be a molecule that is a computed molecule, and should:

  • Have the same properties as the blank molecule, but with their values instead of description. - for the ref_url field, you can leave this blank, as it will be ignored for molecules that are not the blank molecule
  • Have a name that is meaningful, and will eventaully be displayed in Fragalysis - Use the PostEra submission ID, or if not available, a name that is meaningful (e.g. the PDB code, name used in publication, etc.)
  • Have the three following compulsary property fields:
    • ref_mols - a comma separated list of the fragments that inspired the design of the new molecule (codes as they appear in fragalysis - e.g. x0104_0,x0692_0)
    • ref_pdb - either (a) a filepath (relative to the sdf file) to an uploaded pdb file (e.g. Mpro-x0692_0/Mpro-x0692_0_apo.pdb) or (b) the code to the fragment pdb from fragalysis that should be used (e.g. x0692_0)
    • original SMILES - the original smiles of the compound before any computation was carried out

NB: only properties with numerical (or boolean) values will be displayed in fragalysis - this will be reviewed at a later date

1 Like

OPTIMIZED_DOCKING_MODEL_6LU7_PART1.zip (3.8 MB) OPTIMIZED_DOCKING_MODEL_6LU7_PART2.zip (1.9 MB)

I’m posting this to distribute some work I’ve been doing to validate and optimize docking models for Mpro inhibitor design.

This work started with the 6LU7 complex structure, since I thought the conformation of the N3-bound structure would best accommodate growing the fragments from the Diamond screening hits. I began by modeling the 6LU7 non-covalent encounter complex (filenames 6LU7_noncov_complex_model.*) – using the deprotonated thiolate form of C145, which forms a good h-bond with the amide NH of the N3 ligand. I ran a 500 ns trajectory of the dimer with ligand bound in one site only (the B-chain), using Desmond MD with SPC waters. From this trajectory, I saved 5000 frame. I generated OEDocking/FRED grids for every frame (as well as for the xtal structure model), and docked the entire dsi_poised library (768 compounds) using the OpenEye FRED program.

I calculated the ROC curves for recovering the non-covalent actives in the dsi_poised set published on the Diamond website. I computed the early enrichment factor for the top 100 ranked compounds, figuring that if we’re aiming to synthesize 50 per round, we could filter down to 100 designs by docking, and use free energy methods (or some other more complete scoring scheme) to narrow the final list to 50.

There were 3 frames that stood out as providing significant improvement over the performance of the x-ray structure model (which in and of itself wasn’t terrible). Those ROC plots are included as PNG files, and I’ll include them here as well:

6LU7 x-ray structure model

(image found in zip file)

FRAME #2408

(image found in zip file)

FRAME #4839

(image found in zip file)

FRAME #4853

ROC_FRAME_4853

Please feel free to review and test the attached structures with your own docking methods. Apologies that I could only post one of the ROC images, as a new user to the forum, the post is limited to 1.

Demetri

1 Like

Specification - ver_1.2

edit: clarification on pdb.zip upload format

The upload format for compounds will be allowed in one of two ways:

  1. A single sdf file
  2. A single sdf file plus pdb files for the ligands to be loaded into in fragalysis.

The sdf files for these two options will have a standardised format, to allow the following options:

  • The fragments that inspired the design of each molecule can be specified
  • The protein (in pdb file format) for each molecule can be specified
  • Any number of ‘properties’ or ‘scores’ can be specified.

The (SDF) format is as follows:

The sdf file name will be: compound-set_<name>.sdf with <name> replaced with the name you wish to give it. e.g. compound-set_fragmenstein.sdf

A ‘blank’ molecule will be the first in the sdf:

  • This molecule will contain all of the same fields as the sdf, containing a description of those fields.
  • The 3D coordinates of this molecule can be anything - they will be ignored.
  • The name (title line) of this molecule should be the file format specification version e.g. ver_1.2 (as defined in this document)
  • The molecule should have the following compulsory fields:
    • ref_url - the url to the forum post that describes the work
    • submitter_name - the name of the person submitting the compounds
    • submitter_email - the email address of the submitter
    • submitter_institution - the submitters institution
    • generation_date - the date that the file was generated in format yyyy-mm-dd
    • method - a name for the method used to generate the compound poses

NB: all of the compulsory fields for the blank mol can be included for the other molecules, but they will be ignored

Every other molecule in the sdf file will be assumed to be a molecule that is a computed molecule, and should:

  • Have the same properties as the blank molecule, but with their values instead of description. - for the ref_url field, you can leave this blank, as it will be ignored for molecules that are not the blank molecule
  • Have a name that is meaningful, and will eventaully be displayed in Fragalysis - Use the PostEra submission ID, or if not available, a name that is meaningful (e.g. the PDB code, name used in publication, etc.)
  • Have the three following compulsary property fields:
    • ref_mols - a comma separated list of the fragments that inspired the design of the new molecule (codes as they appear in fragalysis - e.g. x0104_0,x0692_0)
    • ref_pdb - either:
      • the file path of the pdb file in the uploaded zip file:
        • Example: If you upload a file called references.zip that contains a pdb file new_protein.pdb, the corresponding path in the ref_pdb file would be references/new_protein.pdb
      • the code to the fragment pdb from fragalysis that should be used (e.g. x0692_0)
    • original SMILES - the original smiles of the compound before any computation was carried out

NB: only properties with numerical (or boolean) values will be displayed in fragalysis - this will be reviewed at a later date

To help, there is more useful information on Fragalysis uploads in this topic - posting a link to save time:

Hi all,

Just a quick note about a problem I found with submissions. It turns out that since many of the submissions don’t have fragment codes associated with their records, the validation script was excluding them when I tried to upload. So I solved this problem by creating 2 fields to store this info:

ref_mols_orig: stores the codes the submitters included (with _0 appended)

ref_mols: has a default value of X0072_0 to enable the upload

Also, when trying to get the upload validated, I discovered that some codes that appear in the spreadsheet of hits (e.g. X1458) were not being accepted by Fragalysis. I’m not sure why, but can imagine people might be struggling with this.

Cheers,
Demetri