A deep dive into our retrosynthesis-based synthetic accessibility score, now available via Manifold API.
Literature Survey
Broadly speaking, there are two classes of synthetic accessibility scores: those that are complexity-based and those that are retrosynthesis-based. Complexity-based scores, such as those by Ertl et al. (which we will refer to as ESA) Coley et al. (SCS), and Thakkar et al. (TRA), analyze molecular features without considering synthetic routes or available building blocks. Complexity-based scores are typically simpler to configure and faster to run, with the downside of being less reliable than retrosynthesis-based ones, as indicated by the distribution of ESA and SCS scores for readily available, purchasable molecules (Figure 1).
Figure 1. Distribution of normalized ESA, SCS, and TRA scores for over 35 thousand readily available molecules. A readily available molecule is defined as a molecule that can be ordered within 10 business days from at least 8 different suppliers. ESA, SCS, TRA scores were normalized by linearly scaling from 1-10, 0-5, and 1-0 respectively to 0-1 for ease of comparison here.
A synthetic accessibility scorer aware of available molecules would know that all of these molecules should have a synthetic accessibility of near 0, with a small amount of variance from price and lead time. Such a scorer would also know that molecules that are 1 robust reaction step away from these readily available molecules should have a low synthetic accessibility score, and so on.
TRA mostly satisfies this particular objective (the majority of the TRA distribution is near 0), but it has other limitations due to its nature as a classifier that we’ll see below (Figure 2).
Retrosynthesis-based scores have the potential to be more reliable than the former by iteratively decomposing the query molecule into precursors and directly checking catalogs. Furthermore, they can be interpreted by checking the routes produced by the retrosynthesis solver. We expect this advantage to increase as building block catalogs continue to expand with more complex structures.
However, retrosynthesis-based scores are significantly more difficult to set up, requiring more compute to run and ongoing catalog maintenance. By offering a retrosynthesis-based synthetic accessibility score via API, PostEra takes responsibility for all of the setup and maintenance, providing API users an easy-to-use and reliable synthetic accessibility estimate.
PostEra’s Retrosynthetic Accessibility Score (RSA)
Our retrosynthesis-based score is powered by the same technology as Manifold Synthesis search. To ensure reaction validity, all reactions are checked by Molecular Transformer (MT), a deep learning approach that members of our team helped pioneer that offers best-in-class reaction prediction accuracy. The current MT we employ in production improves upon the published model with higher data quality and faster serving speed. Reaction precursors are checked against databases containing tens of billions of purchasable molecules, which helps us find the shortest possible routes.
In addition to being comprehensive and robust, our retrosynthesis engine is fast. We ran it against the same 100 molecules that AiZynthFinder (AZF) used in their comparison to ASKCOS, and not only did we find routes for 85/100 molecules (vs 55 for AZF and 62 for ASKCOS), we found them faster with a mean time of 2.2s for the first route (vs 7.1s for AZF and 14s for ASKCOS).
Using our retrosynthesis engine, we compute a synthetic accessibility score based on the found routes, with a scoring function that balances several factors, including the cost/lead-time of the building blocks and how likely Molecular Transformer deems the reactions to proceed. If multiple routes are found, which is the typical case, then the score is discounted based on the viability and diversity of backup alternative routes.
If no routes are found, then we leverage the explored search space by promoting simple precursors (as determined by a fast complexity-based score) to building blocks and searching for routes in this virtual space. This allows us to return meaningful synthetic accessibility scores for the tail end of difficult-to-synthesize compounds.
We combine a robust framework for assessing synthetic accessibility with a highly optimized retrosynthesis engine to deliver to users what we believe is the best-in-class synthetic accessibility score.
Manifold Case Study
As an example, let’s consider three highly similar molecules and see how they compare against three openly available synthetic accessibility measures: ESA, SCS, and TRA.
Figure 2. Three highly similar molecules. The molecules are ranked from top to bottom by the PostEra Retrosynthetic Synthetic Accessibility (RSA) score, with molecules A and B rated as relatively simple, and the molecule C being scored as very hard to synthesize.
Both the ESA and SCS scores for molecule B are surprisingly high, and low for molecule C. TRA gives all 3 molecules a score of effectively 0, largely because it is trained as a classifier without the ability to discriminate between relative difficulty of synthesis.
Molecules A and B in Figure 2, despite the presence of oxetanes, are relatively simple targets, with estimated straightforward routes from in-stock building blocks. In contrast molecule C contains a rare oxetane derivative that is likely difficult to synthesize, and is not commercially available.
The PostEra Retrosynthetic Synthetic Accessibility (RSA) score follows the approximate order estimated by our chemists, while the ESA gives quite a high score due to the bridgehead carbon containing analog, despite the building block being a popular piperazine bioisostere. The bottom oxetane analog does not seem to be appropriately penalized by any of the complexity-based scorers.
As demonstrated, one advantage of the PostEra score is that it will not inappropriately penalize fused, brided, or spiro systems if the corresponding building blocks are readily available.
Another advantage of the PostEra retrosynthetic accessibility score is that it is easily interpretable. To understand why these molecules are scored the way they are, you can simply search Manifold for routes.
Figure 3. Synthesis search results for the Figure 2A compound in Manifold. The top proposed synthetic route relies on two purchasable building blocks. This helps to illustrate the high synthetic accessibility of this molecule, explaining its low PostEra Retrosynthetic Synthetic Accessibility score.
Synthetic accessibility scores are especially useful when paired with generative chemistry approaches that produce more molecules than can be manually screened by a medicinal chemist. Using the most reliable synthetic accessibility score provides more room to optimize for the potency and ADMET properties required in modern hit-to-lead and lead-optimization stages. We at PostEra have successfully employed our synthetic accessibility score to prioritize one of our recent scaffold hopping campaigns (A, B) for our COVID Moonshot drug discovery project.
To get access to the retrosynthetic accessibility score API, please complete this form. We hope you find it as useful as we do!