The SAR Table is primarily designed for data entry by providing scaffolds and substituents, but it is also useful when starting with constructed molecules: these can be broken down into constituents with the assistance of automated scaffold matching.
The SAR Table app is designed to expediate the creation of tables of scaffolds, substituents, whole molecules and data, by providing a user interface that takes as much repetition as possible out of the process of specifying each of the fragment components. Providing a scaffold and substituents for each entry allows the whole molecule structure to be assembled automatically.
This process can be run in reverse: there are a number of possible scenarios where a collection of molecular structures is available for a series, e.g. from an MDL SD file, but the corresponding decomposition into core scaffolds and pendant substituents is not available, or not known. Classifying existing molecules in terms of scaffolds and substituents can be made faster and more effective by a feature that was introduced to v1.1: scaffold matching.
The scaffold matching process is conceptually straightforward. Consider the following molecule, and a core scaffold fragment:
A substructure search of the proposed scaffold core within the molecule has a single unique result, which is obvious to a human, and quite obvious to an algorithm. By overlaying the core onto the molecule, disconnection points can be derived, and the scaffold core can be relabelled as a Markush-style scaffold:
Two new substituent placeholders have been invented - R1 and R2 - and fragment values for both of these are implied by the substructure match.
The assignment and derivation of annotated scaffolds and substituents can be done in an automated fashion using the SAR Table app. The user must provide at least the core of the scaffold, with or without R-group style annotations, and in cases of ambiguity, the user must decide which assignment is preferred. The workflow involves an iterative interplay between selecting or drawing structure fragments, and automated derivation and convenience features to minimise repetition.
The following video demo shows the scaffold matching features described in this article: SAR Table Demo Part 2: Scaffold matching.
The following workflow demonstrates a simple case that involves 11 compounds from
In this example, it is postulated that the structures are available in a file called JMC2011548174.sdf. In order to use these structures with the SAR Table app, the file needs to be made available to the device. Often the easiest way to do this is to send it to yourself by email or, if it is downloadable via a URL, access it from the mobile browser:
Opening with the SAR Table app will activate the import process, which will bring in the new table (left), which ought to be promptly renamed (right):
Opening the new table reveals three columns:
The Molecule column has been imported from the SD file, as has the id. Scaffold was auto-created during the import process, and each entry is blank. Note that to the bottom-right corner of each of the molecules there is a small padlock logo. This indicates that it is a construct molecule, and it is locked, to prevent it from being gratuitously replaced.
To get started with assigning scaffolds, the first thing to do is to provide the scaffold core that is going to be used, which is in this case indole. One way to do this is to draw indole directly in the first scaffold cell. In this case, though, the molecules are all well-sketched and properly oriented, so it is easier to copy-and-paste the first molecule into the scaffold cell, then trim off the substituents:
Note that the scaffold has not been decorated with any substituent labels (e.g. R1, R2, etc.). This matching algorithm will create these automatically, so providing them is optional.
Activate the Scaffold Match button:
The confirmation dialog will appear:
The scaffold candidate and the molecule to match it to are both displayed. The match is carried out after pressing the Submit button, which uses a webservice remote procedure call to compute a suitable result:
There are two possible results returned, and each of them shows the core scaffold that we submitted, decorated with two R-group placeholders, located at the appropriate positions for the substituents. Since the placeholder labels were not provided as part of the query, the matching service picked labels of its own: R1 and R2. In this case the two results are equivalent, but the user may have a preference for the ordering of the labels, and so a choice is offered.
Select the first entry, and press Apply:
Several modifications have been made to the row:
- The scaffold has been updated with R-group labels.
- Two new columns have been added: R1 and R2, and both have been populated with fragments.
- The construct molecule value is no longer locked, which means that it will be automatically rebuilt from the scaffold and substituents whenever any of them is modified.
Note that the scaffold matching is carried out by a remotely hosted webservice, which defaults to molsync.com. If your device is not currently connected to the internet, this feature will be unavailable. If your data is highly confidential and you are uncomfortable with transmitting service requests across the internet, the service can be hosted internally on an intranet. Contact us for more information.
For the second row, which shares the same scaffold, we can take advantage of the substituent placeholders that were determined in the first step. Duplicate the scaffold, and submit the second row to the matching service:
This time, a single result is obtained - this is because the R-group placeholders have already been defined for the scaffold, and since the scaffold matches the molecule non-degenerately, a single outcome is implied:
If all molecular fragment combinations lacked symmetry or any other kind of combinatorial mapping degeneracy, cheminformatics would be simple, but they don't and it isn't. The following sections discuss some of the relevant issues that apply when series data is less clear-cut.
The SAR Table app considers any "atom" to be an attachment placeholder if its placeholder is not an element symbol, and it is not an inline abbreviation. For drug discovery research, the R-group system is very common, e.g. R, R1, R2, etc., but there are many others that are used for special cases, e.g. X, Z, E, etc.
During the matching process, the submitted scaffold core may encounter cases where additional labels are implied, i.e. there is not already an existing attachment placeholder ready to be used. If the scaffold query already had some attachment placeholders defined, the result may contain extra attachment placeholders. Note that when there are multiple results, those which do not require creating new attachment placeholders will be favoured, and others may be omitted.
When new attachment placeholders are required in order to formulate a result, they are numbered using the R-group system, and start at the highest already-present number, e.g. if the scaffold already uses R1, R2 and R3, the next label chosen will be R4.
In cases where multiple new placeholders need to be added, the possibilities are expanded out in a factorial set, up to a point, e.g. for two new attachments, two permutations, e.g. (R1,R2) and (R2,R1), will be returned; for three new attachments, 6 permutations will be returned. Since these are all as good as each other to the algorithm, but the chemist is likely to have a preference, all of them are offered as options.
The fragment structures for new substituents that are monodentate, i.e. have just one point of attachment to the scaffold, there are two options for labelling the connection point: one is to use the same label as on the scaffold, so they can be matched up literally (e.g. R1-to-R1), or to use an arbitrary label, such as R, the definition for which is not important because there is only one option. The latter style is favoured, partly because it is more convenient for cutting-and-pasting into different columns, but if there is a style for precedent existing substituents from other rows, it will be honoured.
Any practical form of substructure searching needs to take into account aromaticity, although there is no need to conflate the cheminformatics requirements with the physical observation of an aromatic ring current. The algorithm used for scaffold matching uses a very simple definition, which is based on alternating double-bonds within groups of one-or-more 6-membered rings: if the alternating single/double bond pattern cannot be localised unambiguously, then the constituent bonds are reclassified as aromatic. Aromatic bonds never match non-aromatic bonds, and vice versa.
The following compounds have 4 rings that are considered aromatic:
Pyridine and naphthalene are considered aromatic, and any alternating single/double bond combination will match any other. Quinone is not, and this is not a problem since there is one resonance form in general use. The same is true for conjugated cyclic amides and 5-membered rings such as thiophene, for which consistent bond localisation is almost universal.
Highly fused ring systems such as coronene are considered aromatic, but larger conjugated systems such as porphyrins are not detected, nor are ambiguous 5-membered ring systems such as imidazolium ions. These are current limitations of the system, and will likely be corrected in the near future.
Consider the following match predicate shown for the second row:
There are 3 things to see:
- The whole molecule is oriented differently from the proposed scaffold, which was copied from a previous entry.
- The 6-membered aromatic ring shows a different resonance form.
- The amide substituent corresponding to R1 is flipped, relative to previous R1 assignments (e.g. previous row).
The matching process turns up a single result, which can be applied to the data:
Note that the fragment obtained for R1 has a different orientation from the input molecule: the structure of the fragment was found to be canonically equivalent to a substituent that already existed in the table, i.e. the amide substituent found in the first 2 rows, and so this sketch representation is used preferentially. Only when new substituents are encountered is the geometry from the submitted molecule used.
The molecule has been redrawn, by reconstructing it starting with the scaffold, then grafting on each of the substituents. Because the aromatic rings were properly matched, despite having different resonance patterns, the permutation found in the scaffold takes precedence.
The locked state has also been switched off, so any further modifications to the scaffold or substituents will be immediately reflected in the construct molecule.
Matched substituents do not need to be terminal. Consider the following table, for which two substituents are already defined from the results of the first row. The molecule in the second row is based on the same core scaffold, with a ring-closing substituent:
The result for the second row indicates that there is just one substituent fragment, and it is arbitrarily shoehorned into the R1 column:
The payload actually contains two substituent attachment points:
Note the tilde glyph that is shown in the R2 column, to indicate that the attachment point is handled by spill-over from another substituent.
Scaffold-substituent disconnects should generally not be used to disconnect and reconnect aromatic ring systems. The scaffold matching algorithm will not discover such matches, since breaking of rings shuts down the aromaticity detection.
Symmetry and ranking
Many scaffolds have internal symmetry, rotational bonds, or other ways of matching degenerately, e.g. in molecules that have multiple occurrences of the scaffold. These cases typically lead to multiple solutions for scaffold matching, but they are not all necessarily equal, and there are often grounds for selecting favourites.
Consider a table that has two rows, the first of which is fully defined, and the second is about to be submitted for a match:
The pendant phenyl substituent is connected via a rotational bond, so any molecule that is a superstructure of this core will be able to match in at least two ways, i.e. as-drawn, and rotated 180 degrees about the rotatable bond. However, the phenyl group contains two attachment placeholders - R1 and R2 - and the implied values for these can be used to differentiate the results:
As can be seen, there are two possible answers that come back from the scaffold match. The first result defines the R1 substituent as methoxy. Note that this is the same assignment for R1 as is seen on the first row:
It is not a coincidence that the match for which R1 = methoxy is listed first. The algorithm performs a series of reduction steps to consolidate degenerate results, then orders those which remain according to commonality with existing assignments. As a general rule, patterns of substituents should be grouped, although there are plenty of exceptions, so the user still has to make the decision.
There are also a number of other criteria for reducing the results. For example, when the incoming scaffold is decorated with attachment labels, the algorithm strongly favours matches which do not require any new auto-created attachment labels.
The scaffold's label placeholders are used to indicate where the substituents are to be reattached, but there is a deeper significance when labels are reused: it is required by definition that they must be identical. This is easily enforced within the workings of the SAR Table operation, since each substituent label gets one and only one column to work with, but it is also strictly enforced by the scaffold matching service.
Consider a scaffold which is an amide functional group, where both sides have the attachment label R:
In the first row shown above, the whole molecule can be matched to the scaffold, because in both cases the resulting assignment is that R = phenyl. The app also helpfully annotates the column with x2, which shows that two units of this substituent are being consumed in order to build the construct molecule.
For the second row, though, the match fails: the only pre-candidate match deduces that R = phenyl, and also that R = o-tolyl, which violates the conservation of substituents.
The same applies when labels are connected to the same scaffold atom:
In the first case the match is successful, because X = F and Z = Cl, but in the second case no match is found, because none of the pre-candidates can satisfy the label equivalence condition.
Stereochemical constraints, in the form of sp3 tetrahedral atom centres and alkene-like restricted rotation double bonds, are honoured. Stereocentres that occur within the body of the scaffold core, or any substituent fragment, are compared for equivalence. When the stereocentre involves crossing between a scaffold- and substituent-atom, e.g. into an R-group label, the geometry and wedge bond of the scaffold is used to make the determination.
Consider the following two cases:
For the first row, the R-group must match the chloro substituent, and the wedges imply the same geometry. For the second row, the match is not consistent with the wedges, and so is rejected. Note that the substituent itself does not show a wedge bond for the bond to the attachment placeholder, because the up/down orientation is taken from the scaffold.
When matching stereochemistry for hydrogen-suppressed chiral centres, the results can be a little counter-intuitive: scaffold matches can in some cases be satisfied literally by creating new substituent labels, which is not always the result that was expected. It is up to the user to reject these proposals if they are not desired.
The same applies to cis/trans alkenes: the match must be consistent with the inputs.
The SAR Table app supports the use of inline abbreviations, and these can also be used for scaffold matching purposes. However, at the present time only the label is used for the atom-to-atom match, e.g. if the label "E" is used abbreviate two different structures, they will still be considered equivalent. This is a known limitation that may be addressed soon.
Scaffold matching is a powerful addition to the SAR Table app, which makes it useful for deriving scaffold/substituent combinations from an arbitrary source of molecules. The underlying substructure scaffold matching algorithm can be used both to verify hypotheses and automate a large part of the work of converting a series into scaffold/substituent fragments.