The RDKit’s code for doing R-group decomposition (RGD) is quite flexible but also rather “undocumented”. Thanks to that fact, you may not be aware of some of the cool stuff that’s there. This post is an attempt to at least begin to remedy that by looking at some of the edge cases that come up while doing RGD.
I have another post coming in the near future which is a bit more of a tutorial, but here we’ll look at a number of difficult/interesting problems that arise all the time when doing RGD on real-world datasets:
Handling symmetric cores
Handling stereochemistry
Handling sidechains that attach to the core at more than one point
Handling multiple scaffolds or variable scaffolds
Some of these problems are really tricky to solve perfectly, so please expect that there will be bugs (particularly in the code for handling symmetrization). If you find something that seems wrong, please do file a bug report, ideally with the exact code and structures that you used.
The code in this blog post behaves correctly with v2019.09.1 and later of the RDKit. Older versions have bugs that generate different results for some of the examples here.
Those labels were automatically assigned and they aren’t consistent with what we had above. Notice, however, that the symmetry in the scaffold has been properly handled.
If we care about the R group labels, We can explicitly label the side chains:
# note: there's a bug in RDKit 2019.03.3 and 2019.03.4 that causes this to generate different# results with those versionsscaffold = Chem.MolFromSmiles('c1c([*:2])c([*:3])ccn1')groups,_ = rdRGroupDecomposition.RGroupDecompose([scaffold],mols,asSmiles=False,asRows=False) PandasTools.RGroupDecompositionToFrame(groups,mols,include_core=True)
Mol
Core
R2
R3
0
1
2
3
4
We’ve just been looking at compound images since that’s a bit more readable. Here’s what the raw output from the function looks like:
Making sure that the sidechains are labelled correctly on chiral centers can be a bit trickier.
Here’s a set of molecules we’ll be using. Some have a chiral center, some don’t. There are a few that have sidechains with dual attachment points (i.e. rings). We’ll look at those in the next section.
mols = [x for x in Chem.SDMolSupplier('../data/rgd_chiral.sdf')]Draw.MolsToGridImage(mols,molsPerRow=4)
Remove the examples with “ring” sidechains. We’ll get to those later
q = Chem.MolFromSmarts('[R2]')mols = [x for x in mols ifnot x.HasSubstructMatch(q)]Draw.MolsToGridImage(mols,molsPerRow=4)
We’ll define two scaffolds, the first with a chiral center, the second without. In this case we will add explicit markers for the substituents. This is currently (v2019.09) necessary to properly handle atomic chirality.
Start with doing a decomposition with the non-chiral scaffold. This matches all the molecules, but generates results that are not consistent with the chirality. The compounds in rows 2 and 3 (numbered from zero) demonstrate the problem clearly.
Try the chiral scaffold. This one will only match the chiral compounds, but it does the right thing with those:
groups,unmatched = rdRGroupDecomposition.RGroupDecompose([chiral_scaffold],mols,asSmiles=False,asRows=False)tmols = [mols[x] for x inrange(len(mols)) if x notin unmatched]PandasTools.RGroupDecompositionToFrame(groups,tmols,include_core=True,redraw_sidechains=True)
[13:50:06] No core matches
[13:50:06] No core matches
[13:50:06] No core matches
[13:50:06] No core matches
Mol
Core
R1
R2
0
1
2
3
4
5
Note that in each case the atom is assigned to the correct R group.
We can also combine the two scaffolds so that we can get the chiral and achiral cases. Order is important, so we include the more specific scaffold (the chiral one) first. In this case the stereochemistry determines the R1/R2 assignment for the chiral molecules. For the non-chiral molecules R1 and R2 are assigned using the standard symmetrization code.
This one is tricky, and there’s not really a right answer, this is just a demonstration of what the current code does
mols = [x for x in Chem.SDMolSupplier('../data/rgd_chiral.sdf')]q = Chem.MolFromSmarts('[R2]')mols = [x for x in mols if x.HasSubstructMatch(q)]Draw.MolsToGridImage(mols,molsPerRow=4)
What happens if there are small variations in the scaffold within the series, something that we see all the time in med chem work?
mols = [Chem.MolFromSmiles(smi) for smi in'c1c(F)cccn1 c1c(Cl)c(C)ccn1 c1c(F)cncn1 c1c(F)c(C)ccn1'.split()]Draw.MolsToGridImage(mols,molsPerRow=4)
scaffold = Chem.MolFromSmiles('c1c([*:1])c([*:2])ccn1')groups,unmatched = rdRGroupDecomposition.RGroupDecompose([scaffold],mols,asSmiles=True,asRows=False)# the second return value, unmatched, provides the indices of the molecules that did not match a scaffold:print(unmatched)groups
You can see that now we only get three results, the third molecule (index 2) didn’t end up in the output. Sometimes this is ok, but in cases like this it would be great if that molecule were also included in the R-group decomposition.
One solution to this is to provide two different scaffolds:
What about if we have multiple scaffolds which share a common SAR? Here we just provide them both and label the attachment points manually to show the correspondance.
# we'll use just the first 16 molecules to make things a bit smaller for this demom16 = mols[:16]groups,unmatched = rdRGroupDecomposition.RGroupDecompose([core],m16,asSmiles=False,asRows=False)PandasTools.RGroupDecompositionToFrame(groups,m16,include_core=True,redraw_sidechains=True)
We can exclude any molecules that have R groups in non-labelled positions:
params = rdRGroupDecomposition.RGroupDecompositionParameters()params.onlyMatchAtRGroups =Truegroups,unmatched = rdRGroupDecomposition.RGroupDecompose([lcore],m16,asSmiles=False,asRows=False,options=params)tmols = [x for i,x inenumerate(m16) if i notin unmatched]PandasTools.RGroupDecompositionToFrame(groups,tmols,include_core=True,redraw_sidechains=True)
[13:50:08] No core matches
[13:50:08] No core matches
[13:50:08] No core matches
[13:50:08] No core matches
[13:50:08] No core matches
[13:50:08] No core matches
Mol
Core
R1
R2
0
1
2
3
4
5
6
7
8
9
There are other useful parameters to control the calculation in that RGroupDecompositionParameters object, but this post is already getting pretty long, so I’m going to wrap up now and leave exploring those as an exercise for the reader. ;-)