R-group decomposition (RGD) is a standard approach for analyzing chemical datasets and doing SAR analysis. Of course the RDKit provides code supporting RGD. Since there are a lot of different RGD use cases, and it turns out that doing R-group decomposition and providing useful results on arbitrary datasets is considerably more complicated than it may initially seem, the RDKit’s RGD implementation needs to be quite flexible (and the implementation is fairly complex). Unfortunately the code is also quite “underdocumented”… this tutorial is a first pass at fixing that. It will eventually end up in the RDKit documentation.
Here I will demonstrate some of the more useful (in my opinion) options of the RGD code using some real-world datasets. I may revisit this topic at some point in the future to explain some of the more advanced topics. If there are pieces you think I should cover, please let me know in the comments!
The first two datasets I’ll use in this post are created using the SMILES provided in the Supplementary Information from a couple of open-acess J Med Chem papers (Aside: isn’t it great that J. Med. Chem. encourages authors to provide SMILES for the structures in their papers and so many authors actually do so? Isn’t it depressing that the journals which focus on computational topics don’t do so?).
from rdkit import Chemfrom rdkit.Chem import Drawfrom rdkit.Chem.Draw import IPythonConsoleimport pandas as pdfrom rdkit.Chem import PandasToolsPandasTools.RenderImagesInAllDataFrames(images=True)import rdkitprint(rdkit.__version__)
2022.09.1
import osbaseDir ='../data/RGD_sets'
The basics
Here’s our first dataset, from https://pubs.acs.org/doi/10.1021/acs.jmedchem.8b00883
These datasets include the core query as the first element in the file (I added this, it was not in the original paper), so pull that out:
core = mols.pop(0)core
Bring in the RGD module
from rdkit.Chem import rdRGroupDecomposition
Here’s a quick demo of the basic usage of the RGD code. The return value is a 2-tuple: 1. the RGD results 2. a list with the indices of molecules which did not match any of the cores
In this case the molecule at position 15 did not match our core:
mols[15]
The output is a list with one entry per matching molecule. Each entry in the list is a dictionary with the core match and R groups for the corresponding molecule:
rgd[:2]
[{'Core': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeac00>,
'R1': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeac70>,
'R2': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeace0>,
'R3': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aead50>,
'R4': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeadc0>,
'R5': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeae30>},
{'Core': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeaea0>,
'R1': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeaf10>,
'R2': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeaf80>,
'R3': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeaff0>,
'R4': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeb060>,
'R5': <rdkit.Chem.rdchem.Mol at 0x7fbeb7aeb0d0>}]
If you want to quickly look at the results from the RGD, it can be convenient to just display them in a Pandas DataFrame. The PandasTools module includes a function to make this easy. The function, RGroupDecompositionToFrame(), needs the results from the RGD in a different format, which we can get by calling RGroupDecompose() with the keyword argument asRows=False:
crgd,fails = rdRGroupDecomposition.RGroupDecompose([core],mols,asRows=False)PandasTools.RGroupDecompositionToFrame(crgd,[mols[i] for i inrange(len(mols)) if i notin fails],).head()
[14:15:53] No core matches
Mol
R1
R2
R3
R4
R5
0
1
2
3
4
When called this way, RGroupDecompose() returns a dictionary of lists instead of a list of dictionaries:
The core we provided has three labelled R groups, but the RGD produced five different R labels. The other labels were automatically assigned when molecules with additional substituents were encountered. This behavior can be disabled using the onlyMatchAtRGroups parameter when calling RGroupDecompose:
[14:15:54] No core matches
[14:15:54] No core matches
[14:15:54] No core matches
[14, 15, 16]
Now there are three molecules which didn’t match: 15 from before and then 14 and 16, which had additional substituents. The other molecules only have R1-R3:
rgd[0]
{'Core': <rdkit.Chem.rdchem.Mol at 0x7fbeb5a9e5e0>,
'R1': <rdkit.Chem.rdchem.Mol at 0x7fbeb5a9e650>,
'R2': <rdkit.Chem.rdchem.Mol at 0x7fbeb5a9e6c0>,
'R3': <rdkit.Chem.rdchem.Mol at 0x7fbeb5a9e730>}
Working without R labels
It’s also possible to do RGD by providing a core without labeled R groups:
crgd,fails = rdRGroupDecomposition.RGroupDecompose([core_without_labels],mols,asRows=False)print(len(fails))PandasTools.RGroupDecompositionToFrame(crgd,[mols[i] for i inrange(len(mols)) if i notin fails],).head()
1
[14:15:54] No core matches
Mol
R1
R2
R3
R4
0
1
2
3
4
Here the R labels are automatically assigned, but the result is otherwise the same as what we saw above.
Handing of chirality
For this we’ll use a new dataset, from https://pubs.acs.org/doi/10.1021/acs.jmedchem.5b00752
Once again, I manually constructed the core when putting these datasets together.
crgd,fails = rdRGroupDecomposition.RGroupDecompose([core_cp],mols,asRows=False)print(len(fails))df = PandasTools.RGroupDecompositionToFrame(crgd,[mols[i] for i inrange(len(mols)) if i notin fails])df.head()
8
Mol
R1
R2
R3
R5
0
1
2
3
4
This time the usual R group symmetrization has been applied and we end up with the piperazines all in R3
It’s also worth pointing out that this result set has a lot less failures - 8 instead of 36 with the original query. We can see one of the extra results in this excerpt from the results table: the third row is a molecule where chirality hasn’t been specified for the central atom.
We can accomplish the same thing without actually modifying the core itself by telling the RGD code to ignore chirality when doing its substructure search.
In the interest of not having this post contain gigabytes of images, here I’m just showing the results as SMILES:
ps = rdRGroupDecomposition.RGroupDecompositionParameters()ps.substructMatchParams.useChirality=Falsergd,fails = rdRGroupDecomposition.RGroupDecompose([core],mols,asRows=True,asSmiles=True,options=ps)print(len(fails))for row in rgd[:5]: row.pop('Core')print(row)
ps = rdRGroupDecomposition.RGroupDecompositionParameters()ps.onlyMatchAtRGroups =Truecrgd,fails = rdRGroupDecomposition.RGroupDecompose([core1,core2],chembl_mols,options=ps,asRows=False)print(len(fails))df = PandasTools.RGroupDecompositionToFrame(crgd,[chembl_mols[i] for i inrange(len(chembl_mols)) if i notin fails])df.head()
12
Mol
R1
R2
0
1
2
3
4
Note that we can accomplish the same thing in this case by including a query feature in the core:
The second and third molecule show how multiple attachments to a single non-labelled atom in the core are handled.
If you are working with cores where you allow R groups at unlabelled points and would rather not have two groups combined into a single group like this, you can disable the behavior:
This does result in the cyclic sidechain of the third molecule appearing twice; that seems a bit strange, but that’s the most consistent behavior we could come up with.