from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import Descriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import Crippen
import rdkit
print(rdkit.__version__)
2024.03.1
April 17, 2023
Updated 15.04.2024 Added a section at the bottom about the EState_VSA
and VSA_EState
descriptors.
A couple of weeks ago I read a fun preprint from a collaboration between Microsoft Research and Novartis: “Learning chemical intuition from humans in the loop” and noticed that one of the high-ranking descriptors in the authors’ analysis of the model they built is SMR_VSA3
(look at figure 4 in the preprint). Since I’ve seen that descriptor, or other descriptors from the same family, show up repeatedly when analyzing models, and because I bet that most people don’t really have much (if any) idea of what exactly those descriptors are, I figured it’s worth a short blog post.
The descriptors were introduced in the paper “A widely applicable set of descriptors” published by Paul Labute in the Journal of Molecular Graphics and Modelling back in 2000. Here’s a link Random aside: I’m a bit surprised to see that JMGM still exists… it used to be on my standard reading list back in the day, but I haven’t thought about it in years. :-)
Added 19.04.2023 Esben Jannik Bjerrum pointed out that the original version of that article is still available via the wayback machine. Thanks Esben!
I won’t get deeply into the motivation and derivation, read the paper for that, but Paul wanted to come up with a set of descriptors which were generally useful for QSAR studies. He published a three sets of related descriptors: SlogP_VSAX
, SMR_VSAX
, and PEOE_VSAX
which are all based on the same idea: you calculate the contribution of each atom in the molecule to a molecular property (either LogP, MR, or the partial charge) along with the contribution of each atom to an approximate molecular surface area measure (this is the VSA part), assign the atoms to bins based on the property contributions, and then sum up the VSA contributions for each atom in a bin.
Sounds complicated, but it isn’t. Here’s a simple example, the molecule methylamine CN
atom | logp contribution | VSA contribution |
---|---|---|
C | -0.2035 | 7.048 |
N | -1.019 | 5.734 |
The boundaries of the LogP bins for the SlogP_VSA
descriptor that the RDKit uses are:
[-0.4, -0.2, 0, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6]
So the N would add a contribution of 5.734 to bin 1 and the C would add a contribution of 7.048 to bin 2 (for this descriptor the bins are labelled from 1).
The boundaries for SMR and PEOE are:
[1.29, 1.82, 2.24, 2.45, 2.75, 3.05, 3.63, 3.8, 4.0]
and
[-.3, -.25, -.20, -.15, -.10, -.05, 0, .05, .10, .15, .20, .25, .30]
For what it’s worth, Here’s where you can find those bin definitions in the RDKit source code: - SlogP_VSA
: https://github.com/rdkit/rdkit-orig/blob/master/Code/GraphMol/Descriptors/MolSurf.cpp#L256 - SMR_VSA
: https://github.com/rdkit/rdkit-orig/blob/master/Code/GraphMol/Descriptors/MolSurf.cpp#L281 - PEOE_VSA
: https://github.com/rdkit/rdkit-orig/blob/master/Code/GraphMol/Descriptors/MolSurf.cpp#L306
These descriptors do end up being quite useful, but they at first seem totally non-interpretable. It turns out that this isn’t completely true: with a bit of digging it’s possible to get a bit of a sense of what the descriptors actually mean.
So with that background, let’s go figure out what SMR_VSA3
actually means.
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import Descriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import Crippen
import rdkit
print(rdkit.__version__)
2024.03.1
Let’s start by looking at the docstring for SMR_VSA3
, because that will tell us what the bins are (we could also look them up above, but then we’d have to figure out which bins are where… this is easier:
So we know that SMR_VSA3
is the sum of the VSA contributions from atoms which contribute between 1.82 and 2.24 to the SMR value.
The SMR (and SlogP) contributions are calculated using a method published by Wildman and Crippen back in 1999: https://doi.org/10.1021/ci990307l) using SMARTS definitions.
Here’s the data that the RDKit uses for that:
# from: https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Descriptors/Crippen.cpp#L194
# each line includes the atom label, the SMARTS, the logP contribution, the MR contribution, and an optional note
rdkit_data='''C1 [CH4] 0.1441 2.503
C1 [CH3]C 0.1441 2.503
C1 [CH2](C)C 0.1441 2.503
C2 [CH](C)(C)C 0 2.433
C2 [C](C)(C)(C)C 0 2.433
C3 [CH3][N,O,P,S,F,Cl,Br,I] -0.2035 2.753
C3 [CH2X4]([N,O,P,S,F,Cl,Br,I])[A;!#1] -0.2035 2.753
C4 [CH1X4]([N,O,P,S,F,Cl,Br,I])([A;!#1])[A;!#1] -0.2051 2.731
C4 [CH0X4]([N,O,P,S,F,Cl,Br,I])([A;!#1])([A;!#1])[A;!#1] -0.2051 2.731
C5 [C]=[!C;A;!#1] -0.2783 5.007
C6 [CH2]=C 0.1551 3.513
C6 [CH1](=C)[A;!#1] 0.1551 3.513
C6 [CH0](=C)([A;!#1])[A;!#1] 0.1551 3.513
C6 [C](=C)=C 0.1551 3.513
C7 [CX2]#[A;!#1] 0.0017 3.888
C8 [CH3]c 0.08452 2.464
C9 [CH3]a -0.1444 2.412
C10 [CH2X4]a -0.0516 2.488
C11 [CHX4]a 0.1193 2.582
C12 [CH0X4]a -0.0967 2.576
C13 [cH0]-[A;!C;!N;!O;!S;!F;!Cl;!Br;!I;!#1] -0.5443 4.041
C14 [c][#9] 0 3.257
C15 [c][#17] 0.245 3.564
C16 [c][#35] 0.198 3.18
C17 [c][#53] 0 3.104
C18 [cH] 0.1581 3.35
C19 [c](:a)(:a):a 0.2955 4.346
C20 [c](:a)(:a)-a 0.2713 3.904
C21 [c](:a)(:a)-C 0.136 3.509
C22 [c](:a)(:a)-N 0.4619 4.067
C23 [c](:a)(:a)-O 0.5437 3.853
C24 [c](:a)(:a)-S 0.1893 2.673
C25 [c](:a)(:a)=[C,N,O] -0.8186 3.135
C26 [C](=C)(a)[A;!#1] 0.264 4.305
C26 [C](=C)(c)a 0.264 4.305
C26 [CH1](=C)a 0.264 4.305
C26 [C]=c 0.264 4.305
C27 [CX4][A;!C;!N;!O;!P;!S;!F;!Cl;!Br;!I;!#1] 0.2148 2.693
CS [#6] 0.08129 3.243
H1 [#1][#6,#1] 0.123 1.057
H2 [#1]O[CX4,c] -0.2677 1.395
H2 [#1]O[!#6;!#7;!#8;!#16] -0.2677 1.395
H2 [#1][!#6;!#7;!#8] -0.2677 1.395
H3 [#1][#7] 0.2142 0.9627
H3 [#1]O[#7] 0.2142 0.9627
H4 [#1]OC=[#6,#7,O,S] 0.298 1.805
H4 [#1]O[O,S] 0.298 1.805
HS [#1] 0.1125 1.112
N1 [NH2+0][A;!#1] -1.019 2.262
N2 [NH+0]([A;!#1])[A;!#1] -0.7096 2.173
N3 [NH2+0]a -1.027 2.827
N4 [NH1+0]([!#1;A,a])a -0.5188 3
N5 [NH+0]=[!#1;A,a] 0.08387 1.757
N6 [N+0](=[!#1;A,a])[!#1;A,a] 0.1836 2.428
N7 [N+0]([A;!#1])([A;!#1])[A;!#1] -0.3187 1.839
N8 [N+0](a)([!#1;A,a])[A;!#1] -0.4458 2.819
N8 [N+0](a)(a)a -0.4458 2.819
N9 [N+0]#[A;!#1] 0.01508 1.725
N10 [NH3,NH2,NH;+,+2,+3] -1.95
N11 [n+0] -0.3239 2.202
N12 [n;+,+2,+3] -1.119
N13 [NH0;+,+2,+3]([A;!#1])([A;!#1])([A;!#1])[A;!#1] -0.3396 0.2604
N13 [NH0;+,+2,+3](=[A;!#1])([A;!#1])[!#1;A,a] -0.3396 0.2604
N13 [NH0;+,+2,+3](=[#6])=[#7] -0.3396 0.2604
N14 [N;+,+2,+3]#[A;!#1] 0.2887 3.359
N14 [N;-,-2,-3] 0.2887 3.359
N14 [N;+,+2,+3](=[N;-,-2,-3])=N 0.2887 3.359
NS [#7] -0.4806 2.134
O1 [o] 0.1552 1.08
O2 [OH,OH2] -0.2893 0.8238
O3 [O]([A;!#1])[A;!#1] -0.0684 1.085
O4 [O](a)[!#1;A,a] -0.4195 1.182
O5 [O]=[#7,#8] 0.0335 3.367
O5 [OX1;-,-2,-3][#7] 0.0335 3.367
O6 [OX1;-,-2,-2][#16] -0.3339 0.7774
O6 [O;-0]=[#16;-0] -0.3339 0.7774
O12 [O-]C(=O) -1.326 \"order flip here intentional\"
O7 [OX1;-,-2,-3][!#1;!N;!S] -1.189 0
O8 [O]=c 0.1788 3.135
O9 [O]=[CH]C -0.1526 0
O9 [O]=C(C)([A;!#1]) -0.1526 0
O9 [O]=[CH][N,O] -0.1526 0
O9 [O]=[CH2] -0.1526 0
O9 [O]=[CX2]=O -0.1526 0
O10 [O]=[CH]c 0.1129 0.2215
O10 [O]=C([C,c])[a;!#1] 0.1129 0.2215
O10 [O]=C(c)[A;!#1] 0.1129 0.2215
O11 [O]=C([!#1;!#6])[!#1;!#6] 0.4833 0.389
OS [#8] -0.1188 0.6865
F [#9-0] 0.4202 1.108
Cl [#17-0] 0.6895 5.853
Br [#35-0] 0.8456 8.927
I [#53-0] 0.8857 14.02
Hal [#9,#17,#35,#53;-] -2.996
Hal [#53;+,+2,+3] -2.996
Hal [+;#3,#11,#19,#37,#55] -2.996 \"Footnote h indicates these should be here?\"
P [#15] 0.8612 6.92
S2 [S;-,-2,-3,-4,+1,+2,+3,+5,+6] -0.0024 7.365 \"Order flip here is intentional\"
S2 [S-0]=[N,O,P,S] -0.0024 7.365 \"Expanded definition of (pseudo-)ionic S\"
S1 [S;A] 0.6482 7.591 \"Order flip here is intentional\"
S3 [s;a] 0.6237 6.691
Me1 [#3,#11,#19,#37,#55] -0.3808 5.754
Me1 [#4,#12,#20,#38,#56] -0.3808 5.754
Me1 [#5,#13,#31,#49,#81] -0.3808 5.754
Me1 [#14,#32,#50,#82] -0.3808 5.754
Me1 [#33,#51,#83] -0.3808 5.754
Me1 [#34,#52,#84] -0.3808 5.754
Me2 [#21,#22,#23,#24,#25,#26,#27,#28,#29,#30] -0.0025
Me2 [#39,#40,#41,#42,#43,#44,#45,#46,#47,#48] -0.0025
Me2 [#72,#73,#74,#75,#76,#77,#78,#79,#80] -0.0025 '''
Let’s parse that data so that we can do something with it:
from collections import namedtuple
CrippenTuple = namedtuple('CrippenTuple',('name','smarts','logp_contrib','mr_contrib','note'))
lines = [x.split('\t') for x in rdkit_data.split('\n')]
crippenData = []
for i,entry in enumerate(lines):
entry[2] = float(entry[2])
if entry[3]:
entry[3] = float(entry[3])
else:
entry[3] = None
crippenData.append(CrippenTuple(*entry))
print(crippenData[:3])
[CrippenTuple(name='C1', smarts='[CH4]', logp_contrib=0.1441, mr_contrib=2.503, note=''), CrippenTuple(name='C1', smarts='[CH3]C', logp_contrib=0.1441, mr_contrib=2.503, note=''), CrippenTuple(name='C1', smarts='[CH2](C)C', logp_contrib=0.1441, mr_contrib=2.503, note='')]
Now define two functions to look things up in that data:
import re
def find_contribs_for_bin(lower,upper,crippenData=crippenData,which='mr_contrib'):
' returns a list of Crippen contributions which are between lower and upper '
res = []
for tpl in crippenData:
v = getattr(tpl,which)
if v is not None and v>=lower and v<=upper:
res.append(tpl)
return res
def find_tuples_for_atom(symbol,crippenData=crippenData):
' returns a list of crippen contributions for a particular atomic symbol, i.e. "C" or "N"'
res = []
anum = Chem.GetPeriodicTable().GetAtomicNumber(symbol)
for tpl in crippenData:
if tpl.name.startswith(symbol) or re.match(f'\[[^\]]*#{anum}[^0-9]',tpl.smarts):
res.append(tpl)
return res
Now we can use that function to lookup the contributions which fall within the SMR_VSA3
SMR bins:
[CrippenTuple(name='N2', smarts='[NH+0]([A;!#1])[A;!#1]', logp_contrib=-0.7096, mr_contrib=2.173, note=''),
CrippenTuple(name='N7', smarts='[N+0]([A;!#1])([A;!#1])[A;!#1]', logp_contrib=-0.3187, mr_contrib=1.839, note=''),
CrippenTuple(name='N11', smarts='[n+0]', logp_contrib=-0.3239, mr_contrib=2.202, note=''),
CrippenTuple(name='NS', smarts='[#7]', logp_contrib=-0.4806, mr_contrib=2.134, note='')]
So the only atom types which can contribute to this bin are Ns.
The first two atom type definitions are reasonably specific: - N2: neutral aliphatic N with one H and two non-hydrogen aliphatic neighbors - N7: neutral aliphatic N with three non-hydrogen aliphatic neighbors
The last two are less specific: - N11: neutral aromatic N - NS: any nitrogen
These would only be matched to atoms which haven’t matched any of the more specific defintions for N.
Here’s the full set of N definitions:
[CrippenTuple(name='N1', smarts='[NH2+0][A;!#1]', logp_contrib=-1.019, mr_contrib=2.262, note=''),
CrippenTuple(name='N2', smarts='[NH+0]([A;!#1])[A;!#1]', logp_contrib=-0.7096, mr_contrib=2.173, note=''),
CrippenTuple(name='N3', smarts='[NH2+0]a', logp_contrib=-1.027, mr_contrib=2.827, note=''),
CrippenTuple(name='N4', smarts='[NH1+0]([!#1;A,a])a', logp_contrib=-0.5188, mr_contrib=3.0, note=''),
CrippenTuple(name='N5', smarts='[NH+0]=[!#1;A,a]', logp_contrib=0.08387, mr_contrib=1.757, note=''),
CrippenTuple(name='N6', smarts='[N+0](=[!#1;A,a])[!#1;A,a]', logp_contrib=0.1836, mr_contrib=2.428, note=''),
CrippenTuple(name='N7', smarts='[N+0]([A;!#1])([A;!#1])[A;!#1]', logp_contrib=-0.3187, mr_contrib=1.839, note=''),
CrippenTuple(name='N8', smarts='[N+0](a)([!#1;A,a])[A;!#1]', logp_contrib=-0.4458, mr_contrib=2.819, note=''),
CrippenTuple(name='N8', smarts='[N+0](a)(a)a', logp_contrib=-0.4458, mr_contrib=2.819, note=''),
CrippenTuple(name='N9', smarts='[N+0]#[A;!#1]', logp_contrib=0.01508, mr_contrib=1.725, note=''),
CrippenTuple(name='N10', smarts='[NH3,NH2,NH;+,+2,+3]', logp_contrib=-1.95, mr_contrib=None, note=''),
CrippenTuple(name='N11', smarts='[n+0]', logp_contrib=-0.3239, mr_contrib=2.202, note=''),
CrippenTuple(name='N12', smarts='[n;+,+2,+3]', logp_contrib=-1.119, mr_contrib=None, note=''),
CrippenTuple(name='N13', smarts='[NH0;+,+2,+3]([A;!#1])([A;!#1])([A;!#1])[A;!#1]', logp_contrib=-0.3396, mr_contrib=0.2604, note=''),
CrippenTuple(name='N13', smarts='[NH0;+,+2,+3](=[A;!#1])([A;!#1])[!#1;A,a]', logp_contrib=-0.3396, mr_contrib=0.2604, note=''),
CrippenTuple(name='N13', smarts='[NH0;+,+2,+3](=[#6])=[#7]', logp_contrib=-0.3396, mr_contrib=0.2604, note=''),
CrippenTuple(name='N14', smarts='[N;+,+2,+3]#[A;!#1]', logp_contrib=0.2887, mr_contrib=3.359, note=''),
CrippenTuple(name='N14', smarts='[N;-,-2,-3]', logp_contrib=0.2887, mr_contrib=3.359, note=''),
CrippenTuple(name='N14', smarts='[N;+,+2,+3](=[N;-,-2,-3])=N', logp_contrib=0.2887, mr_contrib=3.359, note=''),
CrippenTuple(name='NS', smarts='[#7]', logp_contrib=-0.4806, mr_contrib=2.134, note='')]
Let’s check our understanding…
Start by constructing a molecule with a neutral aliphatic N which has one H and two non-H aliphatic neighbors:
Calculate the Crippen contribs for that. We expect the second atom to be of type N2
, with an MR contribution (the second element of each of the 2-tuples in the list below) of 2.173:
Perfect!
What’s the VSA contribution of the atoms? For historical reasons we get these with the function rdMolDescriptors._CalcLabuteASAContribs()
:
[7.04767198267719, 5.316788604006331, 7.04767198267719]
Based on this, we’d expect SMR_VSA3
for this molecule to be 5.317:
Yep! That was right.
Let’s try another one… this time with a neutral aromatic N. We’ll use pyridine:
Since the SMR contribution of the N in pyridine is also in our bin, we know that the VSA contribution from that atom will be 4.984:
To come back to the original point: what does this tell us about the model from the preprint I mentioned at the beginning of the post?
The fact that SMR_VSA3
has a reasonably large negative correlation with the compound scores in the paper indicates that adding neutral N atoms has a tendency to make molecules more attractive to medicinal chemists. Take a look at the preprint to understand more about the model the authors built; it’s worth reading.
Note that the code above can be used to help decipher other SlogP_VSA
and SMR_VSA
descriptors. I hope it’s useful!
Added: 15 April, 2024
A question came up on the original version of this post as to whether or not the same analysis could be applied to the EState_VSA descriptors. The answer is “yes” and this short section shows how to.
The rdkit.Chem.EState_VSA
module provides two closely related families of descriptors EState_VSA*()
and VSA_Estate*()
. EState_VSA*()
is exactly analogous to SMR_VSA*()
except it sums VSA contributions using EState index bins instead of SMR bins. VSA_EState*()
is the same idea, but it sums EState indices over VSA contribution bins.
Aside: the reference for the EState indices is: Hall, Mohney and Kier. JCICS 31 76-81 (1991) https://pubs.acs.org/doi/pdf/10.1021/ci00001a012
Let’s look at the same simple molecule as above:
Start by getting the EState indices:
array([1.875, 2.75 , 1.875])
And the VSA contributions:
[7.04767198267719, 5.316788604006331, 7.04767198267719]
Here are the EState bins, used for EState_VSA*()
:
Based on these bins and the EState values of 1.875 and 2.75, we’d expect to see nonzero values for EState_VSA7()
(from the two C atoms) and EState_VSA8()
(from the N atom):
And here are the VSA bins:
Based on those and the VSA values of 7.05 and 5.32, we’d expect nonzero values for VSA_EState3()
(from the N atom) and VSA_EState9()
(from the two C atoms).