What are the VSA Descriptors?

descriptors
Making sense of something “uninterpretable”
Published

April 17, 2023

Updated 15.04.2024 Added a section at the bottom about the EState_VSA and VSA_EState descriptors.

A couple of weeks ago I read a fun preprint from a collaboration between Microsoft Research and Novartis: “Learning chemical intuition from humans in the loop” and noticed that one of the high-ranking descriptors in the authors’ analysis of the model they built is SMR_VSA3 (look at figure 4 in the preprint). Since I’ve seen that descriptor, or other descriptors from the same family, show up repeatedly when analyzing models, and because I bet that most people don’t really have much (if any) idea of what exactly those descriptors are, I figured it’s worth a short blog post.

The descriptors were introduced in the paper “A widely applicable set of descriptors” published by Paul Labute in the Journal of Molecular Graphics and Modelling back in 2000. Here’s a link Random aside: I’m a bit surprised to see that JMGM still exists… it used to be on my standard reading list back in the day, but I haven’t thought about it in years. :-)

Added 19.04.2023 Esben Jannik Bjerrum pointed out that the original version of that article is still available via the wayback machine. Thanks Esben!

I won’t get deeply into the motivation and derivation, read the paper for that, but Paul wanted to come up with a set of descriptors which were generally useful for QSAR studies. He published a three sets of related descriptors: SlogP_VSAX, SMR_VSAX, and PEOE_VSAX which are all based on the same idea: you calculate the contribution of each atom in the molecule to a molecular property (either LogP, MR, or the partial charge) along with the contribution of each atom to an approximate molecular surface area measure (this is the VSA part), assign the atoms to bins based on the property contributions, and then sum up the VSA contributions for each atom in a bin.

Sounds complicated, but it isn’t. Here’s a simple example, the molecule methylamine CN

atom logp contribution VSA contribution
C -0.2035 7.048
N -1.019 5.734

The boundaries of the LogP bins for the SlogP_VSA descriptor that the RDKit uses are:

[-0.4, -0.2, 0, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6]

So the N would add a contribution of 5.734 to bin 1 and the C would add a contribution of 7.048 to bin 2 (for this descriptor the bins are labelled from 1).

The boundaries for SMR and PEOE are:

[1.29, 1.82, 2.24, 2.45, 2.75, 3.05, 3.63, 3.8, 4.0]

and

[-.3, -.25, -.20, -.15, -.10, -.05, 0, .05, .10, .15, .20, .25, .30]

For what it’s worth, Here’s where you can find those bin definitions in the RDKit source code: - SlogP_VSA: https://github.com/rdkit/rdkit-orig/blob/master/Code/GraphMol/Descriptors/MolSurf.cpp#L256 - SMR_VSA: https://github.com/rdkit/rdkit-orig/blob/master/Code/GraphMol/Descriptors/MolSurf.cpp#L281 - PEOE_VSA: https://github.com/rdkit/rdkit-orig/blob/master/Code/GraphMol/Descriptors/MolSurf.cpp#L306

These descriptors do end up being quite useful, but they at first seem totally non-interpretable. It turns out that this isn’t completely true: with a bit of digging it’s possible to get a bit of a sense of what the descriptors actually mean.

So with that background, let’s go figure out what SMR_VSA3 actually means.

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import Descriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import Crippen
import rdkit
print(rdkit.__version__)
2024.03.1

Let’s start by looking at the docstring for SMR_VSA3, because that will tell us what the bins are (we could also look them up above, but then we’d have to figure out which bins are where… this is easier:

print(Descriptors.SMR_VSA3.__doc__)
MOE MR VSA Descriptor 3 ( 1.82 <= x <  2.24)

So we know that SMR_VSA3 is the sum of the VSA contributions from atoms which contribute between 1.82 and 2.24 to the SMR value.

The SMR (and SlogP) contributions are calculated using a method published by Wildman and Crippen back in 1999: https://doi.org/10.1021/ci990307l) using SMARTS definitions.

Here’s the data that the RDKit uses for that:

# from: https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Descriptors/Crippen.cpp#L194
# each line includes the atom label, the SMARTS, the logP contribution, the MR contribution, and an optional note
rdkit_data='''C1    [CH4]   0.1441  2.503   
C1  [CH3]C  0.1441  2.503   
C1  [CH2](C)C   0.1441  2.503   
C2  [CH](C)(C)C 0   2.433   
C2  [C](C)(C)(C)C   0   2.433   
C3  [CH3][N,O,P,S,F,Cl,Br,I]    -0.2035 2.753   
C3  [CH2X4]([N,O,P,S,F,Cl,Br,I])[A;!#1] -0.2035 2.753   
C4  [CH1X4]([N,O,P,S,F,Cl,Br,I])([A;!#1])[A;!#1]    -0.2051 2.731   
C4  [CH0X4]([N,O,P,S,F,Cl,Br,I])([A;!#1])([A;!#1])[A;!#1]   -0.2051 2.731   
C5  [C]=[!C;A;!#1]  -0.2783 5.007   
C6  [CH2]=C 0.1551  3.513   
C6  [CH1](=C)[A;!#1]    0.1551  3.513   
C6  [CH0](=C)([A;!#1])[A;!#1]   0.1551  3.513   
C6  [C](=C)=C   0.1551  3.513   
C7  [CX2]#[A;!#1]   0.0017  3.888   
C8  [CH3]c  0.08452 2.464   
C9  [CH3]a  -0.1444 2.412   
C10 [CH2X4]a    -0.0516 2.488   
C11 [CHX4]a 0.1193  2.582   
C12 [CH0X4]a    -0.0967 2.576   
C13 [cH0]-[A;!C;!N;!O;!S;!F;!Cl;!Br;!I;!#1] -0.5443 4.041   
C14 [c][#9] 0   3.257   
C15 [c][#17]    0.245   3.564   
C16 [c][#35]    0.198   3.18    
C17 [c][#53]    0   3.104   
C18 [cH]    0.1581  3.35    
C19 [c](:a)(:a):a   0.2955  4.346   
C20 [c](:a)(:a)-a   0.2713  3.904   
C21 [c](:a)(:a)-C   0.136   3.509   
C22 [c](:a)(:a)-N   0.4619  4.067   
C23 [c](:a)(:a)-O   0.5437  3.853   
C24 [c](:a)(:a)-S   0.1893  2.673   
C25 [c](:a)(:a)=[C,N,O] -0.8186 3.135   
C26 [C](=C)(a)[A;!#1]   0.264   4.305   
C26 [C](=C)(c)a 0.264   4.305   
C26 [CH1](=C)a  0.264   4.305   
C26 [C]=c   0.264   4.305   
C27 [CX4][A;!C;!N;!O;!P;!S;!F;!Cl;!Br;!I;!#1]   0.2148  2.693   
CS  [#6]    0.08129 3.243   
H1  [#1][#6,#1] 0.123   1.057   
H2  [#1]O[CX4,c]    -0.2677 1.395   
H2  [#1]O[!#6;!#7;!#8;!#16] -0.2677 1.395   
H2  [#1][!#6;!#7;!#8]   -0.2677 1.395   
H3  [#1][#7]    0.2142  0.9627  
H3  [#1]O[#7]   0.2142  0.9627  
H4  [#1]OC=[#6,#7,O,S]  0.298   1.805   
H4  [#1]O[O,S]  0.298   1.805   
HS  [#1]    0.1125  1.112   
N1  [NH2+0][A;!#1]  -1.019  2.262   
N2  [NH+0]([A;!#1])[A;!#1]  -0.7096 2.173   
N3  [NH2+0]a    -1.027  2.827   
N4  [NH1+0]([!#1;A,a])a -0.5188 3   
N5  [NH+0]=[!#1;A,a]    0.08387 1.757   
N6  [N+0](=[!#1;A,a])[!#1;A,a]  0.1836  2.428   
N7  [N+0]([A;!#1])([A;!#1])[A;!#1]  -0.3187 1.839   
N8  [N+0](a)([!#1;A,a])[A;!#1]  -0.4458 2.819   
N8  [N+0](a)(a)a    -0.4458 2.819   
N9  [N+0]#[A;!#1]   0.01508 1.725   
N10 [NH3,NH2,NH;+,+2,+3]    -1.95       
N11 [n+0]   -0.3239 2.202   
N12 [n;+,+2,+3] -1.119      
N13 [NH0;+,+2,+3]([A;!#1])([A;!#1])([A;!#1])[A;!#1] -0.3396 0.2604  
N13 [NH0;+,+2,+3](=[A;!#1])([A;!#1])[!#1;A,a]   -0.3396 0.2604  
N13 [NH0;+,+2,+3](=[#6])=[#7]   -0.3396 0.2604  
N14 [N;+,+2,+3]#[A;!#1] 0.2887  3.359   
N14 [N;-,-2,-3] 0.2887  3.359   
N14 [N;+,+2,+3](=[N;-,-2,-3])=N 0.2887  3.359   
NS  [#7]    -0.4806 2.134   
O1  [o] 0.1552  1.08    
O2  [OH,OH2]    -0.2893 0.8238  
O3  [O]([A;!#1])[A;!#1] -0.0684 1.085   
O4  [O](a)[!#1;A,a] -0.4195 1.182   
O5  [O]=[#7,#8] 0.0335  3.367   
O5  [OX1;-,-2,-3][#7]   0.0335  3.367   
O6  [OX1;-,-2,-2][#16]  -0.3339 0.7774  
O6  [O;-0]=[#16;-0] -0.3339 0.7774  
O12 [O-]C(=O)   -1.326      \"order flip here intentional\"
O7  [OX1;-,-2,-3][!#1;!N;!S]    -1.189  0   
O8  [O]=c   0.1788  3.135   
O9  [O]=[CH]C   -0.1526 0   
O9  [O]=C(C)([A;!#1])   -0.1526 0   
O9  [O]=[CH][N,O]   -0.1526 0   
O9  [O]=[CH2]   -0.1526 0   
O9  [O]=[CX2]=O -0.1526 0   
O10 [O]=[CH]c   0.1129  0.2215  
O10 [O]=C([C,c])[a;!#1] 0.1129  0.2215  
O10 [O]=C(c)[A;!#1] 0.1129  0.2215  
O11 [O]=C([!#1;!#6])[!#1;!#6]   0.4833  0.389   
OS  [#8]    -0.1188 0.6865  
F   [#9-0]  0.4202  1.108   
Cl  [#17-0] 0.6895  5.853   
Br  [#35-0] 0.8456  8.927   
I   [#53-0] 0.8857  14.02   
Hal [#9,#17,#35,#53;-]  -2.996      
Hal [#53;+,+2,+3]   -2.996      
Hal [+;#3,#11,#19,#37,#55]  -2.996      \"Footnote h indicates these should be here?\"
P   [#15]   0.8612  6.92    
S2  [S;-,-2,-3,-4,+1,+2,+3,+5,+6]   -0.0024 7.365   \"Order flip here is intentional\"
S2  [S-0]=[N,O,P,S] -0.0024 7.365   \"Expanded definition of (pseudo-)ionic S\"
S1  [S;A]   0.6482  7.591   \"Order flip here is intentional\"
S3  [s;a]   0.6237  6.691   
Me1 [#3,#11,#19,#37,#55]    -0.3808 5.754   
Me1 [#4,#12,#20,#38,#56]    -0.3808 5.754   
Me1 [#5,#13,#31,#49,#81]    -0.3808 5.754   
Me1 [#14,#32,#50,#82]   -0.3808 5.754   
Me1 [#33,#51,#83]   -0.3808 5.754   
Me1 [#34,#52,#84]   -0.3808 5.754   
Me2 [#21,#22,#23,#24,#25,#26,#27,#28,#29,#30]   -0.0025     
Me2 [#39,#40,#41,#42,#43,#44,#45,#46,#47,#48]   -0.0025     
Me2 [#72,#73,#74,#75,#76,#77,#78,#79,#80]   -0.0025     '''

Let’s parse that data so that we can do something with it:

from collections import namedtuple

CrippenTuple = namedtuple('CrippenTuple',('name','smarts','logp_contrib','mr_contrib','note'))
lines = [x.split('\t') for x in rdkit_data.split('\n')]

crippenData = []
for i,entry in enumerate(lines):
    entry[2] = float(entry[2])
    if entry[3]:
        entry[3] = float(entry[3])
    else:
        entry[3] = None
    crippenData.append(CrippenTuple(*entry))
print(crippenData[:3])
[CrippenTuple(name='C1', smarts='[CH4]', logp_contrib=0.1441, mr_contrib=2.503, note=''), CrippenTuple(name='C1', smarts='[CH3]C', logp_contrib=0.1441, mr_contrib=2.503, note=''), CrippenTuple(name='C1', smarts='[CH2](C)C', logp_contrib=0.1441, mr_contrib=2.503, note='')]

Now define two functions to look things up in that data:

import re
def find_contribs_for_bin(lower,upper,crippenData=crippenData,which='mr_contrib'):
    ' returns a list of Crippen contributions which are between lower and upper '
    res = []
    for tpl in crippenData:
        v = getattr(tpl,which)
        if v is not None and v>=lower and v<=upper:
            res.append(tpl)
    return res
def find_tuples_for_atom(symbol,crippenData=crippenData):
    ' returns a list of crippen contributions for a particular atomic symbol, i.e. "C" or "N"'
    res = []
    anum = Chem.GetPeriodicTable().GetAtomicNumber(symbol)
    for tpl in crippenData:
        if tpl.name.startswith(symbol) or re.match(f'\[[^\]]*#{anum}[^0-9]',tpl.smarts):
            res.append(tpl)
    return res

Now we can use that function to lookup the contributions which fall within the SMR_VSA3 SMR bins:

find_contribs_for_bin(1.82,2.24)
[CrippenTuple(name='N2', smarts='[NH+0]([A;!#1])[A;!#1]', logp_contrib=-0.7096, mr_contrib=2.173, note=''),
 CrippenTuple(name='N7', smarts='[N+0]([A;!#1])([A;!#1])[A;!#1]', logp_contrib=-0.3187, mr_contrib=1.839, note=''),
 CrippenTuple(name='N11', smarts='[n+0]', logp_contrib=-0.3239, mr_contrib=2.202, note=''),
 CrippenTuple(name='NS', smarts='[#7]', logp_contrib=-0.4806, mr_contrib=2.134, note='')]

So the only atom types which can contribute to this bin are Ns.

The first two atom type definitions are reasonably specific: - N2: neutral aliphatic N with one H and two non-hydrogen aliphatic neighbors - N7: neutral aliphatic N with three non-hydrogen aliphatic neighbors

The last two are less specific: - N11: neutral aromatic N - NS: any nitrogen

These would only be matched to atoms which haven’t matched any of the more specific defintions for N.

Here’s the full set of N definitions:

find_tuples_for_atom('N')
[CrippenTuple(name='N1', smarts='[NH2+0][A;!#1]', logp_contrib=-1.019, mr_contrib=2.262, note=''),
 CrippenTuple(name='N2', smarts='[NH+0]([A;!#1])[A;!#1]', logp_contrib=-0.7096, mr_contrib=2.173, note=''),
 CrippenTuple(name='N3', smarts='[NH2+0]a', logp_contrib=-1.027, mr_contrib=2.827, note=''),
 CrippenTuple(name='N4', smarts='[NH1+0]([!#1;A,a])a', logp_contrib=-0.5188, mr_contrib=3.0, note=''),
 CrippenTuple(name='N5', smarts='[NH+0]=[!#1;A,a]', logp_contrib=0.08387, mr_contrib=1.757, note=''),
 CrippenTuple(name='N6', smarts='[N+0](=[!#1;A,a])[!#1;A,a]', logp_contrib=0.1836, mr_contrib=2.428, note=''),
 CrippenTuple(name='N7', smarts='[N+0]([A;!#1])([A;!#1])[A;!#1]', logp_contrib=-0.3187, mr_contrib=1.839, note=''),
 CrippenTuple(name='N8', smarts='[N+0](a)([!#1;A,a])[A;!#1]', logp_contrib=-0.4458, mr_contrib=2.819, note=''),
 CrippenTuple(name='N8', smarts='[N+0](a)(a)a', logp_contrib=-0.4458, mr_contrib=2.819, note=''),
 CrippenTuple(name='N9', smarts='[N+0]#[A;!#1]', logp_contrib=0.01508, mr_contrib=1.725, note=''),
 CrippenTuple(name='N10', smarts='[NH3,NH2,NH;+,+2,+3]', logp_contrib=-1.95, mr_contrib=None, note=''),
 CrippenTuple(name='N11', smarts='[n+0]', logp_contrib=-0.3239, mr_contrib=2.202, note=''),
 CrippenTuple(name='N12', smarts='[n;+,+2,+3]', logp_contrib=-1.119, mr_contrib=None, note=''),
 CrippenTuple(name='N13', smarts='[NH0;+,+2,+3]([A;!#1])([A;!#1])([A;!#1])[A;!#1]', logp_contrib=-0.3396, mr_contrib=0.2604, note=''),
 CrippenTuple(name='N13', smarts='[NH0;+,+2,+3](=[A;!#1])([A;!#1])[!#1;A,a]', logp_contrib=-0.3396, mr_contrib=0.2604, note=''),
 CrippenTuple(name='N13', smarts='[NH0;+,+2,+3](=[#6])=[#7]', logp_contrib=-0.3396, mr_contrib=0.2604, note=''),
 CrippenTuple(name='N14', smarts='[N;+,+2,+3]#[A;!#1]', logp_contrib=0.2887, mr_contrib=3.359, note=''),
 CrippenTuple(name='N14', smarts='[N;-,-2,-3]', logp_contrib=0.2887, mr_contrib=3.359, note=''),
 CrippenTuple(name='N14', smarts='[N;+,+2,+3](=[N;-,-2,-3])=N', logp_contrib=0.2887, mr_contrib=3.359, note=''),
 CrippenTuple(name='NS', smarts='[#7]', logp_contrib=-0.4806, mr_contrib=2.134, note='')]

Let’s check our understanding…

Start by constructing a molecule with a neutral aliphatic N which has one H and two non-H aliphatic neighbors:

m = Chem.MolFromSmiles('CNC')

Calculate the Crippen contribs for that. We expect the second atom to be of type N2, with an MR contribution (the second element of each of the 2-tuples in the list below) of 2.173:

rdMolDescriptors._CalcCrippenContribs(m)
[(-0.2035, 2.753), (-0.7096, 2.173), (-0.2035, 2.753)]

Perfect!

What’s the VSA contribution of the atoms? For historical reasons we get these with the function rdMolDescriptors._CalcLabuteASAContribs():

list(rdMolDescriptors._CalcLabuteASAContribs(m)[0])
[7.04767198267719, 5.316788604006331, 7.04767198267719]

Based on this, we’d expect SMR_VSA3 for this molecule to be 5.317:

Descriptors.SMR_VSA3(m)
5.316788604006331

Yep! That was right.

Let’s try another one… this time with a neutral aromatic N. We’ll use pyridine:

pyridine = Chem.MolFromSmiles('n1ccccc1')
Descriptors.SMR_VSA3(pyridine)
4.9839785209472085
rdMolDescriptors._CalcCrippenContribs(pyridine)[0]
(-0.3239, 2.202)

Since the SMR contribution of the N in pyridine is also in our bin, we know that the VSA contribution from that atom will be 4.984:

list(rdMolDescriptors._CalcLabuteASAContribs(pyridine)[0])[0]
4.9839785209472085

To come back to the original point: what does this tell us about the model from the preprint I mentioned at the beginning of the post?

The fact that SMR_VSA3 has a reasonably large negative correlation with the compound scores in the paper indicates that adding neutral N atoms has a tendency to make molecules more attractive to medicinal chemists. Take a look at the preprint to understand more about the model the authors built; it’s worth reading.

Note that the code above can be used to help decipher other SlogP_VSA and SMR_VSA descriptors. I hope it’s useful!

Addendum: EState_VSA descriptors

Added: 15 April, 2024

A question came up on the original version of this post as to whether or not the same analysis could be applied to the EState_VSA descriptors. The answer is “yes” and this short section shows how to.

The rdkit.Chem.EState_VSA module provides two closely related families of descriptors EState_VSA*() and VSA_Estate*(). EState_VSA*() is exactly analogous to SMR_VSA*() except it sums VSA contributions using EState index bins instead of SMR bins. VSA_EState*() is the same idea, but it sums EState indices over VSA contribution bins.

Aside: the reference for the EState indices is: Hall, Mohney and Kier. JCICS 31 76-81 (1991) https://pubs.acs.org/doi/pdf/10.1021/ci00001a012

Let’s look at the same simple molecule as above:

from rdkit.Chem import EState
from rdkit.Chem.EState import EState_VSA

Start by getting the EState indices:

m = Chem.MolFromSmiles('CNC')
contribs = EState.EStateIndices(m)
contribs
array([1.875, 2.75 , 1.875])

And the VSA contributions:

list(rdMolDescriptors._CalcLabuteASAContribs(m)[0])
[7.04767198267719, 5.316788604006331, 7.04767198267719]

Here are the EState bins, used for EState_VSA*():

EState_VSA.estateBins
[-0.39, 0.29, 0.717, 1.165, 1.54, 1.807, 2.05, 4.69, 9.17, 15.0]

Based on these bins and the EState values of 1.875 and 2.75, we’d expect to see nonzero values for EState_VSA7() (from the two C atoms) and EState_VSA8() (from the N atom):

EState_VSA.EState_VSA7(m),EState_VSA.EState_VSA8(m)
(14.09534396535438, 5.316788604006331)

And here are the VSA bins:

EState_VSA.vsaBins
[4.78, 5.0, 5.41, 5.74, 6.0, 6.07, 6.45, 7.0, 11.0]

Based on those and the VSA values of 7.05 and 5.32, we’d expect nonzero values for VSA_EState3() (from the N atom) and VSA_EState9() (from the two C atoms).

EState_VSA.VSA_EState3(m),EState_VSA.VSA_EState9(m)
(2.75, 3.75)