Intro to the molecule enumerator

tutorial
substructure
Using some advanced query features
Published

May 13, 2021

The V3000 mol file format allows a number of interesting and useful advanced query features. Here I’ll look at two of them: position variation bonds (a.k.a. variable attachment points) and link nodes.

This blog post uses features from the 2021.03.1 RDKit release; some of this will not work with older releases.

from rdkit import Chem
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdMolEnumerator
import rdkit
print(rdkit.__version__)
2021.03.1

Position variation bonds

Here’s a molecule with a position variation bond:

pv1 = Chem.MolFromMolBlock('''
  Mrv2007 06232015292D          

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 9 8 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C -1.7083 2.415 0 0
M  V30 2 C -3.042 1.645 0 0
M  V30 3 C -3.042 0.105 0 0
M  V30 4 N -1.7083 -0.665 0 0
M  V30 5 C -0.3747 0.105 0 0
M  V30 6 C -0.3747 1.645 0 0
M  V30 7 * -0.8192 1.3883 0 0
M  V30 8 O -0.8192 3.6983 0 0
M  V30 9 C 0.5145 4.4683 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 2 2 3
M  V30 3 1 3 4
M  V30 4 2 4 5
M  V30 5 1 5 6
M  V30 6 2 1 6
M  V30 7 1 7 8 ENDPTS=(3 1 5 6) ATTACH=ANY
M  V30 8 1 8 9
M  V30 END BOND
M  V30 END CTAB
M  END''')
pv1

The query is describing a molecule consisting of a pyriding ring with an methoxy substituted either ortho, meta, or para to the N atom.

The RDKit includes functionality in the rdkit.Chem.rdMolEnumerator module which allows you enumerate all of the molecules which are described by this query.

The function rdMolEnumerator.Enumerate() is straightforward to use: given a molecule with supported query features it returns a MolBundle object which includes each possible expansion of the query:

pv1_bundle = rdMolEnumerator.Enumerate(pv1)
pv1_bundle
<rdkit.Chem.rdchem.MolBundle at 0x7fc138399b20>

We can render the molecules in the bundle using Draw.MolsToGridImage():

Draw.MolsToGridImage(pv1_bundle)

These are pretty ugly since the enumeration hasn’t generated new coordinates for the atom which correspond to the new connectivity.

I’ll use this convenience function to find the common core shared by the molecules in a bundle and generate 2D coordinates for all the molecules with the core oriented consistently:

from rdkit.Chem import rdFMCS
def align_bundle_coords(bndl):
    ps = rdFMCS.MCSParameters()
    for m in bndl:
        Chem.SanitizeMol(m)
    mcs = rdFMCS.FindMCS(bndl,completeRingsOnly=True)
    q = Chem.MolFromSmarts(mcs.smartsString)
    rdDepictor.Compute2DCoords(q)
    for m in bndl:
        rdDepictor.GenerateDepictionMatching2DStructure(m,q)

Now let’s apply that to our bundle:

pv1_bundle = rdMolEnumerator.Enumerate(pv1)
align_bundle_coords(pv1_bundle)
Draw.MolsToGridImage(pv1_bundle)

Of course a molecule can have more than one position variation bond:

pv2 = Chem.MolFromMolBlock('''
  Mrv2007 06242006032D          

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 10 8 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C -1.7083 2.415 0 0
M  V30 2 C -3.042 1.645 0 0
M  V30 3 C -3.042 0.105 0 0
M  V30 4 N -1.7083 -0.665 0 0
M  V30 5 C -0.3747 0.105 0 0
M  V30 6 C -0.3747 1.645 0 0
M  V30 7 * -3.042 0.875 0 0
M  V30 8 F -5.0434 0.875 0 0
M  V30 9 * -1.0415 2.03 0 0
M  V30 10 Cl -1.0415 4.34 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 2 2 3
M  V30 3 1 3 4
M  V30 4 2 4 5
M  V30 5 1 5 6
M  V30 6 2 1 6
M  V30 7 1 7 8 ENDPTS=(2 2 3) ATTACH=ANY
M  V30 8 1 9 10 ENDPTS=(2 1 6) ATTACH=ANY
M  V30 END BOND
M  V30 END CTAB
M  END
''')
pv2

This is also supported by the enumerator:

pv2_bundle = rdMolEnumerator.Enumerate(pv2)
align_bundle_coords(pv2_bundle)
Draw.MolsToGridImage(pv2_bundle)

Combining them

We can also combine link nodes and position variation bonds in the same molecule:

combined = Chem.MolFromMolBlock('''
  Mrv2108 05132110052D          

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 19 20 0 0 0
M  V30 BEGIN ATOM
M  V30 1 N -2.2078 4.3165 0 0
M  V30 2 C -2.9544 2.9695 0 0
M  V30 3 C -2.1612 1.6495 0 0
M  V30 4 C -0.6214 1.6763 0 0
M  V30 5 C 0.1252 3.0233 0 0
M  V30 6 C -0.668 4.3433 0 0
M  V30 7 C 1.6649 3.0501 0 0
M  V30 8 C -4.4941 2.9427 0 0
M  V30 9 C 2.4581 1.7301 0 0
M  V30 10 C 2.985 3.8433 0 0
M  V30 11 C 3.7781 2.5233 0 0
M  V30 12 C -6.3747 4.5774 0 0
M  V30 13 C -6.9764 3.1598 0 0
M  V30 14 C -5.8142 2.1495 0 0
M  V30 15 C -4.8405 4.4431 0 0
M  V30 16 F -7.1678 5.8974 0 0
M  V30 17 O 3.3575 5.3376 0 0
M  V30 18 * -1.1502 2.5564 0 0
M  V30 19 C -1.1502 0.2464 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 2 2 3
M  V30 3 1 3 4
M  V30 4 2 4 5
M  V30 5 1 5 6
M  V30 6 2 1 6
M  V30 7 1 5 7
M  V30 8 1 2 8
M  V30 9 1 9 11
M  V30 10 1 10 11
M  V30 11 1 7 9
M  V30 12 1 7 10
M  V30 13 1 12 13
M  V30 14 1 13 14
M  V30 15 1 12 15
M  V30 16 1 14 8
M  V30 17 1 8 15
M  V30 18 1 12 16
M  V30 19 1 10 17
M  V30 20 1 18 19 ENDPTS=(3 6 3 4) ATTACH=ANY
M  V30 END BOND
M  V30 LINKNODE 1 2 2 10 7 10 11
M  V30 LINKNODE 1 2 2 12 13 12 15
M  V30 END CTAB
M  END
''')
combined

Enumerating that produces 12 molecules:

combined_bundle = rdMolEnumerator.Enumerate(combined)
align_bundle_coords(combined_bundle)
Draw.MolsToGridImage(combined_bundle,subImgSize=(300,250))

Final bit: input from CXSMILES

It’s also possible to read both variable attachment points and link nodes from CXSMILES:

m = Chem.MolFromSmiles('CO*.C1=CC=NC=C1 |c:2,4,6,m:2:3.5.4|')
m

As that example shows, the coordinate generation code is currently not great at setting the atom positions for these. That’s a ToDo for a future release.

m = Chem.MolFromSmiles('OC1CCCC1 |LN:1:1.4.2.5|')
m

The RDKit currently does not write either link nodes or variable attachment points to CXSMILES, that’s another ToDo for a future release.