What’s new in the 2023.09.1 release, part 1

release
documentation
Generalized substructure search
Published

October 17, 2023

This is the first of a few posts covering some of the new features added to the RDKit in the 2023.09.1 release.

The full release notes are available here.

Note: starting in this release cycle we are going to try changing the RDKit release model to include new features in minor releases. So the 2023.09.2 release may potentially include new features. There’s some description of the thinking behind this change in this blog post.

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import rdkit
print(rdkit.__version__)
2023.09.1

Using generics and generalized substructure search with the PostgreSQL cartridge

The generalized substructure functionality is also available in the cartridge as is the ability to search with “Beilstein” generic queries.

from rdkit import Chem
from rdkit.Chem import Draw
%load_ext sql

Start with a generic query:

Chem.MolFromSmiles('O=C(-*)(-*) |$;;ARY_p;ARY_p$|')

By default query the generics are just used as dummy atoms, so the results we get from the query ignore the “ARY” groups:

d = %sql postgresql://localhost/chembl_31 \
  select * from rdk.million_mols where m @>> mol_adjust_query_properties('O=C(-*)(-*) |$;;ARY_p;ARY_p$|'::mol, \
                        '{"adjustDegree": false, "makeDummiesQueries": true}')\
            order by molregno asc limit 10;
ms = [Chem.MolFromSmiles(y) for x,y in d]
Draw.MolsToGridImage(ms,molsPerRow=4)
10 rows affected.

But by passing setGenericQueryFromProperties to mol_adjust_query_properties() we tell the cartridge to use the generics groups:

%sql postgresql://localhost/chembl_31 \
  select  'CC(=O)C'::mol @>> mol_adjust_query_properties('O=C(-*)(-*) |$;;ARY_p;ARY_p$|'::mol, \
                        '{"setGenericQueryFromProperties": true, "adjustDegree": false}');
1 rows affected.
?column?
False
d = %sql postgresql://localhost/chembl_31 \
  select * from rdk.million_mols where m @>> mol_adjust_query_properties('O=C(-*)(-*) |$;;ARY_p;ARY_p$|'::mol, \
                        '{"setGenericQueryFromProperties": true, "adjustDegree": false, "makeDummiesQueries": true}')\
            order by molregno asc limit 10;
ms = [Chem.MolFromSmiles(y) for x,y in d]
Draw.MolsToGridImage(ms,molsPerRow=4)
10 rows affected.

Now let’s look at using generalized substructure search by starting with a query molecule drawn as a tautomer which doesn’t exist in ChEMBL:

mb = '''
     RDKit          2D

  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 7 7 1 0 0
M  V30 BEGIN ATOM
M  V30 1 C 1.208608 -2.457143 0.000000 0
M  V30 2 C 2.445787 -1.742857 0.000000 0
M  V30 3 C 2.445787 -0.314286 0.000000 0
M  V30 4 N 1.208608 0.400000 0.000000 0
M  V30 5 C -0.028571 -0.314286 0.000000 0
M  V30 6 C -0.028571 -1.742857 0.000000 0
M  V30 7 O -1.265751 0.400000 0.000000 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 1 2 3
M  V30 3 1 3 4
M  V30 4 2 4 5
M  V30 5 1 5 6
M  V30 6 1 6 1
M  V30 7 1 5 7
M  V30 END BOND
M  V30 BEGIN SGROUP
M  V30 1 DAT 0 ATOMS=(1 7) FIELDDISP="    0.0000    0.0000    DR    ALL  0 0" -
M  V30 QUERYTYPE=SMARTSQ QUERYOP== FIELDDATA="[#8&X1]"
M  V30 END SGROUP
M  V30 END CTAB
M  END
$$$$
'''
m = Chem.MolFromMolBlock(mb)
m

Chem.MolToSmarts(m)
'[#6]1-[#6]-[#6]-[#7]=[#6](-[#6]-1)-[#8&X1]'

This doesn’t return any results:

d = %sql postgresql://localhost/chembl_31 \
  select * from rdk.million_mols where m @> mol_from_ctab(:mb)\
            order by molregno asc limit 10;
if not len(d):
    raise ValueError('no matches!')
ms = [Chem.MolFromSmiles(y) for x,y in d]
Draw.MolsToGridImage(ms,molsPerRow=4)
0 rows affected.
ValueError: no matches!

But we can enable generalized substructure search by calling mol_to_xqmol() (this creates an extended query molecule, discussed above) and using that for the substructure search:

d = %sql postgresql://localhost/chembl_31 \
  select * from rdk.million_mols where m @> mol_to_xqmol(mol_from_ctab(:mb))\
            order by molregno asc limit 10;
ms = [Chem.MolFromSmiles(y) for x,y in d]
Draw.MolsToGridImage(ms,molsPerRow=4)
10 rows affected.

Here’s a demo that this works with link nodes and tautomers:

qry = Chem.MolFromSmiles('OCc1nc2cccnc2[nH]1 |LN:1:2.3|')
qry

d = %sql postgresql://localhost/chembl_31 \
  select * from rdk.million_mols where m @> mol_to_xqmol('OCc1nc2cccnc2[nH]1 |LN:1:2.3|')\
            limit 10;
ms = [Chem.MolFromSmiles(y) for x,y in d]
Draw.MolsToGridImage(ms,molsPerRow=4)
6 rows affected.

And, as a final demo, link nodes, variable attachment points, and tautomers:

mb = '''qry 
  Mrv2305 09052314502D          

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 13 13 0 0 0
M  V30 BEGIN ATOM
M  V30 1 N -4.75 1.9567 0 0
M  V30 2 C -6.0837 1.1867 0 0
M  V30 3 C -6.0837 -0.3534 0 0
M  V30 4 C -4.75 -1.1234 0 0
M  V30 5 C -3.4163 -0.3534 0 0
M  V30 6 C -3.4163 1.1867 0 0
M  V30 7 N -1.9692 1.7134 0 0
M  V30 8 N -1.8822 -0.7768 0 0
M  V30 9 C -1.0211 0.4999 0 0
M  V30 10 C 0.5179 0.5536 0 0
M  V30 11 N 1.2409 1.9133 0 0
M  V30 12 * -5.6391 -0.0967 0 0
M  V30 13 C -5.6391 -2.4067 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 2 2 3
M  V30 3 1 3 4
M  V30 4 2 4 5
M  V30 5 1 5 6
M  V30 6 2 1 6
M  V30 7 1 8 9
M  V30 8 1 7 6
M  V30 9 1 5 8
M  V30 10 2 7 9
M  V30 11 1 9 10
M  V30 12 1 10 11
M  V30 13 1 12 13 ENDPTS=(3 4 3 2) ATTACH=ANY
M  V30 END BOND
M  V30 LINKNODE 1 2 2 10 9 10 11
M  V30 END CTAB
M  END'''
Chem.MolFromMolBlock(mb)

d = %sql postgresql://localhost/chembl_31 \
  select * from rdk.million_mols where m @> mol_to_xqmol(mol_from_ctab(:mb))\
            limit 10;
ms = [Chem.MolFromSmiles(y) for x,y in d]
Draw.MolsToGridImage(ms,molsPerRow=4)
8 rows affected.