Constructing molecules more quickly

tutorial
optimization
Avoiding unnecessary work
Published

May 1, 2025

This is an updated and lightly revised version of a quite old post.

I have previously talked/posted about ways to construct RDKit molecules more quickly. This post revisits that topic.

By default the RDKit does a lot of work when constructing a molecule. The idea here is to set things up so that we only have to do that work once for a set of molecules that we’re going to work with repeatedly.

There’s also a diversion into thinking about what chemistry information is actually needed for things like substructure searching and tuning the molecule construction to only perceive that information.

Aside 1: the timing information shown below was all generated on my laptop, a 2.5 year old Lenovo X1 Carbon (2.1GHz Intel Core i7) running Windows. I am, as usual, using the conda-forge RDKit builds.

Aside 2: there’s another blog post looking in more detail at how long it takes to serialize/deserialize molecules with the RDKit

from rdkit import Chem
from rdkit import RDConfig
import os,gzip
from rdkit import rdBase
rdBase.rdkitVersion
'2025.03.1'

Setup

Start by reading in a set of ChEMBL molecules that we’ve used before and then reducing that to 50K examples.

ind = [x.strip().split() for x in open('../data/chembl16_2010-2012.smi')]
len(ind)
234681
import random
random.seed(0xf00d)
random.shuffle(ind)
ind = ind[:55000]
ms = []
for x in ind:
    m = Chem.MolFromSmiles(x[0])
    if m is None:
        continue
    ms.append(m)
    if len(ms)==50000:
        break
[09:46:30] Explicit valence for atom # 34 Cl, 4, is greater than permitted
[09:46:41] Explicit valence for atom # 29 Cl, 4, is greater than permitted

Let’s get RDKit-generated representations of the molecules: - A molecule’s ToBinary() method returns a serialized form of the molecule that can be saved and then later efficiently converted back into a molecule. I typically call this the “pickled” form of the molecule because it’s what python’s pickle machinery uses (though if you actually pickle molecules they are a big larger and take a bit longer to restore). - RDKit SMILES - The RDKit’s JSON format. This is derived from Matt Swain’s commonchem.

pkls = [x.ToBinary() for x in ms]
smis = [Chem.MolToSmiles(x) for x in ms]
jsons = [Chem.MolToJSON(x) for x in ms]

Timing of standard parsing

How long does it take to generate the molecules from SMILES?

%timeit [Chem.MolFromSmiles(x) for x in smis]
14.6 s ± 316 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And from the binary format?

%timeit [Chem.Mol(x) for x in pkls]
2.14 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit [Chem.JSONToMols(x)[0] for x in jsons]
4.69 s ± 334 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That dramatic difference between SMILES and pickles isn’t really that surprising given that the binary format was designed to be easy to read in and that no chemistry perception needs to be done. The JSON format also avoids chemistry perception (all computed properties we need are stored in the JSON), but it does require some work to parse.

So the binary format is really efficient, but it’s not human readable and it’s only useable with the RDKit. It would be cool to be able to take advantage of the portability and readability of SMILES but to have the processing not take quite so long.

As an aside: another benefit of using SMILES, for at least some applications, is that they are an order of magnitude smaller than the binary format. The JSON format is even bigger, more than three times as big as the pickles:

print("binary:",sum(len(x) for x in pkls))
print("smiles:",sum(len(x) for x in smis))
print("JSON:",sum(len(x) for x in jsons))
binary: 22808200
smiles: 3029652
JSON: 78812598

Turning off the sanitization and chemistry perception when parsing from the SMILES makes a dramatic difference in processing time:

%timeit [Chem.MolFromSmiles(x,sanitize=False) for x in smis]
3.04 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Unfortunately, with no sanitization at all done, these molecules aren’t that useful in many of the RDKit algorithms.

Partial sanitization

One of the standard operations done on reading molecules from SMILES is a call to the stereochemistry assignment code, which also removes redundant or extraneous stereochemistry specifications. This can be computationally expensive and most likely not needed when reading from an RDKit-generated canonical SMILES since that already has had incorrect or redundant specifications removed.

Let’s see how long it takes if we skip that part (which is part of the call to MolFromSmiles()). We’ll make one other change here and also skip the various cleanup operations the RDKit does by default to fix problematic substructures during sanitization:

def partialSanit1(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^\
                     Chem.SANITIZE_CLEANUPCHIRALITY^Chem.SANITIZE_CLEANUPATROPISOMERS^Chem.SANITIZE_CLEANUP_ORGANOMETALLICS)
    return m
%timeit [partialSanit1(x) for x in smis]
9.91 s ± 259 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s a solid reduction over the 14.6 seconds originally required, but is still a long way off the 3.0 seconds for the completely unsanitized version.

Since the RDKit SMILES contains information about aromaticity, we can also skip the kekulization and aromatization steps of the sanitization:

def partialSanit2(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^\
                     Chem.SANITIZE_CLEANUPCHIRALITY^Chem.SANITIZE_CLEANUPATROPISOMERS^Chem.SANITIZE_CLEANUP_ORGANOMETALLICS^\
                     Chem.SANITIZE_KEKULIZE^Chem.SANITIZE_SETAROMATICITY)
    return m
%timeit [partialSanit2(x) for x in smis]
7.66 s ± 154 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Even better.

We are still calling the ring-finding code, and sometimes we don’t need information about rings (for example, all substructure queries from SMILES and many queries from SMILES), so what if we skip that?

def partialSanit3(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^\
                     Chem.SANITIZE_CLEANUPCHIRALITY^Chem.SANITIZE_CLEANUPATROPISOMERS^Chem.SANITIZE_CLEANUP_ORGANOMETALLICS^\
                     Chem.SANITIZE_KEKULIZE^Chem.SANITIZE_SETAROMATICITY^\
                     Chem.SANITIZE_SYMMRINGS)
    return m
%timeit [partialSanit3(x) for x in smis]
5.11 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If we’re just concerned about doing substructure searches or generating RDKit fingerprints, this is still doing some extra work. Let’s go to the bare minimum of sanitization: only updating the explicit and implicit valences of the atoms. Here it’s less typing to just call the one function we need:

def partialSanit4(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    m.UpdatePropertyCache()
    return m
%timeit [partialSanit4(x) for x in smis]
3.34 s ± 133 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That’s pretty fast, and those molecules are actually useful for many, many calculations.

We can add some ring information by calling FastFindRings(). This algorithm provides reliable information about whether or not atoms or bonds are in rings - and so can help with basic ring-membership queries on atoms and bonds in SMARTS or for generating Pattern fingerprints or standard Morgan fingerprints- but doesn’t help with the number of smallest rings that an atom/bond is in.

def partialSanit5(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    m.UpdatePropertyCache()
    Chem.FastFindRings(m)
    return m
%timeit [partialSanit5(x) for x in smis]
4.08 s ± 86.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think those last two are pretty good. By thinking about the information we need and starting from a reliable SMILES (I often call these “trusted SMILES”) we can get molecules that are useful for many RDKit operations much more quickly than the default.