Timing methods for serializing molecules

reference
optimization
Quickly saving/restoring molecules from text formats
Published

December 9, 2022

This is just a short one, mainly to have the information online to use as a reference.

The RDKit has a number of different ways of serializing and deserializing molecules (converting them to and from strings). This post looks at how long it takes to do that with the three serialization approaches I normally recommend: 1. CXSMILES 2. The RDKit’s internal binary format 3. the RDKit’s variant of commonchem JSON.

I looked at just serializing the molecular structure, including atomic coordinates, and serializing a few propertis together with the molecule.

Note that this isn’t exactly comparing apples to apples: the binary and JSON formats both capture more or less the full perceived state of the molecule (aromaticity, ring systems, etc), while the CXSMILES variant doesn’t include the ring information.

Here are the timing results for 50K molecules from ChEMBL. I ran this on my normal desktop machine, a 3-year old Dell desktop with a 3.1GHz Intel Core9 CPU. I’m running the conda-forge RDKit build.

Times are in seconds

Writing

Method Mol Mol + coordinates Mol + properties
CXSMILES 3.9 6.4 N/A
Pickle 1.31 1.48 1.44
JSON 2.4 3.1 4.1

Reading

Method Mol Mol + coordinates Mol + properties
CXSMILES 1.56 3.78 N/A
Pickle 0.9 1.14 1.16
JSON 1.66 2.08 2.14

The binary format is, of course, the fastest. The JSON format is slower than that, but it’s still faster than using CXSMILES when serializing coordinates and roughly equivalent to CXSMILES when just storing the molecule (but, as noted above, the JSON contains more info than the CXSMILES does).

from rdkit import Chem
import time
import gzip
import rdkit
print(rdkit.__version__)
2022.09.1

Read in a set of 50K ChEMBL molecules that we’ll use for the testing.

mols = []
suppl = Chem.SmilesMolSupplier('../data/new_chembl_document_activity_set.smi')
while len(mols)<50000:
    try:
        m = next(suppl)
    except StopIteration:
        break;
    mols.append(m)
    
len(mols)
50000

This class makes timing our runs easier:

class timer(object):
    def __init__(self, *args, **kwargs):
        pass
    def __enter__(self):
        self.t1=time.time()
    def __exit__(self, type, value, traceback):
        delta = time.time()-self.t1
        print(f'That took {delta:.2f} seconds')
jsons[0]
'{"commonchem":{"version":10},"defaults":{"atom":{"z":6,"impHs":0,"chg":0,"nRad":0,"isotope":0,"stereo":"unspecified"},"bond":{"bo":1,"stereo":"unspecified"}},"molecules":[{"name":"436078","atoms":[{"z":8,"chg":-1},{"z":7,"chg":1},{"z":8},{},{"impHs":1},{"impHs":1},{},{"z":7},{},{"z":8},{"impHs":1,"stereo":"ccw"},{"impHs":1},{"impHs":2},{"impHs":2},{"impHs":1},{"z":8},{"impHs":1,"stereo":"ccw"},{},{"z":8},{},{"impHs":1},{"impHs":1},{"impHs":1},{"impHs":1},{}],"bonds":[{"atoms":[0,1]},{"bo":2,"atoms":[1,2]},{"atoms":[1,3]},{"bo":2,"atoms":[3,4]},{"atoms":[4,5]},{"bo":2,"atoms":[5,6]},{"atoms":[6,7]},{"atoms":[7,8]},{"bo":2,"atoms":[8,9]},{"atoms":[8,10]},{"atoms":[10,11]},{"atoms":[11,12]},{"atoms":[12,13]},{"atoms":[13,14]},{"atoms":[14,15]},{"atoms":[14,16]},{"atoms":[16,17]},{"bo":2,"atoms":[17,18]},{"atoms":[6,19]},{"bo":2,"atoms":[19,20]},{"atoms":[20,21]},{"bo":2,"atoms":[21,22]},{"atoms":[22,23]},{"bo":2,"atoms":[23,24]},{"atoms":[24,3]},{"atoms":[17,7]},{"atoms":[16,10]},{"atoms":[15,11]},{"atoms":[24,19]}],"properties":{"numAtoms":25,"numBonds":29,"numRings":5,"smiles":"O=C1[C@@H]2C3CCC(O3)[C@@H]2C(=O)N1c1ccc([N+](=O)[O-])c2ccccc12"},"extensions":[{"name":"rdkitRepresentation","formatVersion":2,"toolkitVersion":"2022.09.1","aromaticAtoms":[3,4,5,6,19,20,21,22,23,24],"aromaticBonds":[3,4,5,18,19,20,21,22,23,24,28],"cipRanks":[23,19,24,13,7,6,12,18,16,21,8,14,0,1,15,20,9,17,22,10,4,2,3,5,11],"cipCodes":[[10,"S"],[16,"R"]],"atomRings":[[3,24,19,6,5,4],[8,7,17,16,10],[12,11,15,14,13],[15,14,16,10,11],[20,21,22,23,24,19]]}]}]}'

Molecules without conformers

Generating CXSMILES:

with timer() as cm:
    smis = [Chem.MolToCXSmiles(m) for m in mols]
That took 3.88 seconds

Generating the RDKit’s binary format:

with timer() as cm:
    pkls = [m.ToBinary() for m in mols]
That took 1.31 seconds

Generating the RDKit’s variant of commonchem JSON:

with timer() as cm:
    jsons = [Chem.MolToJSON(m) for m in mols]
That took 2.40 seconds

Generating JSON for all of the molecules at once:

with timer() as cm:
    allJson = Chem.MolsToJSON(mols)
That took 2.83 seconds

Now look at reading the molecules.

Start from SMILES, but skip full sanitization since we know that the chemistry and aromaticity are correct:

with timer() as cm:
    for smi in smis:
        m = Chem.MolFromSmiles(smi,sanitize=False)
        m.UpdatePropertyCache()
That took 1.56 seconds

From the binary format:

with timer() as cm:
    for pkl in pkls:
        m = Chem.Mol(pkl)
That took 0.92 seconds

And the two variants for parsing JSON:

with timer() as cm:
    for js in jsons:
        m = Chem.JSONToMols(js)[0]
That took 1.66 seconds
with timer() as cm:
    _ = Chem.JSONToMols(allJson)
That took 1.49 seconds

Molecules with a conformer

Add a conformer to each molecule (which will be written to the output) and re-do the timing runs:

from rdkit.Chem import rdDepictor
for mol in mols:
    rdDepictor.Compute2DCoords(mol)
with timer() as cm:
    smis = [Chem.MolToCXSmiles(m) for m in mols]
That took 6.43 seconds
smis[0]
'O=C1[C@@H]2C3CCC(O3)[C@@H]2C(=O)N1c1ccc([N+](=O)[O-])c2ccccc12 |(-1.03854,-2.25072,;-1.65866,-0.884909,;-3.12813,-0.583792,;-4.49394,-1.20391,;-5.96341,-0.902797,;-6.13112,0.587798,;-4.7653,1.20792,;-3.75347,0.100581,;-3.29584,0.906803,;-1.93002,1.52692,;-1.62891,2.99639,;-0.918193,0.419586,;0.572402,0.587296,;1.17246,1.96204,;2.66305,2.12976,;3.55359,0.922718,;5.04419,1.09043,;5.93473,-0.116609,;5.64424,2.46518,;2.95354,-0.452031,;3.84407,-1.65907,;3.24402,-3.03382,;1.75342,-3.20153,;0.862885,-1.99449,;1.46294,-0.619741,)|'
with timer() as cm:
    pkls = [m.ToBinary() for m in mols]
That took 1.48 seconds
with timer() as cm:
    jsons = [Chem.MolToJSON(m) for m in mols]
That took 3.07 seconds
with timer() as cm:
    for smi in smis:
        m = Chem.MolFromSmiles(smi,sanitize=False)
        m.UpdatePropertyCache()
That took 3.78 seconds
with timer() as cm:
    for pkl in pkls:
        m = Chem.Mol(pkl)
That took 1.14 seconds
with timer() as cm:
    for js in jsons:
        m = Chem.JSONToMols(js)[0]
That took 2.08 seconds

Molecules with properties

Since all three formats can also save properties, see how those impact writing and parsing times:

for mol in mols:
    mol.RemoveAllConformers()
    mol.SetIntProp('numAtoms',mol.GetNumAtoms())
    mol.SetIntProp('numBonds',mol.GetNumBonds())
    mol.SetIntProp('numRings',mol.GetRingInfo().NumRings())
    mol.SetProp('smiles',Chem.MolToSmiles(mol))
    
with timer() as cm:
    pkls = [m.ToBinary(Chem.PropertyPickleOptions.MolProps) for m in mols]
That took 1.44 seconds
with timer() as cm:
    jsons = [Chem.MolToJSON(m) for m in mols]
That took 4.09 seconds
with timer() as cm:
    for pkl in pkls:
        m = Chem.Mol(pkl)
That took 1.16 seconds
with timer() as cm:
    for js in jsons:
        m = Chem.JSONToMols(js)[0]
That took 2.14 seconds