from rdkit import Chem
import time
import gzip
import rdkit
print(rdkit.__version__)
2022.09.1
December 9, 2022
This is just a short one, mainly to have the information online to use as a reference.
The RDKit has a number of different ways of serializing and deserializing molecules (converting them to and from strings). This post looks at how long it takes to do that with the three serialization approaches I normally recommend: 1. CXSMILES 2. The RDKit’s internal binary format 3. the RDKit’s variant of commonchem JSON.
I looked at just serializing the molecular structure, including atomic coordinates, and serializing a few propertis together with the molecule.
Note that this isn’t exactly comparing apples to apples: the binary and JSON formats both capture more or less the full perceived state of the molecule (aromaticity, ring systems, etc), while the CXSMILES variant doesn’t include the ring information.
Here are the timing results for 50K molecules from ChEMBL. I ran this on my normal desktop machine, a 3-year old Dell desktop with a 3.1GHz Intel Core9 CPU. I’m running the conda-forge RDKit build.
Times are in seconds
Writing
Method | Mol | Mol + coordinates | Mol + properties |
---|---|---|---|
CXSMILES | 3.9 | 6.4 | N/A |
Pickle | 1.31 | 1.48 | 1.44 |
JSON | 2.4 | 3.1 | 4.1 |
Reading
Method | Mol | Mol + coordinates | Mol + properties |
---|---|---|---|
CXSMILES | 1.56 | 3.78 | N/A |
Pickle | 0.9 | 1.14 | 1.16 |
JSON | 1.66 | 2.08 | 2.14 |
The binary format is, of course, the fastest. The JSON format is slower than that, but it’s still faster than using CXSMILES when serializing coordinates and roughly equivalent to CXSMILES when just storing the molecule (but, as noted above, the JSON contains more info than the CXSMILES does).
Read in a set of 50K ChEMBL molecules that we’ll use for the testing.
This class makes timing our runs easier:
'{"commonchem":{"version":10},"defaults":{"atom":{"z":6,"impHs":0,"chg":0,"nRad":0,"isotope":0,"stereo":"unspecified"},"bond":{"bo":1,"stereo":"unspecified"}},"molecules":[{"name":"436078","atoms":[{"z":8,"chg":-1},{"z":7,"chg":1},{"z":8},{},{"impHs":1},{"impHs":1},{},{"z":7},{},{"z":8},{"impHs":1,"stereo":"ccw"},{"impHs":1},{"impHs":2},{"impHs":2},{"impHs":1},{"z":8},{"impHs":1,"stereo":"ccw"},{},{"z":8},{},{"impHs":1},{"impHs":1},{"impHs":1},{"impHs":1},{}],"bonds":[{"atoms":[0,1]},{"bo":2,"atoms":[1,2]},{"atoms":[1,3]},{"bo":2,"atoms":[3,4]},{"atoms":[4,5]},{"bo":2,"atoms":[5,6]},{"atoms":[6,7]},{"atoms":[7,8]},{"bo":2,"atoms":[8,9]},{"atoms":[8,10]},{"atoms":[10,11]},{"atoms":[11,12]},{"atoms":[12,13]},{"atoms":[13,14]},{"atoms":[14,15]},{"atoms":[14,16]},{"atoms":[16,17]},{"bo":2,"atoms":[17,18]},{"atoms":[6,19]},{"bo":2,"atoms":[19,20]},{"atoms":[20,21]},{"bo":2,"atoms":[21,22]},{"atoms":[22,23]},{"bo":2,"atoms":[23,24]},{"atoms":[24,3]},{"atoms":[17,7]},{"atoms":[16,10]},{"atoms":[15,11]},{"atoms":[24,19]}],"properties":{"numAtoms":25,"numBonds":29,"numRings":5,"smiles":"O=C1[C@@H]2C3CCC(O3)[C@@H]2C(=O)N1c1ccc([N+](=O)[O-])c2ccccc12"},"extensions":[{"name":"rdkitRepresentation","formatVersion":2,"toolkitVersion":"2022.09.1","aromaticAtoms":[3,4,5,6,19,20,21,22,23,24],"aromaticBonds":[3,4,5,18,19,20,21,22,23,24,28],"cipRanks":[23,19,24,13,7,6,12,18,16,21,8,14,0,1,15,20,9,17,22,10,4,2,3,5,11],"cipCodes":[[10,"S"],[16,"R"]],"atomRings":[[3,24,19,6,5,4],[8,7,17,16,10],[12,11,15,14,13],[15,14,16,10,11],[20,21,22,23,24,19]]}]}]}'
Generating CXSMILES:
Generating the RDKit’s binary format:
Generating the RDKit’s variant of commonchem JSON:
Generating JSON for all of the molecules at once:
Now look at reading the molecules.
Start from SMILES, but skip full sanitization since we know that the chemistry and aromaticity are correct:
with timer() as cm:
for smi in smis:
m = Chem.MolFromSmiles(smi,sanitize=False)
m.UpdatePropertyCache()
That took 1.56 seconds
From the binary format:
And the two variants for parsing JSON:
Add a conformer to each molecule (which will be written to the output) and re-do the timing runs:
'O=C1[C@@H]2C3CCC(O3)[C@@H]2C(=O)N1c1ccc([N+](=O)[O-])c2ccccc12 |(-1.03854,-2.25072,;-1.65866,-0.884909,;-3.12813,-0.583792,;-4.49394,-1.20391,;-5.96341,-0.902797,;-6.13112,0.587798,;-4.7653,1.20792,;-3.75347,0.100581,;-3.29584,0.906803,;-1.93002,1.52692,;-1.62891,2.99639,;-0.918193,0.419586,;0.572402,0.587296,;1.17246,1.96204,;2.66305,2.12976,;3.55359,0.922718,;5.04419,1.09043,;5.93473,-0.116609,;5.64424,2.46518,;2.95354,-0.452031,;3.84407,-1.65907,;3.24402,-3.03382,;1.75342,-3.20153,;0.862885,-1.99449,;1.46294,-0.619741,)|'
Since all three formats can also save properties, see how those impact writing and parsing times:
That took 1.44 seconds