import rdkit
rdkit.__version__
'2024.09.1'
September 29, 2024
The 2024.09.1 version of the RDKit was released on the 27th of September. This is the first in a short series of posts providing brief introductions to new functionality added to the RDKit since the 2024.03.1 release.
The idea of the rdMolProcessing
package is to make it easy and fast to carry out common operations on molecules read from files.
In this initial release the only function provided generates fingerprints, but we will add additional capabilities in future releases.
fpg = rdFingerprintGenerator.GetMorganGenerator()
%timeit suppl = Chem.SmilesMolSupplier(fname,delimiter='\t');fps = [fpg.GetFingerprint(m) for m in suppl if m is not None]
16.4 s ± 35.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
rdMolProcessing.GetFingerprintsForMolsInFile()
does the same work, but operates entirely in C++ and uses multiple threads to read the molecules and generate the fingerprints in parallel.
2.73 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here’s a larger demonstration: generating fingerprints for all 2.3 million molecules in the ChEMBL 31 SDF file. Notice that we don’t need to uncompress the SDF file.
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
import time
t1 = time.time();
fps = rdMolProcessing.GetFingerprintsForMolsInFile('/scratch/Data/ChEMBL/chembl_31.sdf.gz');
t2=time.time();
print(f'{t2-t1:.2f}')
81.94
The fingerprints are generated immediately after each molecule is parsed and then the molecule is discarded, so it’s only necessary to be able to store all of the fingerprints in memory, not all of the molecules.
GetFingerprintsForMolsInFile()
can figure out the file format automatically (as we saw above), but you can, if necessary, provide options controlling how the file is read.
canonical_smiles molregno activity_id standard_value standard_units
N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)cc2F)C(=O)N3CC[C@H](F)C3 29272 671631 49000 nM
N[C@@H](C1CCCCC1)C(=O)N2CCSC2 29758 674222 28000 nM
opts = rdMolProcessing.SupplierOptions()
opts.delimiter = ' '
fps = rdMolProcessing.GetFingerprintsForMolsInFile('../data/herg_data.txt',options=opts)
len(fps)
1090
The default is to generate 2048 bit Morgan fingerprints with a radius of 3, but we can change this by providing a fingerprint generator to the call:
The new function GetAllConformerBestRMS()
makes it easy to calculate the RMSDs between all of the conformers of a molecule.
ps = rdDistGeom.srETKDGv3()
ps.randomSeed = 0xa100f
ps.numThreads = 6
m = Chem.AddHs(Chem.MolFromSmiles('N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)cc2F)C(=O)N3CC[C@H](F)C3'))
rdDistGeom.EmbedMultipleConfs(m,100,ps)
m.GetNumConformers()
100
You can also run the calculation using multiple threads to speed things up: