What’s new in 2024.09.1, part I

documentation
release
First of a series of short posts describing new features since the 2024.03.1 release
Published

September 29, 2024

The 2024.09.1 version of the RDKit was released on the 27th of September. This is the first in a short series of posts providing brief introductions to new functionality added to the RDKit since the 2024.03.1 release.

import rdkit
rdkit.__version__
'2024.09.1'
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

rdMolProcessing

The idea of the rdMolProcessing package is to make it easy and fast to carry out common operations on molecules read from files.

In this initial release the only function provided generates fingerprints, but we will add additional capabilities in future releases.

from rdkit.Chem import rdMolProcessing
from rdkit.Chem import rdFingerprintGenerator
fname = '../data/BLSets_actives.txt'
fpg = rdFingerprintGenerator.GetMorganGenerator()
%timeit suppl = Chem.SmilesMolSupplier(fname,delimiter='\t');fps = [fpg.GetFingerprint(m) for m in suppl if m is not None]
16.4 s ± 35.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

rdMolProcessing.GetFingerprintsForMolsInFile() does the same work, but operates entirely in C++ and uses multiple threads to read the molecules and generate the fingerprints in parallel.

%timeit fps = rdMolProcessing.GetFingerprintsForMolsInFile(fname)
2.73 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here’s a larger demonstration: generating fingerprints for all 2.3 million molecules in the ChEMBL 31 SDF file. Notice that we don’t need to uncompress the SDF file.

from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
import time
t1 = time.time(); 
fps = rdMolProcessing.GetFingerprintsForMolsInFile('/scratch/Data/ChEMBL/chembl_31.sdf.gz'); 
t2=time.time(); 
print(f'{t2-t1:.2f}')
81.94

The fingerprints are generated immediately after each molecule is parsed and then the molecule is discarded, so it’s only necessary to be able to store all of the fingerprints in memory, not all of the molecules.

GetFingerprintsForMolsInFile() can figure out the file format automatically (as we saw above), but you can, if necessary, provide options controlling how the file is read.

!head -3 ../data/herg_data.txt
canonical_smiles molregno activity_id standard_value standard_units
N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)cc2F)C(=O)N3CC[C@H](F)C3 29272 671631 49000 nM
N[C@@H](C1CCCCC1)C(=O)N2CCSC2 29758 674222 28000 nM
opts = rdMolProcessing.SupplierOptions()
opts.delimiter = ' '
fps = rdMolProcessing.GetFingerprintsForMolsInFile('../data/herg_data.txt',options=opts)
len(fps)
1090

The default is to generate 2048 bit Morgan fingerprints with a radius of 3, but we can change this by providing a fingerprint generator to the call:

fpg = rdFingerprintGenerator.GetRDKitFPGenerator(maxPath=5)
fps = rdMolProcessing.GetFingerprintsForMolsInFile('../data/herg_data.txt',options=opts,generator=fpg)

rdMolAlign.GetAllConformerBestRMS()

The new function GetAllConformerBestRMS() makes it easy to calculate the RMSDs between all of the conformers of a molecule.

from rdkit.Chem import rdDistGeom
from rdkit.Chem import rdMolAlign
ps = rdDistGeom.srETKDGv3()
ps.randomSeed = 0xa100f
ps.numThreads = 6

m = Chem.AddHs(Chem.MolFromSmiles('N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)cc2F)C(=O)N3CC[C@H](F)C3'))
rdDistGeom.EmbedMultipleConfs(m,100,ps)
m.GetNumConformers()
100
# generating RMSDs with Hs generally doesn't make sense:
m_noh = Chem.RemoveHs(m)
rmsds = rdMolAlign.GetAllConformerBestRMS(m_noh)
len(rmsds)
4950

You can also run the calculation using multiple threads to speed things up:

%timeit rmsds = rdMolAlign.GetAllConformerBestRMS(m_noh,numThreads=6)
6.17 ms ± 49 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)