from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from lwreg import utils
Using lwreg with the RDKit cartridge
We recently published an application note introducing lwreg, an open-source chemical registration system with a simple Python API that can be easily integrated into computational workflows. We think lwreg is really useful and hope others agree with us.
One of the key ideas behind lwreg is that it should not require any chemical intelligence in the database it uses to store its results. See the paper for more details, but the kinds of identity queries (i.e. “is this molecule or something closely related already registered?”) that a registration system needs to support can be answered with standard text queries. Having said that, there are times when it’s useful to ask questions like “Are there any registered molecules in the database that contain this substructure?”. One easy way to allow this kind of query is to install the RDKit cartridge in the database.
This post shows how to use the RDKit cartridge together with an lwreg instance that is using PostgreSQL as the database backend. Since installing PostgreSQL and the cartridge can be challenging for a normal user, I also show how to do that installation and get a database that’s suitable for local use. The instructions below should work on Windows, Linux, or the Mac.
Many thanks to Jessica Braun, the other main lwreg developer, for her feedback on a draft of this post, which will eventually end up, in some modified form, in the lwreg documentation.
Setting up a local PostgreSQL installation
You probably wouldn’t want to use this setup in any very serious way, but it’s great for learning and could certainly work as a way to configure a personal database. I put this notebook together in Windows on my laptop, but it should work on Linux or the Mac too.
Aside If you want to do this more seriously without a huge amount of complication, your best bet is probably to setup a docker container that runs postgresql and the cartridge and to use that within lwreg. I will try to put together a post showing how to do this in the not-too-distant future.
As usual, I assume that you’re using conda since it makes installing the RDKit and all other dependencies incredibly easy.
Create a conda environment using the lwreg environment.yml file:
conda env create -f https://github.com/rinikerlab/lightweight-registration/raw/refs/heads/main/environment.yml
This creates an environment named py311_lwreg
. Activate that and install lwreg, postgresql, and the rdkit-postgresql cartridge:
conda activate py311_lwreg
python -m pip install git+https://github.com/rinikerlab/lightweight-registration
conda install postgresql
conda install rdkit-postgresql
We need an environment for postgresql to use to store its database files, create that and then tell postgresql to initialize that directory, you will obviously need to pick a different path:
mkdir c:/users/glandrum/database
initdb -D /c/Users/glandrum/database/psql17
Start the postgresql server:
pg_ctl -D /c/Users/glandrum/database/psql17 -l /c/Users/glandrum/database/psql17.log start
Note If you run the server this way, it will be killed when the shell you launched it from exits. You will need to run this start command each time you want to start using lwreg.
Finally create the database that we’re going to work with here:
createdb gl_structuredemo
Now everything is set up, let’s get started!
Set up our the configuration we’re going to use for lwreg:
= utils.defaultConfig()
cfg 'host'] = 'localhost'
cfg['dbtype'] = 'postgresql'
cfg['dbname'] = 'gl_structuredemo'
cfg[# for this use case we're going to be registering 3D structures and want to keep the conformers:
'registerConformers'] = True
cfg[# we don't want lwreg to do any standardization of the molecules when we load them:
'standardization'] = 'none'
cfg[ utils.set_default_config(cfg)
utils.initdb()
This will destroy any existing information in the registration database.
are you sure? [yes/no]: yes
True
Start by loading some structures (these are some subsets of molecules I’ve extracted from the COD for another project). You obviously won’t have these files, but you can provide any SDF files that contain 3D molecules.
= utils.bulk_register(sdfile='c:/users/glandrum/ETH/Datasets/COD/COD_2024Feb14.organic.chembl_selected.sdf')
mrns = utils.bulk_register(sdfile='c:/users/glandrum/ETH/Datasets/COD/COD_2024Feb14.organic.macrocycles.sdf')
mrns = utils.bulk_register(sdfile='c:/users/glandrum/ETH/Datasets/COD/COD_2024Feb14.organic.cages.sdf') mrns
len(utils.get_all_registry_numbers())
2416
Using the RDKit cartridge with data from the lwreg tables.
We start by installing the cartridge in our database (if it’s not already installed) and create a new schema to hold the RDKit tables.
= utils.connect(config=cfg)
cn = cn.cursor()
curs 'create extension if not exists rdkit')
curs.execute('create schema if not exists rdk')
curs.execute( cn.commit()
Now create the rdk.mols
table to hold RDKit molecules, populate it with molecules created from the mol blocks in lwreg’s molblocks
table, and then create an index on the molecule column to allow us to quickly do substructure searches.
'drop table if exists rdk.mols cascade')
curs.execute(# the second argument to mol_from_ctab() here tells the cartridge to not store the molecule's conformers, we don't need them
'select molregno, mol_from_ctab(molblock::cstring,false) m into rdk.mols from molblocks')
curs.execute('create index molidx on rdk.mols using gist(m)')
curs.execute( cn.commit()
Now let’s do some queries.
Start by looking for one of the torsions that’s used for ETKDGv3:
= '[cH0:1][c:2]([cH1])!@;-[O:3][C:4]'
smi "select * from rdk.mols where m @> %s::qmol",(smi,))
curs.execute(= curs.fetchall()
d len(d)
99
And look at the results with just the torsion substructure highlighted:
= Chem.MolFromSmarts(smi)
q = [x.GetIdx() for x in q.GetAtoms() if x.GetAtomMapNum()]
labelledAts
= [Chem.MolFromSmiles(x[1]) for x in d]
ms = []
ats for m in ms:
= m.GetSubstructMatches(q)
matches = []
tl for tm in matches:
for i in labelledAts])
tl.extend([tm[i]
ats.append(tl)
6],subImgSize=(350,300),highlightAtomLists=ats) Draw.MolsToGridImage(ms[:
Now let’s do a query to retrieve conformers for the molecules that match that substructure:
"select molregno,conf_id,molblock from rdk.mols join conformers using (molregno) where m@>%s::qmol",(smi,))
curs.execute(= curs.fetchall()
d len(d)
106
And collect information about the values of the dihedral that we’re interested in across those conformers:
from rdkit.Chem import rdMolTransforms
= Chem.MolFromSmarts(smi)
q = [x.GetIdx() for x in q.GetAtoms() if x.GetAtomMapNum()]
labelledAts
= [Chem.MolFromMolBlock(x[2],removeHs=False) for x in d]
ms = []
vals for m in ms:
= m.GetSubstructMatches(q)
matches = m.GetConformer()
conf for tm in matches:
= [tm[i] for i in labelledAts]
aids *aids)) vals.append(rdMolTransforms.GetDihedralDeg(conf,
len(vals)
219
Do a histogram of the dihedral values:
from matplotlib import pyplot as plt
%matplotlib inline
=20);
plt.hist(vals,bins'dihedral')
plt.xlabel('count')
plt.ylabel(; plt.title(smi)
Keeping the RDKit tables up-to-date when molecules are added to lwreg
Our next step is to make sure that RDKit tables in the database are automatically updated when new molecules are added to lwreg. We can do this in PostgreSQL by adding a trigger
to the molblock
table that is called after a new row is inserted into the molblock
table. The trigger is called with the data from the new row, so we can just insert the new molregno and the molecule constructed from the new molblock into the rdk.mols
table.
Note: if using this in production, you probably would need to do some error handling in the trigger. That’s beyond the scope of this blog post (in this case that means: “I don’t know how to do it”).
The PostgreSQL docs have more information about creating triggers.
'''CREATE OR REPLACE FUNCTION copy_new_mol() RETURNS TRIGGER AS
curs.execute($BODY$
BEGIN
INSERT INTO
rdk.mols(molregno,m)
VALUES(new.molregno,mol_from_ctab(new.molblock::cstring,false));
RETURN new;
END;
$BODY$
language plpgsql;''')
'''CREATE OR REPLACE TRIGGER mol_insert_trigger
curs.execute( AFTER INSERT ON molblocks
FOR EACH ROW
EXECUTE FUNCTION copy_new_mol();''')
Demonstrate that this worked by creating two new molecules that share a core, generating a conformer for each one, and registering each one.
from rdkit.Chem import rdDistGeom
= Chem.AddHs(Chem.MolFromSmiles('c1c(O)csn1'))
newMol =0xf00d)
rdDistGeom.EmbedMolecule(newMol,randomSeed= Chem.RemoveHs(newMol)
tm =tm) utils.register(mol
(2792, 2792)
= Chem.AddHs(Chem.MolFromSmiles('c1c(OC)csn1'))
newMol =0xf00d)
rdDistGeom.EmbedMolecule(newMol,randomSeed= Chem.RemoveHs(newMol)
tm =tm) utils.register(mol
(2793, 2793)
Now show that we can retrieve those two just-registered molecules with a substructure query for the core:
= 'c1c(O)csn1'
sma "select * from rdk.mols where m@>%s::qmol limit 10",(sma,))
curs.execute(= curs.fetchall()
d len(d)
2
0] d[
(2792, 'Oc1cnsc1')
1] d[
(2793, 'COc1cnsc1')