Building a similarity comparison set

reference

similarity

Creating reference data for other projects

Published

March 16, 2025

This is a revised and expanded version of an old blog post which has already been updated once.

Goal: construct a set of molecular pairs that can be used to compare similarity methods to each other.

This works with ChEMBL35.

I want to start with molecules that have some connection to each other, so I will pick pairs that have a baseline similarity: a Tanimoto similarity using count based Morgan0 fingerprints of at least 0.65. I also create a second set of somewhat more closely related molecules where the baseline similarity is 0.55 with a Morgan1 fingerprint. The thresholds were selected based on the analysis in this blog post

If you are interested in using the sets generated in this post, you can find them in the source repo for this blog.

Creating the tables in PostgreSQL

I’m going to use ChEMBL as my data source, so I’ll start by adding a table with count-based morgan fingerprints. Here’s the SQL for that, assuming that you’ve installed the RDKit extension and setup an RDKit schema as described in the docs

select molregno,morgan_fp(m,0) mfp0,morgan_fp(m,1) mfp1,morgan_fp(m,2) mfp2 into rdk.countfps from rdk.mols;
create index cfps_mfp0 on rdk.countfps using gist(mfp0);
create index cfps_mfp1 on rdk.countfps using gist(mfp1);
create index cfps_mfp2 on rdk.countfps using gist(mfp2);

Fingerprints that only contains molecules with <= 50 heavy atoms and a single fragment (we recognize this because there is no ‘.’ in the SMILES):

select molregno,mfp0,mfp1 into table rdk.tfps_smaller from rdk.countfps join compound_properties using (molregno) join compound_structures using (molregno) where heavy_atoms<=50 and canonical_smiles not like '%.%';
create index sfps_mfp0_idx on rdk.tfps_smaller using gist(mfp0);
create index sfps_mfp1_idx on rdk.tfps_smaller using gist(mfp1);

And now I’ll build the set of pairs using Python. This is definitely doable in SQL, but my SQL-fu isn’t that strong.

Start by getting a set of 60K random small molecules:

from rdkit import Chem
import rdkit
print(rdkit.__version__)
import time
print(time.asctime())

2024.09.6
Sat Mar 15 18:36:20 2025

import psycopg2
cn = psycopg2.connect(host='localhost',dbname='chembl_35')
curs = cn.cursor()
curs.execute("select chembl_id,m from rdk.mols join rdk.tfps_smaller using (molregno)"
             " join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND')"
             " order by random() limit 60000")
qs = curs.fetchall()

And now find one neighbor for 50K of those from the mfp0 table of smallish molecules:

cn.rollback()
curs.execute('set rdkit.tanimoto_threshold=0.65')

keep=[]
for i,row in enumerate(qs):
    curs.execute("select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller "
                 "join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') "
                 "where mfp0%%morgan_fp(%s,0) "
                 "and chembl_id!=%s limit 1) t2 using (molregno) "
                 "limit 1",(row[1],row[0]))
    d = curs.fetchone()
    if not d: 
        continue
    keep.append((row[0],row[1],d[0],d[1]))
    if len(keep)>=50000: 
        break
    if not i%1000: print('Done: %d'%i)

Done: 0
Done: 1000
Done: 2000
Done: 3000
Done: 4000
Done: 5000
Done: 6000
Done: 7000
Done: 8000
Done: 9000
Done: 10000
Done: 11000
Done: 12000
Done: 13000
Done: 14000
Done: 15000
Done: 16000
Done: 17000
Done: 18000
Done: 19000
Done: 20000
Done: 21000
Done: 22000
Done: 23000
Done: 24000
Done: 25000
Done: 26000
Done: 27000
Done: 28000
Done: 29000
Done: 30000
Done: 31000
Done: 32000
Done: 33000
Done: 34000
Done: 35000
Done: 36000
Done: 37000
Done: 38000
Done: 39000
Done: 40000
Done: 41000
Done: 42000
Done: 43000
Done: 44000
Done: 45000
Done: 46000
Done: 47000
Done: 48000
Done: 49000
Done: 50000

Finally, write those out to a file so that we can use them elsewhere:

import gzip
with gzip.open('../data/chembl35_50K.mfp0.pairs.txt.gz','wb+') as outf:
    for cid1,smi1,cid2,smi2 in keep: 
        outf.write(f'{cid1}\t{smi1}\t{cid2}\t{smi2}\n'.encode('UTF-8'))

Try molecules that are a bit more similar.

Use a similarity threshold for the pairs using MFP1 bits.

cn.rollback()
curs.execute('set rdkit.tanimoto_threshold=0.55')

keep=[]
for i,row in enumerate(qs):
    curs.execute("select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller "
                 "join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') "
                 "where mfp1%%morgan_fp(%s,1) "
                 "and chembl_id!=%s limit 1) t2 using (molregno) "
                 "limit 1",(row[1],row[0]))
    d = curs.fetchone()
    if not d: 
        continue
    keep.append((row[0],row[1],d[0],d[1]))
    if len(keep)>=50000: 
        break
    if not i%1000: print('Done: %d'%i)

Done: 0
Done: 1000
Done: 2000
Done: 3000
Done: 4000
Done: 5000
Done: 6000
Done: 7000
Done: 8000
Done: 9000
Done: 10000
Done: 11000
Done: 12000
Done: 13000
Done: 14000
Done: 15000
Done: 16000
Done: 17000
Done: 18000
Done: 19000
Done: 20000
Done: 21000
Done: 22000
Done: 23000
Done: 24000
Done: 25000
Done: 26000
Done: 27000
Done: 28000
Done: 29000
Done: 30000
Done: 31000
Done: 32000
Done: 33000
Done: 34000
Done: 35000
Done: 36000
Done: 37000
Done: 38000
Done: 39000
Done: 40000
Done: 41000
Done: 42000
Done: 43000
Done: 44000
Done: 45000
Done: 46000
Done: 47000
Done: 48000
Done: 49000
Done: 50000

import gzip
with gzip.open('../data/chembl35_50K.mfp1.pairs.txt.gz','wb+') as outf:
    for cid1,smi1,cid2,smi2 in keep: 
        outf.write(f'{cid1}\t{smi1}\t{cid2}\t{smi2}\n'.encode('UTF-8'))