Similarity-search hitlist overlap part 2

reference
similarity
fingerprints
Taking the noise level into account
Published

April 13, 2025

This post is a followup to a recent one exploring the amount of overlap in the hit sets returned by doing similarity searches with different fingerprint types.

At the end of the intro to that post I said:

One thing that is worth keeping in mind is that these results almost certainly consider similarity values which are down close to and probably in the region of similarities observed between random compounds. It may be worth refining the analysis in order to only consider similarities which are more significant, but that’s for a possible future post.

This is that “possible future post”. Here I filter the hit sets to only include similarity values greater than or equal to the 99% threshold for each fingerprint in the post on random similarities.

Other than skipping the initial examination of searches done with a single compounds, the rest of the analysis is the same as in earlier post. The results are not qualitatively different, but I think this is a more sensible approach overall.

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs
import numpy as np

from matplotlib import pyplot as plt
plt.style.use('tableau-colorblind10')

%matplotlib inline
%load_ext sql

import rdkit
print(rdkit.__version__)
2025.03.1
import gzip
with gzip.open('../data/Pubchem_AID373_compounds.sdf.gz','rb') as inf:
    with Chem.ForwardSDMolSupplier(inf) as suppl:
        dbmols = [x for x in suppl if x is not None]
len(dbmols)
    
59788

Define the fingerprints and similarity thresholds we will use:

# tuple order: (fp name, generator, count_thresh, bit_thresh)
runs = [
    ('mfp2',rdFingerprintGenerator.GetMorganGenerator(radius=2), 0.30, 0.23),
    ('ffp2',rdFingerprintGenerator.GetMorganGenerator(radius=2,
                                                  atomInvariantsGenerator=rdFingerprintGenerator.GetMorganAtomInvGen()),
        0.32, 0.24),
    ('mfp3',rdFingerprintGenerator.GetMorganGenerator(radius=3), 0.23, 0.17),
    ('ffp3',rdFingerprintGenerator.GetMorganGenerator(radius=3,
                                                  atomInvariantsGenerator=rdFingerprintGenerator.GetMorganAtomInvGen()),
        0.25, 0.18),
    ('mfp1',rdFingerprintGenerator.GetMorganGenerator(radius=1), 0.42, 0.36),
    ('ffp1',rdFingerprintGenerator.GetMorganGenerator(radius=1,
                                                  atomInvariantsGenerator=rdFingerprintGenerator.GetMorganAtomInvGen()),
        0.46, 0.40),
    ('tt',rdFingerprintGenerator.GetTopologicalTorsionGenerator(), 0.26, 0.28),
    ('ap',rdFingerprintGenerator.GetAtomPairGenerator(), 0.32, 0.42),
    ('rdk5',rdFingerprintGenerator.GetRDKitFPGenerator(maxPath=5), 0.33, 0.37),
    ('rdk7',rdFingerprintGenerator.GetRDKitFPGenerator(maxPath=7), 0.20, 0.51),
]
from rdkit.Chem.Pharm2D import Gobbi_Pharm2D,Generate
def Gobbi2D_bits(mol,fpLen=2048):
    res = DataStructs.ExplicitBitVect(fpLen)
    for bit in Generate.Gen2DFingerprint(mol,Gobbi_Pharm2D.factory).GetOnBits():
        # the bits are not hashed, so we need to do so before we fold them:
        res.SetBit(hash((bit,))%fpLen)
    return res
from rdkit.Chem import rdMolDescriptors
from rdkit.Avalon import pyAvalonTools

func_runs = [
    ('gobbi2d',Gobbi2D_bits, 0.37),
    ('avalon',pyAvalonTools.GetAvalonFP, 0.59),
    ('avalon-c',pyAvalonTools.GetAvalonCountFP, 0.50),
    ('pattern',Chem.PatternFingerprint, 0.75),
]

In order to get some decent statistics, we will do similarity searches with 500 random molecules.

Running this cell takes a while.

import random
random.seed(0xbad5eed)
order = list(range(len(dbmols)))
random.shuffle(order)

nToDo = 500
order = order[:500]

accum = {}
for nm,fpg,cthresh,bthresh in runs:
    print(nm)
    fps = list(fpg.GetFingerprints(dbmols,numThreads=8))
    taccum = []
    for i in order:
        qry = fps[i]
        sims = [(sim,j) for j,sim in enumerate(DataStructs.BulkTanimotoSimilarity(qry,fps)) if sim>=bthresh and j!=i]
        sims = sorted(sims,reverse=True)[:1000]
        taccum.append(sims)
    accum[f'{nm}-b'] = taccum

    fps = list(fpg.GetCountFingerprints(dbmols,numThreads=8))
    taccum = []
    for i in order:
        qry = fps[i]
        sims = [(sim,j) for j,sim in enumerate(DataStructs.BulkTanimotoSimilarity(qry,fps)) if sim>=cthresh and j!=i]
        sims = sorted(sims,reverse=True)[:1000]
        taccum.append(sims)
    accum[f'{nm}-c'] = taccum

for nm,func,thresh in func_runs:
    print(nm)
    fps = [func(m) for m in dbmols]
    taccum = []
    for i in order:
        qry = fps[i]
        sims = [(sim,j) for j,sim in enumerate(DataStructs.BulkTanimotoSimilarity(qry,fps)) if sim>=thresh and j!=i]
        sims = sorted(sims,reverse=True)[:1000]
        taccum.append(sims)
    accum[nm] = taccum
mfp2
ffp2
mfp3
ffp3
mfp1
ffp1
tt
ap
rdk5
rdk7
gobbi2d
avalon
avalon-c
pattern
import pickle
import gzip
with gzip.open('./results/sim_overlaps_thresh.pkl.gz','wb+') as outf:
    pickle.dump(accum,outf)

Accumulate the overlaps:

from collections import defaultdict

nms = list(accum.keys())

ovls = defaultdict(dict)
topNs = [10,100,1000]
for i,nmi in enumerate(nms):
    nruns = len(accum[nmi])
    for j in range(i):
        nmj = nms[j]
        for topN in topNs:
            ovls[nmi,nmj][topN] = []
        for run in range(nruns):
            topi = [x for s,x in accum[nmi][run]]
            topj = [x for s,x in accum[nmj][run]]
            for topN in topNs:
                ovls[nmi,nmj][topN].append(len(set(topi[:topN]).intersection(topj[:topN])))
ovls = dict(ovls)

Find the most similar fingerprint types for each of the fingerprints, this time using the mean overlap across the 500 hit sets:

snms = sorted(nms)

for i in range(len(snms)):
    print('-'*70)
    print(snms[i])
    for cnt in (10,100,1000):
        row = []
        for j in range(len(snms)):
            if i==j:
                continue
            nmi = snms[i]
            nmj = snms[j]
            if (nmi,nmj) not in ovls:
                nmi,nmj = nmj,nmi
            row.append((np.mean(ovls[nmi,nmj][cnt]),j))
        row = sorted(row,reverse=True)
        nbrs = []
        for j in range(3):
            nbrs.append(f'{snms[row[j][1]]:8s}({row[j][0]:5.1f})')
        print(f'\t{cnt: 5d}',f'{nbrs[0]:17s}',f'{nbrs[1]:17s}',f'{nbrs[2]:16s}',)
----------------------------------------------------------------------
ap-b
       10 ap-c    (  7.8)   ffp1-c  (  4.9)   mfp1-c  (  4.8) 
      100 ap-c    ( 61.4)   ffp1-c  ( 34.6)   mfp1-c  ( 34.0) 
     1000 ap-c    (329.6)   ffp1-c  (189.2)   mfp1-c  (182.2) 
----------------------------------------------------------------------
ap-c
       10 ap-b    (  7.8)   ffp2-c  (  5.1)   ffp1-c  (  5.1) 
      100 ap-b    ( 61.4)   ffp1-c  ( 40.2)   mfp1-c  ( 39.0) 
     1000 ap-b    (329.6)   ffp1-c  (264.2)   mfp1-c  (252.7) 
----------------------------------------------------------------------
avalon
       10 avalon-c(  5.3)   rdk5-b  (  5.1)   rdk7-b  (  5.1) 
      100 rdk5-b  ( 42.7)   avalon-c( 40.3)   rdk7-c  ( 40.1) 
     1000 rdk7-c  (207.2)   avalon-c(195.1)   rdk5-b  (175.3) 
----------------------------------------------------------------------
avalon-c
       10 avalon  (  5.3)   rdk5-c  (  5.3)   rdk7-c  (  5.1) 
      100 rdk5-c  ( 42.5)   avalon  ( 40.3)   rdk7-c  ( 40.2) 
     1000 rdk7-c  (235.7)   rdk5-c  (224.9)   avalon  (195.1) 
----------------------------------------------------------------------
ffp1-b
       10 mfp1-b  (  7.2)   ffp2-b  (  6.7)   mfp2-b  (  6.2) 
      100 mfp1-b  ( 64.7)   ffp2-b  ( 62.6)   mfp2-b  ( 55.7) 
     1000 ffp2-b  (493.4)   mfp1-b  (467.8)   mfp2-b  (420.7) 
----------------------------------------------------------------------
ffp1-c
       10 mfp1-c  (  7.5)   ffp2-c  (  6.9)   mfp2-c  (  6.4) 
      100 mfp1-c  ( 64.1)   ffp2-c  ( 62.5)   mfp2-c  ( 55.2) 
     1000 ffp2-c  (465.9)   mfp1-c  (435.8)   ffp3-c  (410.9) 
----------------------------------------------------------------------
ffp2-b
       10 ffp3-b  (  8.2)   mfp2-b  (  7.9)   mfp3-b  (  7.5) 
      100 ffp3-b  ( 79.6)   mfp2-b  ( 73.6)   mfp3-b  ( 69.9) 
     1000 ffp3-b  (679.8)   mfp2-b  (575.7)   mfp3-b  (568.1) 
----------------------------------------------------------------------
ffp2-c
       10 ffp3-c  (  8.3)   mfp2-c  (  8.1)   mfp3-c  (  7.5) 
      100 ffp3-c  ( 79.6)   mfp2-c  ( 72.3)   mfp3-c  ( 69.1) 
     1000 ffp3-c  (631.5)   mfp3-c  (524.9)   mfp2-c  (519.3) 
----------------------------------------------------------------------
ffp3-b
       10 ffp2-b  (  8.2)   mfp3-b  (  8.1)   mfp2-b  (  7.5) 
      100 ffp2-b  ( 79.6)   mfp3-b  ( 74.3)   mfp2-b  ( 68.6) 
     1000 ffp2-b  (679.8)   mfp3-b  (589.2)   mfp2-b  (543.7) 
----------------------------------------------------------------------
ffp3-c
       10 ffp2-c  (  8.3)   mfp3-c  (  8.2)   mfp2-c  (  7.6) 
      100 ffp2-c  ( 79.6)   mfp3-c  ( 74.1)   mfp2-c  ( 68.9) 
     1000 ffp2-c  (631.5)   mfp3-c  (567.9)   ffp3-b  (539.2) 
----------------------------------------------------------------------
gobbi2d
       10 ap-c    (  3.8)   ap-b    (  3.7)   mfp3-c  (  3.5) 
      100 ap-c    ( 22.0)   mfp2-c  ( 21.6)   mfp3-c  ( 21.2) 
     1000 ap-c    ( 73.8)   mfp2-b  ( 72.1)   mfp3-c  ( 70.8) 
----------------------------------------------------------------------
mfp1-b
       10 ffp1-b  (  7.2)   mfp2-b  (  7.0)   ffp2-b  (  6.3) 
      100 mfp2-b  ( 65.5)   ffp1-b  ( 64.7)   ffp2-b  ( 56.9) 
     1000 mfp2-b  (515.2)   ffp1-b  (467.8)   mfp3-b  (448.4) 
----------------------------------------------------------------------
mfp1-c
       10 ffp1-c  (  7.5)   mfp2-c  (  7.1)   ffp2-c  (  6.5) 
      100 mfp2-c  ( 65.6)   ffp1-c  ( 64.1)   ffp2-c  ( 57.2) 
     1000 mfp2-c  (480.2)   mfp3-c  (436.2)   ffp1-c  (435.8) 
----------------------------------------------------------------------
mfp2-b
       10 mfp3-b  (  8.3)   ffp2-b  (  7.9)   ffp3-b  (  7.5) 
      100 mfp3-b  ( 80.3)   ffp2-b  ( 73.6)   ffp3-b  ( 68.6) 
     1000 mfp3-b  (669.9)   ffp2-b  (575.7)   ffp3-b  (543.7) 
----------------------------------------------------------------------
mfp2-c
       10 mfp3-c  (  8.3)   ffp2-c  (  8.1)   ffp3-c  (  7.6) 
      100 mfp3-c  ( 80.5)   ffp2-c  ( 72.3)   ffp3-c  ( 68.9) 
     1000 mfp3-c  (614.3)   ffp2-c  (519.3)   ffp3-c  (512.8) 
----------------------------------------------------------------------
mfp3-b
       10 mfp2-b  (  8.3)   ffp3-b  (  8.1)   ffp2-b  (  7.5) 
      100 mfp2-b  ( 80.3)   ffp3-b  ( 74.3)   mfp3-c  ( 70.0) 
     1000 mfp2-b  (669.9)   ffp3-b  (589.2)   ffp2-b  (568.1) 
----------------------------------------------------------------------
mfp3-c
       10 mfp2-c  (  8.3)   ffp3-c  (  8.2)   ffp2-c  (  7.5) 
      100 mfp2-c  ( 80.5)   ffp3-c  ( 74.1)   mfp3-b  ( 70.0) 
     1000 mfp2-c  (614.3)   ffp3-c  (567.9)   mfp3-b  (546.7) 
----------------------------------------------------------------------
pattern
       10 avalon-c(  4.7)   rdk5-c  (  4.2)   avalon  (  4.2) 
      100 avalon-c( 37.4)   rdk5-c  ( 35.7)   rdk7-c  ( 34.1) 
     1000 rdk5-c  (183.8)   rdk7-c  (182.3)   rdk5-b  (170.9) 
----------------------------------------------------------------------
rdk5-b
       10 rdk7-b  (  6.8)   rdk5-c  (  6.4)   rdk7-c  (  6.1) 
      100 rdk5-c  ( 61.9)   rdk7-c  ( 57.6)   rdk7-b  ( 47.3) 
     1000 rdk5-c  (383.5)   rdk7-c  (368.5)   ffp2-b  (228.8) 
----------------------------------------------------------------------
rdk5-c
       10 rdk7-c  (  7.5)   rdk5-b  (  6.4)   rdk7-b  (  5.8) 
      100 rdk7-c  ( 73.4)   rdk5-b  ( 61.9)   tt-c    ( 44.2) 
     1000 rdk7-c  (521.6)   rdk5-b  (383.5)   tt-c    (265.3) 
----------------------------------------------------------------------
rdk7-b
       10 rdk5-b  (  6.8)   rdk7-c  (  6.6)   rdk5-c  (  5.8) 
      100 rdk5-b  ( 47.3)   rdk7-c  ( 46.0)   rdk5-c  ( 42.3) 
     1000 rdk7-c  (163.0)   rdk5-b  (130.0)   rdk5-c  (125.0) 
----------------------------------------------------------------------
rdk7-c
       10 rdk5-c  (  7.5)   rdk7-b  (  6.6)   rdk5-b  (  6.1) 
      100 rdk5-c  ( 73.4)   rdk5-b  ( 57.6)   rdk7-b  ( 46.0) 
     1000 rdk5-c  (521.6)   rdk5-b  (368.5)   tt-c    (282.9) 
----------------------------------------------------------------------
tt-b
       10 tt-c    (  8.5)   ffp3-c  (  6.2)   mfp3-c  (  6.1) 
      100 tt-c    ( 77.6)   ffp2-b  ( 52.7)   mfp2-b  ( 51.4) 
     1000 tt-c    (470.1)   ffp2-b  (321.8)   mfp2-b  (308.9) 
----------------------------------------------------------------------
tt-c
       10 tt-b    (  8.5)   ffp3-c  (  6.4)   ffp2-c  (  6.3) 
      100 tt-b    ( 77.6)   ffp2-c  ( 52.6)   ffp3-c  ( 52.6) 
     1000 tt-b    (470.1)   ffp3-c  (322.5)   ffp2-c  (322.3) 

Look at histograms of the overlap sizes for a few different fingerprint pairs

def compare(prs,ovls=ovls):
    prs = list(prs)
    for i,pr in enumerate(prs):
        if pr not in ovls:
            pr = pr[1],pr[0]
        prs[i] = pr

    plt.figure(figsize=(12,5))
    for i,n in enumerate((100,1000)):
        plt.subplot(1,2,i+1)
        plt.hist([ovls[pr][n] for pr in prs],label=prs,bins=20);
        plt.xlim(0,n)
        plt.xlabel(f'n={n} overlap')
        #plt.title(pr)
        plt.legend()
    plt.tight_layout()

Start with FeatureMorgan and Morgan, where there is a reasonably large amount of overlap:

compare((('ffp2-b','mfp2-b'),('ffp3-b','mfp3-b')))

Now compare count-based and bit-based Morgan fingerprints. I expected the overlaps here to be higher:

compare((('mfp3-c','mfp3-b'),('mfp2-c','mfp2-b'),))

Comparing very different types of fingerprints: count-based Morgan3 and both Topological Torsions and Atom Pairs (these both use count simulation. Here there are significant differences. These are fingerprints that it would be interesting to use together.

compare((('tt-b','mfp3-c'),('ap-b','mfp3-c')))

Same thing with bit-baseed Morgan3 and RDK5. Again, These are nicely complementary fingerprints:

compare((('tt-b','mfp3-b'),('ap-b','mfp3-b'),('rdk5-b','mfp3-b')))

Finally compare Morgan3 with the pattern fingerprint (normally used for substructure screening, not similarity search), Gobbi2D (a 2D pharmacophore FP) and the Avalon FP. These are also nicely different from each other:

compare((('pattern','mfp3-b'),('gobbi2d','mfp3-b'),('avalon','mfp3-b')))