Colliding bits III, expanded

reference
fingerprints
Looking at numbers of collisions and their impact on similarity
Published

January 23, 2023

This is an updated and expanded version of an old post

I looked at the number of collisions in Morgan fingerprints in an earlier post. The topic came up again in discussions about the recent post on Morgan fingerprint stats, which used a much larger data set.

Here I repeat the earlier collision analysis using a dataset of 4 million compounds and look at a number of different fingerprint types – Morgan fingeprints of various radii, RDKit fingerprints, atom pair fingerprints, and topological torsion fingerprints– folded to a set of different sizes – 64, 128, 256, 512, 1K, 2K, 4K, 8K, 16K.

After evaluating the number of collisions, I look at the impact that fingerprint size has on computed similarity values.

TL;DR version: The conclusions match what I observed before, there are a fair number of collisions at fingerprint sizes below 4K. As you would expect, fingerprints with more bits set have more collisions.

Here’s a table with some of the comparisons between Tanimoto similarities calculated with folded fingerprint and those with unfolded fingerprints (the full table is towards the end of the post). In the table d refers to the difference between similarities calculated with the folded and unfolded (=“no” collisions) fingerprint at a particular size. Positive values of d indicate that the similarity with the folded fingerprint is higher than that with the unfolded fingerprint. SpearmanR is the Spearman correlation coefficient between the folded and unfolded similarities.

nBits fingerprint SpearmanR mean(d) std(d) 90% |d|
16384 mfp1 1.000 0.00023 0.0025 0
16384 mfp2 0.999 0.00085 0.0036 0.0034
16384 mfp3 0.999 0.0014 0.0038 0.0056
16384 rdk5 1.000 0.0054 0.0046 0.012
16384 hashap 0.997 0.0066 0.0072 0.016
16384 hashtt 0.999 0.0024 0.0074 0.011
1024 mfp1 0.989 0.0067 0.016 0.028
1024 mfp2 0.990 0.014 0.016 0.036
1024 mfp3 0.987 0.022 0.017 0.044
1024 rdk5 0.981 0.085 0.049 0.15
1024 hashap 0.874 0.1 0.051 0.17
1024 hashtt 0.983 0.025 0.026 0.061
512 mfp1 0.980 0.013 0.021 0.044
512 mfp2 0.979 0.028 0.023 0.059
512 mfp3 0.972 0.044 0.026 0.078
512 rdk5 0.929 0.16 0.09 0.28
512 hashap 0.702 0.19 0.083 0.3
512 hashtt 0.967 0.046 0.037 0.096
128 mfp1 0.918 0.049 0.043 0.11
128 mfp2 0.897 0.11 0.054 0.18
128 mfp3 0.839 0.17 0.066 0.25
128 rdk5 0.528 0.42 0.2 0.68
128 hashap 0.343 0.42 0.12 0.57
128 hashtt 0.828 0.17 0.085 0.28
64 mfp1 0.832 0.099 0.064 0.18
64 mfp2 0.771 0.2 0.083 0.31
64 mfp3 0.642 0.31 0.1 0.44
64 rdk5 0.235 0.5 0.22 0.77
64 hashap 0.158 0.51 0.13 0.65
64 hashtt 0.672 0.28 0.12 0.43

Down to a fingerprint size of 1024 bits all the fingerprints except hashap have an R of >0.90 and comparatively low deviations (last column). With the very small 128 bit fingerprints only the mfp1 and mfp2 (both of which don’t set many bits) are showing halfway decent performance, and with 64bit fingerprints the deviations are large for all fingerprint types (though the R values for mfp1 and mfp2 are surprisingly high).

There are a number of things worth further exploration here, but those will be other blog posts.

from rdkit import Chem,DataStructs
import time,random,gzip,pickle,copy
import numpy as np
from collections import defaultdict
from rdkit.Chem import Draw
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs
import rdkit
%pylab inline
plt.style.use('tableau-colorblind10')
plt.rcParams['font.size'] = '14'
print(rdkit.__version__)
import time
print(time.asctime())
%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib
2022.09.1
Wed Jan 18 13:22:54 2023
/localhome/glandrum/.conda/envs/py310_rdkit/lib/python3.10/site-packages/IPython/core/magics/pylab.py:162: UserWarning: pylab import has clobbered these variables: ['random', 'copy']
`%matplotlib` prevents importing * from pylab and numpy
  warn("pylab import has clobbered these variables: %s"  % clobbered +

Generating the data

For test data I’ll use the 4 million random compounds from PubChem with <=50 heavy atoms. I constructed the set of compounds by downloading the full pubchem compound set (on 8 Jan, 2023), and picking 10 million random lines using an awk script.

filen='/localhome/glandrum/Datasets/PubChem/pubchem_compound_random_10000000.txt.gz'

Loop over the molecules and build fingerprints of multiple radii and folded lengths.

gens = [('mfp1',lambda fpsize: rdFingerprintGenerator.GetMorganGenerator(radius=1,fpSize=fpsize)),
        ('mfp2',lambda fpsize: rdFingerprintGenerator.GetMorganGenerator(radius=2,fpSize=fpsize)),
         ('mfp3',lambda fpsize: rdFingerprintGenerator.GetMorganGenerator(radius=3,fpSize=fpsize)),
         ('rdk5',lambda fpsize: rdFingerprintGenerator.GetRDKitFPGenerator(maxPath=5,fpSize=fpsize)),
         ('hashap',lambda fpsize: rdFingerprintGenerator.GetAtomPairGenerator(fpSize=fpsize)),
         ('hashtt',lambda fpsize: rdFingerprintGenerator.GetTopologicalTorsionGenerator(fpSize=fpsize)),
         ]
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
import copy
history={} # we will use this to see how quickly the results converge
counts=defaultdict(lambda:defaultdict(int))
t1 = time.time()
i = 0
with gzip.open(filen,'rb') as inf:
    for inl in inf:
        try:
            nm,smi = inl.strip().split()
        except:
            break
        m = Chem.MolFromSmiles(smi)
        if m is None or m.GetNumHeavyAtoms()>50:
            continue
        i+=1
        for nm,fpg in gens:
            fpgen = fpg(1024) # doesn't matter here
            onbits=len(fpgen.GetSparseFingerprint(m).GetOnBits())
            counts[nm][onbits]+=1
            for l in 64,128,256,512,1024,2048,4096,8192,16384:
                fpnm = f'{nm}-{l}'
                fpgen = fpg(l)
                dbits = onbits-fpgen.GetFingerprint(m).GetNumOnBits()
                counts[fpnm][dbits]+=1
        if not i%50000:
            t2 = time.time()
            print(f"Done {i} in {t2-t1:.2f} sec")
        if not i%100000:
            history[i] = copy.deepcopy(counts)
        if i>=4000000:
            break
Done 50000 in 214.60 sec
Done 100000 in 475.92 sec
Done 150000 in 790.69 sec
Done 200000 in 1065.40 sec
Done 250000 in 1350.49 sec
Done 300000 in 1663.03 sec
Done 350000 in 1979.41 sec
Done 400000 in 2289.94 sec
Done 450000 in 2584.58 sec
Done 500000 in 2889.27 sec
Done 550000 in 3156.72 sec
Done 600000 in 3416.83 sec
Done 650000 in 3673.20 sec
Done 700000 in 3930.41 sec
Done 750000 in 4232.50 sec
Done 800000 in 4518.37 sec
Done 850000 in 4788.19 sec
Done 900000 in 5080.44 sec
Done 950000 in 5272.51 sec
Done 1000000 in 5507.97 sec
Done 1050000 in 5831.47 sec
Done 1100000 in 6124.22 sec
Done 1150000 in 6405.13 sec
Done 1200000 in 6699.62 sec
Done 1250000 in 7004.17 sec
Done 1300000 in 7264.67 sec
Done 1350000 in 7525.13 sec
Done 1400000 in 7779.24 sec
Done 1450000 in 8045.30 sec
Done 1500000 in 8314.90 sec
Done 1550000 in 8643.98 sec
Done 1600000 in 8978.91 sec
Done 1650000 in 9292.37 sec
Done 1700000 in 9596.19 sec
Done 1750000 in 9854.10 sec
Done 1800000 in 10134.48 sec
Done 1850000 in 10407.01 sec
Done 1900000 in 10729.17 sec
Done 1950000 in 11005.57 sec
Done 2000000 in 11183.76 sec
Done 2050000 in 11527.20 sec
Done 2100000 in 11831.40 sec
Done 2150000 in 12141.98 sec
Done 2200000 in 12413.20 sec
Done 2250000 in 12708.59 sec
Done 2300000 in 12979.10 sec
Done 2350000 in 13263.71 sec
Done 2400000 in 13504.37 sec
Done 2450000 in 13755.21 sec
Done 2500000 in 13974.78 sec
Done 2550000 in 14282.01 sec
Done 2600000 in 14586.37 sec
Done 2650000 in 14844.31 sec
Done 2700000 in 15153.73 sec
Done 2750000 in 15473.13 sec
Done 2800000 in 15783.14 sec
Done 2850000 in 16093.12 sec
counts.keys()
dict_keys(['mfp1', 'mfp1-64', 'mfp1-128', 'mfp1-256', 'mfp1-512', 'mfp1-1024', 'mfp1-2048', 'mfp1-4096', 'mfp1-8192', 'mfp1-16384', 'mfp2', 'mfp2-64', 'mfp2-128', 'mfp2-256', 'mfp2-512', 'mfp2-1024', 'mfp2-2048', 'mfp2-4096', 'mfp2-8192', 'mfp2-16384', 'mfp3', 'mfp3-64', 'mfp3-128', 'mfp3-256', 'mfp3-512', 'mfp3-1024', 'mfp3-2048', 'mfp3-4096', 'mfp3-8192', 'mfp3-16384', 'rdk5', 'rdk5-64', 'rdk5-128', 'rdk5-256', 'rdk5-512', 'rdk5-1024', 'rdk5-2048', 'rdk5-4096', 'rdk5-8192', 'rdk5-16384', 'hashap', 'hashap-64', 'hashap-128', 'hashap-256', 'hashap-512', 'hashap-1024', 'hashap-2048', 'hashap-4096', 'hashap-8192', 'hashap-16384', 'hashtt', 'hashtt-64', 'hashtt-128', 'hashtt-256', 'hashtt-512', 'hashtt-1024', 'hashtt-2048', 'hashtt-4096', 'hashtt-8192', 'hashtt-16384'])
pickle.dump(dict(counts),gzip.open('../data/fp_collision_counts.pkl.gz','wb+'))
for k,d in history.items():
    history[k] = dict(d)
pickle.dump(dict(history),gzip.open('../data/fp_collision_counts.history.pkl.gz','wb+'))
with gzip.open('../data/fp_collision_counts.pkl.gz','rb') as inf:
    counts = pickle.load(inf)
with gzip.open('../data/fp_collision_counts.history.pkl.gz','rb') as inf:
    history = pickle.load(inf)

Quantifying the number of collisions

Now plot histograms of the numbers of collisions along with the distributions of the number of bits set in the non-folded FPs

allCounts = (16384, 8192,4096,2048,1024,) #512,256,128,64)

figure(figsize=(16,5*len(allCounts)))

pidx=1
#----------------------------
for nbits in allCounts:
    subplot(len(allCounts),2,pidx)
    pidx+=1
    maxCollisions = max(counts[f'mfp3-{allCounts[-1]}'].keys())+1
    d1=np.zeros(maxCollisions,int)
    for k,v in counts[f'mfp1-{nbits}'].items():
        d1[k]=v
    d2=np.zeros(maxCollisions,int)
    for k,v in counts[f'mfp2-{nbits}'].items():
        d2[k]=v
    d3=np.zeros(maxCollisions,int)
    for k,v in counts[f'mfp3-{nbits}'].items():
        d3[k]=v
    barWidth=.25
    locs = np.array(range(maxCollisions))

    bar(locs,d1,barWidth,label="r=1")
    bar(locs+barWidth,d2,barWidth,label="r=2")
    bar(locs+2*barWidth,d3,barWidth,label="r=3")
    
    #_=hist((d1,d2,d3),bins=20,log=True,label=("r=1","r=2","r=3"))
    title('%d bits'%nbits)
    _=yscale("log")
    _=legend()
    if nbits == allCounts[-1]:
        xlabel('num collisions')
    
    subplot(len(allCounts),2,pidx)
    pidx+=1
    itms = list(sorted(counts['mfp1'].items()))
    plot([x for x,y in itms],[y for x,y in itms],'.-',label=
        "r=1")
    itms = list(sorted(counts['mfp2'].items()))
    plot([x for x,y in itms],[y for x,y in itms],'.-',label=
        "r=2")
    itms = list(sorted(counts['mfp3'].items()))
    plot([x for x,y in itms],[y for x,y in itms],'.-',label=
        "r=3")
    if nbits == allCounts[-1]:
        _=xlabel("num bits set")
    _=ylabel("count")
    _=legend()
    
    

So, there are definitely some collisions, particularly with radius 3 fingerprints with 1K bits.

The bimodal character of the mfp2 and mfp3 fingerprints is worth following up on: that wasn’t observed in the earlier post, which used a set of compounds from ZINC

Look the short Morgan fingerprints

allCounts = (512,256,128,64)

figure(figsize=(16,5*len(allCounts)))

pidx=1
#----------------------------
for nbits in allCounts:
    subplot(len(allCounts),2,pidx)
    pidx+=1
    maxCollisions = max(counts[f'mfp3-{allCounts[-1]}'].keys())+1
    d1=np.zeros(maxCollisions,int)
    for k,v in counts[f'mfp1-{nbits}'].items():
        d1[k]=v
    d2=np.zeros(maxCollisions,int)
    for k,v in counts[f'mfp2-{nbits}'].items():
        d2[k]=v
    d3=np.zeros(maxCollisions,int)
    for k,v in counts[f'mfp3-{nbits}'].items():
        d3[k]=v
    barWidth=.25
    locs = np.array(range(maxCollisions))

    bar(locs,d1,barWidth,label="r=1")
    bar(locs+barWidth,d2,barWidth,label="r=2")
    bar(locs+2*barWidth,d3,barWidth,label="r=3")
    
    #_=hist((d1,d2,d3),bins=20,log=True,label=("r=1","r=2","r=3"))
    title('%d bits'%nbits)
    _=yscale("log")
    _=legend()
    if nbits == allCounts[-1]:
        xlabel('num collisions')

    subplot(len(allCounts),2,pidx)
    pidx+=1
    itms = list(sorted(counts['mfp1'].items()))
    plot([x for x,y in itms],[y for x,y in itms],'.-',label=
        "r=1")
    itms = list(sorted(counts['mfp2'].items()))
    plot([x for x,y in itms],[y for x,y in itms],'.-',label=
        "r=2")
    itms = list(sorted(counts['mfp3'].items()))
    plot([x for x,y in itms],[y for x,y in itms],'.-',label=
        "r=3")
    if nbits == allCounts[-1]:
        _=xlabel("num bits set")
    _=ylabel("count")
    _=legend()
    
    

Look at the other fingerprint types. We’ll do this a little differently because of the scales.

Start with the number of bits set:

figsize(8,5)

itms = list(sorted(counts['hashtt'].items()))
plot([x for x,y in itms],[y for x,y in itms],'.-',label=
    "hashtt")
itms = list(sorted(counts['hashap'].items()))
plot([x for x,y in itms],[y for x,y in itms],'.-',label=
    "hashap")
itms = list(sorted(counts['rdk5'].items()))
plot([x for x,y in itms],[y for x,y in itms],'.-',label=
    "rdk5")
if nbits == allCounts[-1]:
    _=xlabel("num bits set")
_=ylabel("count")
_=legend()
    
    

Both the hashap and hadhtt distributions show more than one peak. Another thing to follow up on…

allCounts = (16384, 8192,4096,2048,1024,) #512,256,128,64)

figure(figsize=(16,5*len(allCounts)))

pidx=1
#----------------------------
for nbits in allCounts:
    for fpt in ('hashtt','hashap','rdk5'):
        subplot(len(allCounts),3,pidx)
        pidx+=1
        maxCollisions = max(counts[f'{fpt}-{allCounts[-1]}'].keys())+1
        d1=np.zeros(maxCollisions,int)
        for k,v in counts[f'{fpt}-{nbits}'].items():
            d1[k]=v
        barWidth=.25
        locs = np.array(range(maxCollisions))

        bar(locs,d1,barWidth,label=fpt)    
        title(f'{fpt}-{nbits}')
        _=yscale("log")
        _=legend()
        if nbits == allCounts[-1]:
            xlabel('num collisions')

The topological torsion FPs don’t have a huge number of collisions, but both atom pair and RDKit fingerprints really do have collisions, as one would expect given the number of set bits.

Look at the percentage of compounds with <= a particular number of collisions

bcounts = (64,128,256,512,1024,2048,4096,8192,16384)
ccounts = list(range(11))


row = ('length','fingerprint',)+tuple(str(x) for x in ccounts)
divider = ['-'*len(k) for k in row]
print('| '+' | '.join(row)+' |')
print('| '+' | '.join(divider)+' |')

for bc in reversed(bcounts):
    for nm,fpg in gens:
        row = [str(bc),nm]
        fpn = f'{nm}-{bc}'
        nmols = np.sum(np.fromiter((x for x in counts[fpn].values()),int))
        accum = 0
        for ccount in ccounts:
            accum += counts[fpn].get(ccount,0)
            row.append(f'{accum/nmols:.3f}')
        print('| '+' | '.join(row)+' |')
            
| length | fingerprint | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| ------ | ----------- | - | - | - | - | - | - | - | - | - | - | -- |
| 16384 | mfp1 | 0.991 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 16384 | mfp2 | 0.956 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 16384 | mfp3 | 0.900 | 0.993 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 16384 | rdk5 | 0.181 | 0.332 | 0.460 | 0.566 | 0.654 | 0.724 | 0.780 | 0.826 | 0.863 | 0.892 | 0.916 |
| 16384 | hashap | 0.335 | 0.519 | 0.651 | 0.746 | 0.816 | 0.867 | 0.905 | 0.931 | 0.950 | 0.964 | 0.974 |
| 16384 | hashtt | 0.898 | 0.988 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 8192 | mfp1 | 0.969 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 8192 | mfp2 | 0.906 | 0.994 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 8192 | mfp3 | 0.808 | 0.974 | 0.997 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 8192 | rdk5 | 0.076 | 0.177 | 0.266 | 0.346 | 0.420 | 0.486 | 0.545 | 0.597 | 0.645 | 0.687 | 0.725 |
| 8192 | hashap | 0.295 | 0.467 | 0.594 | 0.692 | 0.766 | 0.822 | 0.865 | 0.897 | 0.921 | 0.939 | 0.952 |
| 8192 | hashtt | 0.825 | 0.971 | 0.996 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 4096 | mfp1 | 0.950 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 4096 | mfp2 | 0.829 | 0.980 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 4096 | mfp3 | 0.673 | 0.922 | 0.985 | 0.997 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 4096 | rdk5 | 0.034 | 0.077 | 0.122 | 0.166 | 0.210 | 0.254 | 0.296 | 0.336 | 0.375 | 0.412 | 0.447 |
| 4096 | hashap | 0.156 | 0.270 | 0.370 | 0.455 | 0.529 | 0.592 | 0.646 | 0.693 | 0.733 | 0.768 | 0.798 |
| 4096 | hashtt | 0.765 | 0.951 | 0.992 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 2048 | mfp1 | 0.740 | 0.975 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 2048 | mfp2 | 0.564 | 0.874 | 0.971 | 0.994 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 2048 | mfp3 | 0.385 | 0.720 | 0.892 | 0.963 | 0.988 | 0.996 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 |
| 2048 | rdk5 | 0.017 | 0.037 | 0.058 | 0.080 | 0.101 | 0.124 | 0.146 | 0.169 | 0.191 | 0.214 | 0.236 |
| 2048 | hashap | 0.076 | 0.132 | 0.185 | 0.235 | 0.281 | 0.323 | 0.362 | 0.397 | 0.429 | 0.459 | 0.487 |
| 2048 | hashtt | 0.662 | 0.900 | 0.974 | 0.994 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 1024 | mfp1 | 0.674 | 0.942 | 0.992 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 1024 | mfp2 | 0.393 | 0.728 | 0.898 | 0.965 | 0.989 | 0.997 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 |
| 1024 | mfp3 | 0.207 | 0.476 | 0.694 | 0.835 | 0.917 | 0.960 | 0.981 | 0.991 | 0.996 | 0.998 | 0.999 |
| 1024 | rdk5 | 0.007 | 0.017 | 0.026 | 0.037 | 0.047 | 0.057 | 0.067 | 0.077 | 0.088 | 0.098 | 0.109 |
| 1024 | hashap | 0.037 | 0.066 | 0.095 | 0.122 | 0.148 | 0.174 | 0.199 | 0.223 | 0.246 | 0.269 | 0.291 |
| 1024 | hashtt | 0.448 | 0.724 | 0.872 | 0.944 | 0.977 | 0.991 | 0.996 | 0.999 | 0.999 | 1.000 | 1.000 |
| 512 | mfp1 | 0.501 | 0.835 | 0.956 | 0.989 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 512 | mfp2 | 0.184 | 0.445 | 0.671 | 0.824 | 0.911 | 0.957 | 0.980 | 0.991 | 0.996 | 0.999 | 0.999 |
| 512 | mfp3 | 0.074 | 0.209 | 0.371 | 0.527 | 0.662 | 0.767 | 0.844 | 0.899 | 0.936 | 0.960 | 0.976 |
| 512 | rdk5 | 0.004 | 0.009 | 0.013 | 0.018 | 0.024 | 0.029 | 0.035 | 0.040 | 0.046 | 0.051 | 0.056 |
| 512 | hashap | 0.013 | 0.025 | 0.037 | 0.050 | 0.063 | 0.077 | 0.091 | 0.105 | 0.119 | 0.133 | 0.147 |
| 512 | hashtt | 0.286 | 0.537 | 0.721 | 0.839 | 0.911 | 0.951 | 0.973 | 0.985 | 0.992 | 0.996 | 0.998 |
| 256 | mfp1 | 0.312 | 0.659 | 0.864 | 0.952 | 0.984 | 0.995 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 |
| 256 | mfp2 | 0.060 | 0.186 | 0.348 | 0.511 | 0.653 | 0.764 | 0.846 | 0.902 | 0.939 | 0.963 | 0.978 |
| 256 | mfp3 | 0.022 | 0.066 | 0.134 | 0.220 | 0.314 | 0.409 | 0.500 | 0.584 | 0.659 | 0.726 | 0.783 |
| 256 | rdk5 | 0.003 | 0.005 | 0.007 | 0.009 | 0.011 | 0.014 | 0.016 | 0.019 | 0.021 | 0.024 | 0.027 |
| 256 | hashap | 0.004 | 0.007 | 0.010 | 0.013 | 0.017 | 0.021 | 0.025 | 0.030 | 0.034 | 0.039 | 0.045 |
| 256 | hashtt | 0.105 | 0.234 | 0.373 | 0.501 | 0.612 | 0.706 | 0.782 | 0.842 | 0.887 | 0.920 | 0.944 |
| 128 | mfp1 | 0.117 | 0.324 | 0.546 | 0.726 | 0.846 | 0.918 | 0.958 | 0.980 | 0.991 | 0.996 | 0.998 |
| 128 | mfp2 | 0.015 | 0.049 | 0.106 | 0.184 | 0.275 | 0.373 | 0.469 | 0.560 | 0.642 | 0.714 | 0.777 |
| 128 | mfp3 | 0.007 | 0.018 | 0.036 | 0.062 | 0.095 | 0.136 | 0.183 | 0.235 | 0.290 | 0.345 | 0.399 |
| 128 | rdk5 | 0.002 | 0.002 | 0.003 | 0.004 | 0.005 | 0.006 | 0.007 | 0.008 | 0.009 | 0.010 | 0.011 |
| 128 | hashap | 0.002 | 0.003 | 0.005 | 0.006 | 0.008 | 0.010 | 0.012 | 0.014 | 0.016 | 0.018 | 0.021 |
| 128 | hashtt | 0.035 | 0.079 | 0.139 | 0.208 | 0.283 | 0.357 | 0.429 | 0.497 | 0.561 | 0.620 | 0.675 |
| 64 | mfp1 | 0.024 | 0.083 | 0.179 | 0.302 | 0.435 | 0.564 | 0.677 | 0.771 | 0.843 | 0.896 | 0.932 |
| 64 | mfp2 | 0.004 | 0.011 | 0.025 | 0.044 | 0.072 | 0.107 | 0.149 | 0.199 | 0.254 | 0.312 | 0.371 |
| 64 | mfp3 | 0.003 | 0.006 | 0.011 | 0.017 | 0.026 | 0.037 | 0.050 | 0.066 | 0.084 | 0.105 | 0.129 |
| 64 | rdk5 | 0.001 | 0.002 | 0.002 | 0.002 | 0.003 | 0.003 | 0.004 | 0.004 | 0.005 | 0.005 | 0.006 |
| 64 | hashap | 0.001 | 0.002 | 0.003 | 0.004 | 0.004 | 0.005 | 0.006 | 0.007 | 0.008 | 0.009 | 0.010 |
| 64 | hashtt | 0.016 | 0.034 | 0.059 | 0.092 | 0.132 | 0.177 | 0.227 | 0.279 | 0.331 | 0.382 | 0.431 |

Notice that the high-density fingerprints - rdk5 and hashap - have collisions for the majority of the compounds even with a 16K fingerprint.

Impact on similarity

Now that we have a sense of how many collisions there are, lets look at the impact those collisions have on molecular similarity.

As a data set we’ll use a set of 50000 pairs of ChEMBL30 molecules with a similarity of at least 0.55 using a count-based morgan fingerprint with radius 1 (MFP1). The jupyter notebook used to generate this data can be found here.

pairs = []
for l in gzip.open('../data/chembl30_50K.mfp1.pairs.txt.gz','rt'):
    l = l.split('\t')
    pairs.append([Chem.MolFromSmiles(l[1]),Chem.MolFromSmiles(l[3])])
sims = defaultdict(list)
for nm,fpgen in gens:
    print(f'Doing {nm}')
    fpg = fpgen(1024) # size is irrelevant here
    for m1,m2 in pairs:
        fp1 = fpg.GetSparseFingerprint(m1)
        fp2 = fpg.GetSparseFingerprint(m2)
        sims[nm].append(DataStructs.TanimotoSimilarity(fp1,fp2))
    for l in 64,128,256,512,1024,2048,4096,8192,16384:
        fpn = f'{nm}-{l}'
        fpg = fpgen(l)
        for m1,m2 in pairs:
            fp1 = fpg.GetFingerprint(m1)
            fp2 = fpg.GetFingerprint(m2)
            sims[fpn].append(DataStructs.TanimotoSimilarity(fp1,fp2))
Doing mfp1
Doing mfp2
Doing mfp3
Doing rdk5
Doing hashap
Doing hashtt
pickle.dump(dict(sims),gzip.open('../data/fp_collision_sims.pkl.gz','wb+'))
sims = pickle.load(gzip.open('../data/fp_collision_sims.pkl.gz','rb'))
rads = (1,2,3)
bcounts = (64,128,256,512,1024,2048,4096,8192,16384)
figsize(3*6,len(bcounts)*6)

plt = 1
for i,cnt in enumerate(reversed(bcounts)):
    for j,rad in enumerate(rads):
        subplot(len(bcounts),3,plt)
        plt += 1
        plot((0,1),(0,1),'k-')
        hexbin(sims[f'mfp{rad}'],sims[f'mfp{rad}-{cnt}'],bins='log')
        xlabel('no collisions')
        ylabel(f'{cnt} bits')
        title(f'radius = {rad}')

bcounts = (64,128,256,512,1024,2048,4096,8192,16384)
figsize(3*6,len(bcounts)*6)

plt = 1
for i,cnt in enumerate(reversed(bcounts)):
    for j,nm in enumerate(('hashtt','hashap','rdk5')):
        subplot(len(bcounts),3,plt)
        plt += 1
        plot((0,1),(0,1),'k-')
        hexbin(sims[f'{nm}'],sims[f'{nm}-{cnt}'],bins='log')       
        xlabel('no collisions')
        ylabel(f'{cnt} bits')
        title(f'{nm}')

Do a statistical comparison of the similarity distributions

from scipy.stats import spearmanr

row = ('nBits','fingerprint','SpearmanR','mean(d)','std(d)','90% \\|d\\|')
divider = ['-'*len(k) for k in row]
print('| '+' | '.join(row)+' |')
print('| '+' | '.join(divider)+' |')
        

rads = (1,2,3)
bcounts = (64,128,256,512,1024,2048,4096,8192,16384)
for i,cnt in enumerate(reversed(bcounts)):
    for j,(nm,fpg) in enumerate(gens):
        d1 = numpy.array(sims[nm])
        d2 = numpy.array(sims[f'{nm}-{cnt}'])
        delt = d2-d1
        row = [str(cnt),nm,f'{spearmanr(d1,d2).correlation:.3f}',f'{np.mean(delt):.2g}',
               f'{np.std(delt):.2g}',f'{np.quantile(np.abs(delt),0.9):.2g}']
        print('| '+' | '.join(row)+' |')
| nBits | fingerprint | SpearmanR | mean(d) | std(d) | 90% \|d\| |
| ----- | ----------- | --------- | ------- | ------ | --------- |
| 16384 | mfp1 | 1.000 | 0.00023 | 0.0025 | 0 |
| 16384 | mfp2 | 0.999 | 0.00085 | 0.0036 | 0.0034 |
| 16384 | mfp3 | 0.999 | 0.0014 | 0.0038 | 0.0056 |
| 16384 | rdk5 | 1.000 | 0.0054 | 0.0046 | 0.012 |
| 16384 | hashap | 0.997 | 0.0066 | 0.0072 | 0.016 |
| 16384 | hashtt | 0.999 | 0.0024 | 0.0074 | 0.011 |
| 8192 | mfp1 | 0.999 | 0.00089 | 0.005 | 0 |
| 8192 | mfp2 | 0.999 | 0.0017 | 0.0051 | 0.007 |
| 8192 | mfp3 | 0.999 | 0.0027 | 0.0054 | 0.011 |
| 8192 | rdk5 | 0.999 | 0.011 | 0.008 | 0.022 |
| 8192 | hashap | 0.996 | 0.0089 | 0.0084 | 0.02 |
| 8192 | hashtt | 0.997 | 0.0046 | 0.01 | 0.019 |
| 4096 | mfp1 | 0.998 | 0.0016 | 0.0069 | 0.0079 |
| 4096 | mfp2 | 0.998 | 0.0033 | 0.0073 | 0.014 |
| 4096 | mfp3 | 0.997 | 0.0054 | 0.0077 | 0.016 |
| 4096 | rdk5 | 0.998 | 0.022 | 0.014 | 0.041 |
| 4096 | hashap | 0.988 | 0.022 | 0.015 | 0.042 |
| 4096 | hashtt | 0.996 | 0.0068 | 0.013 | 0.024 |
| 2048 | mfp1 | 0.993 | 0.0042 | 0.013 | 0.02 |
| 2048 | mfp2 | 0.995 | 0.0069 | 0.012 | 0.023 |
| 2048 | mfp3 | 0.994 | 0.011 | 0.012 | 0.027 |
| 2048 | rdk5 | 0.995 | 0.044 | 0.026 | 0.08 |
| 2048 | hashap | 0.956 | 0.05 | 0.029 | 0.088 |
| 2048 | hashtt | 0.993 | 0.011 | 0.016 | 0.034 |
| 1024 | mfp1 | 0.989 | 0.0067 | 0.016 | 0.028 |
| 1024 | mfp2 | 0.990 | 0.014 | 0.016 | 0.036 |
| 1024 | mfp3 | 0.987 | 0.022 | 0.017 | 0.044 |
| 1024 | rdk5 | 0.981 | 0.085 | 0.049 | 0.15 |
| 1024 | hashap | 0.874 | 0.1 | 0.051 | 0.17 |
| 1024 | hashtt | 0.983 | 0.025 | 0.026 | 0.061 |
| 512 | mfp1 | 0.980 | 0.013 | 0.021 | 0.044 |
| 512 | mfp2 | 0.979 | 0.028 | 0.023 | 0.059 |
| 512 | mfp3 | 0.972 | 0.044 | 0.026 | 0.078 |
| 512 | rdk5 | 0.929 | 0.16 | 0.09 | 0.28 |
| 512 | hashap | 0.702 | 0.19 | 0.083 | 0.3 |
| 512 | hashtt | 0.967 | 0.046 | 0.037 | 0.096 |
| 256 | mfp1 | 0.962 | 0.026 | 0.029 | 0.065 |
| 256 | mfp2 | 0.954 | 0.057 | 0.035 | 0.1 |
| 256 | mfp3 | 0.932 | 0.087 | 0.041 | 0.14 |
| 256 | rdk5 | 0.785 | 0.28 | 0.15 | 0.48 |
| 256 | hashap | 0.539 | 0.3 | 0.11 | 0.43 |
| 256 | hashtt | 0.921 | 0.092 | 0.057 | 0.17 |
| 128 | mfp1 | 0.918 | 0.049 | 0.043 | 0.11 |
| 128 | mfp2 | 0.897 | 0.11 | 0.054 | 0.18 |
| 128 | mfp3 | 0.839 | 0.17 | 0.066 | 0.25 |
| 128 | rdk5 | 0.528 | 0.42 | 0.2 | 0.68 |
| 128 | hashap | 0.343 | 0.42 | 0.12 | 0.57 |
| 128 | hashtt | 0.828 | 0.17 | 0.085 | 0.28 |
| 64 | mfp1 | 0.832 | 0.099 | 0.064 | 0.18 |
| 64 | mfp2 | 0.771 | 0.2 | 0.083 | 0.31 |
| 64 | mfp3 | 0.642 | 0.31 | 0.1 | 0.44 |
| 64 | rdk5 | 0.235 | 0.5 | 0.22 | 0.77 |
| 64 | hashap | 0.158 | 0.51 | 0.13 | 0.65 |
| 64 | hashtt | 0.672 | 0.28 | 0.12 | 0.43 |