Similarity-search hitlist overlap

reference
similarity
fingerprints
Why it’s interesting to use different fingerprints for similarity searching
Published

March 21, 2025

One of the things I say when teaching about fingerprints and similarity searches is that there is no best fingerprint for every use case and that it’s often useful to do similarity searches using different types of fingerprints and combine the results. If you use fingerprints that use quite different features to determine similarity (e.g. Morgan3 and atom pairs), this is a good way to take advantage of the differences between the fingerprints.

The goal of this post is to explore (and try to quantitate) the differences between the hit sets returned by doing similarity searches with different fingerprint types.

For this post I will use the 59788 compounds tested in PubChem Bioassay ID 373. This is one of the PubChem screening assays, so the the compounds are taken from a screening deck and should be expected to be reasonably diverse with some small SAR clusters. You can directly access and download the compound structures here.

One thing that is worth keeping in mind is that these results almost certainly consider similarity values which are down close to and probably in the region of similarities observed between random compounds. It may be worth refining the analysis in order to only consider similarities which are more significant, but that’s for a possible future post.

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs
import numpy as np

from matplotlib import pyplot as plt
plt.style.use('tableau-colorblind10')

%matplotlib inline
%load_ext sql

import rdkit
print(rdkit.__version__)
2024.09.6
import gzip
with gzip.open('../data/Pubchem_AID373_compounds.sdf.gz','rb') as inf:
    with Chem.ForwardSDMolSupplier(inf) as suppl:
        dbmols = [x for x in suppl if x is not None]
len(dbmols)
    
59788
dbmols = ms
qry = Chem.MolFromSmiles('CCOC1=C(C=C(C=C1)C2=NOC(=N2)C3=CC=NC=C3)OCC') # https://pubchem.ncbi.nlm.nih.gov/compound/666359
qry

runs = [
    ('mfp2',rdFingerprintGenerator.GetMorganGenerator(radius=2)),
    ('ffp2',rdFingerprintGenerator.GetMorganGenerator(radius=2,
                                                  atomInvariantsGenerator=rdFingerprintGenerator.GetMorganAtomInvGen())),
    ('mfp3',rdFingerprintGenerator.GetMorganGenerator(radius=3)),
    ('ffp3',rdFingerprintGenerator.GetMorganGenerator(radius=3,
                                                  atomInvariantsGenerator=rdFingerprintGenerator.GetMorganAtomInvGen())),
    ('mfp1',rdFingerprintGenerator.GetMorganGenerator(radius=1)),
    ('ffp1',rdFingerprintGenerator.GetMorganGenerator(radius=1,
                                                  atomInvariantsGenerator=rdFingerprintGenerator.GetMorganAtomInvGen())),
    ('tt',rdFingerprintGenerator.GetTopologicalTorsionGenerator()),
    ('ap',rdFingerprintGenerator.GetAtomPairGenerator()),
    ('rdk5',rdFingerprintGenerator.GetRDKitFPGenerator(maxPath=5)),
    ('rdk7',rdFingerprintGenerator.GetRDKitFPGenerator(maxPath=7)),
]

accum = {}
for nm,fpg in runs:
    print(nm)
    fps = fpg.GetFingerprints(dbmols,numThreads=8)
    sims = DataStructs.BulkTanimotoSimilarity(fpg.GetFingerprint(qry),fps)
    accum[f'{nm}-b'] = sims

    fps = fpg.GetCountFingerprints(dbmols,numThreads=8)
    sims = DataStructs.BulkTanimotoSimilarity(fpg.GetCountFingerprint(qry),list(fps))
    accum[f'{nm}-c'] = sims
mfp2
ffp2
mfp3
ffp3
mfp1
ffp1
tt
ap
rdk5
rdk7
from rdkit.Chem.Pharm2D import Gobbi_Pharm2D,Generate
def Gobbi2D_bits(mol,fpLen=2048):
    res = DataStructs.ExplicitBitVect(fpLen)
    for bit in Generate.Gen2DFingerprint(mol,Gobbi_Pharm2D.factory).GetOnBits():
        # the bits are not hashed, so we need to do so before we fold them:
        res.SetBit(hash((bit,))%fpLen)
    return res
from rdkit.Chem import rdMolDescriptors
from rdkit.Avalon import pyAvalonTools

func_runs = [
    ('gobbi2d',Gobbi2D_bits),
    ('avalon',pyAvalonTools.GetAvalonFP),
    ('avalon-c',pyAvalonTools.GetAvalonCountFP),
    ('pattern',Chem.PatternFingerprint),
]
for nm,func in func_runs:
    print(nm)
    fps = [func(m) for m in dbmols]
    sims = DataStructs.BulkTanimotoSimilarity(func(qry),fps)
    accum[nm] = sims
gobbi2d
avalon
avalon-c
pattern

Accumulate the overlaps between the hit sets:

from collections import defaultdict

nms = list(accum.keys())

ovls = defaultdict(dict)
topNs = [10,100,1000]
for i,nmi in enumerate(nms):
    topi = [x for s,x in sorted([(s,x) for x,s in enumerate(accum[nmi])],reverse=True)]
    for j in range(i):
        nmj = nms[j]
        topj = [x for s,x in sorted([(s,x) for x,s in enumerate(accum[nmj])],reverse=True)]
        for topN in topNs:
            ovls[nmi,nmj][topN] = set(topi[:topN]).intersection(topj[:topN])
ovls = dict(ovls)

Here’s the full table of concordances between the fingerprints in the top 10, 100, and 1000 results:

from IPython.display import HTML

snms = sorted(nms)

html = "<table>"
ths = "".join(f"<th>{nm}</th>" for nm in snms)
html += f"<tr><td></td>{ths}</tr>\n"
ovls = dict(ovls)
for i,nm in enumerate(snms):
    row = f"<th>{nm}</th>"
    tds = []
    for j in range(len(snms)):
        if i==j:
            tds.append('<td></td>')
            continue
        nmi = snms[i]
        nmj = snms[j]
        if (nmi,nmj) not in ovls:
            nmi,nmj = nmj,nmi
        tds.append(f"<td>{len(ovls[nmi,nmj][10])}<br />{len(ovls[nmi,nmj][100])}<br />{len(ovls[nmi,nmj][1000])}</td>")
    tds = "".join(tds)
    row += tds
    html += f"<tr>{row}</tr>\n"

html += "</table>"

HTML(html)
ap-b ap-c avalon avalon-c ffp1-b ffp1-c ffp2-b ffp2-c ffp3-b ffp3-c gobbi2d mfp1-b mfp1-c mfp2-b mfp2-c mfp3-b mfp3-c pattern rdk5-b rdk5-c rdk7-b rdk7-c tt-b tt-c
ap-b 9
77
724
7
40
245
7
40
255
6
43
317
8
38
383
7
46
281
7
42
336
7
43
260
6
41
310
3
51
235
7
43
323
8
41
380
7
47
303
7
43
337
7
46
264
7
45
314
5
35
237
6
46
292
6
46
325
7
47
279
6
46
290
7
39
298
7
41
306
ap-c 9
77
724
6
46
271
6
49
291
6
42
350
8
45
445
7
48
312
7
44
378
7
45
289
7
44
349
3
49
238
6
46
349
7
48
424
6
47
312
6
48
359
6
47
280
6
48
335
5
41
295
6
52
326
6
51
364
6
50
310
7
48
317
7
41
337
7
47
359
avalon 7
40
245
6
46
271
9
70
588
7
57
512
8
40
399
8
57
428
9
43
357
8
51
354
8
43
320
5
37
290
8
65
552
9
48
451
9
60
529
9
50
443
8
54
429
9
51
382
6
56
518
7
83
625
8
74
502
8
73
502
8
68
487
8
54
399
8
60
446
avalon-c 7
40
255
6
49
291
9
70
588
7
50
430
8
50
402
8
46
379
9
41
371
8
44
339
8
40
344
5
37
289
8
59
459
9
57
466
9
52
458
9
50
469
8
47
399
9
46
430
6
62
532
7
63
494
8
69
478
8
61
424
8
68
458
8
50
361
8
57
393
ffp1-b 6
43
317
6
42
350
7
57
512
7
50
430
7
60
586
9
68
671
8
54
547
9
55
573
8
52
496
6
37
263
9
82
759
7
60
562
8
67
662
8
59
529
8
59
542
8
57
469
5
50
436
7
58
563
8
54
485
8
57
458
7
49
441
8
65
514
8
63
544
ffp1-c 8
38
383
8
45
445
8
40
399
8
50
402
7
60
586
8
60
560
9
70
722
8
55
507
8
65
633
5
39
292
7
59
539
9
80
674
8
59
527
8
68
598
7
60
461
8
64
532
6
48
385
8
41
449
8
44
473
7
40
405
8
43
442
9
60
543
9
62
579
ffp2-b 7
46
281
7
48
312
8
57
428
8
46
379
9
68
671
8
60
560
9
75
682
10
82
804
9
73
649
6
49
285
9
62
582
8
61
483
9
84
658
9
78
539
9
82
598
9
79
510
5
39
385
8
61
511
8
56
479
9
63
453
8
57
447
9
70
595
9
74
611
ffp2-c 7
42
336
7
44
378
9
43
357
9
41
371
8
54
547
9
70
722
9
75
682
9
78
669
9
90
848
6
41
314
8
52
483
9
65
564
9
71
559
9
81
655
8
76
553
9
80
625
6
32
347
8
47
430
9
45
441
8
47
387
9
44
413
9
64
578
9
67
603
ffp3-b 7
43
260
7
45
289
8
51
354
8
44
339
9
55
573
8
55
507
10
82
804
9
78
669
9
81
715
6
45
277
9
54
498
8
58
450
9
74
606
9
77
543
9
86
617
9
82
547
5
34
337
8
56
443
8
54
431
9
60
397
8
54
395
9
63
538
9
67
546
ffp3-c 6
41
310
7
44
349
8
43
320
8
40
344
8
52
496
8
65
633
9
73
649
9
90
848
9
81
715
6
41
307
8
50
440
8
63
519
9
69
535
9
79
648
8
79
565
9
82
645
6
32
314
8
47
394
8
45
413
8
47
363
9
44
383
9
60
548
9
63
570
gobbi2d 3
51
235
3
49
238
5
37
290
5
37
289
6
37
263
5
39
292
6
49
285
6
41
314
6
45
277
6
41
307
5
40
280
5
42
309
6
55
319
6
48
351
5
49
301
6
48
346
3
33
238
7
42
349
7
41
367
5
44
367
5
43
371
6
44
278
6
41
304
mfp1-b 7
43
323
6
46
349
8
65
552
8
59
459
9
82
759
7
59
539
9
62
582
8
52
483
9
54
498
8
50
440
5
40
280
8
66
645
9
68
729
9
60
582
9
59
580
9
57
505
5
57
443
7
63
581
7
57
494
9
60
455
7
52
444
8
61
461
8
60
507
mfp1-c 8
41
380
7
48
424
9
48
451
9
57
466
7
60
562
9
80
674
8
61
483
9
65
564
8
58
450
8
63
519
5
42
309
8
66
645
9
65
613
9
75
729
8
65
530
9
69
628
6
53
408
7
48
481
8
51
483
8
48
414
8
50
439
8
59
434
8
60
482
mfp2-b 7
47
303
6
47
312
9
60
529
9
52
458
8
67
662
8
59
527
9
84
658
9
71
559
9
74
606
9
69
535
6
55
319
9
68
729
9
65
613
10
81
694
9
83
780
10
80
643
6
44
449
8
61
605
8
57
534
9
64
504
8
57
490
9
74
480
9
75
521
mfp2-c 7
43
337
6
48
359
9
50
443
9
50
469
8
59
529
8
68
598
9
78
539
9
81
655
9
77
543
9
79
648
6
48
351
9
60
582
9
75
729
10
81
694
9
84
679
10
89
844
6
41
393
8
53
483
8
51
488
9
54
440
8
51
461
9
67
456
9
71
507
mfp3-b 7
46
264
6
47
280
8
54
429
8
47
399
8
59
542
7
60
461
9
82
598
8
76
553
9
86
617
8
79
565
5
49
301
9
59
580
8
65
530
9
83
780
9
84
679
9
89
735
5
39
382
7
58
516
7
57
470
9
63
459
7
58
447
8
67
444
8
70
477
mfp3-c 7
45
314
6
48
335
9
51
382
9
46
430
8
57
469
8
64
532
9
79
510
9
80
625
9
82
547
9
82
645
6
48
346
9
57
505
9
69
628
10
80
643
10
89
844
9
89
735
6
39
360
8
55
436
8
52
446
9
56
412
8
51
424
9
65
431
9
70
472
pattern 5
35
237
5
41
295
6
56
518
6
62
532
5
50
436
6
48
385
5
39
385
6
32
347
5
34
337
6
32
314
3
33
238
5
57
443
6
53
408
6
44
449
6
41
393
5
39
382
6
39
360
5
51
513
6
49
432
5
52
393
6
51
398
6
45
347
6
47
385
rdk5-b 6
46
292
6
52
326
7
83
625
7
63
494
7
58
563
8
41
449
8
61
511
8
47
430
8
56
443
8
47
394
7
42
349
7
63
581
7
48
481
8
61
605
8
53
483
7
58
516
8
55
436
5
51
513
8
83
736
7
84
705
7
73
663
9
61
504
9
67
548
rdk5-c 6
46
325
6
51
364
8
74
502
8
69
478
8
54
485
8
44
473
8
56
479
9
45
441
8
54
431
8
45
413
7
41
367
7
57
494
8
51
483
8
57
534
8
51
488
7
57
470
8
52
446
6
49
432
8
83
736
7
85
701
8
86
812
8
59
509
8
64
539
rdk7-b 7
47
279
6
50
310
8
73
502
8
61
424
8
57
458
7
40
405
9
63
453
8
47
387
9
60
397
8
47
363
5
44
367
9
60
455
8
48
414
9
64
504
9
54
440
9
63
459
9
56
412
5
52
393
7
84
705
7
85
701
7
85
776
8
63
468
8
67
490
rdk7-c 6
46
290
7
48
317
8
68
487
8
68
458
7
49
441
8
43
442
8
57
447
9
44
413
8
54
395
9
44
383
5
43
371
7
52
444
8
50
439
8
57
490
8
51
461
7
58
447
8
51
424
6
51
398
7
73
663
8
86
812
7
85
776
8
57
493
8
62
518
tt-b 7
39
298
7
41
337
8
54
399
8
50
361
8
65
514
9
60
543
9
70
595
9
64
578
9
63
538
9
60
548
6
44
278
8
61
461
8
59
434
9
74
480
9
67
456
8
67
444
9
65
431
6
45
347
9
61
504
8
59
509
8
63
468
8
57
493
10
87
853
tt-c 7
41
306
7
47
359
8
60
446
8
57
393
8
63
544
9
62
579
9
74
611
9
67
603
9
67
546
9
63
570
6
41
304
8
60
507
8
60
482
9
75
521
9
71
507
8
70
477
9
70
472
6
47
385
9
67
548
8
64
539
8
67
490
8
62
518
10
87
853

Look at the highest overlaps (most similar results) for each FP:

snms = sorted(nms)

ovls = dict(ovls)
for i in range(len(snms)):
    print('-'*70)
    print(snms[i])
    for cnt in (10,100,1000):
        row = []
        for j in range(len(snms)):
            if i==j:
                continue
            nmi = snms[i]
            nmj = snms[j]
            if (nmi,nmj) not in ovls:
                nmi,nmj = nmj,nmi
            row.append((len(ovls[nmi,nmj][cnt]),j))
        row = sorted(row,reverse=True)
        nbrs = []
        for j in range(3):
            nbrs.append(f'{snms[row[j][1]]}({row[j][0]})')
        print(f'\t{cnt: 5d}',f'{nbrs[0]:13s}',f'{nbrs[1]:13s}',f'{nbrs[2]:13s}',)
----------------------------------------------------------------------
ap-b
       10 ap-c(9)       mfp1-c(8)     ffp1-c(8)    
      100 ap-c(77)      gobbi2d(51)   rdk7-b(47)   
     1000 ap-c(724)     ffp1-c(383)   mfp1-c(380)  
----------------------------------------------------------------------
ap-c
       10 ap-b(9)       ffp1-c(8)     tt-c(7)      
      100 ap-b(77)      rdk5-b(52)    rdk5-c(51)   
     1000 ap-b(724)     ffp1-c(445)   mfp1-c(424)  
----------------------------------------------------------------------
avalon
       10 mfp3-c(9)     mfp2-c(9)     mfp2-b(9)    
      100 rdk5-b(83)    rdk5-c(74)    rdk7-b(73)   
     1000 rdk5-b(625)   avalon-c(588) mfp1-b(552)  
----------------------------------------------------------------------
avalon-c
       10 mfp3-c(9)     mfp2-c(9)     mfp2-b(9)    
      100 avalon(70)    rdk5-c(69)    rdk7-c(68)   
     1000 avalon(588)   pattern(532)  rdk5-b(494)  
----------------------------------------------------------------------
ffp1-b
       10 mfp1-b(9)     ffp3-b(9)     ffp2-b(9)    
      100 mfp1-b(82)    ffp2-b(68)    mfp2-b(67)   
     1000 mfp1-b(759)   ffp2-b(671)   mfp2-b(662)  
----------------------------------------------------------------------
ffp1-c
       10 tt-c(9)       tt-b(9)       mfp1-c(9)    
      100 mfp1-c(80)    ffp2-c(70)    mfp2-c(68)   
     1000 ffp2-c(722)   mfp1-c(674)   ffp3-c(633)  
----------------------------------------------------------------------
ffp2-b
       10 ffp3-b(10)    tt-c(9)       tt-b(9)      
      100 mfp2-b(84)    mfp3-b(82)    ffp3-b(82)   
     1000 ffp3-b(804)   ffp2-c(682)   ffp1-b(671)  
----------------------------------------------------------------------
ffp2-c
       10 tt-c(9)       tt-b(9)       rdk7-c(9)    
      100 ffp3-c(90)    mfp2-c(81)    mfp3-c(80)   
     1000 ffp3-c(848)   ffp1-c(722)   ffp2-b(682)  
----------------------------------------------------------------------
ffp3-b
       10 ffp2-b(10)    tt-c(9)       tt-b(9)      
      100 mfp3-b(86)    mfp3-c(82)    ffp2-b(82)   
     1000 ffp2-b(804)   ffp3-c(715)   ffp2-c(669)  
----------------------------------------------------------------------
ffp3-c
       10 tt-c(9)       tt-b(9)       rdk7-c(9)    
      100 ffp2-c(90)    mfp3-c(82)    ffp3-b(81)   
     1000 ffp2-c(848)   ffp3-b(715)   ffp2-b(649)  
----------------------------------------------------------------------
gobbi2d
       10 rdk5-c(7)     rdk5-b(7)     tt-c(6)      
      100 mfp2-b(55)    ap-b(51)      mfp3-b(49)   
     1000 rdk7-c(371)   rdk7-b(367)   rdk5-c(367)  
----------------------------------------------------------------------
mfp1-b
       10 rdk7-b(9)     mfp3-c(9)     mfp3-b(9)    
      100 ffp1-b(82)    mfp2-b(68)    mfp1-c(66)   
     1000 ffp1-b(759)   mfp2-b(729)   mfp1-c(645)  
----------------------------------------------------------------------
mfp1-c
       10 mfp3-c(9)     mfp2-c(9)     mfp2-b(9)    
      100 ffp1-c(80)    mfp2-c(75)    mfp3-c(69)   
     1000 mfp2-c(729)   ffp1-c(674)   mfp1-b(645)  
----------------------------------------------------------------------
mfp2-b
       10 mfp3-c(10)    mfp2-c(10)    tt-c(9)      
      100 ffp2-b(84)    mfp3-b(83)    mfp2-c(81)   
     1000 mfp3-b(780)   mfp1-b(729)   mfp2-c(694)  
----------------------------------------------------------------------
mfp2-c
       10 mfp3-c(10)    mfp2-b(10)    tt-c(9)      
      100 mfp3-c(89)    mfp3-b(84)    mfp2-b(81)   
     1000 mfp3-c(844)   mfp1-c(729)   mfp2-b(694)  
----------------------------------------------------------------------
mfp3-b
       10 rdk7-b(9)     mfp3-c(9)     mfp2-c(9)    
      100 mfp3-c(89)    ffp3-b(86)    mfp2-c(84)   
     1000 mfp2-b(780)   mfp3-c(735)   mfp2-c(679)  
----------------------------------------------------------------------
mfp3-c
       10 mfp2-c(10)    mfp2-b(10)    tt-c(9)      
      100 mfp3-b(89)    mfp2-c(89)    ffp3-c(82)   
     1000 mfp2-c(844)   mfp3-b(735)   ffp3-c(645)  
----------------------------------------------------------------------
pattern
       10 tt-c(6)       tt-b(6)       rdk7-c(6)    
      100 avalon-c(62)  mfp1-b(57)    avalon(56)   
     1000 avalon-c(532) avalon(518)   rdk5-b(513)  
----------------------------------------------------------------------
rdk5-b
       10 tt-c(9)       tt-b(9)       rdk5-c(8)    
      100 rdk7-b(84)    rdk5-c(83)    avalon(83)   
     1000 rdk5-c(736)   rdk7-b(705)   rdk7-c(663)  
----------------------------------------------------------------------
rdk5-c
       10 ffp2-c(9)     tt-c(8)       tt-b(8)      
      100 rdk7-c(86)    rdk7-b(85)    rdk5-b(83)   
     1000 rdk7-c(812)   rdk5-b(736)   rdk7-b(701)  
----------------------------------------------------------------------
rdk7-b
       10 mfp3-c(9)     mfp3-b(9)     mfp2-c(9)    
      100 rdk7-c(85)    rdk5-c(85)    rdk5-b(84)   
     1000 rdk7-c(776)   rdk5-b(705)   rdk5-c(701)  
----------------------------------------------------------------------
rdk7-c
       10 ffp3-c(9)     ffp2-c(9)     tt-c(8)      
      100 rdk5-c(86)    rdk7-b(85)    rdk5-b(73)   
     1000 rdk5-c(812)   rdk7-b(776)   rdk5-b(663)  
----------------------------------------------------------------------
tt-b
       10 tt-c(10)      rdk5-b(9)     mfp3-c(9)    
      100 tt-c(87)      mfp2-b(74)    ffp2-b(70)   
     1000 tt-c(853)     ffp2-b(595)   ffp2-c(578)  
----------------------------------------------------------------------
tt-c
       10 tt-b(10)      rdk5-b(9)     mfp3-c(9)    
      100 tt-b(87)      mfp2-b(75)    ffp2-b(74)   
     1000 tt-b(853)     ffp2-b(611)   ffp2-c(603)  

Do some statistics

Those were just results for a single query molecule. Let’s try and get more robust results by picking 500 molecules and looking at statistics for the overlaps:

import random
random.seed(0xbad5eed)
order = list(range(len(dbmols)))
random.shuffle(order)


nToDo = 500
order = order[:500]



accum = {}
for nm,fpg in runs:
    print(nm)
    fps = list(fpg.GetFingerprints(dbmols,numThreads=8))
    taccum = []
    for i in order:
        qry = fps[i]
        sims = [(sim,j) for j,sim in enumerate(DataStructs.BulkTanimotoSimilarity(qry,fps))]
        # remove the self similarity
        del sims[i]
        sims = sorted(sims,reverse=True)[:1000]
        taccum.append(sims)
    accum[f'{nm}-b'] = taccum

    fps = list(fpg.GetCountFingerprints(dbmols,numThreads=8))
    taccum = []
    for i in order:
        qry = fps[i]
        sims = [(sim,j) for j,sim in enumerate(DataStructs.BulkTanimotoSimilarity(qry,fps))]
        # remove the self similarity
        del sims[i]
        sims = sorted(sims,reverse=True)[:1000]
        taccum.append(sims)
    accum[f'{nm}-c'] = taccum

for nm,func in func_runs:
    print(nm)
    fps = [func(m) for m in dbmols]
    taccum = []
    for i in order:
        qry = fps[i]
        sims = [(sim,j) for j,sim in enumerate(DataStructs.BulkTanimotoSimilarity(qry,fps))]
        # remove the self similarity
        del sims[i]
        sims = sorted(sims,reverse=True)[:1000]
        taccum.append(sims)
    accum[nm] = taccum
mfp2
ffp2
mfp3
ffp3
mfp1
ffp1
tt
ap
rdk5
rdk7
gobbi2d
avalon
avalon-c
pattern
import pickle
import gzip
with gzip.open('./results/sim_overlaps.pkl.gz','wb+') as outf:
    pickle.dump(accum,outf)

Accumulate the overlaps:

from collections import defaultdict

nms = list(accum.keys())

ovls = defaultdict(dict)
topNs = [10,100,1000]
for i,nmi in enumerate(nms):
    nruns = len(accum[nmi])
    for j in range(i):
        nmj = nms[j]
        for topN in topNs:
            ovls[nmi,nmj][topN] = []
        for run in range(nruns):
            topi = [x for s,x in accum[nmi][run]]
            topj = [x for s,x in accum[nmj][run]]
            for topN in topNs:
                ovls[nmi,nmj][topN].append(len(set(topi[:topN]).intersection(topj[:topN])))
ovls = dict(ovls)

Find the most similar fingerprint types for each of the fingerprints, this time using the mean overlap across the 500 hit sets:

snms = sorted(nms)

for i in range(len(snms)):
    print('-'*70)
    print(snms[i])
    for cnt in (10,100,1000):
        row = []
        for j in range(len(snms)):
            if i==j:
                continue
            nmi = snms[i]
            nmj = snms[j]
            if (nmi,nmj) not in ovls:
                nmi,nmj = nmj,nmi
            row.append((np.mean(ovls[nmi,nmj][cnt]),j))
        row = sorted(row,reverse=True)
        nbrs = []
        for j in range(3):
            nbrs.append(f'{snms[row[j][1]]:8s}({row[j][0]:5.1f})')
        print(f'\t{cnt: 5d}',f'{nbrs[0]:17s}',f'{nbrs[1]:17s}',f'{nbrs[2]:16s}',)
----------------------------------------------------------------------
ap-b
       10 ap-c    (  8.1)   ffp1-c  (  5.0)   mfp1-c  (  5.0) 
      100 ap-c    ( 74.9)   ffp1-c  ( 40.7)   mfp1-c  ( 39.7) 
     1000 ap-c    (732.9)   ffp1-c  (395.4)   mfp1-c  (370.4) 
----------------------------------------------------------------------
ap-c
       10 ap-b    (  8.1)   ffp2-c  (  5.1)   ffp1-c  (  5.1) 
      100 ap-b    ( 74.9)   ffp1-c  ( 43.6)   mfp1-c  ( 41.9) 
     1000 ap-b    (732.9)   ffp1-c  (432.4)   mfp1-c  (402.0) 
----------------------------------------------------------------------
avalon
       10 avalon-c(  5.6)   rdk7-b  (  5.4)   rdk5-b  (  5.3) 
      100 avalon-c( 49.7)   rdk5-b  ( 49.3)   rdk7-b  ( 48.1) 
     1000 avalon-c(452.4)   rdk5-b  (403.7)   rdk5-c  (370.2) 
----------------------------------------------------------------------
avalon-c
       10 avalon  (  5.6)   rdk5-c  (  5.4)   rdk7-c  (  5.1) 
      100 avalon  ( 49.7)   rdk5-c  ( 45.9)   rdk7-c  ( 43.2) 
     1000 avalon  (452.4)   rdk5-c  (378.0)   rdk7-c  (358.5) 
----------------------------------------------------------------------
ffp1-b
       10 mfp1-b  (  7.2)   ffp2-b  (  6.7)   mfp2-b  (  6.2) 
      100 mfp1-b  ( 66.2)   ffp2-b  ( 64.0)   mfp2-b  ( 56.8) 
     1000 ffp2-b  (645.2)   mfp1-b  (622.2)   mfp2-b  (545.6) 
----------------------------------------------------------------------
ffp1-c
       10 mfp1-c  (  7.6)   ffp2-c  (  7.0)   mfp2-c  (  6.5) 
      100 mfp1-c  ( 68.0)   ffp2-c  ( 66.5)   mfp2-c  ( 58.3) 
     1000 ffp2-c  (698.4)   mfp1-c  (652.9)   ffp3-c  (610.5) 
----------------------------------------------------------------------
ffp2-b
       10 ffp3-b  (  8.2)   mfp2-b  (  7.9)   mfp3-b  (  7.5) 
      100 ffp3-b  ( 80.2)   mfp2-b  ( 74.1)   mfp3-b  ( 70.4) 
     1000 ffp3-b  (797.6)   mfp2-b  (679.3)   mfp3-b  (651.9) 
----------------------------------------------------------------------
ffp2-c
       10 ffp3-c  (  8.3)   mfp2-c  (  8.1)   mfp3-c  (  7.5) 
      100 ffp3-c  ( 82.2)   mfp2-c  ( 74.4)   mfp3-c  ( 70.8) 
     1000 ffp3-c  (840.0)   mfp2-c  (700.4)   ffp1-c  (698.4) 
----------------------------------------------------------------------
ffp3-b
       10 ffp2-b  (  8.2)   mfp3-b  (  8.1)   mfp2-b  (  7.5) 
      100 ffp2-b  ( 80.2)   mfp3-b  ( 74.7)   ffp3-c  ( 69.5) 
     1000 ffp2-b  (797.6)   mfp3-b  (662.3)   ffp3-c  (660.8) 
----------------------------------------------------------------------
ffp3-c
       10 ffp2-c  (  8.3)   mfp3-c  (  8.2)   mfp2-c  (  7.6) 
      100 ffp2-c  ( 82.2)   mfp3-c  ( 75.4)   mfp2-c  ( 70.6) 
     1000 ffp2-c  (840.0)   mfp3-c  (692.7)   mfp2-c  (674.0) 
----------------------------------------------------------------------
gobbi2d
       10 ap-c    (  3.9)   ap-b    (  3.8)   mfp3-c  (  3.6) 
      100 ap-c    ( 27.7)   ap-b    ( 27.6)   mfp2-c  ( 27.0) 
     1000 ap-b    (253.4)   ap-c    (251.7)   mfp2-c  (218.5) 
----------------------------------------------------------------------
mfp1-b
       10 ffp1-b  (  7.2)   mfp2-b  (  7.0)   ffp2-b  (  6.3) 
      100 mfp2-b  ( 67.2)   ffp1-b  ( 66.2)   ffp2-b  ( 57.9) 
     1000 mfp2-b  (681.8)   ffp1-b  (622.2)   mfp3-b  (581.7) 
----------------------------------------------------------------------
mfp1-c
       10 ffp1-c  (  7.6)   mfp2-c  (  7.2)   ffp2-c  (  6.5) 
      100 mfp2-c  ( 69.8)   ffp1-c  ( 68.0)   ffp2-c  ( 60.0) 
     1000 mfp2-c  (727.6)   ffp1-c  (652.9)   mfp3-c  (644.3) 
----------------------------------------------------------------------
mfp2-b
       10 mfp3-b  (  8.3)   ffp2-b  (  7.9)   ffp3-b  (  7.5) 
      100 mfp3-b  ( 81.1)   ffp2-b  ( 74.1)   ffp3-b  ( 69.1) 
     1000 mfp3-b  (807.9)   mfp1-b  (681.8)   ffp2-b  (679.3) 
----------------------------------------------------------------------
mfp2-c
       10 mfp3-c  (  8.3)   ffp2-c  (  8.1)   ffp3-c  (  7.6) 
      100 mfp3-c  ( 83.4)   ffp2-c  ( 74.4)   ffp3-c  ( 70.6) 
     1000 mfp3-c  (847.8)   mfp1-c  (727.6)   ffp2-c  (700.4) 
----------------------------------------------------------------------
mfp3-b
       10 mfp2-b  (  8.3)   ffp3-b  (  8.1)   ffp2-b  (  7.5) 
      100 mfp2-b  ( 81.1)   ffp3-b  ( 74.7)   mfp3-c  ( 71.4) 
     1000 mfp2-b  (807.9)   mfp3-c  (678.4)   ffp3-b  (662.3) 
----------------------------------------------------------------------
mfp3-c
       10 mfp2-c  (  8.3)   ffp3-c  (  8.2)   ffp2-c  (  7.5) 
      100 mfp2-c  ( 83.4)   ffp3-c  ( 75.4)   mfp3-b  ( 71.4) 
     1000 mfp2-c  (847.8)   ffp3-c  (692.7)   ffp2-c  (678.8) 
----------------------------------------------------------------------
pattern
       10 avalon-c(  4.8)   avalon  (  4.3)   rdk5-c  (  4.2) 
      100 avalon-c( 42.3)   avalon  ( 39.9)   rdk5-c  ( 38.9) 
     1000 avalon-c(356.0)   rdk5-c  (345.9)   rdk5-b  (340.6) 
----------------------------------------------------------------------
rdk5-b
       10 rdk7-b  (  7.3)   rdk5-c  (  6.4)   rdk7-c  (  6.1) 
      100 rdk7-b  ( 68.4)   rdk5-c  ( 64.8)   rdk7-c  ( 59.3) 
     1000 rdk5-c  (651.3)   rdk7-c  (545.5)   rdk7-b  (535.0) 
----------------------------------------------------------------------
rdk5-c
       10 rdk7-c  (  7.5)   rdk5-b  (  6.4)   rdk7-b  (  6.1) 
      100 rdk7-c  ( 74.9)   rdk5-b  ( 64.8)   rdk7-b  ( 58.2) 
     1000 rdk7-c  (688.7)   rdk5-b  (651.3)   rdk7-b  (457.5) 
----------------------------------------------------------------------
rdk7-b
       10 rdk5-b  (  7.3)   rdk7-c  (  7.0)   rdk5-c  (  6.1) 
      100 rdk5-b  ( 68.4)   rdk7-c  ( 63.8)   rdk5-c  ( 58.2) 
     1000 rdk7-c  (538.0)   rdk5-b  (535.0)   rdk5-c  (457.5) 
----------------------------------------------------------------------
rdk7-c
       10 rdk5-c  (  7.5)   rdk7-b  (  7.0)   rdk5-b  (  6.1) 
      100 rdk5-c  ( 74.9)   rdk7-b  ( 63.8)   rdk5-b  ( 59.3) 
     1000 rdk5-c  (688.7)   rdk5-b  (545.5)   rdk7-b  (538.0) 
----------------------------------------------------------------------
tt-b
       10 tt-c    (  8.6)   ffp3-c  (  6.2)   mfp3-c  (  6.1) 
      100 tt-c    ( 83.4)   ffp2-b  ( 54.9)   mfp2-b  ( 53.5) 
     1000 tt-c    (824.8)   ffp2-b  (481.9)   ffp2-c  (468.5) 
----------------------------------------------------------------------
tt-c
       10 tt-b    (  8.6)   ffp3-c  (  6.4)   ffp2-c  (  6.3) 
      100 tt-b    ( 83.4)   ffp2-c  ( 54.9)   ffp3-c  ( 54.6) 
     1000 tt-b    (824.8)   ffp2-c  (494.2)   ffp2-b  (486.5) 

Look at histograms of the overlap sizes for a few different fingerprint pairs

def compare(prs,ovls=ovls):
    prs = list(prs)
    for i,pr in enumerate(prs):
        if pr not in ovls:
            pr = pr[1],pr[0]
        prs[i] = pr

    plt.figure(figsize=(12,5))
    for i,n in enumerate((100,1000)):
        plt.subplot(1,2,i+1)
        plt.hist([ovls[pr][n] for pr in prs],label=prs,bins=20);
        plt.xlim(0,n)
        plt.xlabel(f'n={n} overlap')
        #plt.title(pr)
        plt.legend()
    plt.tight_layout()

Start with FeatureMorgan and Morgan, where there is a reasonably large amount of overlap:

compare((('ffp2-b','mfp2-b'),('ffp3-b','mfp3-b')))

Now compare count-based and bit-based Morgan fingerprints. I expected the overlaps here to be higher:

compare((('mfp3-c','mfp3-b'),('mfp2-c','mfp2-b'),))

Comparing very different types of fingerprints: count-based Morgan3 and both Topological Torsions and Atom Pairs (these both use count simulation. Here there are significant differences. These are fingerprints that it would be interesting to use together.

compare((('tt-b','mfp3-c'),('ap-b','mfp3-c')))

Same thing with bit-baseed Morgan3 and RDK5. Again, These are nicely complementary fingerprints:

compare((('tt-b','mfp3-b'),('ap-b','mfp3-b'),('rdk5-b','mfp3-b')))

Finally compare Morgan3 with the pattern fingerprint (normally used for substructure screening, not similarity search), Gobbi2D (a 2D pharmacophore FP) and the Avalon FP. These are also nicely different from each other:

compare((('pattern','mfp3-b'),('gobbi2d','mfp3-b'),('avalon','mfp3-b')))