Some observations about similarity search thresholds

similarity
reference
Fingerprint efficiency
Published

May 26, 2021

Updated 08.06.2021 after I expanded the set of “related compounds”. The source of the previous version of the post is available in github. The updates didn’t change the discussion that much.

TL;DR

Based on the analysis here it looks like the fingerprint the RDKit provides which does the best job of efficiently retrieving chemically similar structures is the RDKit fingerprint with maxPath set to 6.

Intro / Results

I recently did a post presenting an approach for finding reasonable thresholds for similarity searches using the fingerprints the RDKit provides. This is a followup to that one written after I’ve done some more looking at the data. I want to come up with a suggestion for which fingerprint to use for similarity searches when the goal is retrieving as many chemically related compounds as possible. I’ll do that by looking at search efficiency as measured by the fraction of the total database retrieved when using similarity thresholds sufficient to return 90-95% of the related compounds. See the earlier post for an explanation of what “related compounds” means here and how the searches were done.

As a reminder, this is how I presented the results in that post and how to interpret the data:

0.95 of related compounds 0.9 of related compounds 0.8 of related compounds 0.5 of related compounds
Fingerprint 0.95 noise level threshold db fraction / count per million threshold db fraction / count per million threshold db fraction / count per million threshold db fraction / count per million
Morgan2 (bits) 0.27 0.4 0.00019 / 190 0.4 0.00019 / 190 0.45 0.00012 / 115 0.55 2.5e-05 / 25

The 0.95 noise level (from the previous analysis) for the MFP2 fingerprint is 0.27. If I want to retrieve 95% of the related compounds I need to set the similarity threshold to 0.4. With this threshold I would retrieve ~190 compounds per million compounds in the database (0.4% of the database). Similarly, if I were willing to live with finding 50% of the related actives I could set the search threshold to 0.55, in which case I’d only retrieve ~25 rows per million compounds in the database.

I won’t reproduce the full results table from the post here, but here are the rows with the highest search efficiencies (lowest number of compounds returned from the “background database”) at 90% and 95% of related compounds found. I sorted the table by the efficiency at 90% of related compounds retrieved:

0.95 of related compounds 0.9 of related compounds 0.8 of related compounds 0.5 of related compounds
Fingerprint 0.95 noise level threshold db fraction / count per million threshold db fraction / count per million threshold db fraction / count per million threshold db fraction / count per million
RDKit 7 (bits) 0.43 0.55 0.00051 / 510 0.6 8e-05 / 80 0.6 8e-05 / 80 0.7 3e-05 / 30
Topological Torsions (counts) 0.19 0.35 0.00049 / 489 0.4 0.00011 / 110 0.45 7.5e-05 / 75 0.55 2.5e-05 / 25
linear RDKit 7 (bits) 0.26 0.45 0.00053 / 535 0.5 0.00013 / 130 0.55 9e-05 / 90 0.65 3.5e-05 / 35
RDKit 6 (bits) 0.31 0.5 0.00021 / 210 0.55 0.00014 / 135 0.6 6e-05 / 60 0.7 3e-05 / 30
Morgan2 (counts) 0.25 0.4 0.00014 / 140 0.4 0.00014 / 140 0.45 8.5e-05 / 84 0.55 2e-05 / 20
Avalon 1024 (bits) 0.37 0.55 0.00075 / 750 0.6 0.00014 / 140 0.65 9e-05 / 90 0.75 2.5e-05 / 25
Morgan3 (counts) 0.20 0.3 0.00026 / 260 0.35 0.00015 / 154 0.35 0.00015 / 154 0.45 3.5e-05 / 35
RDKit 5 (bits) 0.29 0.5 0.00025 / 250 0.55 0.00016 / 155 0.6 6e-05 / 60 0.7 3e-05 / 30
Topological Torsions (bits) 0.22 0.4 0.00016 / 160 0.4 0.00016 / 160 0.45 0.00011 / 105 0.55 3.5e-05 / 35
Morgan2 (bits) 0.27 0.4 0.00019 / 190 0.4 0.00019 / 190 0.45 0.00012 / 115 0.55 2.5e-05 / 25
FeatMorgan3 (counts) 0.28 0.4 0.00022 / 220 0.4 0.00022 / 220 0.45 0.00013 / 130 0.55 3e-05 / 30
linear RDKit 6 (bits) 0.28 0.5 0.00022 / 220 0.5 0.00022 / 220 0.55 0.00014 / 140 0.7 3e-05 / 30

The threshold values are rounded to the nearest 0.05.

I’ve included count-based fingerprints in the above table, but they wouldn’t be my first choice for use in a real-world similarity search application. Calculating similarity for count-based fingerprints is significantly slower than bit vector fingerprints, so they really aren’t practical for large datasets. Note that the RDKit has a method for approximating counts using bit vector fingerprints which is used by the Atom Pair and Topological Torsion fingeprints and could also be an option for the other fingerprint types, but that’s a topic for another post.

Based on these numbers (and, of course, the dataset I used) it looks like the RDKit fingerprint is the optimal choice for chemical similarity search. Taking the efficiency at both 90% and 95% into account, the version of the fingerprint with maxPath=6 is arguably better than the version with maxPath=7 (which is the default). There’s not a publication for the RDKit fingerprint but it is described in detail in the RDKit documentation.

The Morgan3 fingerprint, which is what I kind of expected to be the best at this task, doesn’t do that well - the bit-vector based form didn’t even make this list of top performaers. The Morgan2 fingerprint, on the other hand, seems like another good choice. The Morgan fingerprints are the RDKit’s implementation of the circular fingerprints described in this publication.

A real surprise to me was how well the topological torsions fingerprint does at this chemical search. I had (I guess without much evidence) thought of it as more of a fuzzy (or “scaffold-hopping”) fingerprint, but the high efficiency on this chemical search problem makes me reconsider that. Topological torsions were introduced in this publication.

The Avalon fingerprint seems to be another decent choice, at least at 90%. This isn’t surprising to me, but I’ll probably remain resistant to making heavy of it due to the complexity of the fingerprint itself. The only non-code description I’m aware of for the Avalon FP is in the supplementary material for this paper; it’s likely that the current version of the fingerprint, which was under active development for at least 10 years after that paper appeared, deviates from that.

Before getting any deeper into details with this kind of analysis, I think I would like to look into using more than 10K of the “related” molecules and increasing the size of the background database just to make sure the statistics are solid. I’ll do that in a separate post and leave the count-based fingerprints out.