Updated 08.06.2021 after I expanded the set of “related compounds”. The source of the previous version of the post is available in github

Prologue

If you’re interested in this topic and have questions, notice mistakes in my analysis, or have suggestions or ideas for improving this (particularly when it comes to the sets of “related compounds” I use), please either send me email or leave a comment.

Intro / Results

One of the RDKit blog posts I refer back to the most is the one where I tried to establish the Tanimoto similarity value which constitutes a “noise level” for each of the fingerprints the RDKit supports by looking at the distributions of similarities between randomly chosen molecules. I periodically update the post just in case the threshold values change with RDKit versions. Here’s the most recent version of the post and the associated jupyter notebook. I find it really useful to be able to say things like “The 95% noise level for Tanimoto similarity calculated with the the bit-based version of the RDKit’s MFP2 is 0.27.” Based on this I know that when doing similarity searches the threshold for MFP2 shouldn’t be set below 0.27 (I normally say 0.3) in most cases. But that analysis doesn’t tell me what I should set the threshold to.

Of course the answer to that question is “it depends”. Let’s assume that the database you’re searching contains a certain number of compounds which would actually be interesting for you and a much larger number of compounds which are not interesting (at least not for the search you’re currently running). And let’s further assume that the similarities between those interesting compounds and your query is generally above the noise level for the fingerprint you’re using. Any similarity search is going to return a mix of both interesting and non-interesting compounds and the proportions in that mix are generally going to be determined by the similarity threshould you use. Setting the similarity threshold high tends to give a larger proportion of interesting compounds at the cost of missing interesting compounds while a lower threshold will return more of the interesting compounds but a higher fraction of uninteresting compounds.

Basically how bad your FOMO is will determine how many results you need to look through.

This isn’t a big deal if you’re searching a small database and or if you’re going to be post-processing the results using some other computational tool, but if the idea is that you’re going to actually be looking at the results of the similarity search, then result sets with 10K or more rows are going to require a lot of patience.

This post is an attempt to come up with recommendations for reasonable threshold values for the common RDKit fingerprints so that you can make a more informed decision about what to use for a given search.

There’s a more complete description below along with links to the jupyter notebooks with the actual code, but here’s a quick summary of what I did:

  1. I started with 1047 groups of related compounds (~66K compounds in all). Each group is a set of 50-100 compounds from a single ChEMBL document.
  2. I calculated intra-group similarities within each of those 1047 groups using each of the fingerprint types to determine thresholds for retrieving various fractions of each group.
  3. I used a randomly selected subset of 10K of those compounds to do similarity searches on 100K molecules randomly selected from ChEMBL in order to determine what fraction of the database would be retrieved for various similarity thresholds.

Here are what the results look like for bit-based MFP2:

0.95 of related compounds 0.9 of related compounds 0.8 of related compounds 0.5 of related compounds
Fingerprint 0.95 noise level threshold db fraction / count per million threshold db fraction / count per million threshold db fraction / count per million threshold db fraction / count per million
Morgan2 (bits) 0.27 0.4 0.00019 / 190 0.4 0.00019 / 190 0.45 0.00012 / 115 0.55 2.5e-05 / 25

The 0.95 noise level (from the previous analysis) for this FP is 0.27. If I want to retrieve 95% of the related compounds I need to set the similarity threshold to 0.4. With this threshold I would retrieve ~190 compounds per million compounds in the database (0.4% of the database). Similarly, if I were willing to live with finding 50% of the related actives I could set the search threshold to 0.55, in which case I’d only retrieve ~25 rows per million compounds in the database.

I find this is a useful way of thinking about the thresholds: it makes the balance between recall (number of interesting compounds retrieved) and the overall result set size visible. For example, for the MFP2 results shown above, if I’m willing to live with retrieving about 90% of the interesting compounds instead of 95% I would only have to look through about 1/8th of the results from the database.

With that explained, here’s the full results table:

0.95 of related compounds 0.9 of related compounds 0.8 of related compounds 0.5 of related compounds
Fingerprint 0.95 noise level threshold db fraction / count per million threshold db fraction / count per million threshold db fraction / count per million threshold db fraction / count per million
MACCS 0.57 0.65 0.016 / 15820 0.65 0.016 / 15820 0.7 0.0019 / 1880 0.8 8e-05 / 80
Morgan0 (counts) 0.57 0.6 0.017 / 16990 0.6 0.017 / 16990 0.65 0.009 / 9040 0.75 0.00057 / 565
Morgan1 (counts) 0.36 0.5 0.0003 / 300 0.5 0.0003 / 300 0.55 0.00017 / 170 0.65 2.5e-05 / 25
Morgan2 (counts) 0.25 0.4 0.00014 / 140 0.4 0.00014 / 140 0.45 8.5e-05 / 84 0.55 2e-05 / 20
Morgan3 (counts) 0.20 0.3 0.00026 / 260 0.35 0.00015 / 154 0.35 0.00015 / 154 0.45 3.5e-05 / 35
Morgan0 (bits) 0.57 0.6 0.019 / 18550 0.6 0.019 / 18550 0.65 0.0099 / 9880 0.75 0.00063 / 629
Morgan1 (bits) 0.37 0.5 0.00036 / 360 0.5 0.00036 / 360 0.55 0.0002 / 200 0.65 2.5e-05 / 25
Morgan2 (bits) 0.27 0.4 0.00019 / 190 0.4 0.00019 / 190 0.45 0.00012 / 115 0.55 2.5e-05 / 25
Morgan3 (bits) 0.22 0.3 0.00057 / 570 0.35 0.00031 / 309 0.4 5e-05 / 50 0.5 2e-05 / 20
FeatMorgan0 (counts) 0.74 0.65 0.17 / 165672 0.7 0.073 / 72975 0.7 0.073 / 72975 0.8 0.0086 / 8620
FeatMorgan1 (counts) 0.51 0.55 0.021 / 21000 0.6 0.0024 / 2360 0.65 0.0012 / 1235 0.7 0.00011 / 110
FeatMorgan2 (counts) 0.36 0.45 0.0038 / 3782 0.5 0.00023 / 230 0.55 0.00014 / 135 0.65 2.5e-05 / 25
FeatMorgan3 (counts) 0.28 0.4 0.00022 / 220 0.4 0.00022 / 220 0.45 0.00013 / 130 0.55 3e-05 / 30
FeatMorgan0 (bits) 0.74 0.65 0.17 / 165672 0.7 0.073 / 72975 0.7 0.073 / 72975 0.8 0.0086 / 8620
FeatMorgan1 (bits) 0.51 0.55 0.023 / 22962 0.6 0.0027 / 2660 0.65 0.0014 / 1390 0.7 0.00012 / 120
FeatMorgan2 (bits) 0.38 0.45 0.006 / 6007 0.5 0.00031 / 310 0.55 0.00018 / 175 0.65 2.5e-05 / 25
FeatMorgan3 (bits) 0.30 0.4 0.00037 / 370 0.45 0.00021 / 210 0.45 0.00021 / 210 0.55 3.5e-05 / 35
RDKit 4 (bits) 0.33 0.5 0.00069 / 690 0.55 0.0004 / 400 0.6 0.00011 / 110 0.7 4e-05 / 40
RDKit 5 (bits) 0.29 0.5 0.00025 / 250 0.55 0.00016 / 155 0.6 6e-05 / 60 0.7 3e-05 / 30
RDKit 6 (bits) 0.31 0.5 0.00021 / 210 0.55 0.00014 / 135 0.6 6e-05 / 60 0.7 3e-05 / 30
RDKit 7 (bits) 0.43 0.55 0.00051 / 510 0.6 8e-05 / 80 0.6 8e-05 / 80 0.7 3e-05 / 30
linear RDKit 4 (bits) 0.35 0.5 0.0015 / 1470 0.55 0.00083 / 830 0.6 0.00019 / 190 0.7 5e-05 / 50
linear RDKit 5 (bits) 0.31 0.5 0.00046 / 455 0.55 0.00027 / 272 0.6 9e-05 / 90 0.7 3e-05 / 30
linear RDKit 6 (bits) 0.28 0.5 0.00022 / 220 0.5 0.00022 / 220 0.55 0.00014 / 140 0.7 3e-05 / 30
linear RDKit 7 (bits) 0.26 0.45 0.00053 / 535 0.5 0.00013 / 130 0.55 9e-05 / 90 0.65 3.5e-05 / 35
Atom Pairs (counts) 0.27 0.35 0.0037 / 3724 0.35 0.0037 / 3724 0.4 0.00016 / 160 0.5 3e-05 / 30
Topological Torsions (counts) 0.19 0.35 0.00049 / 489 0.4 0.00011 / 110 0.45 7.5e-05 / 75 0.55 2.5e-05 / 25
Atom Pairs (bits) 0.36 0.4 0.01 / 10380 0.45 0.0053 / 5250 0.5 0.00012 / 120 0.55 7e-05 / 70
Topological Torsions (bits) 0.22 0.4 0.00016 / 160 0.4 0.00016 / 160 0.45 0.00011 / 105 0.55 3.5e-05 / 35
Avalon 512 (bits) 0.51 0.65 0.0004 / 400 0.65 0.0004 / 400 0.7 8e-05 / 80 0.8 2e-05 / 20
Avalon 1024 (bits) 0.37 0.55 0.00075 / 750 0.6 0.00014 / 140 0.65 9e-05 / 90 0.75 2.5e-05 / 25
Avalon 512 (counts) 0.42 0.55 0.0028 / 2785 0.6 0.00028 / 280 0.65 0.00016 / 160 0.75 2.5e-05 / 25
Avalon 1024 (counts) 0.38 0.55 0.0012 / 1192 0.6 0.00017 / 170 0.6 0.00017 / 170 0.7 4e-05 / 40

The threshold values are rounded to the nearest 0.05.

Method

I won’t get into heavy detail here, the actual notebooks are linked below.

Similarity between random molecules

The workflow and dataset for this is described in a blog post. The very quick summary is that I generated statistics for the similarity distribution of 25K random pairs of reasonable sized (MW<600) molecules exported from ChEMBL.

This is really the central pillar of the post: how do we pick sets of compounds we can use to quantify similarity search performance?

One obvious possibility is to just take groups of molecules which are known to be active against the same targets. This is the classic similarity-based virtual screening use case and it’s one which has been done a lot in the literature. That’s an interesting (and important) use case and it’s something which I may come back to in a future post, but it requires a connection between chemical similarity and biological activity. That connection (or lack thereof) makes analysis of the threshold results more complex and introduces a significant amount of variability.

Here I want to look at a different use case: searching a database and retrieving compounds which are chemically similar to each other. For this I need to pick groups of chemically similar compounds without actually using a traditional approach to chemical similarity. The approach I used is to assume that the typical medicinal chemistry SAR paper includes a bunch of compounds which come from a small number of chemical series (typically one). These compounds are definitely related to each other and it’s not unreasonable to expect that a similarity search for one should return the others as results.

This led me back to an earlier blog post looking at identifying scaffolds from ChEMBL compounds tested in the same assay (given the structure of ChEMBL, this implies that the compounds are from the same paper). That post includes some pre-filtering of the results to try and get only SAR papers by only keeping assays (papers) where 50-100 compounds were measured. For this post I re-ran that analysis against ChEMBL28 and expanded my search criteria to include IC50 data as well as the Ki data used in the original set. The analysis produced results for 1396 groups (a group is the compounds tested in one assay); for this analysis I further filtered these down to the 1047 groups (70026 compounds in total) where the number of atoms in the scaffold is at least 50% of the average number of atoms for compounds in the group. I further filtered each group to only include compounds which have a substructure match to the fuzzy MCS which was found for the full set. The hope here is that this will limit us to only consider the compounds which are part of the chemical series being reported. This lowers the total number of compounds to 66577 across the 1047 groups.

So given these 1047 groups of chemically related compounds I was ready to start doing some searches.

Determining background retrieval rates

In order to get a sense of how many compounds would be retrieved from a database when using the related compounds, I randomly picked 100K molecules from ChEMBL28 to use as a background. I wanted a representative sample, so I didn’t apply MW filters when doing this selection.

I then queried the background compounds with each molecule in a random subset of the 66K members of the “related compounds” set, counted the number of results each returned for each fingerprint/similarity threshold combination, and did statistics based on those results.

Summarizing the data

Here’s an example of a graphical summary of the results presented in the final notebook listed below:

The violin plots show the distribution of similarity values required to match 50% of the related compound pairs for each of the fingerprints. The dark gray boxes show the noise level for the fingerprints. The red line shows the median fraction of the 100K ChEMBL compounds retrieved when using the median value from the violin plots as a similarity threshold.

The notebooks

Here are the github links for the notebooks I used: