Colliding bits II, revisited

The impact of bit collisions on machine-learning performance

December 25, 2022

Impact of bit collisions on learning performance

Note: This is a significantly revised version of an earlier post.

In an earlier post I looked at the minimal impact bit collisions in the RDKit’s Morgan fingerprints has on calculated similarity between molecules. This time I’m going to look at the impact of fingerprint size (as a surrogate for number of collisions) on the performance of machine-learning algorithms.

I will use Datasets II from our model fusion paper to do the analysis. I’ve covered these datasets, which are available as part of the benchmarkng platform, in some detail in an earlier post.

While working on this post I updated the benchmarking platform to work with Python 3 (the paper was a long time ago!) and added a few additional learning methods and one additional evaluation metric. Those will be merged back to the github repo soon (probably by the time this blog post actually appears), in the meantime it’s here (if that branch is missing, it indicates that the code is already merged to the main repo).

The fingerprints examined here:

  1. MFP2: Morgan fingerprint, radius = 2
  2. MFP3: Morgan fingerprint, radius = 3
  3. RDK5: RDKit fingerprint, max path length = 5
  4. HashAP: atom pairs, using count simulation
  5. HashTT: topological torsions, using count simulation

The benchmarking platform is pre-configured to support a short (1K) and long (16K) form of the fingerprints, so that part was easy.

The learning algorithms:

  1. LR: logistic regression, sklearn implementation
  2. LMNB: Laplacian Naive Bayes, NIBR implementation
  3. NB: Naive Bayes, sklearn implementation. Note that something went wrong with these calculations, so they won’t be included in any summaries.
  4. RF: random forest, sklearn implementation
  5. BRF: balanced random forest, imbalanced-learn implementation
  6. XGB: extreme gradient boosting, XGBoost implementation

Here’s a summary of the results using AUC as a metric (there’s a giant table at the bottom with the other metrics). The result column indicates whether the AUC value for the short fingerprint is usually less than (lt), the same as (same), or greater than (gt) the AUC value for the long fingerprint. The P column provides the P value for the difference (assessed using scipy’s Wilcoxon signed-rank test). The delt column has the median difference between the short-fingerprint AUC and long-fingerprint AUC. The pct_delt column indicates the median percentage change in AUC relative to the short-fingerprint AUC.

alg fp metric result P delt pct_delt
lr mfp2 AUC lt 8.68e-51 -0.011 -0.0142
lr mfp3 AUC lt 2.6e-64 -0.019 -0.0243
lr rdk5 AUC lt 2.09e-24 -0.011 -0.0143
lr hashap AUC lt 8.77e-37 -0.026 -0.0325
lr hashtt AUC lt 8.02e-54 -0.022 -0.0294
lmnb mfp2 AUC lt 3.2e-104 -0.042 -0.0573
lmnb mfp3 AUC lt 6.05e-107 -0.077 -0.109
lmnb rdk5 AUC lt 4.32e-60 -0.071 -0.193
lmnb hashap AUC gt 1.67e-18 0.031 0.0494
lmnb hashtt AUC lt 7.52e-105 -0.051 -0.0767
rf mfp2 AUC lt 1.45e-102 -0.087 -0.132
rf mfp3 AUC lt 2.23e-92 -0.074 -0.115
rf rdk5 AUC gt 4.27e-11 0.015 0.0212
rf hashap AUC same 0.00227 0.0036 0.00503
rf hashtt AUC lt 1.25e-63 -0.046 -0.0677
brf mfp2 AUC lt 6.61e-37 -0.024 -0.0304
brf mfp3 AUC lt 9.25e-40 -0.031 -0.04
brf rdk5 AUC gt 2.77e-09 0.01 0.0131
brf hashap AUC gt 5.32e-07 0.011 0.0152
brf hashtt AUC lt 8.45e-29 -0.027 -0.0362
xgb mfp2 AUC same 0.831 0.00097 0.00109
xgb mfp3 AUC same 0.00162 -0.0051 -0.00719
xgb rdk5 AUC gt 1.36e-07 0.012 0.0157
xgb hashap AUC gt 1.49e-20 0.027 0.0358
xgb hashtt AUC same 0.0265 -0.0043 -0.00645

The TL;DR from this: for most methods and fingerprints you can get a small, but real improvement in model performance (as measured by AUC) by using the longer fingerprints. There are a few cases, e.g. LMNB+rdk5 and RF+mfp2, where the model built with the longer fingerprint is >10% better. Exceptions to the general rule, which it could be interesting to investigate more deeply, include: - The rdk5 and hashap fingerprints: for this use case it looks like the additional information in the longer fingerprints tends to degrade performance. The “simpler” learning algorithms - lr and lmnb - either don’t show this effect or show it less… a strong suggestion that the problem is overfitting. - XGB doesn’t show improved performance for the longer fingerprint with any of the methods.

Note that we’ve only looked at one type of data set here - training on reasonably homogeneous active molecules and then testing with diverse actives - so the results could well be different for models built/tested on things like screening data (where both the training and test sets are chemically diverse).

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
import numpy as np

import rdkit
import time
Datasets II results from the benchmarking platform

Read in the results from running the benchmarking platform:

from collections import defaultdict
import glob
import pickle
import gzip
alld = {}
for fn in glob.glob('../data/bp_results/dsII_validation/ChEMBL/*.pkl.gz'):
    with,'rb') as inf:
        d = pickle.load(inf)
    fn = fn.split('_')[-1].split('.')[0]
dict_keys(['100', '10188', '10193', '10260', '10280', '10434', '107', '108', '10980', '11140', '11365', '114', '11489', '11534', '11575', '11631', '121', '12209', '12252', '126', '12952', '130', '13001', '15', '165', '17045', '19905', '25', '259', '43', '51', '61', '65', '72', '87', '90', '93'])

These are the evaluation metrics we’ll use: AUC, EF1, EF5, and AUPRC (area onder the precision-recall curve). Note that you have to be careful comparing EF1, EF5, and AUPRC across data sets with different active/inactive ratios, but that doesn’t affect us here since we only compare values within a given data set.

metrics = ('AUC','AUPRC','EF1','EF5',)

Detailed view of the results for a particular method

Start by reproducing some of the plots we used last time, though we’ll quickly switch to a view which is better suited to doing comparisons.

We’ll just show the plots for logistic regression:

for assay,d in alld.items():
    for metric in metrics:
        v1 = np.array([x[0] for x in d[metric]['lr_mfp2']])
        v2 = np.array([x[0] for x in d[metric]['lr_lmfp2']])
        means[assay][metric] = (np.mean(v1),np.mean(v2))
        medians[assay][metric] = (np.median(v1),np.median(v2))
        delts[assay][metric] = (np.mean(delt),np.min(delt),np.max(delt))
        fdelts[assay][metric] = delt

Look at the mean values of the metrics across the different targets.

for i,metric in enumerate(metrics):
    plot([x[metric][0] for x in medians.values()],c='b',label='mfp2')
    plot([x[metric][1] for x in medians.values()],c='r',label='lmfp2')

With LR, at least, the longer fingerprints (red lines) are, on average, slightly better with each metric.

Using the means here, which are calculated across multiple datasets (papers) per target isn’t the most accurate way to view the results, it’s more accurate to look at the deltas.

The same behavior is observed in the deltas, though the min delta curves do show that there are times that the longer fingerprints are sligthly worse:

for i,metric in enumerate(metrics):
    plot([x[metric][0] for x in delts.values()],c='b')
    plot([x[metric][1] for x in delts.values()],c='r')
    plot([x[metric][2] for x in delts.values()],c='r')
    title(f'$\Delta$ {metric}')

That’s a bit clearer in a box plot for the deltas:

for i,metric in enumerate(metrics):
    _=boxplot([x[metric] for x in fdelts.values()])
    title(f'$\Delta$ {metric}')

More direct comparisons

Rather than doing a bunch more of the box and whisker plots showing the results broken down by dataset, let’s look at direct comparisions of the results for the various learning methods. We can do this because we know what all the method+fp combinations were trained/tested using the same data splits.

def do_scatter_plots(k1,k2,alld=alld,metrics=metrics):
    delts = defaultdict(list)
    v1s = defaultdict(list)
    v2s = defaultdict(list)

    for assay,d in alld.items():
        for metric in metrics:
            v1 = np.array([x[0] for x in d[metric][k1]])
            v2 = np.array([x[0] for x in d[metric][k2]])

    nmetrics = len(metrics)        
    nrows = nmetrics//2 + nmetrics%2
    ncols = 2
    for i,metric in enumerate(metrics):
        xmin,xmax = xlim()
        ymin,ymax = ylim()
        vmin = min(xmin,ymin)
        vmax = max(xmax,ymax)


Logistic regression


Balanced random forests


Random forests




Laplacian Naive Bayes


Naive Bayes


Something isn’t right with the NB results… I need to look into that

Look at morgan 3

This increases the number of set bits






Look at RDK5

Now we really have a lot of set bits






Again, some of those enrichment plots look suspicious and should be looked into

Are there statistically significant differences?

Use the Wilcoxon signed-rank test to detect whether or not the differences we saw in the plots are statistically significant. We’ll do this for every method and fingerprint and make a table.

from scipy import stats
def do_significance(alg,fp,tableOutput=False,showHeader=False,pthresh=0.001,alld=alld,metrics=metrics):
    k1 = f'{alg}_{fp}'
    k2 = f'{alg}_l{fp}'
    if alg == 'lmnb':
        k1 += '_2'
        k2 += '_2'

    v1s = defaultdict(list)
    v2s = defaultdict(list)
    delts = defaultdict(list)
    pct_delts = defaultdict(list)

    if tableOutput and showHeader:
        row = ('alg','fp','metric','result','P','delt','pct_delt')
        divider = ['-'*len(k) for k in row]
        print('| '+' | '.join(row)+' |')
        print('| '+' | '.join(divider)+' |')
    for assay,d in alld.items():
        for metric in metrics:
            v1 = np.array([x[0] for x in d[metric][k1]])
            v2 = np.array([x[0] for x in d[metric][k2]])
            delt = v1-v2
    for metric in metrics:
        w2s = stats.wilcoxon(v1s[metric],v2s[metric],alternative='two-sided').pvalue
        wlt = stats.wilcoxon(v1s[metric],v2s[metric],alternative='less').pvalue
        wgt = stats.wilcoxon(v1s[metric],v2s[metric],alternative='greater').pvalue
        result = 'same'
        which = w2s
        if w2s < pthresh:
            if wlt < pthresh:
                result = 'lt'
                which = wlt
            elif wgt < pthresh:
                result = 'gt'
                which = wgt
        row = (alg,fp,metric,result,f'{which:.3g}',f'{np.median(delts[metric]):.2g}',f'{np.median(pct_delts[metric]):.3g}')
        if not tableOutput:
            print(' '.join(row))
            print('| '+' | '.join(row)+' |')

# we skip 'nb' here since there's clearly something bogus with the results
for metric in metrics:
    print(f'## {metric}')
    for i,alg in enumerate(('lr','lmnb','rf','brf','xgb')):
        for j,fp in enumerate(('mfp2','mfp3','rdk5','hashap','hashtt')):
            do_significance(alg,fp,tableOutput=True,showHeader=(not i + j),metrics=(metric,))
## AUC
| alg | fp | metric | result | P | delt | pct_delt |
| --- | -- | ------ | ------ | - | ---- | -------- |
| lr | mfp2 | AUC | lt | 8.68e-51 | -0.011 | -0.0142 |
| lr | mfp3 | AUC | lt | 2.6e-64 | -0.019 | -0.0243 |
| lr | rdk5 | AUC | lt | 2.09e-24 | -0.011 | -0.0143 |
| lr | hashap | AUC | lt | 8.77e-37 | -0.026 | -0.0325 |
| lr | hashtt | AUC | lt | 8.02e-54 | -0.022 | -0.0294 |
| lmnb | mfp2 | AUC | lt | 3.2e-104 | -0.042 | -0.0573 |
| lmnb | mfp3 | AUC | lt | 6.05e-107 | -0.077 | -0.109 |
| lmnb | rdk5 | AUC | lt | 4.32e-60 | -0.071 | -0.193 |
| lmnb | hashap | AUC | gt | 1.67e-18 | 0.031 | 0.0494 |
| lmnb | hashtt | AUC | lt | 7.52e-105 | -0.051 | -0.0767 |
| rf | mfp2 | AUC | lt | 1.45e-102 | -0.087 | -0.132 |
| rf | mfp3 | AUC | lt | 2.23e-92 | -0.074 | -0.115 |
| rf | rdk5 | AUC | gt | 4.27e-11 | 0.015 | 0.0212 |
| rf | hashap | AUC | same | 0.00227 | 0.0036 | 0.00503 |
| rf | hashtt | AUC | lt | 1.25e-63 | -0.046 | -0.0677 |
| brf | mfp2 | AUC | lt | 6.61e-37 | -0.024 | -0.0304 |
| brf | mfp3 | AUC | lt | 9.25e-40 | -0.031 | -0.04 |
| brf | rdk5 | AUC | gt | 2.77e-09 | 0.01 | 0.0131 |
| brf | hashap | AUC | gt | 5.32e-07 | 0.011 | 0.0152 |
| brf | hashtt | AUC | lt | 8.45e-29 | -0.027 | -0.0362 |
| xgb | mfp2 | AUC | same | 0.831 | 0.00097 | 0.00109 |
| xgb | mfp3 | AUC | same | 0.00162 | -0.0051 | -0.00719 |
| xgb | rdk5 | AUC | gt | 1.36e-07 | 0.012 | 0.0157 |
| xgb | hashap | AUC | gt | 1.49e-20 | 0.027 | 0.0358 |
| xgb | hashtt | AUC | same | 0.0265 | -0.0043 | -0.00645 |

| alg | fp | metric | result | P | delt | pct_delt |
| --- | -- | ------ | ------ | - | ---- | -------- |
| lr | mfp2 | AUPRC | lt | 9.58e-55 | -0.0079 | -0.0535 |
| lr | mfp3 | AUPRC | lt | 1.72e-71 | -0.011 | -0.0758 |
| lr | rdk5 | AUPRC | lt | 2.93e-38 | -0.0042 | -0.0508 |
| lr | hashap | AUPRC | lt | 1.11e-13 | -0.01 | -0.0857 |
| lr | hashtt | AUPRC | lt | 2.33e-60 | -0.0091 | -0.102 |
| lmnb | mfp2 | AUPRC | lt | 9.34e-31 | -0.009 | -0.101 |
| lmnb | mfp3 | AUPRC | lt | 5.43e-27 | -0.0099 | -0.135 |
| lmnb | rdk5 | AUPRC | gt | 0.000113 | 0.00025 | 0.00953 |
| lmnb | hashap | AUPRC | lt | 1.02e-18 | -0.0052 | -0.114 |
| lmnb | hashtt | AUPRC | lt | 2.39e-17 | -0.0039 | -0.0839 |
| rf | mfp2 | AUPRC | lt | 3.24e-46 | -0.017 | -0.174 |
| rf | mfp3 | AUPRC | lt | 3.09e-59 | -0.019 | -0.211 |
| rf | rdk5 | AUPRC | same | 0.41 | 1.1e-05 | 0.000428 |
| rf | hashap | AUPRC | gt | 2.54e-05 | 0.011 | 0.1 |
| rf | hashtt | AUPRC | lt | 8.39e-42 | -0.011 | -0.14 |
| brf | mfp2 | AUPRC | lt | 1.41e-22 | -0.0086 | -0.0915 |
| brf | mfp3 | AUPRC | lt | 4.72e-35 | -0.012 | -0.119 |
| brf | rdk5 | AUPRC | same | 0.455 | -0.00027 | -0.00484 |
| brf | hashap | AUPRC | gt | 0.00023 | 0.0058 | 0.0642 |
| brf | hashtt | AUPRC | lt | 3.38e-26 | -0.0075 | -0.121 |
| xgb | mfp2 | AUPRC | lt | 1.34e-12 | -0.0044 | -0.0558 |
| xgb | mfp3 | AUPRC | lt | 1.74e-25 | -0.0088 | -0.113 |
| xgb | rdk5 | AUPRC | lt | 1.32e-08 | -0.0034 | -0.0545 |
| xgb | hashap | AUPRC | same | 0.221 | 0.008 | 0.0872 |
| xgb | hashtt | AUPRC | lt | 1.07e-13 | -0.0044 | -0.0792 |

## EF1
| alg | fp | metric | result | P | delt | pct_delt |
| --- | -- | ------ | ------ | - | ---- | -------- |
| lr | mfp2 | EF1 | lt | 4.4e-23 | -1 | -0.0357 |
| lr | mfp3 | EF1 | lt | 1.24e-36 | -1 | -0.0625 |
| lr | rdk5 | EF1 | lt | 7.74e-16 | 0 | 0 |
| lr | hashap | EF1 | lt | 3.05e-06 | -1 | -0.0556 |
| lr | hashtt | EF1 | lt | 5.74e-36 | -1 | -0.0833 |
| lmnb | mfp2 | EF1 | lt | 2.32e-74 | -3 | -0.167 |
| lmnb | mfp3 | EF1 | lt | 9.4e-69 | -3 | -0.25 |
| lmnb | rdk5 | EF1 | gt | 1.97e-58 | 4 | nan |
| lmnb | hashap | EF1 | same | 0.147 | -1 | nan |
| lmnb | hashtt | EF1 | lt | 5.42e-48 | -2 | -0.2 |
| rf | mfp2 | EF1 | lt | 4.12e-11 | -1 | -0.0706 |
| rf | mfp3 | EF1 | lt | 1.94e-14 | -1 | -0.0833 |
| rf | rdk5 | EF1 | same | 0.0619 | 0 | 0 |
| rf | hashap | EF1 | gt | 3.15e-06 | 2 | 0.1 |
| rf | hashtt | EF1 | lt | 1.33e-09 | -1 | -0.0476 |
| brf | mfp2 | EF1 | lt | 2.84e-09 | -1 | -0.0421 |
| brf | mfp3 | EF1 | lt | 4.46e-13 | -1 | -0.0769 |
| brf | rdk5 | EF1 | same | 0.301 | 0 | nan |
| brf | hashap | EF1 | gt | 9.93e-09 | 1 | 0.0779 |
| brf | hashtt | EF1 | lt | 7.97e-14 | -1 | -0.0871 |
| xgb | mfp2 | EF1 | lt | 1.06e-14 | -1 | -0.075 |
| xgb | mfp3 | EF1 | lt | 2.22e-17 | -1 | -0.0625 |
| xgb | rdk5 | EF1 | lt | 4.54e-09 | -1 | nan |
| xgb | hashap | EF1 | same | 0.335 | 1 | 0.0769 |
| xgb | hashtt | EF1 | lt | 6.27e-09 | 0 | 0 |

## EF5
| alg | fp | metric | result | P | delt | pct_delt |
| --- | -- | ------ | ------ | - | ---- | -------- |
| lr | mfp2 | EF5 | lt | 4.32e-23 | -0.2 | -0.0357 |
| lr | mfp3 | EF5 | lt | 3.64e-38 | -0.4 | -0.0559 |
| lr | rdk5 | EF5 | lt | 8.96e-26 | -0.2 | -0.0469 |
C:\Users\glandrum\AppData\Local\Temp\ipykernel_9592\ RuntimeWarning: invalid value encountered in divide
C:\Users\glandrum\AppData\Local\Temp\ipykernel_9592\ RuntimeWarning: divide by zero encountered in divide
| lr | hashap | EF5 | lt | 1.42e-08 | -0.2 | -0.0385 |
| lr | hashtt | EF5 | lt | 6.58e-38 | -0.4 | -0.0769 |
| lmnb | mfp2 | EF5 | lt | 1.44e-82 | -1 | -0.154 |
| lmnb | mfp3 | EF5 | lt | 2.57e-92 | -1.4 | -0.264 |
| lmnb | rdk5 | EF5 | same | 0.725 | -0.4 | nan |
| lmnb | hashap | EF5 | lt | 1.47e-64 | -1 | -0.217 |
| lmnb | hashtt | EF5 | lt | 6.85e-73 | -0.81 | -0.185 |
| rf | mfp2 | EF5 | lt | 3.72e-39 | -0.61 | -0.118 |
| rf | mfp3 | EF5 | lt | 3.14e-26 | -0.4 | -0.0948 |
| rf | rdk5 | EF5 | same | 0.988 | 0 | 0 |
| rf | hashap | EF5 | gt | 6.9e-14 | 0.61 | 0.0769 |
| rf | hashtt | EF5 | lt | 2.61e-19 | -0.4 | -0.0811 |
| brf | mfp2 | EF5 | lt | 4.02e-16 | -0.4 | -0.0476 |
| brf | mfp3 | EF5 | lt | 8.91e-25 | -0.6 | -0.0854 |
| brf | rdk5 | EF5 | same | 0.319 | 0 | 0 |
| brf | hashap | EF5 | gt | 9.01e-07 | 0.4 | 0.0674 |
| brf | hashtt | EF5 | lt | 1.18e-22 | -0.4 | -0.101 |
| xgb | mfp2 | EF5 | lt | 3.14e-05 | -0.2 | -0.0308 |
| xgb | mfp3 | EF5 | lt | 4.57e-13 | -0.4 | -0.0797 |
| xgb | rdk5 | EF5 | same | 0.00432 | -0.2 | -0.0354 |
| xgb | hashap | EF5 | gt | 1.24e-09 | 0.61 | 0.1 |
| xgb | hashtt | EF5 | lt | 0.000141 | -0.2 | -0.0199 |

And now display that:


I am sorely tempted to compare the individual methods and fingerprints to each other as well, but that’s not the point of this post, so I’ll hold that for another time.