An approximation to make working with count vectors more efficient
Published
July 6, 2021
Many of the RDKit’s fingerprints are available as either bit vectors or count vectors. Bit vectors track whether or not features appear in a molecule while count vectors track the number of times each feature appears. It seems intuitive that a count vector is a better representation of similarity than bit vectors, but we often use bit vector representations for computational expediency - bit vectors require less memory and are much faster to operate on.
What impact does this using bit vectors have on computed similarity values and the ordering of similarities? This notebook attempts to provide at least a partial answer to that question and also examines a strategy for simulating counts using bit vectors. I look at the following fingerprints: - Morgan 2 - Topological Torsion - Atom Pair - RDKit
And I use two sets of compunds: - Random pairs of compounds taken from this blog post - Pairs of “related compounds” taken from this blog post
Bit vector similarity vs count-based similarity
Let’s start with two molecules where this makes a big difference:
The calculated similarity with MFP2 and counts is 0.6 while with bits it’s 0.29. That’s easy to understand since with the bit-based fingerprints the long alkyl chains don’t make the large ontribution to the similarity that they do when using counts.
To demonstrate that this isn’t all about long chains, here’s another pair where there’s a significant difference:
In this case the count-based similarity is 0.59 while with bits it’s 0.35.
Those were a couple of anecdotes, but let’s look at the differences across the entire datasets:
Here I’ve plotted bit-based similarity vs count-based similarity and included statistics on the correlation in the title. The left plot is for the random compound pairs and the right plot is for the related compound pairs. There are significant differences in similarity here, with the bit vector similarities being consistently lower than the count-based equivalent, but it’s worth pointing out that the rankings of the similarities (as measured by the Spearman rank-order correlation values) are reasonably equivalent, particularly for the related compound pairs.
The equivalent plots for the RDKit fingerprint show the same qualitative behavior with the difference that bit vector similarities tend to be higher than count based similarities:
Simulating counts
The RDKit has a simple mechanism for simulating counts using bit vectors: set multiple bits for each feature where the number of bits set is determined by the count. The approach uses a fixed number of potential bits which each have a threshold value; if the count for the feature exceeds the threshold value then the corresponding bit is set. Here’s a schematic illustration for count simulation with four bits and the thresholds 1, 2, 4, and 8:
The example shown, with the first two bits set for feature N, is what we’d get if feature N is set either 2 or 3 times in a molecule. Note that we aren’t just using a binary representation of the count itself. In that case a feature which is present one time in the first molecule, representation 1000, and two times in the second molecule, representation 0100, would contribute zero to the overall similarity. That’s not desirable.
Note that since the count simulation approach uses multiple bits per feature, it decreases the effective length of the fingerprint by a factor equal to the number of bits used. With the default setting of four bits per feature a 2048 bit fingerprint will have the same number of bit collisions as a 512 bit fingerprint without count simulation. This becomes more relevant the more bits a fingerprint tends to set. For example using count simulation to calculate similarity with the RDKit fingerprint, which sets a large number of bits, actually decreases the correlation with the similarity calculated with count vectors (see below for the plot) unless I also increase the overall length of the fingerprint.
Results and discussion
Here’s a summary of the results for the fingerprints I examine here
Random pairs
Fingerprint
bits Spearman r
bits MAE
bits RMSE
count-simulation Spearman r
count-simulation MAE
count-simulation RMSE
Note
Morgan 2
0.84
0.097
0.10
0.90
0.024
0.036
Topological torsions
0.92
0.026
0.051
0.98
0.018
0.029
Topological torsions
0.92
0.026
0.051
0.99
0.010
0.021
8192 bits for count simulation
Atom pairs
0.82
0.031
0.049
0.90
0.055
0.066
Atom pairs
0.82
0.031
0.049
0.96
0.014
0.023
8192 bits for count simulation
RDKit
0.83
0.079
0.10
0.94
0.029
0.045
8192 bits for count simulation
Related pairs
Fingerprint
bits Spearman r
bits MAE
bits RMSE
count-simulation Spearman r
count-simulation MAE
count-simulation RMSE
Note
Morgan 2
0.94
0.043
0.062
0.98
0.019
0.028
Topological torsions
0.90
0.050
0.079
0.98
0.021
0.035
Topological torsions
0.90
0.050
0.079
0.98
0.018
0.032
8192 bits for count simulation
Atom pairs
0.91
0.043
0.067
0.97
0.052
0.063
Atom pairs
0.91
0.043
0.067
0.98
0.020
0.032
8192 bits for count simulation
RDKit
0.91
0.077
0.11
0.98
0.034
0.053
8192 bits for count simulation
Using the count simulation strategies does improve the match between similarities calculated with bit vectors and those calculated with count vectors. The differences are statistically significant (results not shown here) and large enough to potentially be meaningful. MAE and RMSE values for the various fingerprints typically decrease by at least a factor of two and Spearman rank-order correlation in general increases quite a bit. These conclusions hold for both randomly paired molecules and related pairs with more dramatic differences seen at the lower ends of the similarity scale (the random pairs).
Note that this analysis focuses solely on similarity. The extra information added by doing count simulation will most likely also influence the performance of machine learning models built using these fingerprints. But that’s for a future blog post.
The code to reproduce all of this, along with more plots, is below.
2021.09.1pre
Populating the interactive namespace from numpy and matplotlib
Some technical notes:
Note that this notebook uses a couple of features which did not work properly until the v2021.03.4 of the RDKit (which will be released in July).
Count simulation is only generally available when working with the “new” fingerprint generators, so those are used throughout this notebook.
Count simulation is used by default for atom pair and topological torsion fingerprints, both with the “new” fingerprint generators and the older fingerprinting functions.
Construct the dataset.
Start with our standard similarity comparison set:
import gzipwith gzip.open('../data/chembl21_25K.pairs.txt.gz','rt') as inf: ls = [x.split() for x in inf.readlines()]ms = [(Chem.MolFromSmiles(x[1]),Chem.MolFromSmiles(x[3])) for x in ls]
That’s weighted towards lower similarity values, get some pairs from the related compounds set:
import picklefrom collections import namedtupleMCSRes=namedtuple('MCSRes',('smarts','numAtoms','numMols','avgNumMolAtoms','mcsTime'))data = pickle.load(open('../data/scaffolds_revisited_again.simplified.pkl','rb'))data2 = pickle.load(open('../data/scaffolds_expanded.simplified.pkl','rb'))data += data2# keep only sets where the MCS was at least 50% of the average number of atoms:keep = [x for x in data if x[2].numAtoms>=np.mean(x[2].avgNumMolAtoms)/2]len(keep)import randomrandom.seed(0xf00d)related_pairs = []# keep only molecules matching the MCS:for i,tpl inenumerate(keep): assay,smis,mcs,svg = tpl patt = Chem.MolFromSmarts(mcs.smarts) smis = [(x,y) for x,y in smis if Chem.MolFromSmiles(y).HasSubstructMatch(patt)] ssmis = smis[:] random.shuffle(ssmis) related_pairs.extend([(x[0],x[1],y[0],y[1]) for x,y inzip(smis,ssmis)][:10])print(f'{len(related_pairs)} related pairs')related_ms = [(Chem.MolFromSmiles(x[1]),Chem.MolFromSmiles(x[3])) for x in related_pairs]
10470 related pairs
len(ms)
25000
import randomrandom.seed(0xf00d)indices =list(range(len(ms)))random.shuffle(indices)random_pairs = [ms[x] for x in indices[:5000]]indices =list(range(len(related_ms)))random.shuffle(indices)related_pairs = [related_ms[x] for x in indices[:5000]]
Performance of similarity comparisons
fpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2,fpSize=2048,countSimulation=False)bv_pairs = [(fpgen.GetFingerprint(x[0]),fpgen.GetFingerprint(x[1])) for x in random_pairs]cv_pairs = [(fpgen.GetCountFingerprint(x[0]),fpgen.GetCountFingerprint(x[1])) for x in random_pairs]
%timeit _ = [DataStructs.TanimotoSimilarity(x,y) for x,y in bv_pairs]
5.06 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit _ = [DataStructs.TanimotoSimilarity(x,y) for x,y in cv_pairs]
8.37 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Not a huge difference there, but what about a fingerprint which sets a much larger number of bits?
fpgen = rdFingerprintGenerator.GetRDKitFPGenerator(fpSize=2048,countSimulation=False)bv_pairs = [(fpgen.GetFingerprint(x[0]),fpgen.GetFingerprint(x[1])) for x in random_pairs]cv_pairs = [(fpgen.GetCountFingerprint(x[0]),fpgen.GetCountFingerprint(x[1])) for x in random_pairs]
%timeit _ = [DataStructs.TanimotoSimilarity(x,y) for x,y in bv_pairs]
6.22 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit _ = [DataStructs.TanimotoSimilarity(x,y) for x,y in cv_pairs]
189 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here the performance difference is quite noticeable.
Morgan 2
fpgen1 = rdFingerprintGenerator.GetMorganGenerator(radius=2,fpSize=2048,countSimulation=False)fpsims = [DataStructs.TanimotoSimilarity(fpgen1.GetFingerprint(x[0]),fpgen1.GetFingerprint(x[1])) for x in ms]countsims = [DataStructs.TanimotoSimilarity(fpgen1.GetCountFingerprint(x[0]),fpgen1.GetCountFingerprint(x[1])) for x in ms]related_fpsims = [DataStructs.TanimotoSimilarity(fpgen1.GetFingerprint(x[0]),fpgen1.GetFingerprint(x[1])) for x in related_ms]related_countsims = [DataStructs.TanimotoSimilarity(fpgen1.GetCountFingerprint(x[0]),fpgen1.GetCountFingerprint(x[1])) for x in related_ms]
fpgen2 = rdFingerprintGenerator.GetMorganGenerator(radius=2,fpSize=2048,countSimulation=True)fpsims_countsim = [DataStructs.TanimotoSimilarity(fpgen2.GetFingerprint(x[0]),fpgen2.GetFingerprint(x[1])) for x in ms]related_fpsims_countsim = [DataStructs.TanimotoSimilarity(fpgen2.GetFingerprint(x[0]),fpgen2.GetFingerprint(x[1])) for x in related_ms]
delts =sorted([(countsims[i]-fpsims[i],i) for i inrange(len(fpsims))])print(delts[:5])print(delts[-5:])
fpgen3 = rdFingerprintGenerator.GetMorganGenerator(radius=2,fpSize=8192,countSimulation=True)fpsims_countsim2 = [DataStructs.TanimotoSimilarity(fpgen3.GetFingerprint(x[0]),fpgen3.GetFingerprint(x[1])) for x in ms]related_fpsims_countsim2 = [DataStructs.TanimotoSimilarity(fpgen3.GetFingerprint(x[0]),fpgen3.GetFingerprint(x[1])) for x in related_ms]figsize(18,9)subplot(1,2,1)x,y = countsims,fpsims_countsim2hexbin(x,y,cmap='Blues',bins='log');plot((0,1),(0,1),'k');ylabel('count simulation 8192')xlabel('count');sr,p = spearmanr(x,y)mae = median_absolute_error(x,y)rmse = sqrt(mean_squared_error(x,y))title(f'Random pairs, Morgan2 spearman r={sr:.3f} MAE={mae:.3f} RMSE={rmse:.3f}');subplot(1,2,2)x,y = related_countsims,related_fpsims_countsim2hexbin(x,y,cmap='Blues',bins='log');plot((0,1),(0,1),'k');ylabel('count simulation 8192')xlabel('count');sr,p = spearmanr(x,y)mae = median_absolute_error(x,y)rmse = sqrt(mean_squared_error(x,y))title(f'Related pairs, Morgan2 spearman r={sr:.3f} MAE={mae:.3f} RMSE={rmse:.3f}');
Topological Torsions
fpgen1 = rdFingerprintGenerator.GetTopologicalTorsionGenerator(fpSize=2048,countSimulation=False)fpsims = [DataStructs.TanimotoSimilarity(fpgen1.GetFingerprint(x[0]),fpgen1.GetFingerprint(x[1])) for x in ms]countsims = [DataStructs.TanimotoSimilarity(fpgen1.GetCountFingerprint(x[0]),fpgen1.GetCountFingerprint(x[1])) for x in ms]related_fpsims = [DataStructs.TanimotoSimilarity(fpgen1.GetFingerprint(x[0]),fpgen1.GetFingerprint(x[1])) for x in related_ms]related_countsims = [DataStructs.TanimotoSimilarity(fpgen1.GetCountFingerprint(x[0]),fpgen1.GetCountFingerprint(x[1])) for x in related_ms]fpgen2 = rdFingerprintGenerator.GetTopologicalTorsionGenerator(fpSize=2048,countSimulation=True)fpsims_countsim = [DataStructs.TanimotoSimilarity(fpgen2.GetFingerprint(x[0]),fpgen2.GetFingerprint(x[1])) for x in ms]related_fpsims_countsim = [DataStructs.TanimotoSimilarity(fpgen2.GetFingerprint(x[0]),fpgen2.GetFingerprint(x[1])) for x in related_ms]
fpgen3 = rdFingerprintGenerator.GetTopologicalTorsionGenerator(fpSize=8192,countSimulation=True)fpsims_countsim2 = [DataStructs.TanimotoSimilarity(fpgen3.GetFingerprint(x[0]),fpgen3.GetFingerprint(x[1])) for x in ms]related_fpsims_countsim2 = [DataStructs.TanimotoSimilarity(fpgen3.GetFingerprint(x[0]),fpgen3.GetFingerprint(x[1])) for x in related_ms]figsize(18,9)subplot(1,2,1)x,y = countsims,fpsims_countsim2hexbin(x,y,cmap='Blues',bins='log');plot((0,1),(0,1),'k');ylabel('count simulation 8192')xlabel('count');sr,p = spearmanr(x,y)mae = median_absolute_error(x,y)rmse = sqrt(mean_squared_error(x,y))title(f'Random pairs, TT spearman r={sr:.3f} MAE={mae:.3f} RMSE={rmse:.3f}');subplot(1,2,2)x,y = related_countsims,related_fpsims_countsim2hexbin(x,y,cmap='Blues',bins='log');plot((0,1),(0,1),'k');ylabel('count simulation 8192')xlabel('count');sr,p = spearmanr(x,y)mae = median_absolute_error(x,y)rmse = sqrt(mean_squared_error(x,y))title(f'Related pairs, TT spearman r={sr:.3f} MAE={mae:.3f} RMSE={rmse:.3f}');
Atom pairs
fpgen1 = rdFingerprintGenerator.GetAtomPairGenerator(fpSize=2048,countSimulation=False)fpsims = [DataStructs.TanimotoSimilarity(fpgen1.GetFingerprint(x[0]),fpgen1.GetFingerprint(x[1])) for x in ms]countsims = [DataStructs.TanimotoSimilarity(fpgen1.GetCountFingerprint(x[0]),fpgen1.GetCountFingerprint(x[1])) for x in ms]related_fpsims = [DataStructs.TanimotoSimilarity(fpgen1.GetFingerprint(x[0]),fpgen1.GetFingerprint(x[1])) for x in related_ms]related_countsims = [DataStructs.TanimotoSimilarity(fpgen1.GetCountFingerprint(x[0]),fpgen1.GetCountFingerprint(x[1])) for x in related_ms]fpgen2 = rdFingerprintGenerator.GetAtomPairGenerator(fpSize=2048,countSimulation=True)fpsims_countsim = [DataStructs.TanimotoSimilarity(fpgen2.GetFingerprint(x[0]),fpgen2.GetFingerprint(x[1])) for x in ms]related_fpsims_countsim = [DataStructs.TanimotoSimilarity(fpgen2.GetFingerprint(x[0]),fpgen2.GetFingerprint(x[1])) for x in related_ms]
fpgen3 = rdFingerprintGenerator.GetAtomPairGenerator(fpSize=8192,countSimulation=True)fpsims_countsim2 = [DataStructs.TanimotoSimilarity(fpgen3.GetFingerprint(x[0]),fpgen3.GetFingerprint(x[1])) for x in ms]related_fpsims_countsim2 = [DataStructs.TanimotoSimilarity(fpgen3.GetFingerprint(x[0]),fpgen3.GetFingerprint(x[1])) for x in related_ms]figsize(18,9)subplot(1,2,1)x,y = countsims,fpsims_countsim2hexbin(x,y,cmap='Blues',bins='log');plot((0,1),(0,1),'k');ylabel('count simulation 8192')xlabel('count');sr,p = spearmanr(x,y)mae = median_absolute_error(x,y)rmse = sqrt(mean_squared_error(x,y))title(f'Random pairs, AP spearman r={sr:.3f} MAE={mae:.3f} RMSE={rmse:.3f}');subplot(1,2,2)x,y = related_countsims,related_fpsims_countsim2hexbin(x,y,cmap='Blues',bins='log');plot((0,1),(0,1),'k');ylabel('count simulation 8192')xlabel('count');sr,p = spearmanr(x,y)mae = median_absolute_error(x,y)rmse = sqrt(mean_squared_error(x,y))title(f'Related pairs, AP spearman r={sr:.3f} MAE={mae:.3f} RMSE={rmse:.3f}');
RDKit Fingerprint
fpgen1 = rdFingerprintGenerator.GetRDKitFPGenerator(fpSize=2048,countSimulation=False)fpsims = [DataStructs.TanimotoSimilarity(fpgen1.GetFingerprint(x[0]),fpgen1.GetFingerprint(x[1])) for x in ms]countsims = [DataStructs.TanimotoSimilarity(fpgen1.GetCountFingerprint(x[0]),fpgen1.GetCountFingerprint(x[1])) for x in ms]related_fpsims = [DataStructs.TanimotoSimilarity(fpgen1.GetFingerprint(x[0]),fpgen1.GetFingerprint(x[1])) for x in related_ms]related_countsims = [DataStructs.TanimotoSimilarity(fpgen1.GetCountFingerprint(x[0]),fpgen1.GetCountFingerprint(x[1])) for x in related_ms]fpgen2 = rdFingerprintGenerator.GetRDKitFPGenerator(fpSize=2048,countSimulation=True)fpsims_countsim = [DataStructs.TanimotoSimilarity(fpgen2.GetFingerprint(x[0]),fpgen2.GetFingerprint(x[1])) for x in ms]related_fpsims_countsim = [DataStructs.TanimotoSimilarity(fpgen2.GetFingerprint(x[0]),fpgen2.GetFingerprint(x[1])) for x in related_ms]