ChEMBL Document Similarity

exploration
similarity
Using compound similarity to define document similarity
Published

February 2, 2025

For a couple of other ideas I want to explore, I need some “clumpy” sets of compounds: a group of compounds with a number of separate clusters of different sizes and cohesiveness (I’m not sure what the right word here is, I mean how tightly packed the cluster is). As I almost always do, I’m going to do this using data from ChEMBL and I will use groups of compounds from individual med chem papers as my clumps. Since I want clumpiness at different similarity ranges, I want to be able to find groups of “related” papers: papers with similar compounds. This post is a first step towards finding those related papers.

There’s an old ChEMBL blog post about document similarity. Not too long after that, I did a blog post with a related approach. This time instead of looking at the number of highly similar compounds between papers, I’m going to look at the median similarity between the compounds the papers..

I start by comparing at the distribution of similarities between compounds in med chem papers to those between papers in order to demonstrate that there is a significant difference there. To do this I pick a random sample of 20000 documents with between 20 and 100 compounds and use the usual Morgan fingerprint with radius 2 (there are plots below for radius 3 as well).

Here are the distributions of the median compound similarities within the papers and the median similarities between 500K random pairs of papers:

image.png

The plot also contains lines showing the threshold values for similarity between random compounds and related compounds taken from this blog post. Unsurprisingly, the intra-paper similarity values tend to be significantly higher than the inter-paper similarity values. The intra-paper similarity values tend to be above the “related” threshold value while the inter-paper values are below the “random” threshold value.

Moving on to finding related documents. One simple measure for document similarity is to count how many exact matches they have in common. This isn’t particularly interesting for my purposes, so I chose to focus on pairs of documents that have high ratios of compounds that have similarities above the “related” threshold. Here’s the overall distribution for fraction of exact matches, and fraction of pairs above the “random” and “related” thresholds for a set of 20 million random document pairs:

image-2.png

Looking through the papers with the highest “related” fraction shows that they contain peptides or other large molecules. This is less useful for me, so I’m going to come back to this in a follow-up post where I either limit the size of the compounds considered or use a count-based fingerprint to calculate similarity so that larger molecules don’t automatically have higher similarity scores.|

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import rdDepictor
rdDepictor.SetPreferCoordGen(True)

from rdkit import DataStructs

import numpy as np

from matplotlib import pyplot as plt
plt.style.use('tableau-colorblind10')
%matplotlib inline
import psycopg2
%load_ext sql

import rdkit
print(rdkit.__version__)
2024.09.4
%config SqlMagic.feedback = False

Data collection

Start by getting a collection of documents with between 20 and 100 compounds; this is my normal heuristic for trying to get med chem papers.

%sql postgresql://localhost/chembl_35 \
    select doc_id,count(distinct molregno) cnt into temporary table doc_compound_counts \
      from docs join compound_records using (doc_id) join compound_structures using (molregno) \
      group by (doc_id);
[]
%sql \
    select count(*) from doc_compound_counts where cnt>=20 and cnt<=100;
 * postgresql://localhost/chembl_35
count
34821
docs = %sql \
   select doc_id from doc_compound_counts tablesample bernoulli(80) repeatable (8723346) \
    where cnt>=20 and cnt<=100 limit 20000;
len(docs)
 * postgresql://localhost/chembl_35
20000

Generate fingerprints for the compounds from the papers

fpg2 = rdFingerprintGenerator.GetMorganGenerator(radius=2)
fpg3 = rdFingerprintGenerator.GetMorganGenerator(radius=3)

fps2 = {}
fps3 = {}
for doc_id in docs:
    doc_id = doc_id[0]
    pkls = %sql postgresql://localhost/chembl_35 \
        select molregno,mol_to_pkl(m) from rdk.mols join compound_records using (molregno) \
        where doc_id=:doc_id
    ms = [Chem.Mol(x[1].tobytes()) for x in pkls]
    fps2[doc_id] = fpg2.GetFingerprints(ms)
    fps3[doc_id] = fpg3.GetFingerprints(ms)

Intra- and inter-paper similarity distributions

Calculate intra-paper similarity distributions

sims2 = {}
sims3 = {}
for idx,doc_id in enumerate(docs):
    if not idx%1000:
        print(f'Done {idx}')
    doc_id = doc_id[0]
    a = []
    fps = fps2[doc_id]
    for i in range(1,len(fps)):
        a.extend(DataStructs.BulkTanimotoSimilarity(fps[i],fps[0:i]))
    sims2[doc_id] = np.quantile(a,[0.5,0.9,0.95,0.99])
    a = []
    fps = fps3[doc_id]
    for i in range(1,len(fps)):
        a.extend(DataStructs.BulkTanimotoSimilarity(fps[i],fps[0:i]))
    sims3[doc_id] = np.quantile(a,[0.5,0.9,0.95,0.99])
    
Done 0
Done 1000
Done 2000
Done 3000
Done 4000
Done 5000
Done 6000
Done 7000
Done 8000
Done 9000
Done 10000
Done 11000
Done 12000
Done 13000
Done 14000
Done 15000
Done 16000
Done 17000
Done 18000
Done 19000
plt.figure(figsize=(8,8))

plt.subplot(2,1,1)
plt.hist([[x[0] for x in sims2.values()],[x[0] for x in sims3.values()]],bins=20,label=(['mfp2','mfp3']));
plt.xlabel('median intra-doc similarity');
plt.legend();

plt.subplot(2,1,2)
plt.hist([[x[2] for x in sims2.values()],[x[2] for x in sims3.values()]],bins=20,label=(['mfp2','mfp3']));
plt.xlabel('95% intra-doc similarity');
plt.legend();

Now calculate the inter-paper similarity distributions; I’ll choose 500,000 random pairs for this:

target = 500000

import random
import math
import time
# from: https://stackoverflow.com/a/55245866
def decode(i):
    k = math.floor((1+math.sqrt(1+8*i))/2)
    return k,i-k*(k-1)//2

def rand_pairs(n,m):
    return [decode(i) for i in random.sample(range(n*(n-1)//2),m)]

pairs = rand_pairs(len(docs),target)
pairs = [(docs[i][0],docs[j][0]) for i,j in pairs]
       
        
inter_sims2 = {}
related_sims2 = {}
inter_sims3 = {}
related_sims3 = {}
for idx in range(target):
    if not idx%(target//20):
        print(f'Done {idx}')
    di,dj = pairs[idx]
   
    a = []
    fpsi = fps2[di]
    fpsj = fps2[dj]
    for i in range(len(fpsi)):
        a.extend(DataStructs.BulkTanimotoSimilarity(fpsi[i],fpsj))
    inter_sims2[(di,dj)] = np.quantile(a,[0.5,0.9,0.95,0.99])
    related_sims2[(di,dj)] = (len([1 for x in a if x>0.27]),len([1 for x in a if x>0.4]),len(fpsi),len(fpsj))

    a = []
    fpsi = fps3[di]
    fpsj = fps3[dj]
    for i in range(len(fpsi)):
        a.extend(DataStructs.BulkTanimotoSimilarity(fpsi[i],fpsj))
    inter_sims3[(di,dj)] = np.quantile(a,[0.5,0.9,0.95,0.99])
    related_sims3[(di,dj)] = (len([1 for x in a if x>0.22]),len([1 for x in a if x>0.3]),len(fpsi),len(fpsj) )
Done 0
Done 25000
Done 50000
Done 75000
Done 100000
Done 125000
Done 150000
Done 175000
Done 200000
Done 225000
Done 250000
Done 275000
Done 300000
Done 325000
Done 350000
Done 375000
Done 400000
Done 425000
Done 450000
Done 475000

Compare the intra- and inter-paper distributions.

The line labeled noise in the plot below is the “noise level” for the fingerprint - the 95% threshold for similarity between random compound. The dashed line labeled “related” is the threshold level for compounds which are related to each other. You can find more information about these definitions in this blog post

plt.figure(figsize=(8,8))

plt.subplot(2,1,1)
plt.hist([[x[0] for x in sims2.values()],[x[0] for x in inter_sims2.values()]],bins=20,
         label=(['intra-document','inter-document']),density=True);
ymax = plt.ylim()[1]
plt.plot([0.27,0.27],[0,ymax],'k-',label='noise')
plt.plot([0.4,0.4],[0,ymax],'k--',label='related')
plt.xlabel('median mfp2 similarity');
plt.legend();

plt.subplot(2,1,2)
plt.hist([[x[2] for x in sims2.values()],[x[2] for x in inter_sims2.values()]],bins=20,
         label=(['intra-document','inter-document']),density=True);
ymax = plt.ylim()[1]
plt.plot([0.22,0.22],[0,ymax],'k-',label='noise')
plt.plot([0.35,0.35],[0,ymax],'k--',label='related')
plt.xlabel('95% mfp2 similarity');
plt.legend();

Get just the first of those plots, because that’s what we’ll use in the summary above:

plt.figure(figsize=(8,4))

plt.hist([[x[0] for x in sims2.values()],[x[0] for x in inter_sims2.values()]],bins=20,
         label=(['intra-document','inter-document']),density=True);
ymax = plt.ylim()[1]
plt.plot([0.27,0.27],[0,ymax],'k-',label='noise')
plt.plot([0.4,0.4],[0,ymax],'k--',label='related')
plt.xlabel('median mfp2 similarity');
plt.legend();

np.quantile([x[0] for x in sims2.values()],[0.5,0.9,0.95,0.99])
array([0.48979592, 0.65753425, 0.7       , 0.78703901])
np.quantile([x[0] for x in inter_sims2.values()],[0.5,0.9,0.95,0.99])
array([0.10227273, 0.14      , 0.15306122, 0.18390805])