Using custom atom and bond invariants with fingerprint generators

fingerprints
tutorial
More fine-grained control over the details of how a fingerprint is generated.
Published

January 10, 2025

When using the RDKit fingerprint generators - there is a tutorial on using these - it’s possible to change the invariants that are used to describe the atoms and/or bonds. This post provides a short intro to how to do that.

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs
import rdkit
print(rdkit.__version__)
2024.09.4

For this blog post I’m going to use the Morgan fingerprint generator, but this approach works for any fingerprint generator.

fpg = rdFingerprintGenerator.GetMorganGenerator()

We’ll start by looking at atom invariants, so construct two molecules that differ in a single atom:

m1 = Chem.MolFromSmiles('c1ccccc1')
m2 = Chem.MolFromSmiles('c1ccccn1')

Generate fingerprints for those and calculate the similarity between them:

fps = [fpg.GetFingerprint(m) for m in (m1,m2)]
print(DataStructs.TanimotoSimilarity(fps[0],fps[1]))
0.2727272727272727

The GetFingerprint() function takes an optional argument, customAtomInvariants, that allows you to provide the atom invariants that are used.

Here’s how that works. We’ll use a simple function that just uses the explicit degree (number of explicit bonds) of the atom as its invariant:

atomGen = lambda atom: atom.GetDegree()*1000

# generate fingerprints using the custom invariants:
fps_ats = [fpg.GetFingerprint(m,customAtomInvariants=[atomGen(at) for at in m.GetAtoms()]) for m in (m1,m2)]

print(DataStructs.TanimotoSimilarity(fps_ats[0],fps_ats[1]))
1.0

Now the fingerprints are identical

We can do the same thing for bond invariants.

To simplify the demonstration, I will just kekulize the first molecule so that it has alternating single and double bonds instead of aromatic bonds:

m3 = Chem.Mol(m1)
Chem.Kekulize(m3,clearAromaticFlags=True)

By default, the similarity between these is quite low:

fps = [fpg.GetFingerprint(m) for m in (m1,m3)]

print(DataStructs.TanimotoSimilarity(fps[0],fps[1]))
0.14285714285714285

However, if we define a bond invariant which treats all conjugated bonds the same, the molecules are identical:

bondGen = lambda bond: 10 if bond.GetIsConjugated() else int(2*bond.GetBondTypeAsDouble())

fps_bnds = [fpg.GetFingerprint(m,customBondInvariants=[bondGen(b) for b in m.GetBonds()]) for m in (m1,m3)]

print(DataStructs.TanimotoSimilarity(fps_bnds[0],fps_bnds[1]))
1.0

Finally, an example to show how to combine multiple components into the atom invariants. This is a silly one, but it hopefully demonstrates the idea.

The important things here is to make sure that the different pieces of information are stored in different parts of the invariant, so we multiple the degree by 1000 (to ensure a number larger than any possible atomic number) and then add the atomic number.

# define an invariant generator that combines atom degree and atomic number:
atomGen = lambda atom: atom.GetDegree()*1000 + atom.GetAtomicNum()

# generate fingerprints using the custom invariants:
fps_ats = [fpg.GetFingerprint(m,customAtomInvariants=[atomGen(at) for at in m.GetAtoms()]) for m in (m1,m2)]

print(DataStructs.TanimotoSimilarity(fps_ats[0],fps_ats[1]))
0.2727272727272727

That’s it for this one, I hope this brief intro was useful!