from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import rdkit
print(rdkit.__version__)
2025.03.3
June 27, 2025
This post looks at the options available for controlling how much chemistry perception and checking (sanitization) is done when parsing molecules and how those options interact. This gets pretty deep into the weeds, but I think it’s useful to capture this information somewhere. I definitely need to find the appropriate place in the documentation to put this.
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import rdkit
print(rdkit.__version__)
2025.03.3
Let’s start by disabling both sanitization and H removal.
This results in a molecule where Hs haven’t been removed and no chemistry perception has been done:
smiles_with_h = 'C1=CC=CC=C1[H]'
params = Chem.SmilesParserParams()
params.removeHs = False
params.sanitize = False
mol = Chem.MolFromSmiles(smiles_with_h, params=params)
mol.Debug()
Atoms:
0 6 C chg: 0 deg: 2 exp: N/A imp: N/A hyb:
1 6 C chg: 0 deg: 2 exp: N/A imp: N/A hyb:
2 6 C chg: 0 deg: 2 exp: N/A imp: N/A hyb:
3 6 C chg: 0 deg: 2 exp: N/A imp: N/A hyb:
4 6 C chg: 0 deg: 2 exp: N/A imp: N/A hyb:
5 6 C chg: 0 deg: 3 exp: N/A imp: N/A hyb:
6 1 H chg: 0 deg: 1 exp: N/A imp: 0 hyb:
Bonds:
0 0->1 order: 2
1 1->2 order: 1
2 2->3 order: 2
3 3->4 order: 1
4 4->5 order: 2
5 5->6 order: 1
6 5->0 order: 1
We can do sanitization without removing Hs:
params = Chem.SmilesParserParams()
params.removeHs = False
params.sanitize = True
mol = Chem.MolFromSmiles(smiles_with_h, params=params)
mol.Debug()
Atoms:
0 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
1 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
2 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
3 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
4 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
5 6 C chg: 0 deg: 3 exp: 4 imp: 0 hyb: SP2 arom?: 1
6 1 H chg: 0 deg: 1 exp: 1 imp: 0 hyb: S
Bonds:
0 0->1 order: a conj?: 1 aromatic?: 1
1 1->2 order: a conj?: 1 aromatic?: 1
2 2->3 order: a conj?: 1 aromatic?: 1
3 3->4 order: a conj?: 1 aromatic?: 1
4 4->5 order: a conj?: 1 aromatic?: 1
5 5->6 order: 1
6 5->0 order: a conj?: 1 aromatic?: 1
And we can remove Hs without doing sanitization:
params = Chem.SmilesParserParams()
params.removeHs = True
params.sanitize = False
mol = Chem.MolFromSmiles(smiles_with_h, params=params)
mol.Debug()
Atoms:
0 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb:
1 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb:
2 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb:
3 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb:
4 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb:
5 6 C chg: 0 deg: 2 exp: 4 imp: 0 hyb:
Bonds:
0 0->1 order: 2
1 1->2 order: 1
2 2->3 order: 2
3 3->4 order: 1
4 4->5 order: 2
5 5->0 order: 1
As we will see later, the SDF/Mol file parser does not remove Hs when sanitization is not enabled. This inconsistency is the type of thing that is difficult to fix; doing so would break a lot of other peoples’ code.
One of the other steps that’s carried out when parsing molecules is cleaning up bad stereochemistry specifications. There’s no option enabling direct control over this step, but let’s see how the sanitization options affect that.
If we provide a molecule that has chirality indicated on an atom that’s not a chiral center, the standard parsing optios remove the chirality spec:
smiles_with_bad_stereo = "C[C@H](F)C"
params = Chem.SmilesParserParams()
params.sanitize = True
mol = Chem.MolFromSmiles(smiles_with_bad_stereo, params=params)
mol.Debug()
Atoms:
0 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb: SP3
1 6 C chg: 0 deg: 3 exp: 3 imp: 1 hyb: SP3
2 9 F chg: 0 deg: 1 exp: 1 imp: 0 hyb: SP3
3 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb: SP3
Bonds:
0 0->1 order: 1
1 1->2 order: 1
2 1->3 order: 1
The stereo cleanup is done even when sanitization is disabled:
params = Chem.SmilesParserParams()
params.sanitize = False
mol = Chem.MolFromSmiles(smiles_with_bad_stereo, params=params)
mol.Debug()
Atoms:
0 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb:
1 6 C chg: 0 deg: 3 exp: 3 imp: 1 hyb:
2 9 F chg: 0 deg: 1 exp: 1 imp: 0 hyb:
3 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb:
Bonds:
0 0->1 order: 1
1 1->2 order: 1
2 1->3 order: 1
Notice that, in this case, explicit and implicit valences have been assigned even though sanitization is disabled. This happens because the AssignStereochemistry()
function needs valence information. In this case the values are calculated, but there’s no check to make sure that they make chemical sense (as would happen during normal sanitization):
valence_too_high = "C[C@H](FC)C"
params = Chem.SmilesParserParams()
params.sanitize = False
mol = Chem.MolFromSmiles(valence_too_high, params=params)
mol.Debug()
Atoms:
0 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb:
1 6 C chg: 0 deg: 3 exp: 3 imp: 1 hyb:
2 9 F chg: 0 deg: 2 exp: 2 imp: 0 hyb:
3 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb:
4 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb:
Bonds:
0 0->1 order: 1
1 1->2 order: 1
2 2->3 order: 1
3 1->4 order: 1
Here the two-valent F atom (atom 2) was accepted without complaint.
Skipping both sanitization and H removal also skips the call to AssignStereochemistry()
, so the bad chirality specification is not removed (and no valences are calculated):
params = Chem.SmilesParserParams()
params.sanitize = False
params.removeHs = False
mol = Chem.MolFromSmiles(smiles_with_bad_stereo, params=params)
mol.Debug()
Atoms:
0 6 C chg: 0 deg: 1 exp: N/A imp: N/A hyb:
1 6 C chg: 0 deg: 3 exp: N/A imp: 0 hyb: chi: CCW nbrs:[0 2 3]
2 9 F chg: 0 deg: 1 exp: N/A imp: N/A hyb:
3 6 C chg: 0 deg: 1 exp: N/A imp: N/A hyb:
Bonds:
0 0->1 order: 1
1 1->2 order: 1
2 1->3 order: 1
molb_with_h = '''
RDKit 2D
0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 7 7 0 0 0
M V30 BEGIN ATOM
M V30 1 C 1.500000 0.000000 0.000000 0
M V30 2 C 0.750000 -1.299038 0.000000 0
M V30 3 C -0.750000 -1.299038 0.000000 0
M V30 4 C -1.500000 0.000000 0.000000 0
M V30 5 C -0.750000 1.299038 0.000000 0
M V30 6 C 0.750000 1.299038 0.000000 0
M V30 7 H 1.500000 2.598076 0.000000 0
M V30 END ATOM
M V30 BEGIN BOND
M V30 1 2 1 2
M V30 2 1 2 3
M V30 3 2 3 4
M V30 4 1 4 5
M V30 5 2 5 6
M V30 6 1 6 7
M V30 7 1 6 1
M V30 END BOND
M V30 END CTAB
M END'''
molb_with_bad_stereo = '''
RDKit 2D
0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 4 3 0 0 0
M V30 BEGIN ATOM
M V30 1 C 0.000000 0.000000 0.000000 0
M V30 2 C 1.299038 0.750000 0.000000 0
M V30 3 F 1.299038 2.250000 0.000000 0
M V30 4 C 2.598076 -0.000000 0.000000 0
M V30 END ATOM
M V30 BEGIN BOND
M V30 1 1 2 1 CFG=3
M V30 2 1 2 3
M V30 3 1 2 4
M V30 END BOND
M V30 END CTAB
M END'''
As with SMILES, we can disable sanitization and H removal and then very little chemistry perception or cleanup is done:
Atoms:
0 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
1 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
2 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
3 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
4 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
5 6 C chg: 0 deg: 3 exp: 4 imp: N/A hyb:
6 1 H chg: 0 deg: 1 exp: 1 imp: N/A hyb:
Bonds:
0 0->1 order: 2
1 1->2 order: 1
2 2->3 order: 2
3 3->4 order: 1
4 4->5 order: 2
5 5->6 order: 1
6 5->0 order: 1
Unlike with the SMILES parser, setting removeHs to true with sanitize set to false has no effect:
Atoms:
0 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
1 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
2 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
3 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
4 6 C chg: 0 deg: 2 exp: 3 imp: N/A hyb:
5 6 C chg: 0 deg: 3 exp: 4 imp: N/A hyb:
6 1 H chg: 0 deg: 1 exp: 1 imp: N/A hyb:
Bonds:
0 0->1 order: 2
1 1->2 order: 1
2 2->3 order: 2
3 3->4 order: 1
4 4->5 order: 2
5 5->6 order: 1
6 5->0 order: 1
We can, of course sanitize without removing Hs:
Atoms:
0 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
1 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
2 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
3 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
4 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
5 6 C chg: 0 deg: 3 exp: 4 imp: 0 hyb: SP2 arom?: 1
6 1 H chg: 0 deg: 1 exp: 1 imp: 0 hyb: S
Bonds:
0 0->1 order: a conj?: 1 aromatic?: 1
1 1->2 order: a conj?: 1 aromatic?: 1
2 2->3 order: a conj?: 1 aromatic?: 1
3 3->4 order: a conj?: 1 aromatic?: 1
4 4->5 order: a conj?: 1 aromatic?: 1
5 5->6 order: 1
6 5->0 order: a conj?: 1 aromatic?: 1
Now let’s look at the mol block with the bad stereo specification.
As before, without sanitization, the invalid chirality is assigned:
Atoms:
0 6 C chg: 0 deg: 1 exp: 1 imp: N/A hyb:
1 6 C chg: 0 deg: 3 exp: 4 imp: 0 hyb: chi: CCW nbrs:[0 2 3]
2 9 F chg: 0 deg: 1 exp: 1 imp: 0 hyb:
3 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb:
Bonds:
0 1->0 order: 1
1 1->2 order: 1
2 1->3 order: 1
But doing sanitization results in AssignStereochemistry()
being called and the invalid stereo label is removed:
Atoms:
0 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb: SP3
1 6 C chg: 0 deg: 3 exp: 3 imp: 1 hyb: SP3
2 9 F chg: 0 deg: 1 exp: 1 imp: 0 hyb: SP3
3 6 C chg: 0 deg: 1 exp: 1 imp: 3 hyb: SP3
Bonds:
0 1->0 order: 1
1 1->2 order: 1
2 1->3 order: 1
If you want to read molecules from either SMILES or a Mol block without either sanitization or H removal, do some checks and standardizations of your own, and then end up with a molecule with the same information as if you had parsed it normally, the steps are: 1. Parse without sanitization or H removal. 2. Remove Hs if you want to 3. Sanitize if you haven’t removed Hs (by default RemoveHs()
will sanitize) 4. Call AssignStereochemistry()
Here’s an example of that:
params = Chem.SmilesParserParams()
params.sanitize = False
params.removeHs = False
mol = Chem.MolFromSmiles(smiles_with_h, params=params)
#
# do whatever cleanup and preprocessing you want to do
#
mol = Chem.RemoveHs(mol)
Chem.AssignStereochemistry(mol,force=True, cleanIt=True, flagPossibleStereoCenters=True)
mol.Debug()
Atoms:
0 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
1 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
2 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
3 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
4 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
5 6 C chg: 0 deg: 2 exp: 3 imp: 1 hyb: SP2 arom?: 1
Bonds:
0 0->1 order: a conj?: 1 aromatic?: 1
1 1->2 order: a conj?: 1 aromatic?: 1
2 2->3 order: a conj?: 1 aromatic?: 1
3 3->4 order: a conj?: 1 aromatic?: 1
4 4->5 order: a conj?: 1 aromatic?: 1
5 5->0 order: a conj?: 1 aromatic?: 1