from rdkit import Chem
import rdkit
print(rdkit.__version__)
2024.09.4
January 24, 2025
RDKit molecules, atoms, bonds, conformers, and reactions support an interface, we call it the property interface, for storing arbitrary data that is used a lot internally but that can also very useful in other code. This post provides a quick overview of how properties work and what you can do with them
The properties are stored in a key:value data structure (similar to a dictionary in Python). The keys must be strings but the values can be various types.
One obvious use of properties is to store the additional data found in an SDF file on the molecule. Here’s an example of that:
We can get a list of the properties present:
['PUBCHEM_COMPOUND_CID',
'PUBCHEM_COMPOUND_CANONICALIZED',
'PUBCHEM_CACTVS_COMPLEXITY',
'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
'PUBCHEM_CACTVS_HBOND_DONOR',
'PUBCHEM_CACTVS_ROTATABLE_BOND',
'PUBCHEM_CACTVS_SUBSKEYS',
'PUBCHEM_IUPAC_OPENEYE_NAME',
'PUBCHEM_IUPAC_CAS_NAME',
'PUBCHEM_IUPAC_NAME_MARKUP',
'PUBCHEM_IUPAC_NAME',
'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
'PUBCHEM_IUPAC_TRADITIONAL_NAME',
'PUBCHEM_IUPAC_INCHI',
'PUBCHEM_IUPAC_INCHIKEY',
'PUBCHEM_XLOGP3_AA',
'PUBCHEM_EXACT_MASS',
'PUBCHEM_MOLECULAR_FORMULA',
'PUBCHEM_MOLECULAR_WEIGHT',
'PUBCHEM_OPENEYE_CAN_SMILES',
'PUBCHEM_OPENEYE_ISO_SMILES',
'PUBCHEM_CACTVS_TPSA',
'PUBCHEM_MONOISOTOPIC_WEIGHT',
'PUBCHEM_TOTAL_CHARGE',
'PUBCHEM_HEAVY_ATOM_COUNT',
'PUBCHEM_ATOM_DEF_STEREO_COUNT',
'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
'PUBCHEM_BOND_DEF_STEREO_COUNT',
'PUBCHEM_BOND_UDEF_STEREO_COUNT',
'PUBCHEM_ISOTOPIC_ATOM_COUNT',
'PUBCHEM_COMPONENT_COUNT',
'PUBCHEM_CACTVS_TAUTO_COUNT',
'PUBCHEM_COORDINATE_TYPE',
'PUBCHEM_BONDANNOTATIONS']
And then retrieve the property values themselves with GetProp()
:
GetProp()
returns the property values as strings, but we can also get them as specific types by asking for the type:
The retrieval functions currently supported on molecules are: - GetProp()
-> string - GetDoubleProp()
-> floating point - GetIntProp()
-> integer - GetUnsignedProp()
-> unsigned integer - GetBoolProp()
-> boolean
It’s possible to retrieve all of the properties, with the correct types, in one call:
{'PUBCHEM_COMPOUND_CID': 4500001,
'PUBCHEM_COMPOUND_CANONICALIZED': 1,
'PUBCHEM_CACTVS_COMPLEXITY': 626,
'PUBCHEM_CACTVS_HBOND_ACCEPTOR': 4,
'PUBCHEM_CACTVS_HBOND_DONOR': 1,
'PUBCHEM_CACTVS_ROTATABLE_BOND': 7,
'PUBCHEM_CACTVS_SUBSKEYS': 'AAADceB7oABHAAAAAAAAAAAAGAAAAWAAAAAwYAAAAAAAAAAB0AAAHgYYAAAADQrF2ySz0IfMEAiqAidydACS0AthB7AdykA4ZoiIKCLBm5HEIAhgnALIyAcQgMAOhABQAAKAABQIAKAABQAAKAAAAAAAAA==',
'PUBCHEM_IUPAC_OPENEYE_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-N-(2,4,5-trichlorophenyl)acetamide',
'PUBCHEM_IUPAC_CAS_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]thio]-N-(2,4,5-trichlorophenyl)acetamide',
'PUBCHEM_IUPAC_NAME_MARKUP': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-<I>N</I>-(2,4,5-trichlorophenyl)acetamide',
'PUBCHEM_IUPAC_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-N-(2,4,5-trichlorophenyl)acetamide',
'PUBCHEM_IUPAC_SYSTEMATIC_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-N-[2,4,5-tris(chloranyl)phenyl]ethanamide',
'PUBCHEM_IUPAC_TRADITIONAL_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]thio]-N-(2,4,5-trichlorophenyl)acetamide',
'PUBCHEM_IUPAC_INCHI': 'InChI=1S/C21H18Cl4N4OS/c1-2-29-20(14-7-13(14)11-3-5-12(22)6-4-11)27-28-21(29)31-10-19(30)26-18-9-16(24)15(23)8-17(18)25/h3-6,8-9,13-14H,2,7,10H2,1H3,(H,26,30)',
'PUBCHEM_IUPAC_INCHIKEY': 'JGFAHFCLMPEOJM-UHFFFAOYSA-N',
'PUBCHEM_XLOGP3_AA': 6,
'PUBCHEM_EXACT_MASS': 515.992593,
'PUBCHEM_MOLECULAR_FORMULA': 'C21H18Cl4N4OS',
'PUBCHEM_MOLECULAR_WEIGHT': 516.3,
'PUBCHEM_OPENEYE_CAN_SMILES': 'CCN1C(=NN=C1SCC(=O)NC2=CC(=C(C=C2Cl)Cl)Cl)C3CC3C4=CC=C(C=C4)Cl',
'PUBCHEM_OPENEYE_ISO_SMILES': 'CCN1C(=NN=C1SCC(=O)NC2=CC(=C(C=C2Cl)Cl)Cl)C3CC3C4=CC=C(C=C4)Cl',
'PUBCHEM_CACTVS_TPSA': 85.1,
'PUBCHEM_MONOISOTOPIC_WEIGHT': 513.995543,
'PUBCHEM_TOTAL_CHARGE': 0,
'PUBCHEM_HEAVY_ATOM_COUNT': 31,
'PUBCHEM_ATOM_DEF_STEREO_COUNT': 0,
'PUBCHEM_ATOM_UDEF_STEREO_COUNT': 2,
'PUBCHEM_BOND_DEF_STEREO_COUNT': 0,
'PUBCHEM_BOND_UDEF_STEREO_COUNT': 0,
'PUBCHEM_ISOTOPIC_ATOM_COUNT': 0,
'PUBCHEM_COMPONENT_COUNT': 1,
'PUBCHEM_CACTVS_TAUTO_COUNT': -1,
'PUBCHEM_COORDINATE_TYPE': '1\n5\n255',
'PUBCHEM_BONDANNOTATIONS': '11 14 3\n12 15 3\n15 16 8\n15 17 8\n16 20 8\n17 21 8\n20 22 8\n21 22 8\n26 27 8\n26 28 8\n27 30 8\n28 29 8\n29 31 8\n30 31 8\n7 14 8\n7 19 8\n8 14 8\n8 9 8\n9 19 8'}
You can check whether or not a property is there:
Asking for a property that’s not present throws an exception:
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[9], line 1 ----> 1 m.GetProp('foo') KeyError: 'foo'
And you can remove properties:
Properties whose names start with an underscore - _
- are considered to be private and any property can be marked as computed. These properties are not displayed by default by calls to GetPropNames()
or GetPropsAsDict()
for molecules.
One frequently used private property is _Name
, which is read from the header of mol files:
You can see the full list of property names by passing the includePrivate
and includeComputed
flags to GetPropNames()
or GetPropsAsDict()
:
['__computedProps',
'_Name',
'_MolFileInfo',
'_MolFileComments',
'_MolFileChiralFlag',
'numArom',
'_StereochemDone',
'PUBCHEM_COMPOUND_CID',
'PUBCHEM_COMPOUND_CANONICALIZED',
'PUBCHEM_CACTVS_COMPLEXITY',
'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
'PUBCHEM_CACTVS_HBOND_DONOR',
'PUBCHEM_CACTVS_ROTATABLE_BOND',
'PUBCHEM_CACTVS_SUBSKEYS',
'PUBCHEM_IUPAC_OPENEYE_NAME',
'PUBCHEM_IUPAC_CAS_NAME',
'PUBCHEM_IUPAC_NAME_MARKUP',
'PUBCHEM_IUPAC_NAME',
'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
'PUBCHEM_IUPAC_TRADITIONAL_NAME',
'PUBCHEM_IUPAC_INCHI',
'PUBCHEM_IUPAC_INCHIKEY',
'PUBCHEM_XLOGP3_AA',
'PUBCHEM_EXACT_MASS',
'PUBCHEM_MOLECULAR_FORMULA',
'PUBCHEM_MOLECULAR_WEIGHT',
'PUBCHEM_OPENEYE_CAN_SMILES',
'PUBCHEM_OPENEYE_ISO_SMILES',
'PUBCHEM_CACTVS_TPSA',
'PUBCHEM_MONOISOTOPIC_WEIGHT',
'PUBCHEM_TOTAL_CHARGE',
'PUBCHEM_ATOM_DEF_STEREO_COUNT',
'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
'PUBCHEM_BOND_DEF_STEREO_COUNT',
'PUBCHEM_BOND_UDEF_STEREO_COUNT',
'PUBCHEM_ISOTOPIC_ATOM_COUNT',
'PUBCHEM_COMPONENT_COUNT',
'PUBCHEM_CACTVS_TAUTO_COUNT',
'PUBCHEM_COORDINATE_TYPE',
'PUBCHEM_BONDANNOTATIONS']
I’m demonstrating this for molecules, but the same thing works for the other types.
m = Chem.MolFromSmiles('CCC')
m.SetProp('prop1','val1')
m.SetIntProp('prop2',2)
m.SetDoubleProp('prop3',3.14159)
m.GetPropsAsDict()
{'prop1': 'val1', 'prop2': 2, 'prop3': 3.14159}
m.SetProp('computed1','val', computed=True)
m.SetProp('_private1','val', computed=False)
m.GetPropsAsDict()
{'prop1': 'val1', 'prop2': 2, 'prop3': 3.14159}
{'numArom': 0,
'prop1': 'val1',
'prop2': 2,
'prop3': 3.14159,
'computed1': 'val'}
Properties are copied when molecules are copied, either using the RDKit’s recommended approach:
m2 = Chem.Mol(m)
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']
Or using the copy
module:
import copy
m2 = copy.deepcopy(m)
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']
Properties are not, by default, captured when molecules are serialized (converted to binary):
m2 = Chem.Mol(m.ToBinary())
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: []
atom: []
But you can change this:
m2 = Chem.Mol(m.ToBinary(Chem.PropertyPickleOptions.AllProps))
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']
And finally, Python’s pickling tool does not serialize properties by default:
import pickle
m2 = pickle.loads(pickle.dumps(m))
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: []
atom: []
But you can change this with a global variable:
Both the SDWriter
and the SmilesWriter
can write properties
The SDWriter will by default write all non-private properties (include computed properties):
RDKit 2D
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2990 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
M END
> <prop1> (1)
v1
> <computed1> (1)
v2
$$$$
But you can control which properties are written:
sio = StringIO()
with Chem.SDWriter(sio) as w:
w.SetProps(['prop1'])
w.write(m)
print(sio.getvalue())
RDKit 2D
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2990 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
M END
> <prop1> (1)
v1
$$$$
The SmilesWriter
doesn’t write properties by default, but we can tell it to: