A tutorial on properties

tutorial
Storing arbitrary data on molecules, atoms, bonds, etc.
Published

January 24, 2025

RDKit molecules, atoms, bonds, conformers, and reactions support an interface, we call it the property interface, for storing arbitrary data that is used a lot internally but that can also very useful in other code. This post provides a quick overview of how properties work and what you can do with them

from rdkit import Chem

import rdkit
print(rdkit.__version__)
2024.09.4

Property basics

The properties are stored in a key:value data structure (similar to a dictionary in Python). The keys must be strings but the values can be various types.

One obvious use of properties is to store the additional data found in an SDF file on the molecule. Here’s an example of that:

import gzip
with gzip.open('/scratch/Data/PubChem/Compound_004500001_005000000.sdf.gz') as inf:
    suppl = Chem.ForwardSDMolSupplier(inf)
    ms = [next(suppl) for x in range(10)]

We can get a list of the properties present:

m = ms[0]
list(m.GetPropNames())
['PUBCHEM_COMPOUND_CID',
 'PUBCHEM_COMPOUND_CANONICALIZED',
 'PUBCHEM_CACTVS_COMPLEXITY',
 'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
 'PUBCHEM_CACTVS_HBOND_DONOR',
 'PUBCHEM_CACTVS_ROTATABLE_BOND',
 'PUBCHEM_CACTVS_SUBSKEYS',
 'PUBCHEM_IUPAC_OPENEYE_NAME',
 'PUBCHEM_IUPAC_CAS_NAME',
 'PUBCHEM_IUPAC_NAME_MARKUP',
 'PUBCHEM_IUPAC_NAME',
 'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
 'PUBCHEM_IUPAC_TRADITIONAL_NAME',
 'PUBCHEM_IUPAC_INCHI',
 'PUBCHEM_IUPAC_INCHIKEY',
 'PUBCHEM_XLOGP3_AA',
 'PUBCHEM_EXACT_MASS',
 'PUBCHEM_MOLECULAR_FORMULA',
 'PUBCHEM_MOLECULAR_WEIGHT',
 'PUBCHEM_OPENEYE_CAN_SMILES',
 'PUBCHEM_OPENEYE_ISO_SMILES',
 'PUBCHEM_CACTVS_TPSA',
 'PUBCHEM_MONOISOTOPIC_WEIGHT',
 'PUBCHEM_TOTAL_CHARGE',
 'PUBCHEM_HEAVY_ATOM_COUNT',
 'PUBCHEM_ATOM_DEF_STEREO_COUNT',
 'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
 'PUBCHEM_BOND_DEF_STEREO_COUNT',
 'PUBCHEM_BOND_UDEF_STEREO_COUNT',
 'PUBCHEM_ISOTOPIC_ATOM_COUNT',
 'PUBCHEM_COMPONENT_COUNT',
 'PUBCHEM_CACTVS_TAUTO_COUNT',
 'PUBCHEM_COORDINATE_TYPE',
 'PUBCHEM_BONDANNOTATIONS']

And then retrieve the property values themselves with GetProp():

m.GetProp('PUBCHEM_MOLECULAR_WEIGHT')
'516.3'

GetProp() returns the property values as strings, but we can also get them as specific types by asking for the type:

m.GetDoubleProp('PUBCHEM_MOLECULAR_WEIGHT')
516.3
m.GetIntProp('PUBCHEM_HEAVY_ATOM_COUNT')
31

The retrieval functions currently supported on molecules are: - GetProp() -> string - GetDoubleProp() -> floating point - GetIntProp() -> integer - GetUnsignedProp() -> unsigned integer - GetBoolProp() -> boolean

It’s possible to retrieve all of the properties, with the correct types, in one call:

m.GetPropsAsDict()
{'PUBCHEM_COMPOUND_CID': 4500001,
 'PUBCHEM_COMPOUND_CANONICALIZED': 1,
 'PUBCHEM_CACTVS_COMPLEXITY': 626,
 'PUBCHEM_CACTVS_HBOND_ACCEPTOR': 4,
 'PUBCHEM_CACTVS_HBOND_DONOR': 1,
 'PUBCHEM_CACTVS_ROTATABLE_BOND': 7,
 'PUBCHEM_CACTVS_SUBSKEYS': 'AAADceB7oABHAAAAAAAAAAAAGAAAAWAAAAAwYAAAAAAAAAAB0AAAHgYYAAAADQrF2ySz0IfMEAiqAidydACS0AthB7AdykA4ZoiIKCLBm5HEIAhgnALIyAcQgMAOhABQAAKAABQIAKAABQAAKAAAAAAAAA==',
 'PUBCHEM_IUPAC_OPENEYE_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-N-(2,4,5-trichlorophenyl)acetamide',
 'PUBCHEM_IUPAC_CAS_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]thio]-N-(2,4,5-trichlorophenyl)acetamide',
 'PUBCHEM_IUPAC_NAME_MARKUP': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-<I>N</I>-(2,4,5-trichlorophenyl)acetamide',
 'PUBCHEM_IUPAC_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-N-(2,4,5-trichlorophenyl)acetamide',
 'PUBCHEM_IUPAC_SYSTEMATIC_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]sulfanyl]-N-[2,4,5-tris(chloranyl)phenyl]ethanamide',
 'PUBCHEM_IUPAC_TRADITIONAL_NAME': '2-[[5-[2-(4-chlorophenyl)cyclopropyl]-4-ethyl-1,2,4-triazol-3-yl]thio]-N-(2,4,5-trichlorophenyl)acetamide',
 'PUBCHEM_IUPAC_INCHI': 'InChI=1S/C21H18Cl4N4OS/c1-2-29-20(14-7-13(14)11-3-5-12(22)6-4-11)27-28-21(29)31-10-19(30)26-18-9-16(24)15(23)8-17(18)25/h3-6,8-9,13-14H,2,7,10H2,1H3,(H,26,30)',
 'PUBCHEM_IUPAC_INCHIKEY': 'JGFAHFCLMPEOJM-UHFFFAOYSA-N',
 'PUBCHEM_XLOGP3_AA': 6,
 'PUBCHEM_EXACT_MASS': 515.992593,
 'PUBCHEM_MOLECULAR_FORMULA': 'C21H18Cl4N4OS',
 'PUBCHEM_MOLECULAR_WEIGHT': 516.3,
 'PUBCHEM_OPENEYE_CAN_SMILES': 'CCN1C(=NN=C1SCC(=O)NC2=CC(=C(C=C2Cl)Cl)Cl)C3CC3C4=CC=C(C=C4)Cl',
 'PUBCHEM_OPENEYE_ISO_SMILES': 'CCN1C(=NN=C1SCC(=O)NC2=CC(=C(C=C2Cl)Cl)Cl)C3CC3C4=CC=C(C=C4)Cl',
 'PUBCHEM_CACTVS_TPSA': 85.1,
 'PUBCHEM_MONOISOTOPIC_WEIGHT': 513.995543,
 'PUBCHEM_TOTAL_CHARGE': 0,
 'PUBCHEM_HEAVY_ATOM_COUNT': 31,
 'PUBCHEM_ATOM_DEF_STEREO_COUNT': 0,
 'PUBCHEM_ATOM_UDEF_STEREO_COUNT': 2,
 'PUBCHEM_BOND_DEF_STEREO_COUNT': 0,
 'PUBCHEM_BOND_UDEF_STEREO_COUNT': 0,
 'PUBCHEM_ISOTOPIC_ATOM_COUNT': 0,
 'PUBCHEM_COMPONENT_COUNT': 1,
 'PUBCHEM_CACTVS_TAUTO_COUNT': -1,
 'PUBCHEM_COORDINATE_TYPE': '1\n5\n255',
 'PUBCHEM_BONDANNOTATIONS': '11  14  3\n12  15  3\n15  16  8\n15  17  8\n16  20  8\n17  21  8\n20  22  8\n21  22  8\n26  27  8\n26  28  8\n27  30  8\n28  29  8\n29  31  8\n30  31  8\n7  14  8\n7  19  8\n8  14  8\n8  9  8\n9  19  8'}

You can check whether or not a property is there:

m.HasProp('foo')
0

Asking for a property that’s not present throws an exception:

m.GetProp('foo')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[9], line 1
----> 1 m.GetProp('foo')

KeyError: 'foo'

And you can remove properties:

m.ClearProp('PUBCHEM_HEAVY_ATOM_COUNT')
m.HasProp('PUBCHEM_HEAVY_ATOM_COUNT')
0

Special property types

Properties whose names start with an underscore - _ - are considered to be private and any property can be marked as computed. These properties are not displayed by default by calls to GetPropNames() or GetPropsAsDict() for molecules.

One frequently used private property is _Name, which is read from the header of mol files:

m.GetProp('_Name')
'4500001'

You can see the full list of property names by passing the includePrivate and includeComputed flags to GetPropNames() or GetPropsAsDict():

list(m.GetPropNames(includePrivate=True, includeComputed=True))
['__computedProps',
 '_Name',
 '_MolFileInfo',
 '_MolFileComments',
 '_MolFileChiralFlag',
 'numArom',
 '_StereochemDone',
 'PUBCHEM_COMPOUND_CID',
 'PUBCHEM_COMPOUND_CANONICALIZED',
 'PUBCHEM_CACTVS_COMPLEXITY',
 'PUBCHEM_CACTVS_HBOND_ACCEPTOR',
 'PUBCHEM_CACTVS_HBOND_DONOR',
 'PUBCHEM_CACTVS_ROTATABLE_BOND',
 'PUBCHEM_CACTVS_SUBSKEYS',
 'PUBCHEM_IUPAC_OPENEYE_NAME',
 'PUBCHEM_IUPAC_CAS_NAME',
 'PUBCHEM_IUPAC_NAME_MARKUP',
 'PUBCHEM_IUPAC_NAME',
 'PUBCHEM_IUPAC_SYSTEMATIC_NAME',
 'PUBCHEM_IUPAC_TRADITIONAL_NAME',
 'PUBCHEM_IUPAC_INCHI',
 'PUBCHEM_IUPAC_INCHIKEY',
 'PUBCHEM_XLOGP3_AA',
 'PUBCHEM_EXACT_MASS',
 'PUBCHEM_MOLECULAR_FORMULA',
 'PUBCHEM_MOLECULAR_WEIGHT',
 'PUBCHEM_OPENEYE_CAN_SMILES',
 'PUBCHEM_OPENEYE_ISO_SMILES',
 'PUBCHEM_CACTVS_TPSA',
 'PUBCHEM_MONOISOTOPIC_WEIGHT',
 'PUBCHEM_TOTAL_CHARGE',
 'PUBCHEM_ATOM_DEF_STEREO_COUNT',
 'PUBCHEM_ATOM_UDEF_STEREO_COUNT',
 'PUBCHEM_BOND_DEF_STEREO_COUNT',
 'PUBCHEM_BOND_UDEF_STEREO_COUNT',
 'PUBCHEM_ISOTOPIC_ATOM_COUNT',
 'PUBCHEM_COMPONENT_COUNT',
 'PUBCHEM_CACTVS_TAUTO_COUNT',
 'PUBCHEM_COORDINATE_TYPE',
 'PUBCHEM_BONDANNOTATIONS']

Adding your own properties

I’m demonstrating this for molecules, but the same thing works for the other types.

m = Chem.MolFromSmiles('CCC')
m.SetProp('prop1','val1')
m.SetIntProp('prop2',2)
m.SetDoubleProp('prop3',3.14159)

m.GetPropsAsDict()
{'prop1': 'val1', 'prop2': 2, 'prop3': 3.14159}
m.SetProp('computed1','val', computed=True)
m.SetProp('_private1','val', computed=False)

m.GetPropsAsDict()
{'prop1': 'val1', 'prop2': 2, 'prop3': 3.14159}
m.GetPropsAsDict(includeComputed=True)
{'numArom': 0,
 'prop1': 'val1',
 'prop2': 2,
 'prop3': 3.14159,
 'computed1': 'val'}
m.GetPropsAsDict(includePrivate=True,includeComputed=True)
{'__computedProps': <rdkit.rdBase._vectNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE at 0x7b88e3f5f040>,
 'numArom': 0,
 '_StereochemDone': 1,
 'prop1': 'val1',
 'prop2': 2,
 'prop3': 3.14159,
 'computed1': 'val',
 '_private1': 'val'}

Properties and copying/serialization/pickling

m = Chem.MolFromSmiles('CC')
m.SetProp('prop1','v1')
m.SetProp('computed1','v2')
m.GetAtomWithIdx(0).SetIntProp('aprop',1)

Properties are copied when molecules are copied, either using the RDKit’s recommended approach:

m2 = Chem.Mol(m)
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']

Or using the copy module:

import copy
m2 = copy.deepcopy(m)
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']

Properties are not, by default, captured when molecules are serialized (converted to binary):

m2 = Chem.Mol(m.ToBinary())
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: []
atom: []

But you can change this:

m2 = Chem.Mol(m.ToBinary(Chem.PropertyPickleOptions.AllProps))
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']

And finally, Python’s pickling tool does not serialize properties by default:

import pickle

m2 = pickle.loads(pickle.dumps(m))
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: []
atom: []

But you can change this with a global variable:

Chem.SetDefaultPickleProperties(Chem.PropertyPickleOptions.AllProps)

m2 = pickle.loads(pickle.dumps(m))
print('mol:',list(m2.GetPropNames(includeComputed=True)))
print('atom:',list(m2.GetAtomWithIdx(0).GetPropNames(includeComputed=True)))
mol: ['numArom', 'prop1', 'computed1']
atom: ['aprop']

Writing properties

Both the SDWriter and the SmilesWriter can write properties

from io import StringIO
m = Chem.MolFromSmiles('CC')
m.SetProp('prop1','v1')
m.SetProp('computed1','v2')

The SDWriter will by default write all non-private properties (include computed properties):

sio = StringIO()
with Chem.SDWriter(sio) as w:
    w.write(m)
print(sio.getvalue())

     RDKit          2D

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2990    0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
M  END
>  <prop1>  (1) 
v1

>  <computed1>  (1) 
v2

$$$$

But you can control which properties are written:

sio = StringIO()
with Chem.SDWriter(sio) as w:
    w.SetProps(['prop1'])
    w.write(m)
print(sio.getvalue())

     RDKit          2D

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2990    0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
M  END
>  <prop1>  (1) 
v1

$$$$

The SmilesWriter doesn’t write properties by default, but we can tell it to:

sio = StringIO()
with Chem.SmilesWriter(sio) as w:
    w.SetProps(m.GetPropNames())
    w.write(m)
print(sio.getvalue())
SMILES Name prop1 computed1
CC 0 v1 v2