What happens if you ask an LLM to generate an RDKit blog post?
Published
July 31, 2025
What is this post?
On the bus to the CADD GRC, Nadine suggested that I try using Copilot to generate ideas for an RDKit blog post. It seemed like an interesting experiment and I’m heading out for a couple of weeks of vacation, so I thought I’d go ahead and give it a try. This post is the result.
What’s below is the result of a chat session with Copilot, using Claude Sonnet 4. The initial prompt was: suggest a topic for a new RDKit blog post. I made a couple of small edits at the very end, but otherwise the text is exactly as generated by the LLM. I hope to come back to this at some point in the future and dig into the code and content in detail, but right now I need to go pack my gear to head to the mountains for a couple of weeks of climbing and mountaineering.
I’m going to go ahead and upload this and post it sometime around July 31st (or whenever I have decent internet access). There won’t be a new blog post next week, but I will be back the week of August 11th.
RDKit and Large Language Models: Chemical Structure-Text Integration in the AI Era
The landscape of artificial intelligence has been dramatically transformed by the rise of large language models (LLMs), and the field of chemistry is no exception. From GPT-4’s ability to reason about molecular structures to specialized chemistry-focused models like ChemCrow and Galactica, we’re witnessing an unprecedented convergence of natural language processing and chemical informatics. This presents both exciting opportunities and unique challenges for computational chemists and cheminformatics practitioners.
The RDKit, as one of the most widely-used open-source cheminformatics toolkits, finds itself at a fascinating intersection in this new landscape. While LLMs excel at processing and generating text-based representations of chemical knowledge, they often struggle with the precise, structure-based reasoning that is second nature to dedicated cheminformatics tools. Conversely, traditional cheminformatics approaches excel at molecular manipulation and property calculation but have limited ability to understand and generate natural language descriptions of chemical concepts.
This complementary relationship suggests powerful synergies. LLMs can help democratize access to chemical knowledge by translating between technical chemical representations and human-readable explanations. Meanwhile, tools like RDKit provide the essential chemical “reality check” - validating molecular structures, calculating properties, and ensuring that AI-generated chemistry actually makes sense from a chemical perspective.
In this post, we’ll explore practical approaches for integrating RDKit with modern LLM workflows. We’ll cover how to:
Use RDKit to validate and process chemical structures generated by LLMs
Convert RDKit molecular representations into rich text descriptions suitable for training or prompting LLMs
Build robust pipelines that combine the strengths of both approaches
Handle the unique challenges that arise when bridging symbolic chemical representations with statistical language models
The goal is not to replace either approach, but rather to show how they can work together to create more powerful, reliable, and accessible chemical AI systems. Whether you’re building a chemical chatbot, curating training data for a chemistry-focused LLM, or simply trying to make your chemical data more searchable and interpretable, the patterns we’ll explore should provide a solid foundation.
Let’s start by setting up our environment and exploring some basic integration patterns.
RDKit version: 2025.03.4
Generated: Sat Jul 26 09:53:54 2025
# Additional imports for LLM integration examplesimport jsonimport reimport pandas as pdfrom typing import List, Dict, Optional, Tupleimport warningswarnings.filterwarnings('ignore')
Structure Validation: RDKit as the Chemical Reality Check
One of the most immediate and practical applications of RDKit in LLM workflows is structure validation. LLMs, while impressive at generating chemical-looking text, often produce invalid SMILES strings, impossible molecular structures, or chemically nonsensical compounds. RDKit provides robust validation capabilities that can catch these errors and provide meaningful feedback.
Let’s explore some common validation scenarios and how to handle them systematically.
def validate_smiles_with_details(smiles: str) -> Dict[str, any]:""" Comprehensive SMILES validation with detailed feedback. Returns a dictionary with validation results and diagnostic information. """ result = {'smiles': smiles,'is_valid': False,'mol': None,'canonical_smiles': None,'errors': [],'warnings': [],'properties': {} }# Basic parsingtry: mol = Chem.MolFromSmiles(smiles)if mol isNone: result['errors'].append("Invalid SMILES: Could not parse structure")return result result['mol'] = mol result['is_valid'] =True# Get canonical SMILEStry: result['canonical_smiles'] = Chem.MolToSmiles(mol)except: result['warnings'].append("Could not generate canonical SMILES")# Basic chemical checks num_atoms = mol.GetNumAtoms()if num_atoms ==0: result['errors'].append("Empty molecule") result['is_valid'] =Falseelif num_atoms >200: result['warnings'].append(f"Very large molecule ({num_atoms} atoms)")# Check for unusual valencestry: Chem.SanitizeMol(mol)exceptExceptionas e: result['errors'].append(f"Sanitization failed: {str(e)}") result['is_valid'] =False# Calculate basic properties if validif result['is_valid']:try: result['properties'] = {'num_atoms': num_atoms,'num_bonds': mol.GetNumBonds(),'molecular_weight': Descriptors.MolWt(mol),'num_rings': rdMolDescriptors.CalcNumRings(mol),'num_aromatic_rings': rdMolDescriptors.CalcNumAromaticRings(mol) }exceptExceptionas e: result['warnings'].append(f"Property calculation failed: {str(e)}")exceptExceptionas e: result['errors'].append(f"Parsing error: {str(e)}")return result# Test the function with some examplestest_smiles = ["CCO", # ethanol - valid"c1ccccc1", # benzene - valid"C[C@H](N)C(=O)O", # alanine - valid with stereochemistry"CCO[invalid]", # invalid SMILES"C1CCC", # invalid - unclosed ring"C(C)(C)(C)(C)C", # carbon with too many bonds"", # empty string]print("SMILES Validation Examples:")print("="*50)for smiles in test_smiles: result = validate_smiles_with_details(smiles)print(f"\nSMILES: {smiles}")print(f"Valid: {result['is_valid']}")if result['canonical_smiles']:print(f"Canonical: {result['canonical_smiles']}")if result['errors']:print(f"Errors: {', '.join(result['errors'])}")if result['warnings']:print(f"Warnings: {', '.join(result['warnings'])}")if result['properties']: props = result['properties']print(f"Properties: {props['num_atoms']} atoms, MW={props['molecular_weight']:.1f}")
[09:53:54] SMILES Parse Error: syntax error while parsing: CCO[invalid]
[09:53:54] SMILES Parse Error: check for mistakes around position 5:
[09:53:54] CCO[invalid]
[09:53:54] ~~~~^
[09:53:54] SMILES Parse Error: Failed parsing SMILES 'CCO[invalid]' for input: 'CCO[invalid]'
[09:53:54] SMILES Parse Error: unclosed ring for input: 'C1CCC'
[09:53:54] Explicit valence for atom # 0 C, 5, is greater than permitted
Batch Validation for LLM Outputs
When working with LLMs that generate multiple chemical structures, you’ll often need to validate batches of SMILES strings. Here’s a more robust approach that can handle large datasets efficiently:
def batch_validate_structures(smiles_list: List[str], include_properties: bool=True) -> pd.DataFrame:""" Validate a batch of SMILES strings and return results as a DataFrame. Useful for processing LLM outputs or curating chemical datasets. """ results = []for i, smiles inenumerate(smiles_list): result = validate_smiles_with_details(smiles)# Flatten the result for DataFrame storage row = {'index': i,'input_smiles': smiles,'is_valid': result['is_valid'],'canonical_smiles': result['canonical_smiles'],'num_errors': len(result['errors']),'num_warnings': len(result['warnings']),'error_messages': '; '.join(result['errors']) if result['errors'] elseNone,'warning_messages': '; '.join(result['warnings']) if result['warnings'] elseNone }# Add properties if requested and availableif include_properties and result['properties']: row.update(result['properties']) results.append(row)return pd.DataFrame(results)# Simulate some LLM-generated SMILES (mix of valid and invalid)llm_generated_smiles = ["CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", # ibuprofen-like"C1=CC=C(C=C1)C(=O)O", # benzoic acid"invalidsmiles123", # clearly invalid"C1=CC=CC=C1", # benzene"CCO", # ethanol"C[C@H](N)C(=O)O", # L-alanine"C1CCC1C", # methylcyclopropane"c1ccccc1c2ccccc2", # biphenyl"CCCCCCCCCCCCCCCCO", # long-chain alcohol"C1=CC=C2C(=C1)C=CC=C2", # naphthalene]print("Processing batch of LLM-generated SMILES...")df_results = batch_validate_structures(llm_generated_smiles)# Display summary statisticsprint(f"\nBatch Validation Summary:")print(f"Total structures: {len(df_results)}")print(f"Valid structures: {df_results['is_valid'].sum()}")print(f"Invalid structures: {(~df_results['is_valid']).sum()}")print(f"Structures with warnings: {(df_results['num_warnings'] >0).sum()}")# Show the results tableprint(f"\nDetailed Results:")display_cols = ['input_smiles', 'is_valid', 'canonical_smiles', 'molecular_weight', 'num_atoms']available_cols = [col for col in display_cols if col in df_results.columns]print(df_results[available_cols].to_string(index=False))# Show error details for invalid structuresinvalid_structures = df_results[~df_results['is_valid']]iflen(invalid_structures) >0:print(f"\nError Details for Invalid Structures:")for _, row in invalid_structures.iterrows():print(f"'{row['input_smiles']}': {row['error_messages']}")
[09:53:54] SMILES Parse Error: syntax error while parsing: invalidsmiles123
[09:53:54] SMILES Parse Error: check for mistakes around position 1:
[09:53:54] invalidsmiles123
[09:53:54] ^
[09:53:54] SMILES Parse Error: Failed parsing SMILES 'invalidsmiles123' for input: 'invalidsmiles123'
Chemical Plausibility Checks
Beyond basic SMILES validation, we can implement more sophisticated checks to assess whether LLM-generated structures are chemically reasonable. This is particularly important because LLMs might generate syntactically valid but chemically implausible structures:
def assess_chemical_plausibility(mol: Chem.Mol) -> Dict[str, any]:""" Assess the chemical plausibility of a molecule beyond basic validation. Returns flags for potential issues that might indicate AI-generated artifacts. """if mol isNone:return {'plausible': False, 'issues': ['Invalid molecule']} issues = [] warnings = []# Molecular weight checks mw = Descriptors.MolWt(mol)if mw <16: # Lighter than methane issues.append(f"Extremely low molecular weight: {mw:.1f}")elif mw >2000: # Very large for small molecule warnings.append(f"Very high molecular weight: {mw:.1f}")# Atom count checks num_atoms = mol.GetNumAtoms()if num_atoms >150: # Unusually large for typical organic molecules warnings.append(f"Very large molecule: {num_atoms} atoms")# Check for unusual atom types in organic chemistry context unusual_atoms = []for atom in mol.GetAtoms(): symbol = atom.GetSymbol()if symbol notin ['C', 'N', 'O', 'S', 'P', 'F', 'Cl', 'Br', 'I', 'H']: unusual_atoms.append(symbol)if unusual_atoms: warnings.append(f"Unusual atoms present: {set(unusual_atoms)}")# Check carbon-to-heteroatom ratio carbon_count =sum(1for atom in mol.GetAtoms() if atom.GetSymbol() =='C') heteroatom_count =sum(1for atom in mol.GetAtoms() if atom.GetSymbol() notin ['C', 'H'])if heteroatom_count >0: c_to_hetero_ratio = carbon_count / heteroatom_countif c_to_hetero_ratio <0.1: # Too many heteroatoms warnings.append(f"Unusual C:heteroatom ratio: {c_to_hetero_ratio:.2f}")# Check for overly complex ring systems ring_info = mol.GetRingInfo() num_rings = ring_info.NumRings()if num_rings >10: # Many fused rings might be suspicious warnings.append(f"Complex ring system: {num_rings} rings")# Check for extremely high or low LogP (rough estimate)try: logp = Descriptors.MolLogP(mol)if logp >8: warnings.append(f"Very high LogP: {logp:.2f}")elif logp <-5: warnings.append(f"Very low LogP: {logp:.2f}")except: warnings.append("Could not calculate LogP")# Check for drug-likeness violations (Lipinski's Rule of Five) violations = []if mw >500: violations.append("MW > 500")if Descriptors.MolLogP(mol) >5: violations.append("LogP > 5")if Descriptors.NumHDonors(mol) >5: violations.append("H-donors > 5")if Descriptors.NumHAcceptors(mol) >10: violations.append("H-acceptors > 10")iflen(violations) >=2: # Allow one violation warnings.append(f"Multiple Lipinski violations: {violations}")return {'plausible': len(issues) ==0,'issues': issues,'warnings': warnings,'properties': {'molecular_weight': mw,'num_atoms': num_atoms,'num_rings': num_rings,'logp': Descriptors.MolLogP(mol) if mol elseNone,'lipinski_violations': len(violations) } }# Test with some examplestest_molecules = [ ("CCO", "Simple alcohol"), ("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "Ibuprofen"), ("C"*50, "Very long alkyl chain"), # Unusual but valid ("CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC(=O)O", "Long fatty acid"), ("c1ccc2c(c1)c3ccccc3c4ccccc24", "Anthracene (PAH)"),]print("Chemical Plausibility Assessment:")print("="*50)for smiles, description in test_molecules: mol = Chem.MolFromSmiles(smiles) assessment = assess_chemical_plausibility(mol)print(f"\nMolecule: {description}")print(f"SMILES: {smiles}")print(f"Plausible: {assessment['plausible']}")if assessment['issues']:print(f"Issues: {'; '.join(assessment['issues'])}")if assessment['warnings']:print(f"Warnings: {'; '.join(assessment['warnings'])}") props = assessment['properties']print(f"Properties: MW={props['molecular_weight']:.1f}, "f"Rings={props['num_rings']}, "f"LogP={props['logp']:.2f}, "f"Lipinski violations={props['lipinski_violations']}")
Putting It Together: A Complete Validation Pipeline
Here’s how you might integrate these validation functions into a complete pipeline for processing LLM-generated chemical structures:
class LLMChemicalValidator:""" A comprehensive validator for LLM-generated chemical structures. Combines SMILES validation, chemical plausibility checks, and filtering. """def__init__(self, strict_mode: bool=False):self.strict_mode = strict_modeself.validation_stats = {'total_processed': 0,'valid_smiles': 0,'chemically_plausible': 0,'passed_filters': 0 }def validate_structure(self, smiles: str) -> Dict[str, any]:"""Validate a single SMILES string with full analysis.""" result = validate_smiles_with_details(smiles)if result['is_valid'] and result['mol']:# Add plausibility assessment plausibility = assess_chemical_plausibility(result['mol']) result['plausibility'] = plausibility result['chemically_plausible'] = plausibility['plausible']# Apply filters based on modeifself.strict_mode: result['passes_filters'] = (plausibility['plausible'] andlen(plausibility['warnings']) ==0)else: result['passes_filters'] = plausibility['plausible']else: result['chemically_plausible'] =False result['passes_filters'] =False# Update statsself.validation_stats['total_processed'] +=1if result['is_valid']:self.validation_stats['valid_smiles'] +=1if result.get('chemically_plausible', False):self.validation_stats['chemically_plausible'] +=1if result.get('passes_filters', False):self.validation_stats['passed_filters'] +=1return resultdef validate_batch(self, smiles_list: List[str]) -> List[Dict[str, any]]:"""Validate a batch of SMILES strings."""return [self.validate_structure(smiles) for smiles in smiles_list]def get_filtered_structures(self, smiles_list: List[str]) -> List[str]:"""Return only the structures that pass all validation checks.""" results =self.validate_batch(smiles_list)return [r['canonical_smiles'] for r in results if r['passes_filters'] and r['canonical_smiles']]def get_validation_report(self) ->str:"""Generate a summary report of validation statistics.""" stats =self.validation_stats total = stats['total_processed']if total ==0:return"No structures processed yet." report =f"""Validation Report:==================Total structures processed: {total}Valid SMILES: {stats['valid_smiles']} ({stats['valid_smiles']/total*100:.1f}%)Chemically plausible: {stats['chemically_plausible']} ({stats['chemically_plausible']/total*100:.1f}%)Passed all filters: {stats['passed_filters']} ({stats['passed_filters']/total*100:.1f}%)Success rate: {stats['passed_filters']/total*100:.1f}%"""return report.strip()# Demonstration with simulated LLM outputprint("LLM Chemical Validator Demo")print("="*40)# Simulate a mix of good and problematic LLM-generated structuresllm_output = ["CCO", # ethanol - should pass"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", # ibuprofen - should pass"c1ccccc1", # benzene - should pass"invalidsmiles", # invalid SMILES"CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC", # very long chain"C1=CC=C(C=C1)C(=O)O", # benzoic acid - should pass"C[C@H](N)C(=O)O", # alanine - should pass"C1CCC", # invalid - unclosed ring"CC(C)(C)(C)(C)C", # too many bonds"CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC", # extremely long]# Test with normal modeprint("Testing with Normal Mode:")validator_normal = LLMChemicalValidator(strict_mode=False)filtered_normal = validator_normal.get_filtered_structures(llm_output)print(f"Input structures: {len(llm_output)}")print(f"Structures passing validation: {len(filtered_normal)}")print("\nPassed structures:")for i, smiles inenumerate(filtered_normal, 1):print(f"{i}. {smiles}")print(validator_normal.get_validation_report())# Test with strict modeprint("\n"+"="*40)print("Testing with Strict Mode:")validator_strict = LLMChemicalValidator(strict_mode=True)filtered_strict = validator_strict.get_filtered_structures(llm_output)print(f"Structures passing strict validation: {len(filtered_strict)}")print(validator_strict.get_validation_report())
[09:53:54] SMILES Parse Error: syntax error while parsing: invalidsmiles
[09:53:54] SMILES Parse Error: check for mistakes around position 1:
[09:53:54] invalidsmiles
[09:53:54] ^
[09:53:54] SMILES Parse Error: Failed parsing SMILES 'invalidsmiles' for input: 'invalidsmiles'
[09:53:54] SMILES Parse Error: unclosed ring for input: 'C1CCC'
[09:53:54] Explicit valence for atom # 1 C, 6, is greater than permitted
[09:53:54] SMILES Parse Error: syntax error while parsing: invalidsmiles
[09:53:54] SMILES Parse Error: check for mistakes around position 1:
[09:53:54] invalidsmiles
[09:53:54] ^
[09:53:54] SMILES Parse Error: Failed parsing SMILES 'invalidsmiles' for input: 'invalidsmiles'
[09:53:54] SMILES Parse Error: unclosed ring for input: 'C1CCC'
[09:53:54] Explicit valence for atom # 1 C, 6, is greater than permitted
From Molecules to Text: Generating Rich Descriptions for LLM Training
While validation is crucial for processing LLM outputs, the reverse direction - converting molecular structures into natural language descriptions - is equally important for training chemical AI systems and creating meaningful prompts. RDKit’s rich chemical analysis capabilities make it possible to generate detailed, chemically-informed text descriptions of molecules.
This approach is valuable for: - Creating training data for chemistry-focused LLMs - Building chemical chatbots with rich molecular knowledge - Making chemical databases more searchable with natural language - Generating explanatory text for chemical education applications
def extract_molecular_features(mol: Chem.Mol) -> Dict[str, any]:""" Extract comprehensive molecular features suitable for text generation. This creates a rich feature dictionary that can be converted to natural language. """if mol isNone:return {}try:# Basic molecular properties features = {'smiles': Chem.MolToSmiles(mol),'molecular_formula': rdMolDescriptors.CalcMolFormula(mol),'molecular_weight': round(Descriptors.MolWt(mol), 2),'num_atoms': mol.GetNumAtoms(),'num_heavy_atoms': mol.GetNumHeavyAtoms(),'num_bonds': mol.GetNumBonds(), }# Ring and aromaticity information features.update({'num_rings': rdMolDescriptors.CalcNumRings(mol),'num_aromatic_rings': rdMolDescriptors.CalcNumAromaticRings(mol),'num_saturated_rings': rdMolDescriptors.CalcNumSaturatedRings(mol),'num_aliphatic_rings': rdMolDescriptors.CalcNumAliphaticRings(mol), })# Physicochemical properties features.update({'logp': round(Descriptors.MolLogP(mol), 2),'tpsa': round(Descriptors.TPSA(mol), 2),'h_bond_donors': Descriptors.NumHDonors(mol),'h_bond_acceptors': Descriptors.NumHAcceptors(mol),'rotatable_bonds': Descriptors.NumRotatableBonds(mol), })# Atom type counts atom_counts = {}for atom in mol.GetAtoms(): symbol = atom.GetSymbol() atom_counts[symbol] = atom_counts.get(symbol, 0) +1 features['atom_composition'] = atom_counts# Functional group analysis (simplified) functional_groups = []# Check for common functional groups using SMARTS fg_patterns = {'carboxylic_acid': '[CX3](=O)[OX2H1]','ester': '[#6][CX3](=O)[OX2H0][#6]','amide': '[CX3](=[OX1])[NX3H2]','alcohol': '[OX2H]','amine': '[NX3;H2,H1;!$(NC=O)]','ketone': '[#6][CX3](=O)[#6]','aldehyde': '[CX3H1](=O)[#6]','ether': '[OD2]([#6])[#6]','phenol': '[OX2H][cX3]:[c]','nitro': '[NX3+](=O)[O-]','aromatic_ring': 'c1ccccc1', }for fg_name, smarts in fg_patterns.items(): pattern = Chem.MolFromSmarts(smarts)if pattern and mol.HasSubstructMatch(pattern): matches = mol.GetSubstructMatches(pattern)if matches: functional_groups.append(f"{fg_name}({len(matches)})") features['functional_groups'] = functional_groups# Drug-likeness assessment lipinski_violations = []if features['molecular_weight'] >500: lipinski_violations.append('MW > 500')if features['logp'] >5: lipinski_violations.append('LogP > 5')if features['h_bond_donors'] >5: lipinski_violations.append('H-donors > 5')if features['h_bond_acceptors'] >10: lipinski_violations.append('H-acceptors > 10') features['lipinski_violations'] = lipinski_violations features['drug_like'] =len(lipinski_violations) <=1return featuresexceptExceptionas e:return {'error': f"Feature extraction failed: {str(e)}"}# Test the function with some example moleculestest_molecules = [ ("CCO", "Ethanol"), ("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "Ibuprofen"), ("C1=CC=C(C=C1)C(=O)O", "Benzoic acid"), ("CC(=O)OC1=CC=CC=C1C(=O)O", "Aspirin"), ("CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "Caffeine"),]print("Molecular Feature Extraction Examples:")print("="*50)for smiles, name in test_molecules: mol = Chem.MolFromSmiles(smiles) features = extract_molecular_features(mol)print(f"\n{name} ({smiles}):")print(f" Formula: {features.get('molecular_formula', 'N/A')}")print(f" MW: {features.get('molecular_weight', 'N/A')} Da")print(f" LogP: {features.get('logp', 'N/A')}")print(f" TPSA: {features.get('tpsa', 'N/A')} Ų")print(f" Rings: {features.get('num_rings', 0)} (aromatic: {features.get('num_aromatic_rings', 0)})")print(f" H-bond donors/acceptors: {features.get('h_bond_donors', 0)}/{features.get('h_bond_acceptors', 0)}")print(f" Functional groups: {', '.join(features.get('functional_groups', [])) or'None detected'}")print(f" Drug-like: {features.get('drug_like', False)}")if features.get('lipinski_violations'):print(f" Lipinski violations: {', '.join(features['lipinski_violations'])}")
Now let’s create functions that transform these molecular features into natural language descriptions suitable for LLM training:
def generate_molecular_description(mol: Chem.Mol, style: str='detailed') ->str:""" Generate natural language descriptions of molecules in different styles. Args: mol: RDKit molecule object style: 'detailed', 'concise', 'technical', or 'educational' Returns: Natural language description of the molecule """if mol isNone:return"Invalid molecule structure." features = extract_molecular_features(mol)if'error'in features:returnf"Could not analyze molecule: {features['error']}"if style =='concise':return _generate_concise_description(features)elif style =='technical':return _generate_technical_description(features)elif style =='educational':return _generate_educational_description(features)else: # detailedreturn _generate_detailed_description(features)def _generate_concise_description(features: Dict) ->str:"""Generate a brief, factual description.""" desc =f"This is {features['molecular_formula']} with molecular weight {features['molecular_weight']} Da."if features['functional_groups']: fg_text =', '.join([fg.split('(')[0].replace('_', ' ') for fg in features['functional_groups']]) desc +=f" Contains {fg_text} functional groups."return descdef _generate_technical_description(features: Dict) ->str:"""Generate a technical description with precise chemical language.""" parts = []# Basic structure parts.append(f"Molecular formula {features['molecular_formula']} (MW: {features['molecular_weight']} Da)")# Ring systemsif features['num_rings'] >0: ring_desc =f"{features['num_rings']} ring(s)"if features['num_aromatic_rings'] >0: ring_desc +=f" including {features['num_aromatic_rings']} aromatic ring(s)" parts.append(ring_desc)# Physicochemical properties parts.append(f"LogP: {features['logp']}, TPSA: {features['tpsa']} Ų")# Hydrogen bondingif features['h_bond_donors'] >0or features['h_bond_acceptors'] >0: parts.append(f"H-bond donors: {features['h_bond_donors']}, acceptors: {features['h_bond_acceptors']}")# Drug-likenessif features['drug_like']: parts.append("Complies with Lipinski's Rule of Five")elif features['lipinski_violations']: parts.append(f"Violates Lipinski's Rule: {', '.join(features['lipinski_violations'])}")return". ".join(parts) +"."def _generate_detailed_description(features: Dict) ->str:"""Generate a comprehensive description suitable for training data.""" parts = []# Introduction parts.append(f"This compound has the molecular formula {features['molecular_formula']} and a molecular weight of {features['molecular_weight']} daltons.")# Structural features structure_parts = []if features['num_heavy_atoms'] >0: structure_parts.append(f"{features['num_heavy_atoms']} heavy atoms")if features['num_rings'] >0:if features['num_aromatic_rings'] >0and features['num_saturated_rings'] >0: structure_parts.append(f"{features['num_rings']} rings ({features['num_aromatic_rings']} aromatic, {features['num_saturated_rings']} saturated)")elif features['num_aromatic_rings'] >0: structure_parts.append(f"{features['num_aromatic_rings']} aromatic ring(s)")else: structure_parts.append(f"{features['num_rings']} saturated ring(s)")if structure_parts: parts.append(f"The structure contains {', '.join(structure_parts)}.")# Atom composition atoms = features['atom_composition']iflen(atoms) >1: atom_desc = []for atom, count insorted(atoms.items()):if atom !='H': # Skip hydrogen for brevityif count ==1: atom_desc.append(f"one {atom}")else: atom_desc.append(f"{count}{atom} atoms")if atom_desc: parts.append(f"It is composed of {', '.join(atom_desc)}.")# Functional groupsif features['functional_groups']: fg_names = [fg.split('(')[0].replace('_', ' ') for fg in features['functional_groups']]iflen(fg_names) ==1: parts.append(f"The molecule contains a {fg_names[0]} functional group.")else: parts.append(f"Functional groups present include {', '.join(fg_names[:-1])} and {fg_names[-1]}.")# Physicochemical properties prop_desc = []if features['logp'] isnotNone:if features['logp'] >3: prop_desc.append("lipophilic character")elif features['logp'] <0: prop_desc.append("hydrophilic nature")else: prop_desc.append("moderate lipophilicity")if features['tpsa'] >90: prop_desc.append("high polar surface area")elif features['tpsa'] <30: prop_desc.append("low polar surface area")if prop_desc: parts.append(f"The compound exhibits {' and '.join(prop_desc)}.")# Drug-likeness assessmentif features['drug_like']: parts.append("This molecule satisfies Lipinski's Rule of Five, suggesting good oral bioavailability potential.")elif features['lipinski_violations']: parts.append(f"The compound violates Lipinski's Rule of Five due to {', '.join(features['lipinski_violations'])}, which may affect its drug-like properties.")return" ".join(parts)def _generate_educational_description(features: Dict) ->str:"""Generate an educational description explaining chemical concepts.""" parts = []# Start with basics parts.append(f"This molecule, with formula {features['molecular_formula']}, is an organic compound weighing {features['molecular_weight']} daltons.")# Explain ring systems in educational termsif features['num_aromatic_rings'] >0: parts.append(f"It contains {features['num_aromatic_rings']} aromatic ring(s), which are stable ring structures with delocalized electrons that give the molecule special stability and chemical properties.")elif features['num_rings'] >0: parts.append(f"The molecule has {features['num_rings']} ring structure(s), which can affect its shape and biological activity.")# Explain functional groupsif features['functional_groups']: parts.append("Functional groups are specific arrangements of atoms that give molecules their chemical reactivity.") fg_explanations = {'carboxylic_acid': "carboxylic acid groups (which can donate protons and are often acidic)",'alcohol': "alcohol groups (which can form hydrogen bonds)",'amine': "amine groups (which are basic and can accept protons)",'ketone': "ketone groups (which are reactive carbonyl groups)",'ester': "ester groups (often found in fats and can be hydrolyzed)",'ether': "ether groups (which are generally unreactive)" } explained_groups = []for fg in features['functional_groups']: fg_name = fg.split('(')[0]if fg_name in fg_explanations: explained_groups.append(fg_explanations[fg_name])if explained_groups: parts.append(f"This molecule contains {', '.join(explained_groups)}.")# Explain drug-likeness in simple termsif features['drug_like']: parts.append("The molecule's size and properties suggest it could potentially be developed as an oral medication.")elif features['lipinski_violations']: parts.append("The molecule is quite large or has properties that might make it challenging to develop as an oral drug.")return" ".join(parts)# Test the different description stylestest_molecules = [ ("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "Ibuprofen"), ("CC(=O)OC1=CC=CC=C1C(=O)O", "Aspirin"), ("CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "Caffeine"),]styles = ['concise', 'technical', 'detailed', 'educational']print("Natural Language Description Examples:")print("="*60)for smiles, name in test_molecules: mol = Chem.MolFromSmiles(smiles)print(f"\n{name} ({smiles}):")print("-"* (len(name) +len(smiles) +4))for style in styles: description = generate_molecular_description(mol, style)print(f"\n{style.title()} style:")print(description)
Natural Language Description Examples:
============================================================
Ibuprofen (CC(C)CC1=CC=C(C=C1)C(C)C(=O)O):
------------------------------------------
Concise style:
This is C13H18O2 with molecular weight 206.28 Da. Contains carboxylic acid, alcohol, aromatic ring functional groups.
Technical style:
Molecular formula C13H18O2 (MW: 206.28 Da). 1 ring(s) including 1 aromatic ring(s). LogP: 3.07, TPSA: 37.3 Ų. H-bond donors: 1, acceptors: 1. Complies with Lipinski's Rule of Five.
Detailed style:
This compound has the molecular formula C13H18O2 and a molecular weight of 206.28 daltons. The structure contains 15 heavy atoms, 1 aromatic ring(s). It is composed of 13 C atoms, 2 O atoms. Functional groups present include carboxylic acid, alcohol and aromatic ring. The compound exhibits lipophilic character. This molecule satisfies Lipinski's Rule of Five, suggesting good oral bioavailability potential.
Educational style:
This molecule, with formula C13H18O2, is an organic compound weighing 206.28 daltons. It contains 1 aromatic ring(s), which are stable ring structures with delocalized electrons that give the molecule special stability and chemical properties. Functional groups are specific arrangements of atoms that give molecules their chemical reactivity. This molecule contains carboxylic acid groups (which can donate protons and are often acidic), alcohol groups (which can form hydrogen bonds). The molecule's size and properties suggest it could potentially be developed as an oral medication.
Aspirin (CC(=O)OC1=CC=CC=C1C(=O)O):
-----------------------------------
Concise style:
This is C9H8O4 with molecular weight 180.16 Da. Contains carboxylic acid, ester, alcohol, ether, aromatic ring functional groups.
Technical style:
Molecular formula C9H8O4 (MW: 180.16 Da). 1 ring(s) including 1 aromatic ring(s). LogP: 1.31, TPSA: 63.6 Ų. H-bond donors: 1, acceptors: 3. Complies with Lipinski's Rule of Five.
Detailed style:
This compound has the molecular formula C9H8O4 and a molecular weight of 180.16 daltons. The structure contains 13 heavy atoms, 1 aromatic ring(s). It is composed of 9 C atoms, 4 O atoms. Functional groups present include carboxylic acid, ester, alcohol, ether and aromatic ring. The compound exhibits moderate lipophilicity. This molecule satisfies Lipinski's Rule of Five, suggesting good oral bioavailability potential.
Educational style:
This molecule, with formula C9H8O4, is an organic compound weighing 180.16 daltons. It contains 1 aromatic ring(s), which are stable ring structures with delocalized electrons that give the molecule special stability and chemical properties. Functional groups are specific arrangements of atoms that give molecules their chemical reactivity. This molecule contains carboxylic acid groups (which can donate protons and are often acidic), ester groups (often found in fats and can be hydrolyzed), alcohol groups (which can form hydrogen bonds), ether groups (which are generally unreactive). The molecule's size and properties suggest it could potentially be developed as an oral medication.
Caffeine (CN1C=NC2=C1C(=O)N(C(=O)N2C)C):
----------------------------------------
Concise style:
This is C8H10N4O2 with molecular weight 194.19 Da.
Technical style:
Molecular formula C8H10N4O2 (MW: 194.19 Da). 2 ring(s) including 2 aromatic ring(s). LogP: -1.03, TPSA: 61.82 Ų. H-bond donors: 0, acceptors: 6. Complies with Lipinski's Rule of Five.
Detailed style:
This compound has the molecular formula C8H10N4O2 and a molecular weight of 194.19 daltons. The structure contains 14 heavy atoms, 2 aromatic ring(s). It is composed of 8 C atoms, 4 N atoms, 2 O atoms. The compound exhibits hydrophilic nature. This molecule satisfies Lipinski's Rule of Five, suggesting good oral bioavailability potential.
Educational style:
This molecule, with formula C8H10N4O2, is an organic compound weighing 194.19 daltons. It contains 2 aromatic ring(s), which are stable ring structures with delocalized electrons that give the molecule special stability and chemical properties. The molecule's size and properties suggest it could potentially be developed as an oral medication.
Creating Structured Training Data
For training LLMs, you often need structured datasets that pair chemical structures with their descriptions. Here’s how to create such datasets systematically:
def create_molecule_text_dataset(smiles_list: List[str], names: Optional[List[str]] =None, include_multiple_styles: bool=True) -> List[Dict[str, str]]:""" Create a structured dataset pairing molecules with text descriptions. Suitable for fine-tuning LLMs on chemical tasks. Args: smiles_list: List of SMILES strings names: Optional list of molecule names include_multiple_styles: Whether to generate multiple description styles Returns: List of dictionaries with 'structure', 'description', and metadata """ dataset = []for i, smiles inenumerate(smiles_list): mol = Chem.MolFromSmiles(smiles)if mol isNone:continue# Skip invalid molecules name = names[i] if names and i <len(names) elsef"Compound_{i+1}" canonical_smiles = Chem.MolToSmiles(mol)if include_multiple_styles: styles = ['concise', 'technical', 'detailed', 'educational']else: styles = ['detailed']for style in styles: description = generate_molecular_description(mol, style)# Create training example in various formats# Format 1: Direct description dataset.append({'task_type': 'molecule_description','style': style,'input_smiles': smiles,'canonical_smiles': canonical_smiles,'molecule_name': name,'description': description,'prompt': f"Describe the chemical structure {canonical_smiles}:",'response': description })# Format 2: Question-answer format dataset.append({'task_type': 'chemical_qa','style': style,'input_smiles': smiles,'canonical_smiles': canonical_smiles,'molecule_name': name,'description': description,'prompt': f"What can you tell me about the molecule with SMILES {canonical_smiles}?",'response': description })# Format 3: Named molecule descriptionif name !=f"Compound_{i+1}": dataset.append({'task_type': 'named_molecule_description','style': style,'input_smiles': smiles,'canonical_smiles': canonical_smiles,'molecule_name': name,'description': description,'prompt': f"Describe the chemical structure and properties of {name}:",'response': f"{name} ({canonical_smiles}) is a compound where {description.lower()}" })return datasetdef create_property_prediction_dataset(smiles_list: List[str], target_properties: List[str] =None) -> List[Dict[str, str]]:""" Create a dataset for training property prediction with natural language explanations. """if target_properties isNone: target_properties = ['molecular_weight', 'logp', 'h_bond_donors', 'h_bond_acceptors', 'drug_like'] dataset = []for smiles in smiles_list: mol = Chem.MolFromSmiles(smiles)if mol isNone:continue features = extract_molecular_features(mol) canonical_smiles = Chem.MolToSmiles(mol)for prop in target_properties:if prop in features: value = features[prop]# Create natural language explanations for propertiesif prop =='molecular_weight': explanation =f"The molecular weight is {value} daltons, which "if value <200: explanation +="indicates a relatively small molecule."elif value >500: explanation +="suggests a large molecule that may have bioavailability issues."else: explanation +="is in a reasonable range for drug-like compounds."elif prop =='logp': explanation =f"The LogP value is {value}, indicating "if value >3: explanation +="high lipophilicity and potential membrane permeability."elif value <0: explanation +="hydrophilic character and good water solubility."else: explanation +="balanced lipophilicity suitable for oral drugs."elif prop =='h_bond_donors': explanation =f"This molecule has {value} hydrogen bond donor(s), which "if value ==0: explanation +="means it cannot donate hydrogen bonds."elif value <=5: explanation +="is within the acceptable range for drug-like molecules."else: explanation +="may limit its ability to cross biological membranes."elif prop =='drug_like': explanation =f"This molecule {'is'if value else'is not'} drug-like according to Lipinski's Rule of Five."else: explanation =f"The {prop.replace('_', ' ')} value is {value}." dataset.append({'task_type': 'property_prediction','property': prop,'input_smiles': smiles,'canonical_smiles': canonical_smiles,'property_value': value,'prompt': f"What is the {prop.replace('_', ' ')} of {canonical_smiles}?",'response': f"{value}",'explanation': explanation,'detailed_response': f"The {prop.replace('_', ' ')} of {canonical_smiles} is {value}. {explanation}" })return dataset# Example: Create training datasetsexample_molecules = [ ("CCO", "Ethanol"), ("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "Ibuprofen"), ("CC(=O)OC1=CC=CC=C1C(=O)O", "Aspirin"), ("CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "Caffeine"), ("CC1=CC=C(C=C1)C(C)C(=O)O", "2-(4-methylphenyl)propanoic acid"),]smiles_only = [mol[0] for mol in example_molecules]names_only = [mol[1] for mol in example_molecules]print("Creating Molecule-Text Dataset...")mol_text_data = create_molecule_text_dataset(smiles_only, names_only, include_multiple_styles=False)print(f"Generated {len(mol_text_data)} training examples")print("\nExample training entries:")print("="*50)# Show a few examplesfor i, entry inenumerate(mol_text_data[:3]):print(f"\nExample {i+1}:")print(f"Task: {entry['task_type']}")print(f"Molecule: {entry['molecule_name']}")print(f"Prompt: {entry['prompt']}")print(f"Response: {entry['response'][:100]}{'...'iflen(entry['response']) >100else''}")print("\n"+"="*50)print("Creating Property Prediction Dataset...")prop_data = create_property_prediction_dataset(smiles_only[:3])print(f"Generated {len(prop_data)} property prediction examples")print("\nExample property prediction entries:")for i, entry inenumerate(prop_data[:3]):print(f"\nExample {i+1}:")print(f"Property: {entry['property']}")print(f"Molecule: {entry['canonical_smiles']}")print(f"Prompt: {entry['prompt']}")print(f"Response: {entry['detailed_response']}")
Creating Molecule-Text Dataset...
Generated 15 training examples
Example training entries:
==================================================
Example 1:
Task: molecule_description
Molecule: Ethanol
Prompt: Describe the chemical structure CCO:
Response: This compound has the molecular formula C2H6O and a molecular weight of 46.07 daltons. The structure...
Example 2:
Task: chemical_qa
Molecule: Ethanol
Prompt: What can you tell me about the molecule with SMILES CCO?
Response: This compound has the molecular formula C2H6O and a molecular weight of 46.07 daltons. The structure...
Example 3:
Task: named_molecule_description
Molecule: Ethanol
Prompt: Describe the chemical structure and properties of Ethanol:
Response: Ethanol (CCO) is a compound where this compound has the molecular formula c2h6o and a molecular weig...
==================================================
Creating Property Prediction Dataset...
Generated 15 property prediction examples
Example property prediction entries:
Example 1:
Property: molecular_weight
Molecule: CCO
Prompt: What is the molecular weight of CCO?
Response: The molecular weight of CCO is 46.07. The molecular weight is 46.07 daltons, which indicates a relatively small molecule.
Example 2:
Property: logp
Molecule: CCO
Prompt: What is the logp of CCO?
Response: The logp of CCO is -0.0. The LogP value is -0.0, indicating balanced lipophilicity suitable for oral drugs.
Example 3:
Property: h_bond_donors
Molecule: CCO
Prompt: What is the h bond donors of CCO?
Response: The h bond donors of CCO is 1. This molecule has 1 hydrogen bond donor(s), which is within the acceptable range for drug-like molecules.
Exporting Data for Different LLM Platforms
Different LLM training platforms expect different data formats. Here are functions to export your chemical training data in common formats:
def export_for_huggingface(dataset: List[Dict], output_file: str="chemical_training_data.jsonl"):""" Export dataset in HuggingFace datasets format (JSONL). Each line is a JSON object with 'text' field for language modeling or 'input'/'output' fields for instruction tuning. """withopen(output_file, 'w') as f:for entry in dataset:# Format for instruction tuning hf_entry = {'instruction': entry['prompt'],'input': entry.get('canonical_smiles', ''),'output': entry['response'],'metadata': {'task_type': entry.get('task_type', ''),'molecule_name': entry.get('molecule_name', ''),'style': entry.get('style', '') } } f.write(json.dumps(hf_entry) +'\n')print(f"Exported {len(dataset)} entries to {output_file} (HuggingFace format)")def export_for_openai(dataset: List[Dict], output_file: str="chemical_training_data_openai.jsonl"):""" Export dataset in OpenAI fine-tuning format. Each line has 'messages' with system/user/assistant format. """withopen(output_file, 'w') as f:for entry in dataset: openai_entry = {'messages': [ {'role': 'system','content': 'You are a helpful chemistry assistant that can analyze and describe chemical structures.' }, {'role': 'user', 'content': entry['prompt'] }, {'role': 'assistant','content': entry['response'] } ] } f.write(json.dumps(openai_entry) +'\n')print(f"Exported {len(dataset)} entries to {output_file} (OpenAI format)")def export_for_llama(dataset: List[Dict], output_file: str="chemical_training_data_llama.json"):""" Export dataset in LLaMA/Alpaca instruction format. """ llama_data = []for entry in dataset: llama_entry = {'instruction': entry['prompt'],'input': entry.get('canonical_smiles', ''),'output': entry['response'] } llama_data.append(llama_entry)withopen(output_file, 'w') as f: json.dump(llama_data, f, indent=2)print(f"Exported {len(dataset)} entries to {output_file} (LLaMA/Alpaca format)")def create_conversational_dataset(smiles_list: List[str], names: List[str] =None) -> List[Dict]:""" Create a conversational dataset simulating a chemistry chatbot interaction. """ conversations = []for i, smiles inenumerate(smiles_list): mol = Chem.MolFromSmiles(smiles)if mol isNone:continue name = names[i] if names and i <len(names) elsef"Compound_{i+1}" features = extract_molecular_features(mol)# Create a multi-turn conversation conversation = {'conversation_id': f"chem_chat_{i+1}",'molecule_smiles': smiles,'molecule_name': name,'turns': [] }# Turn 1: Initial structure inquiry conversation['turns'].append({'user': f"Can you analyze the structure {smiles}?",'assistant': generate_molecular_description(mol, 'detailed') })# Turn 2: Property questionif features.get('molecular_weight'): conversation['turns'].append({'user': "What's its molecular weight?",'assistant': f"The molecular weight is {features['molecular_weight']} daltons." })# Turn 3: Drug-likeness question conversation['turns'].append({'user': "Would this be a good drug candidate?",'assistant': f"{'Yes'if features.get('drug_like') else'It may face challenges'}, this molecule {'complies with'if features.get('drug_like') else'violates'} Lipinski's Rule of Five. {generate_molecular_description(mol, 'educational').split('.')[-2]}." }) conversations.append(conversation)return conversations# Demonstrate the export functionsprint("Exporting Training Data in Different Formats:")print("="*50)# Use the previously created datasetsample_dataset = mol_text_data[:5] # Use a small sample for demonstration# Export in different formatsexport_for_huggingface(sample_dataset, "demo_huggingface.jsonl")export_for_openai(sample_dataset, "demo_openai.jsonl") export_for_llama(sample_dataset, "demo_llama.json")# Create and show conversational datasetprint("\nCreating Conversational Dataset:")conv_data = create_conversational_dataset(smiles_only[:2], names_only[:2])print(f"Generated {len(conv_data)} conversations")print("\nExample conversation:")print("="*30)for i, turn inenumerate(conv_data[0]['turns']):print(f"Turn {i+1}:")print(f"User: {turn['user']}")print(f"Assistant: {turn['assistant'][:100]}{'...'iflen(turn['assistant']) >100else''}")print()# Show file contents (first few lines)print("Sample exported data:")print("-"*30)try:withopen("demo_huggingface.jsonl", 'r') as f: lines = f.readlines()[:2]for i, line inenumerate(lines):print(f"Entry {i+1}: {line[:100]}...")exceptFileNotFoundError:print("Demo files not created (running in demo mode)")
Exporting Training Data in Different Formats:
==================================================
Exported 5 entries to demo_huggingface.jsonl (HuggingFace format)
Exported 5 entries to demo_openai.jsonl (OpenAI format)
Exported 5 entries to demo_llama.json (LLaMA/Alpaca format)
Creating Conversational Dataset:
Generated 2 conversations
Example conversation:
==============================
Turn 1:
User: Can you analyze the structure CCO?
Assistant: This compound has the molecular formula C2H6O and a molecular weight of 46.07 daltons. The structure...
Turn 2:
User: What's its molecular weight?
Assistant: The molecular weight is 46.07 daltons.
Turn 3:
User: Would this be a good drug candidate?
Assistant: Yes, this molecule complies with Lipinski's Rule of Five. The molecule's size and properties sugges...
Sample exported data:
------------------------------
Entry 1: {"instruction": "Describe the chemical structure CCO:", "input": "CCO", "output": "This compound has...
Entry 2: {"instruction": "What can you tell me about the molecule with SMILES CCO?", "input": "CCO", "output"...
Real-World Challenges and Solutions
While the integration of RDKit and LLMs offers exciting possibilities, real-world implementations face several challenges. This section addresses common issues and provides practical solutions based on experience with production systems.
Challenge 1: Performance and Scalability
When processing large datasets or building real-time applications, performance becomes critical. Let’s explore strategies for optimization:
import multiprocessing as mpfrom concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutorimport timefrom functools import lru_cacheclass OptimizedChemicalProcessor:""" An optimized processor for handling large-scale chemical data processing with RDKit and LLM integration. """def__init__(self, max_workers: int=None, use_caching: bool=True):self.max_workers = max_workers or mp.cpu_count()self.use_caching = use_caching# Pre-compile common SMARTS patterns for efficiencyself.functional_group_patterns = { name: Chem.MolFromSmarts(smarts) for name, smarts in {'carboxylic_acid': '[CX3](=O)[OX2H1]','ester': '[#6][CX3](=O)[OX2H0][#6]','alcohol': '[OX2H]','amine': '[NX3;H2,H1;!$(NC=O)]','ketone': '[#6][CX3](=O)[#6]','aromatic_ring': 'c1ccccc1', }.items() }@lru_cache(maxsize=10000)def cached_mol_from_smiles(self, smiles: str):"""Cached molecule parsing to avoid repeated work."""return Chem.MolFromSmiles(smiles)def process_smiles_batch_parallel(self, smiles_list: List[str]) -> List[Dict]:""" Process a batch of SMILES in parallel for better performance. """# Use ProcessPoolExecutor for CPU-bound taskswith ProcessPoolExecutor(max_workers=self.max_workers) as executor:# Split into chunks to reduce overhead chunk_size =max(1, len(smiles_list) // (self.max_workers *4)) chunks = [smiles_list[i:i + chunk_size] for i inrange(0, len(smiles_list), chunk_size)]# Process chunks in parallel futures = [executor.submit(self._process_chunk, chunk) for chunk in chunks] results = []for future in futures: results.extend(future.result())return resultsdef _process_chunk(self, smiles_chunk: List[str]) -> List[Dict]:"""Process a chunk of SMILES strings.""" results = []for smiles in smiles_chunk:try:ifself.use_caching: mol =self.cached_mol_from_smiles(smiles)else: mol = Chem.MolFromSmiles(smiles)if mol: features =self._extract_features_optimized(mol) features['smiles'] = smiles results.append(features)else: results.append({'smiles': smiles, 'error': 'Invalid SMILES'})exceptExceptionas e: results.append({'smiles': smiles, 'error': str(e)})return resultsdef _extract_features_optimized(self, mol: Chem.Mol) -> Dict:"""Optimized feature extraction focusing on essential properties."""# Only calculate the most important features to save timetry: features = {'molecular_weight': round(Descriptors.MolWt(mol), 2),'logp': round(Descriptors.MolLogP(mol), 2),'num_atoms': mol.GetNumAtoms(),'num_rings': rdMolDescriptors.CalcNumRings(mol),'h_bond_donors': Descriptors.NumHDonors(mol),'h_bond_acceptors': Descriptors.NumHAcceptors(mol), }# Fast functional group detection using pre-compiled patterns functional_groups = []for fg_name, pattern inself.functional_group_patterns.items():if mol.HasSubstructMatch(pattern): functional_groups.append(fg_name) features['functional_groups'] = functional_groupsreturn featuresexceptExceptionas e:return {'error': f'Feature extraction failed: {str(e)}'}# Performance comparison demonstrationdef benchmark_processing_methods():"""Compare different processing approaches for performance."""# Generate test data test_smiles = ["CCO", "c1ccccc1", "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O","CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "CC(=O)OC1=CC=CC=C1C(=O)O" ] *200# 1000 molecules for testing processor = OptimizedChemicalProcessor()print("Performance Benchmark:")print("="*40)# Method 1: Sequential processing start_time = time.time() sequential_results = []for smiles in test_smiles: mol = Chem.MolFromSmiles(smiles)if mol: features = extract_molecular_features(mol) sequential_results.append(features) sequential_time = time.time() - start_time# Method 2: Parallel processing start_time = time.time() parallel_results = processor.process_smiles_batch_parallel(test_smiles) parallel_time = time.time() - start_time# Method 3: Cached processing start_time = time.time() cached_results = []for smiles in test_smiles: mol = processor.cached_mol_from_smiles(smiles)if mol: features = processor._extract_features_optimized(mol) cached_results.append(features) cached_time = time.time() - start_timeprint(f"Sequential processing: {sequential_time:.2f} seconds")print(f"Parallel processing: {parallel_time:.2f} seconds")print(f"Cached processing: {cached_time:.2f} seconds")print(f"Speedup (parallel): {sequential_time/parallel_time:.1f}x")print(f"Speedup (cached): {sequential_time/cached_time:.1f}x")# Verify results are consistentprint(f"\nResults consistency check:")print(f"Sequential results: {len(sequential_results)} molecules processed")print(f"Parallel results: {len([r for r in parallel_results if'error'notin r])} molecules processed")print(f"Cached results: {len(cached_results)} molecules processed")# Run the benchmarkbenchmark_processing_methods()
LLMs can produce inconsistent or partially correct chemical information. Here’s how to build robust systems that handle these issues:
class RobustChemicalParser:""" A parser that can handle messy, inconsistent LLM outputs and extract useful chemical information with confidence scoring. """def__init__(self):# Common patterns for extracting chemical information from textself.smiles_patterns = [r'\b([A-Za-z0-9@+\-\[\]\(\)=#$:%\\\/\.]+)\b', # General SMILES patternr'SMILES[:\s]*([A-Za-z0-9@+\-\[\]\(\)=#$:%\\\/\.]+)', # SMILES: prefixr'structure[:\s]*([A-Za-z0-9@+\-\[\]\(\)=#$:%\\\/\.]+)', # structure: prefix ]self.molecular_weight_patterns = [r'molecular weight[:\s]*(\d+\.?\d*)\s*(?:Da|daltons?|g/mol)?',r'MW[:\s]*(\d+\.?\d*)',r'(\d+\.?\d*)\s*(?:Da|daltons?|g/mol)', ]self.property_patterns = {'logp': [r'LogP[:\s]*(-?\d+\.?\d*)', r'log\s*P[:\s]*(-?\d+\.?\d*)'],'tpsa': [r'TPSA[:\s]*(\d+\.?\d*)', r'polar surface area[:\s]*(\d+\.?\d*)'],'hbd': [r'H-bond donors?[:\s]*(\d+)', r'hydrogen bond donors?[:\s]*(\d+)'],'hba': [r'H-bond acceptors?[:\s]*(\d+)', r'hydrogen bond acceptors?[:\s]*(\d+)'], }def extract_chemical_entities(self, text: str) -> Dict[str, any]:""" Extract chemical entities from potentially messy LLM text output. Returns results with confidence scores. """ results = {'extracted_smiles': [],'molecular_weight': None,'properties': {},'confidence_scores': {},'raw_text': text }# Extract SMILES candidates smiles_candidates =self._extract_smiles_candidates(text) validated_smiles = []for candidate, confidence in smiles_candidates: validation = validate_smiles_with_details(candidate)if validation['is_valid']: validated_smiles.append({'smiles': candidate,'canonical_smiles': validation['canonical_smiles'],'confidence': confidence,'properties': validation['properties'] }) results['extracted_smiles'] = validated_smiles# Extract molecular weight mw_match =self._extract_molecular_weight(text)if mw_match: results['molecular_weight'] = mw_match['value'] results['confidence_scores']['molecular_weight'] = mw_match['confidence']# Extract other propertiesfor prop_name, patterns inself.property_patterns.items(): prop_match =self._extract_property(text, patterns)if prop_match: results['properties'][prop_name] = prop_match['value'] results['confidence_scores'][prop_name] = prop_match['confidence']return resultsdef _extract_smiles_candidates(self, text: str) -> List[Tuple[str, float]]:"""Extract SMILES candidates with confidence scores.""" candidates = []for i, pattern inenumerate(self.smiles_patterns): matches = re.findall(pattern, text, re.IGNORECASE)for match in matches:# Basic heuristics for SMILES likelihood confidence =self._calculate_smiles_confidence(match, pattern_index=i)if confidence >0.3: # Minimum confidence threshold candidates.append((match, confidence))# Remove duplicates and sort by confidence seen =set() unique_candidates = []for smiles, conf insorted(candidates, key=lambda x: x[1], reverse=True):if smiles notin seen: seen.add(smiles) unique_candidates.append((smiles, conf))return unique_candidates[:5] # Return top 5 candidatesdef _calculate_smiles_confidence(self, smiles: str, pattern_index: int) ->float:"""Calculate confidence score for a SMILES candidate.""" confidence =0.5# Base confidence# Pattern-based confidence adjustmentif pattern_index ==1: # Explicit SMILES: prefix confidence +=0.3elif pattern_index ==2: # structure: prefix confidence +=0.2# Length-based heuristicsif5<=len(smiles) <=200: confidence +=0.2eliflen(smiles) <3orlen(smiles) >500: confidence -=0.3# Character composition heuristics valid_chars =set('CNOSPFClBrIH[]()=@+-.0123456789#\\/%$')ifall(c in valid_chars for c in smiles): confidence +=0.2# Common SMILES patternsifany(pattern in smiles for pattern in ['c1ccccc1', 'CC', 'CN', 'CO']): confidence +=0.1returnmin(1.0, max(0.0, confidence))def _extract_molecular_weight(self, text: str) -> Optional[Dict]:"""Extract molecular weight with confidence."""for pattern inself.molecular_weight_patterns: matches = re.findall(pattern, text, re.IGNORECASE)if matches:try: value =float(matches[0])if10<= value <=5000: # Reasonable MW rangereturn {'value': value,'confidence': 0.8if'molecular weight'in pattern else0.6 }exceptValueError:continuereturnNonedef _extract_property(self, text: str, patterns: List[str]) -> Optional[Dict]:"""Extract a property value with confidence."""for pattern in patterns: matches = re.findall(pattern, text, re.IGNORECASE)if matches:try: value =float(matches[0])return {'value': value, 'confidence': 0.7}exceptValueError:continuereturnNonedef cross_validate_extractions(self, text: str) -> Dict[str, any]:""" Cross-validate extracted information against RDKit calculations. """ extraction =self.extract_chemical_entities(text) validation_results = {'extraction': extraction,'validation': {},'discrepancies': [],'overall_confidence': 0.0 }if extraction['extracted_smiles']: best_smiles = extraction['extracted_smiles'][0] mol = Chem.MolFromSmiles(best_smiles['canonical_smiles'])if mol: rdkit_props = extract_molecular_features(mol)# Cross-validate molecular weightif extraction['molecular_weight']: rdkit_mw = rdkit_props['molecular_weight'] extracted_mw = extraction['molecular_weight'] mw_diff =abs(rdkit_mw - extracted_mw)if mw_diff <1.0: validation_results['validation']['molecular_weight'] ='MATCH'elif mw_diff <10.0: validation_results['validation']['molecular_weight'] ='CLOSE' validation_results['discrepancies'].append(f"MW discrepancy: extracted={extracted_mw}, calculated={rdkit_mw}" )else: validation_results['validation']['molecular_weight'] ='MISMATCH' validation_results['discrepancies'].append(f"Major MW discrepancy: extracted={extracted_mw}, calculated={rdkit_mw}" )# Cross-validate other propertiesfor prop in ['logp', 'hbd', 'hba']:if prop in extraction['properties']:# Add property validation logic herepass# Calculate overall confidence confidence_scores =list(extraction['confidence_scores'].values())if confidence_scores: validation_results['overall_confidence'] =sum(confidence_scores) /len(confidence_scores)return validation_results# Demonstration with messy LLM outputsparser = RobustChemicalParser()# Simulate various types of messy LLM outputsmessy_outputs = ["The compound has SMILES: CC(C)CC1=CC=C(C=C1)C(C)C(=O)O and molecular weight 206.3 Da","This molecule (c1ccccc1) is benzene, MW: 78.11 g/mol, LogP: 2.13","Structure: CCO, ethanol, with 2 H-bond donors and 1 H-bond acceptor","The SMILES string is probably C1=CC=C(C=C1)C(=O)O but I'm not completely sure about the molecular weight being around 122","Invalid SMILES: XYZ123ABC and some random text with molecular weight 999999 Da",]print("Robust Chemical Information Extraction:")print("="*50)for i, text inenumerate(messy_outputs):print(f"\nExample {i+1}:")print(f"Input: {text}")# Extract and validate result = parser.cross_validate_extractions(text) extraction = result['extraction']print(f"Extracted SMILES: {len(extraction['extracted_smiles'])} candidates")if extraction['extracted_smiles']: best = extraction['extracted_smiles'][0]print(f" Best: {best['canonical_smiles']} (confidence: {best['confidence']:.2f})")if extraction['molecular_weight']:print(f"Molecular Weight: {extraction['molecular_weight']} (confidence: {extraction['confidence_scores'].get('molecular_weight', 0):.2f})")if result['discrepancies']:print(f"Discrepancies: {'; '.join(result['discrepancies'])}")print(f"Overall confidence: {result['overall_confidence']:.2f}")print("-"*30)
Robust Chemical Information Extraction:
==================================================
Example 1:
Input: The compound has SMILES: CC(C)CC1=CC=C(C=C1)C(C)C(=O)O and molecular weight 206.3 Da
Extracted SMILES: 1 candidates
Best: CC(C)Cc1ccc(C(C)C(=O)O)cc1 (confidence: 1.00)
Molecular Weight: 206.3 (confidence: 0.80)
Overall confidence: 0.80
------------------------------
Example 2:
Input: This molecule (c1ccccc1) is benzene, MW: 78.11 g/mol, LogP: 2.13
Extracted SMILES: 1 candidates
Best: c1ccccc1 (confidence: 0.80)
Molecular Weight: 78.11 (confidence: 0.60)
Overall confidence: 0.65
------------------------------
Example 3:
Input: Structure: CCO, ethanol, with 2 H-bond donors and 1 H-bond acceptor
Extracted SMILES: 1 candidates
Best: CCO (confidence: 1.00)
Overall confidence: 0.00
------------------------------
Example 4:
Input: The SMILES string is probably C1=CC=C(C=C1)C(=O)O but I'm not completely sure about the molecular weight being around 122
Extracted SMILES: 1 candidates
Best: O=C(O)c1ccccc1 (confidence: 1.00)
Overall confidence: 0.00
------------------------------
Example 5:
Input: Invalid SMILES: XYZ123ABC and some random text with molecular weight 999999 Da
Extracted SMILES: 0 candidates
Overall confidence: 0.00
------------------------------
[09:53:55] SMILES Parse Error: syntax error while parsing: 206.3
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] 206.3
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES '206.3' for input: '206.3'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'compound' for input: 'compound'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'SMILES' for input: 'SMILES'
[09:53:55] SMILES Parse Error: syntax error while parsing: molecular
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] molecular
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'molecular' for input: 'molecular'
[09:53:55] SMILES Parse Error: syntax error while parsing: 78.11
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] 78.11
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES '78.11' for input: '78.11'
[09:53:55] SMILES Parse Error: syntax error while parsing: molecule
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] molecule
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'molecule' for input: 'molecule'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'benzene' for input: 'benzene'
[09:53:55] SMILES Parse Error: syntax error while parsing: g/mol
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] g/mol
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'g/mol' for input: 'g/mol'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'Structure' for input: 'Structure'
[09:53:55] SMILES Parse Error: syntax error while parsing: ethanol
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] ethanol
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'ethanol' for input: 'ethanol'
[09:53:55] SMILES Parse Error: syntax error while parsing: H-bond
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] H-bond
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'H-bond' for input: 'H-bond'
[09:53:55] SMILES Parse Error: syntax error while parsing: donors
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] donors
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'donors' for input: 'donors'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'string' for input: 'string'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'SMILES' for input: 'SMILES'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'probably' for input: 'probably'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'completely' for input: 'completely'
[09:53:55] SMILES Parse Error: syntax error while parsing: XYZ123ABC
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] XYZ123ABC
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'XYZ123ABC' for input: 'XYZ123ABC'
[09:53:55] SMILES Parse Error: syntax error while parsing: 999999
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] 999999
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES '999999' for input: '999999'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'Invalid' for input: 'Invalid'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'SMILES' for input: 'SMILES'
[09:53:55] SMILES Parse Error: syntax error while parsing: random
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] random
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'random' for input: 'random'
Challenge 3: Data Quality and Contamination
Chemical datasets often contain errors, duplicates, and inconsistencies. Here’s how to build quality control systems:
NOTE FROM GREG: I needed to add an import to the following block
# EDIT BY GREG: add missing importfrom rdkit import DataStructsclass ChemicalDataQualityController:""" A comprehensive system for detecting and handling data quality issues in chemical datasets used for LLM training. """def__init__(self, similarity_threshold: float=0.85):self.similarity_threshold = similarity_thresholdself.quality_stats = {'total_processed': 0,'duplicates_found': 0,'invalid_structures': 0,'outliers_detected': 0,'cleaned_structures': 0 }def comprehensive_quality_check(self, data: List[Dict]) -> Dict[str, any]:""" Perform comprehensive quality checks on chemical dataset. Args: data: List of dictionaries with 'smiles' and optional 'name', 'description' keys Returns: Quality report with flagged issues and cleaned data """ report = {'original_size': len(data),'issues': {'duplicates': [],'invalid_smiles': [],'structural_outliers': [],'text_inconsistencies': [],'suspicious_patterns': [] },'cleaned_data': [],'recommendations': [] }print("Running comprehensive data quality checks...")# Step 1: Basic validation and cleaning valid_entries = []for i, entry inenumerate(data): smiles = entry.get('smiles', '') validation = validate_smiles_with_details(smiles)if validation['is_valid']:# Store canonical SMILES for consistency entry['canonical_smiles'] = validation['canonical_smiles'] entry['original_index'] = i valid_entries.append(entry)else: report['issues']['invalid_smiles'].append({'index': i,'smiles': smiles,'errors': validation['errors'] })print(f" Valid structures: {len(valid_entries)}/{len(data)}")# Step 2: Duplicate detection duplicates =self._find_duplicates(valid_entries) report['issues']['duplicates'] = duplicates# Step 3: Structural outlier detection outliers =self._detect_structural_outliers(valid_entries) report['issues']['structural_outliers'] = outliers# Step 4: Text-structure consistency checksifany('description'in entry for entry in valid_entries): inconsistencies =self._check_text_structure_consistency(valid_entries) report['issues']['text_inconsistencies'] = inconsistencies# Step 5: Suspicious pattern detection suspicious =self._detect_suspicious_patterns(valid_entries) report['issues']['suspicious_patterns'] = suspicious# Step 6: Create cleaned dataset cleaned_data =self._create_cleaned_dataset(valid_entries, report['issues']) report['cleaned_data'] = cleaned_data report['final_size'] =len(cleaned_data)# Step 7: Generate recommendations report['recommendations'] =self._generate_recommendations(report)return reportdef _find_duplicates(self, entries: List[Dict]) -> List[Dict]:"""Find duplicate structures using canonical SMILES and similarity."""print(" Checking for duplicates...") duplicates = [] seen_smiles = {}# Exact duplicates (same canonical SMILES)for entry in entries: canonical = entry['canonical_smiles']if canonical in seen_smiles: duplicates.append({'type': 'exact_duplicate','indices': [seen_smiles[canonical]['original_index'], entry['original_index']],'smiles': canonical,'names': [seen_smiles[canonical].get('name', 'Unknown'), entry.get('name', 'Unknown')] })else: seen_smiles[canonical] = entry# Near-duplicates (high structural similarity)iflen(entries) <1000: # Only for smaller datasets due to O(n²) complexity fingerprints = []for entry in entries: mol = Chem.MolFromSmiles(entry['canonical_smiles'])if mol: fp = Chem.RDKFingerprint(mol) fingerprints.append((entry, fp))for i, (entry1, fp1) inenumerate(fingerprints):for j, (entry2, fp2) inenumerate(fingerprints[i+1:], i+1): similarity = DataStructs.TanimotoSimilarity(fp1, fp2)if similarity >self.similarity_threshold: duplicates.append({'type': 'near_duplicate','similarity': similarity,'indices': [entry1['original_index'], entry2['original_index']],'smiles': [entry1['canonical_smiles'], entry2['canonical_smiles']],'names': [entry1.get('name', 'Unknown'), entry2.get('name', 'Unknown')] })return duplicatesdef _detect_structural_outliers(self, entries: List[Dict]) -> List[Dict]:"""Detect structural outliers that might indicate data quality issues."""print(" Detecting structural outliers...") outliers = [] properties = []# Calculate properties for all moleculesfor entry in entries: mol = Chem.MolFromSmiles(entry['canonical_smiles'])if mol: props = {'mw': Descriptors.MolWt(mol),'logp': Descriptors.MolLogP(mol),'num_atoms': mol.GetNumAtoms(),'num_rings': rdMolDescriptors.CalcNumRings(mol),'entry': entry } properties.append(props)ifnot properties:return outliers# Statistical outlier detection using IQR methodfor prop_name in ['mw', 'logp', 'num_atoms', 'num_rings']: values = [p[prop_name] for p in properties] q1 = pd.Series(values).quantile(0.25) q3 = pd.Series(values).quantile(0.75) iqr = q3 - q1 lower_bound = q1 -3* iqr # More permissive than standard 1.5*IQR upper_bound = q3 +3* iqrfor props in properties: value = props[prop_name]if value < lower_bound or value > upper_bound: outliers.append({'type': f'{prop_name}_outlier','index': props['entry']['original_index'],'smiles': props['entry']['canonical_smiles'],'property': prop_name,'value': value,'expected_range': f'{lower_bound:.1f} - {upper_bound:.1f}' })return outliersdef _check_text_structure_consistency(self, entries: List[Dict]) -> List[Dict]:"""Check consistency between text descriptions and calculated properties."""print(" Checking text-structure consistency...") inconsistencies = []for entry in entries:if'description'notin entry:continue description = entry['description'].lower() mol = Chem.MolFromSmiles(entry['canonical_smiles'])ifnot mol:continue features = extract_molecular_features(mol)# Check molecular weight consistencyif'molecular weight'in description or'mw'in description:# Extract numbers that might be molecular weights mw_numbers = re.findall(r'\b(\d{2,4}(?:\.\d+)?)\b', description) actual_mw = features['molecular_weight'] found_matching_mw =Falsefor num_str in mw_numbers:try: described_mw =float(num_str)ifabs(described_mw - actual_mw) <5: # 5 Da tolerance found_matching_mw =TruebreakexceptValueError:continueifnot found_matching_mw and mw_numbers: inconsistencies.append({'type': 'molecular_weight_mismatch','index': entry['original_index'],'smiles': entry['canonical_smiles'],'described_mw': mw_numbers,'calculated_mw': actual_mw,'description_excerpt': description[:100] +'...' })# Check functional group mentions mentioned_groups = []if'alcohol'in description: mentioned_groups.append('alcohol')if'carboxylic acid'in description or'carboxyl'in description: mentioned_groups.append('carboxylic_acid')if'amine'in description: mentioned_groups.append('amine')if'aromatic'in description or'benzene'in description: mentioned_groups.append('aromatic') detected_groups = [fg.split('(')[0] for fg in features.get('functional_groups', [])]for mentioned in mentioned_groups:if mentioned notin detected_groups and mentioned !='aromatic': inconsistencies.append({'type': 'functional_group_mismatch','index': entry['original_index'],'smiles': entry['canonical_smiles'],'mentioned_group': mentioned,'detected_groups': detected_groups,'description_excerpt': description[:100] +'...' })return inconsistenciesdef _detect_suspicious_patterns(self, entries: List[Dict]) -> List[Dict]:"""Detect suspicious patterns that might indicate synthetic or corrupted data."""print(" Detecting suspicious patterns...") suspicious = []# Pattern 1: Repetitive SMILES patterns smiles_parts = {}for entry in entries: smiles = entry['canonical_smiles']# Look for repeated substringsfor length in [3, 4, 5]:for i inrange(len(smiles) - length +1): substring = smiles[i:i+length]if substring.count(substring[0]) !=len(substring): # Not all same character smiles_parts[substring] = smiles_parts.get(substring, 0) +1# Flag SMILES with highly repetitive patternsfor entry in entries: smiles = entry['canonical_smiles'] repetitive_score =0for substring, count in smiles_parts.items():if count >len(entries) *0.1and substring in smiles: # Appears in >10% of dataset repetitive_score += countif repetitive_score >len(entries) *0.2: suspicious.append({'type': 'repetitive_smiles_pattern','index': entry['original_index'],'smiles': smiles,'repetitive_score': repetitive_score })# Pattern 2: Unrealistic property combinationsfor entry in entries: mol = Chem.MolFromSmiles(entry['canonical_smiles'])if mol: mw = Descriptors.MolWt(mol) num_atoms = mol.GetNumAtoms()# Flag extremely dense molecules (too much mass for number of atoms)if num_atoms >0and mw / num_atoms >50: # Average atomic weight > 50 suspicious.append({'type': 'unrealistic_density','index': entry['original_index'],'smiles': entry['canonical_smiles'],'mw_per_atom': mw / num_atoms })return suspiciousdef _create_cleaned_dataset(self, entries: List[Dict], issues: Dict) -> List[Dict]:"""Create a cleaned dataset by removing problematic entries."""print(" Creating cleaned dataset...")# Collect indices to remove indices_to_remove =set()# Remove exact duplicates (keep first occurrence)for dup in issues['duplicates']:if dup['type'] =='exact_duplicate': indices_to_remove.add(dup['indices'][1]) # Remove second occurrence# Remove severe outliersfor outlier in issues['structural_outliers']:if outlier['type'] in ['num_atoms_outlier', 'mw_outlier']:# Only remove extreme outliersif outlier['type'] =='num_atoms_outlier'and outlier['value'] >200: indices_to_remove.add(outlier['index'])elif outlier['type'] =='mw_outlier'and outlier['value'] >2000: indices_to_remove.add(outlier['index'])# Remove highly suspicious entriesfor susp in issues['suspicious_patterns']:if susp['type'] =='unrealistic_density'and susp['mw_per_atom'] >100: indices_to_remove.add(susp['index'])# Create cleaned dataset cleaned = []for entry in entries:if entry['original_index'] notin indices_to_remove:# Clean up the entry cleaned_entry = {'smiles': entry['canonical_smiles'], # Use canonical form'name': entry.get('name', ''),'description': entry.get('description', '') } cleaned.append(cleaned_entry)return cleaneddef _generate_recommendations(self, report: Dict) -> List[str]:"""Generate recommendations based on quality issues found.""" recommendations = []# Duplicate recommendationsif report['issues']['duplicates']: exact_dups =len([d for d in report['issues']['duplicates'] if d['type'] =='exact_duplicate']) near_dups =len([d for d in report['issues']['duplicates'] if d['type'] =='near_duplicate'])if exact_dups >0: recommendations.append(f"Remove {exact_dups} exact duplicate structures")if near_dups >0: recommendations.append(f"Review {near_dups} near-duplicate pairs (similarity > {self.similarity_threshold})")# Data size recommendations removal_rate = (report['original_size'] - report['final_size']) / report['original_size']if removal_rate >0.2: recommendations.append(f"High removal rate ({removal_rate:.1%}) suggests significant data quality issues")# Outlier recommendationsif report['issues']['structural_outliers']: recommendations.append("Review structural outliers - they may indicate data entry errors")# Consistency recommendationsif report['issues']['text_inconsistencies']: recommendations.append("Text descriptions inconsistent with structures - consider regenerating descriptions")ifnot recommendations: recommendations.append("Dataset quality looks good! No major issues detected.")return recommendations# Demonstration with a problematic datasetdef create_test_dataset_with_issues():"""Create a test dataset with various quality issues for demonstration."""return [# Good entries {'smiles': 'CCO', 'name': 'Ethanol', 'description': 'Ethanol is an alcohol with molecular weight 46.07 Da'}, {'smiles': 'c1ccccc1', 'name': 'Benzene', 'description': 'Benzene is an aromatic compound'},# Duplicates {'smiles': 'CCO', 'name': 'Ethyl alcohol', 'description': 'Another entry for ethanol'}, # Exact duplicate {'smiles': 'c1ccccc1', 'name': 'Benzene ring', 'description': 'Benzene ring structure'}, # Exact duplicate# Invalid SMILES {'smiles': 'INVALID123', 'name': 'Bad entry', 'description': 'This is not a valid SMILES'}, {'smiles': 'C1CCC', 'name': 'Unclosed ring', 'description': 'Missing ring closure'},# Outliers {'smiles': 'C'*100, 'name': 'Very long chain', 'description': 'Extremely long alkyl chain'},# Text inconsistencies {'smiles': 'CCO', 'name': 'Ethanol', 'description': 'This alcohol has molecular weight 200 Da'}, # Wrong MW {'smiles': 'c1ccccc1', 'name': 'Benzene', 'description': 'Contains carboxylic acid groups'}, # Wrong functional group# Near duplicates {'smiles': 'CC(C)O', 'name': 'Isopropanol', 'description': 'Isopropyl alcohol'},# Good entries for padding {'smiles': 'CC(=O)O', 'name': 'Acetic acid', 'description': 'Acetic acid is a carboxylic acid'}, {'smiles': 'CN', 'name': 'Methylamine', 'description': 'Simple amine compound'}, ]# Run quality control demonstrationprint("Chemical Data Quality Control Demonstration")print("="*50)# Create test dataset with known issuestest_data = create_test_dataset_with_issues()print(f"Test dataset created with {len(test_data)} entries")# Run quality controlqc = ChemicalDataQualityController(similarity_threshold=0.8)quality_report = qc.comprehensive_quality_check(test_data)# Display resultsprint(f"\nQuality Control Results:")print(f"Original dataset size: {quality_report['original_size']}")print(f"Cleaned dataset size: {quality_report['final_size']}")print(f"Removal rate: {(1- quality_report['final_size']/quality_report['original_size'])*100:.1f}%")print(f"\nIssues Found:")for issue_type, issues in quality_report['issues'].items():if issues:print(f" {issue_type.replace('_', ' ').title()}: {len(issues)} cases")print(f"\nRecommendations:")for i, rec inenumerate(quality_report['recommendations'], 1):print(f" {i}. {rec}")# Show example of cleaned dataprint(f"\nFirst 3 entries of cleaned dataset:")for i, entry inenumerate(quality_report['cleaned_data'][:3]):print(f" {i+1}. {entry['name']}: {entry['smiles']}")
Chemical Data Quality Control Demonstration
==================================================
Test dataset created with 12 entries
Running comprehensive data quality checks...
Valid structures: 10/12
Checking for duplicates...
Detecting structural outliers...
Checking text-structure consistency...
Detecting suspicious patterns...
Creating cleaned dataset...
Quality Control Results:
Original dataset size: 12
Cleaned dataset size: 6
Removal rate: 50.0%
Issues Found:
Duplicates: 10 cases
Invalid Smiles: 2 cases
Structural Outliers: 3 cases
Text Inconsistencies: 2 cases
Suspicious Patterns: 6 cases
Recommendations:
1. Remove 4 exact duplicate structures
2. Review 6 near-duplicate pairs (similarity > 0.8)
3. High removal rate (50.0%) suggests significant data quality issues
4. Review structural outliers - they may indicate data entry errors
5. Text descriptions inconsistent with structures - consider regenerating descriptions
First 3 entries of cleaned dataset:
1. Ethanol: CCO
2. Benzene: c1ccccc1
3. Very long chain: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
[09:55:35] SMILES Parse Error: Failed parsing SMILES 'INVALID123' for input: 'INVALID123'
[09:55:35] SMILES Parse Error: unclosed ring for input: 'C1CCC'
Advanced Contamination Detection
NOTE FROM GREG: the following section of code does not work. I’m leaving it here as the first example of something that didn’t actually work as generated by the LLM.
Data contamination is a critical issue when training LLMs on chemical data. Beyond basic quality checks, sophisticated contamination detection is essential:
Cross-Dataset Leakage Detection:
def detect_cross_dataset_leakage(train_smiles, test_smiles, validation_smiles):"""Detect potential data leakage between training, test, and validation sets"""# Implementation would check for exact matches and high-similarity compoundspass
Temporal Contamination: Chemical databases often contain compounds discovered after certain dates. For historically-aware models, ensure temporal consistency:
Vendor Catalog Contamination: Many “novel” compounds in datasets are actually from commercial catalogs, which can lead to unrealistic performance expectations.
Synthetic Accessibility Bias: Datasets often over-represent easily synthesizable compounds, creating bias in LLM outputs.
class ContaminationDetector:""" Advanced contamination detection for chemical datasets used in LLM training. """def__init__(self, similarity_threshold: float=0.85):self.similarity_threshold = similarity_thresholddef detect_cross_dataset_leakage(self, datasets: Dict[str, List[str]]) -> Dict:""" Detect potential data leakage between multiple datasets. Args: datasets: Dictionary with dataset names as keys and lists of SMILES as values e.g., {'train': [...], 'test': [...], 'validation': [...]} Returns: Dictionary with contamination analysis results """print("Detecting cross-dataset contamination...") results = {'exact_overlaps': {},'similarity_overlaps': {},'summary': {} }# Convert SMILES to canonical form and create fingerprints canonical_datasets = {} fingerprint_datasets = {}for name, smiles_list in datasets.items(): canonical_smiles = [] fingerprints = []for smiles in smiles_list: mol = Chem.MolFromSmiles(smiles)if mol: canonical = Chem.MolToSmiles(mol) canonical_smiles.append(canonical) fingerprints.append(Chem.RDKFingerprint(mol)) canonical_datasets[name] = canonical_smiles fingerprint_datasets[name] = fingerprintsprint(f" {name}: {len(canonical_smiles)} valid structures")# Check exact overlaps dataset_names =list(datasets.keys())for i, name1 inenumerate(dataset_names):for name2 in dataset_names[i+1:]: overlap_key =f"{name1}_vs_{name2}" set1 =set(canonical_datasets[name1]) set2 =set(canonical_datasets[name2]) exact_overlap = set1.intersection(set2) results['exact_overlaps'][overlap_key] = {'count': len(exact_overlap),'percentage_of_first': len(exact_overlap) /len(set1) *100if set1 else0,'percentage_of_second': len(exact_overlap) /len(set2) *100if set2 else0,'overlapping_smiles': list(exact_overlap)[:10] # First 10 examples }# Check similarity-based overlaps (computationally expensive, limited to smaller datasets)for i, name1 inenumerate(dataset_names):for name2 in dataset_names[i+1:]: overlap_key =f"{name1}_vs_{name2}"iflen(fingerprint_datasets[name1]) <1000andlen(fingerprint_datasets[name2]) <1000: similar_pairs = []for idx1, fp1 inenumerate(fingerprint_datasets[name1]):for idx2, fp2 inenumerate(fingerprint_datasets[name2]): similarity = DataStructs.TanimotoSimilarity(fp1, fp2)if similarity >self.similarity_threshold: similar_pairs.append({'smiles1': canonical_datasets[name1][idx1],'smiles2': canonical_datasets[name2][idx2],'similarity': similarity }) results['similarity_overlaps'][overlap_key] = {'count': len(similar_pairs),'pairs': similar_pairs[:5] # First 5 examples }else: results['similarity_overlaps'][overlap_key] = {'count': 'skipped_large_dataset','note': 'Dataset too large for pairwise comparison' }# Generate summary total_exact_overlaps =sum(result['count'] for result in results['exact_overlaps'].values() ifisinstance(result['count'], int)) results['summary'] = {'total_exact_overlaps': total_exact_overlaps,'datasets_analyzed': len(datasets),'contamination_severity': self._assess_contamination_severity(results),'recommendations': self._generate_contamination_recommendations(results) }return resultsdef detect_temporal_contamination(self, smiles_list: List[str], target_date: str="2020-01-01") -> Dict:""" Detect compounds that might be from after a target date. Note: This is a simplified version - real implementation would require a database of compound discovery/publication dates. """print(f"Checking for temporal contamination (compounds after {target_date})...")# This is a placeholder implementation# In practice, you'd need a database mapping SMILES to discovery dates suspicious_compounds = []for smiles in smiles_list: mol = Chem.MolFromSmiles(smiles)if mol:# Example heuristic: very complex molecules might be recent complexity_score =self._calculate_molecular_complexity(mol)if complexity_score >20: # Arbitrary threshold suspicious_compounds.append({'smiles': smiles,'complexity_score': complexity_score,'reason': 'high_complexity' })return {'suspicious_count': len(suspicious_compounds),'suspicious_compounds': suspicious_compounds[:10],'recommendation': f"Manual review recommended for {len(suspicious_compounds)} complex compounds" }def detect_vendor_catalog_bias(self, smiles_list: List[str]) -> Dict:""" Detect potential bias from commercial compound catalogs. """print("Detecting vendor catalog bias...")# Common patterns in commercial compounds vendor_patterns = {'simple_aromatics': r'c1ccccc1', # Simple benzene rings'common_protecting_groups': r'C\(=O\)OC\(C\)\(C\)C', # Boc groups'standard_linkers': r'OCCOCCOC', # PEG-like linkers } pattern_counts = {pattern: 0for pattern in vendor_patterns.keys()} flagged_compounds = []for smiles in smiles_list: mol = Chem.MolFromSmiles(smiles)if mol:# Check for overly simple structuresif mol.GetNumAtoms() <10and mol.GetNumBonds() <10: flagged_compounds.append({'smiles': smiles,'reason': 'overly_simple','num_atoms': mol.GetNumAtoms() })# Check for common patternsfor pattern_name, pattern in vendor_patterns.items():if re.search(pattern, smiles): pattern_counts[pattern_name] +=1return {'pattern_counts': pattern_counts,'flagged_simple': len(flagged_compounds),'total_analyzed': len(smiles_list),'bias_indicators': {'high_simple_proportion': len(flagged_compounds) /len(smiles_list) >0.3,'pattern_dominance': any(count >len(smiles_list) *0.2for count in pattern_counts.values()) } }def _calculate_molecular_complexity(self, mol) ->float:"""Calculate a simple molecular complexity score."""ifnot mol:return0 complexity =0 complexity += mol.GetNumAtoms() *0.1 complexity += mol.GetNumBonds() *0.1 complexity += rdMolDescriptors.CalcNumRings(mol) *2 complexity += rdMolDescriptors.CalcNumHeteroatoms(mol) *1.5 complexity +=len(rdMolDescriptors.GetUSRScore(mol)) *0.01ifhasattr(rdMolDescriptors, 'GetUSRScore') else0return complexitydef _assess_contamination_severity(self, results: Dict) ->str:"""Assess the overall severity of contamination.""" exact_overlaps =sum(result['count'] for result in results['exact_overlaps'].values() ifisinstance(result['count'], int))if exact_overlaps ==0:return"low"elif exact_overlaps <10:return"moderate"else:return"high"def _generate_contamination_recommendations(self, results: Dict) -> List[str]:"""Generate recommendations based on contamination analysis.""" recommendations = [] severity =self._assess_contamination_severity(results)if severity =="high": recommendations.append("CRITICAL: High level of contamination detected - review data splitting strategy") recommendations.append("Consider using scaffold-based or time-based splitting instead of random splitting")elif severity =="moderate": recommendations.append("Moderate contamination detected - remove overlapping compounds")else: recommendations.append("Low contamination level - acceptable for most applications")# Check for specific issuesfor overlap_key, overlap_data in results['exact_overlaps'].items():if overlap_data['count'] >0: recommendations.append(f"Remove {overlap_data['count']} overlapping compounds between {overlap_key}")return recommendations# Demonstration of contamination detectionprint("Contamination Detection Demonstration")print("="*40)# Create example datasets with known contaminationexample_datasets = {'train': ['CCO', # ethanol'c1ccccc1', # benzene'CC(=O)O', # acetic acid'CC(C)O', # isopropanol'CN', # methylamine ],'test': ['CCO', # CONTAMINATION: same as train'c1ccc(O)cc1', # phenol'CC(=O)OC', # methyl acetate ],'validation': ['CC(=O)O', # CONTAMINATION: same as train'CC(C)(C)O', # tert-butanol'c1cccnc1', # pyridine ]}detector = ContaminationDetector(similarity_threshold=0.8)# Run contamination analysiscontamination_results = detector.detect_cross_dataset_leakage(example_datasets)print("\nContamination Analysis Results:")print(f"Contamination severity: {contamination_results['summary']['contamination_severity']}")print(f"Total exact overlaps: {contamination_results['summary']['total_exact_overlaps']}")print("\nExact overlaps found:")for overlap_key, overlap_data in contamination_results['exact_overlaps'].items():if overlap_data['count'] >0:print(f" {overlap_key}: {overlap_data['count']} compounds ({overlap_data['percentage_of_first']:.1f}% of first dataset)")print(f" Examples: {overlap_data['overlapping_smiles']}")print("\nRecommendations:")for i, rec inenumerate(contamination_results['summary']['recommendations'], 1):print(f" {i}. {rec}")# Test temporal contamination detectionprint(f"\nTemporal Contamination Check:")temporal_results = detector.detect_temporal_contamination(example_datasets['train'])print(f"Suspicious compounds: {temporal_results['suspicious_count']}")# Test vendor bias detectionprint(f"\nVendor Catalog Bias Check:")bias_results = detector.detect_vendor_catalog_bias(example_datasets['train'])print(f"Simple compounds flagged: {bias_results['flagged_simple']}/{bias_results['total_analyzed']}")print(f"Bias indicators detected: {any(bias_results['bias_indicators'].values())}")
Contamination Detection Demonstration
========================================
Detecting cross-dataset contamination...
train: 5 valid structures
test: 3 valid structures
validation: 3 valid structures
Contamination Analysis Results:
Contamination severity: moderate
Total exact overlaps: 2
Exact overlaps found:
train_vs_test: 1 compounds (20.0% of first dataset)
Examples: ['CCO']
train_vs_validation: 1 compounds (20.0% of first dataset)
Examples: ['CC(=O)O']
Recommendations:
1. Moderate contamination detected - remove overlapping compounds
2. Remove 1 overlapping compounds between train_vs_test
3. Remove 1 overlapping compounds between train_vs_validation
Temporal Contamination Check:
Checking for temporal contamination (compounds after 2020-01-01)...
---------------------------------------------------------------------------ArgumentError Traceback (most recent call last)
Cell In[15], line 271 269# Test temporal contamination detection 270print(f"\nTemporal Contamination Check:")
--> 271 temporal_results =detector.detect_temporal_contamination(example_datasets['train']) 272print(f"Suspicious compounds: {temporal_results['suspicious_count']}")
274# Test vendor bias detection
Cell In[15], line 122, in ContaminationDetector.detect_temporal_contamination(self, smiles_list, target_date) 119 mol = Chem.MolFromSmiles(smiles)
120if mol:
121# Example heuristic: very complex molecules might be recent--> 122 complexity_score =self._calculate_molecular_complexity(mol) 123if complexity_score >20: # Arbitrary threshold 124 suspicious_compounds.append({
125'smiles': smiles,
126'complexity_score': complexity_score,
127'reason': 'high_complexity' 128 })
Cell In[15], line 188, in ContaminationDetector._calculate_molecular_complexity(self, mol) 186 complexity += rdMolDescriptors.CalcNumRings(mol) *2 187 complexity += rdMolDescriptors.CalcNumHeteroatoms(mol) *1.5--> 188 complexity +=len(rdMolDescriptors.GetUSRScore(mol)) *0.01ifhasattr(rdMolDescriptors, 'GetUSRScore') else0 190return complexity
ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors.GetUSRScore(Mol)
did not match C++ signature:
GetUSRScore(boost::python::api::object descriptor1, boost::python::api::object descriptor2, boost::python::api::object weights=[])