An LLM experiment

exploratoration

LLMs

What happens if you ask an LLM to generate an RDKit blog post?

Published

July 31, 2025

What is this post?

On the bus to the CADD GRC, Nadine suggested that I try using Copilot to generate ideas for an RDKit blog post. It seemed like an interesting experiment and I’m heading out for a couple of weeks of vacation, so I thought I’d go ahead and give it a try. This post is the result.

What’s below is the result of a chat session with Copilot, using Claude Sonnet 4. The initial prompt was: suggest a topic for a new RDKit blog post. I made a couple of small edits at the very end, but otherwise the text is exactly as generated by the LLM. I hope to come back to this at some point in the future and dig into the code and content in detail, but right now I need to go pack my gear to head to the mountains for a couple of weeks of climbing and mountaineering.

The actual chat session is available here.

I’m going to go ahead and upload this and post it sometime around July 31st (or whenever I have decent internet access). There won’t be a new blog post next week, but I will be back the week of August 11th.

RDKit and Large Language Models: Chemical Structure-Text Integration in the AI Era

The landscape of artificial intelligence has been dramatically transformed by the rise of large language models (LLMs), and the field of chemistry is no exception. From GPT-4’s ability to reason about molecular structures to specialized chemistry-focused models like ChemCrow and Galactica, we’re witnessing an unprecedented convergence of natural language processing and chemical informatics. This presents both exciting opportunities and unique challenges for computational chemists and cheminformatics practitioners.

The RDKit, as one of the most widely-used open-source cheminformatics toolkits, finds itself at a fascinating intersection in this new landscape. While LLMs excel at processing and generating text-based representations of chemical knowledge, they often struggle with the precise, structure-based reasoning that is second nature to dedicated cheminformatics tools. Conversely, traditional cheminformatics approaches excel at molecular manipulation and property calculation but have limited ability to understand and generate natural language descriptions of chemical concepts.

This complementary relationship suggests powerful synergies. LLMs can help democratize access to chemical knowledge by translating between technical chemical representations and human-readable explanations. Meanwhile, tools like RDKit provide the essential chemical “reality check” - validating molecular structures, calculating properties, and ensuring that AI-generated chemistry actually makes sense from a chemical perspective.

In this post, we’ll explore practical approaches for integrating RDKit with modern LLM workflows. We’ll cover how to:

Use RDKit to validate and process chemical structures generated by LLMs
Convert RDKit molecular representations into rich text descriptions suitable for training or prompting LLMs
Build robust pipelines that combine the strengths of both approaches
Handle the unique challenges that arise when bridging symbolic chemical representations with statistical language models

The goal is not to replace either approach, but rather to show how they can work together to create more powerful, reliable, and accessible chemical AI systems. Whether you’re building a chemical chatbot, curating training data for a chemistry-focused LLM, or simply trying to make your chemical data more searchable and interpretable, the patterns we’ll explore should provide a solid foundation.

Let’s start by setting up our environment and exploring some basic integration patterns.

# Standard RDKit imports
from rdkit import Chem
from rdkit.Chem import Draw, Descriptors, rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
import time

print(f"RDKit version: {rdBase.rdkitVersion}")
print(f"Generated: {time.asctime()}")

RDKit version: 2025.03.4
Generated: Sat Jul 26 09:53:54 2025

# Additional imports for LLM integration examples
import json
import re
import pandas as pd
from typing import List, Dict, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

Structure Validation: RDKit as the Chemical Reality Check

One of the most immediate and practical applications of RDKit in LLM workflows is structure validation. LLMs, while impressive at generating chemical-looking text, often produce invalid SMILES strings, impossible molecular structures, or chemically nonsensical compounds. RDKit provides robust validation capabilities that can catch these errors and provide meaningful feedback.

Let’s explore some common validation scenarios and how to handle them systematically.

def validate_smiles_with_details(smiles: str) -> Dict[str, any]:
    """
    Comprehensive SMILES validation with detailed feedback.
    Returns a dictionary with validation results and diagnostic information.
    """
    result = {
        'smiles': smiles,
        'is_valid': False,
        'mol': None,
        'canonical_smiles': None,
        'errors': [],
        'warnings': [],
        'properties': {}
    }
    
    # Basic parsing
    try:
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            result['errors'].append("Invalid SMILES: Could not parse structure")
            return result
        
        result['mol'] = mol
        result['is_valid'] = True
        
        # Get canonical SMILES
        try:
            result['canonical_smiles'] = Chem.MolToSmiles(mol)
        except:
            result['warnings'].append("Could not generate canonical SMILES")
        
        # Basic chemical checks
        num_atoms = mol.GetNumAtoms()
        if num_atoms == 0:
            result['errors'].append("Empty molecule")
            result['is_valid'] = False
        elif num_atoms > 200:
            result['warnings'].append(f"Very large molecule ({num_atoms} atoms)")
        
        # Check for unusual valences
        try:
            Chem.SanitizeMol(mol)
        except Exception as e:
            result['errors'].append(f"Sanitization failed: {str(e)}")
            result['is_valid'] = False
        
        # Calculate basic properties if valid
        if result['is_valid']:
            try:
                result['properties'] = {
                    'num_atoms': num_atoms,
                    'num_bonds': mol.GetNumBonds(),
                    'molecular_weight': Descriptors.MolWt(mol),
                    'num_rings': rdMolDescriptors.CalcNumRings(mol),
                    'num_aromatic_rings': rdMolDescriptors.CalcNumAromaticRings(mol)
                }
            except Exception as e:
                result['warnings'].append(f"Property calculation failed: {str(e)}")
        
    except Exception as e:
        result['errors'].append(f"Parsing error: {str(e)}")
    
    return result

# Test the function with some examples
test_smiles = [
    "CCO",  # ethanol - valid
    "c1ccccc1",  # benzene - valid
    "C[C@H](N)C(=O)O",  # alanine - valid with stereochemistry
    "CCO[invalid]",  # invalid SMILES
    "C1CCC",  # invalid - unclosed ring
    "C(C)(C)(C)(C)C",  # carbon with too many bonds
    "",  # empty string
]

print("SMILES Validation Examples:")
print("=" * 50)

for smiles in test_smiles:
    result = validate_smiles_with_details(smiles)
    print(f"\nSMILES: {smiles}")
    print(f"Valid: {result['is_valid']}")
    if result['canonical_smiles']:
        print(f"Canonical: {result['canonical_smiles']}")
    if result['errors']:
        print(f"Errors: {', '.join(result['errors'])}")
    if result['warnings']:
        print(f"Warnings: {', '.join(result['warnings'])}")
    if result['properties']:
        props = result['properties']
        print(f"Properties: {props['num_atoms']} atoms, MW={props['molecular_weight']:.1f}")

SMILES Validation Examples:
==================================================

SMILES: CCO
Valid: True
Canonical: CCO
Properties: 3 atoms, MW=46.1

SMILES: c1ccccc1
Valid: True
Canonical: c1ccccc1
Properties: 6 atoms, MW=78.1

SMILES: C[C@H](N)C(=O)O
Valid: True
Canonical: C[C@H](N)C(=O)O
Properties: 6 atoms, MW=89.1

SMILES: CCO[invalid]
Valid: False
Errors: Invalid SMILES: Could not parse structure

SMILES: C1CCC
Valid: False
Errors: Invalid SMILES: Could not parse structure

SMILES: C(C)(C)(C)(C)C
Valid: False
Errors: Invalid SMILES: Could not parse structure

SMILES: 
Valid: False
Errors: Empty molecule

[09:53:54] SMILES Parse Error: syntax error while parsing: CCO[invalid]
[09:53:54] SMILES Parse Error: check for mistakes around position 5:
[09:53:54] CCO[invalid]
[09:53:54] ~~~~^
[09:53:54] SMILES Parse Error: Failed parsing SMILES 'CCO[invalid]' for input: 'CCO[invalid]'
[09:53:54] SMILES Parse Error: unclosed ring for input: 'C1CCC'
[09:53:54] Explicit valence for atom # 0 C, 5, is greater than permitted

Batch Validation for LLM Outputs

When working with LLMs that generate multiple chemical structures, you’ll often need to validate batches of SMILES strings. Here’s a more robust approach that can handle large datasets efficiently:

def batch_validate_structures(smiles_list: List[str], 
                             include_properties: bool = True) -> pd.DataFrame:
    """
    Validate a batch of SMILES strings and return results as a DataFrame.
    Useful for processing LLM outputs or curating chemical datasets.
    """
    results = []
    
    for i, smiles in enumerate(smiles_list):
        result = validate_smiles_with_details(smiles)
        
        # Flatten the result for DataFrame storage
        row = {
            'index': i,
            'input_smiles': smiles,
            'is_valid': result['is_valid'],
            'canonical_smiles': result['canonical_smiles'],
            'num_errors': len(result['errors']),
            'num_warnings': len(result['warnings']),
            'error_messages': '; '.join(result['errors']) if result['errors'] else None,
            'warning_messages': '; '.join(result['warnings']) if result['warnings'] else None
        }
        
        # Add properties if requested and available
        if include_properties and result['properties']:
            row.update(result['properties'])
        
        results.append(row)
    
    return pd.DataFrame(results)

# Simulate some LLM-generated SMILES (mix of valid and invalid)
llm_generated_smiles = [
    "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",  # ibuprofen-like
    "C1=CC=C(C=C1)C(=O)O",  # benzoic acid
    "invalidsmiles123",  # clearly invalid
    "C1=CC=CC=C1",  # benzene
    "CCO",  # ethanol
    "C[C@H](N)C(=O)O",  # L-alanine
    "C1CCC1C",  # methylcyclopropane
    "c1ccccc1c2ccccc2",  # biphenyl
    "CCCCCCCCCCCCCCCCO",  # long-chain alcohol
    "C1=CC=C2C(=C1)C=CC=C2",  # naphthalene
]

print("Processing batch of LLM-generated SMILES...")
df_results = batch_validate_structures(llm_generated_smiles)

# Display summary statistics
print(f"\nBatch Validation Summary:")
print(f"Total structures: {len(df_results)}")
print(f"Valid structures: {df_results['is_valid'].sum()}")
print(f"Invalid structures: {(~df_results['is_valid']).sum()}")
print(f"Structures with warnings: {(df_results['num_warnings'] > 0).sum()}")

# Show the results table
print(f"\nDetailed Results:")
display_cols = ['input_smiles', 'is_valid', 'canonical_smiles', 'molecular_weight', 'num_atoms']
available_cols = [col for col in display_cols if col in df_results.columns]
print(df_results[available_cols].to_string(index=False))

# Show error details for invalid structures
invalid_structures = df_results[~df_results['is_valid']]
if len(invalid_structures) > 0:
    print(f"\nError Details for Invalid Structures:")
    for _, row in invalid_structures.iterrows():
        print(f"'{row['input_smiles']}': {row['error_messages']}")

Processing batch of LLM-generated SMILES...

Batch Validation Summary:
Total structures: 10
Valid structures: 9
Invalid structures: 1
Structures with warnings: 0

Detailed Results:
                 input_smiles  is_valid           canonical_smiles  molecular_weight  num_atoms
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O      True CC(C)Cc1ccc(C(C)C(=O)O)cc1           206.285       15.0
          C1=CC=C(C=C1)C(=O)O      True             O=C(O)c1ccccc1           122.123        9.0
             invalidsmiles123     False                       None               NaN        NaN
                  C1=CC=CC=C1      True                   c1ccccc1            78.114        6.0
                          CCO      True                        CCO            46.069        3.0
              C[C@H](N)C(=O)O      True            C[C@H](N)C(=O)O            89.094        6.0
                      C1CCC1C      True                    CC1CCC1            70.135        5.0
             c1ccccc1c2ccccc2      True        c1ccc(-c2ccccc2)cc1           154.212       12.0
            CCCCCCCCCCCCCCCCO      True          CCCCCCCCCCCCCCCCO           242.447       17.0
        C1=CC=C2C(=C1)C=CC=C2      True             c1ccc2ccccc2c1           128.174       10.0

Error Details for Invalid Structures:
'invalidsmiles123': Invalid SMILES: Could not parse structure

[09:53:54] SMILES Parse Error: syntax error while parsing: invalidsmiles123
[09:53:54] SMILES Parse Error: check for mistakes around position 1:
[09:53:54] invalidsmiles123
[09:53:54] ^
[09:53:54] SMILES Parse Error: Failed parsing SMILES 'invalidsmiles123' for input: 'invalidsmiles123'

Chemical Plausibility Checks

Beyond basic SMILES validation, we can implement more sophisticated checks to assess whether LLM-generated structures are chemically reasonable. This is particularly important because LLMs might generate syntactically valid but chemically implausible structures:

def assess_chemical_plausibility(mol: Chem.Mol) -> Dict[str, any]:
    """
    Assess the chemical plausibility of a molecule beyond basic validation.
    Returns flags for potential issues that might indicate AI-generated artifacts.
    """
    if mol is None:
        return {'plausible': False, 'issues': ['Invalid molecule']}
    
    issues = []
    warnings = []
    
    # Molecular weight checks
    mw = Descriptors.MolWt(mol)
    if mw < 16:  # Lighter than methane
        issues.append(f"Extremely low molecular weight: {mw:.1f}")
    elif mw > 2000:  # Very large for small molecule
        warnings.append(f"Very high molecular weight: {mw:.1f}")
    
    # Atom count checks
    num_atoms = mol.GetNumAtoms()
    if num_atoms > 150:  # Unusually large for typical organic molecules
        warnings.append(f"Very large molecule: {num_atoms} atoms")
    
    # Check for unusual atom types in organic chemistry context
    unusual_atoms = []
    for atom in mol.GetAtoms():
        symbol = atom.GetSymbol()
        if symbol not in ['C', 'N', 'O', 'S', 'P', 'F', 'Cl', 'Br', 'I', 'H']:
            unusual_atoms.append(symbol)
    
    if unusual_atoms:
        warnings.append(f"Unusual atoms present: {set(unusual_atoms)}")
    
    # Check carbon-to-heteroatom ratio
    carbon_count = sum(1 for atom in mol.GetAtoms() if atom.GetSymbol() == 'C')
    heteroatom_count = sum(1 for atom in mol.GetAtoms() if atom.GetSymbol() not in ['C', 'H'])
    
    if heteroatom_count > 0:
        c_to_hetero_ratio = carbon_count / heteroatom_count
        if c_to_hetero_ratio < 0.1:  # Too many heteroatoms
            warnings.append(f"Unusual C:heteroatom ratio: {c_to_hetero_ratio:.2f}")
    
    # Check for overly complex ring systems
    ring_info = mol.GetRingInfo()
    num_rings = ring_info.NumRings()
    if num_rings > 10:  # Many fused rings might be suspicious
        warnings.append(f"Complex ring system: {num_rings} rings")
    
    # Check for extremely high or low LogP (rough estimate)
    try:
        logp = Descriptors.MolLogP(mol)
        if logp > 8:
            warnings.append(f"Very high LogP: {logp:.2f}")
        elif logp < -5:
            warnings.append(f"Very low LogP: {logp:.2f}")
    except:
        warnings.append("Could not calculate LogP")
    
    # Check for drug-likeness violations (Lipinski's Rule of Five)
    violations = []
    if mw > 500:
        violations.append("MW > 500")
    if Descriptors.MolLogP(mol) > 5:
        violations.append("LogP > 5")
    if Descriptors.NumHDonors(mol) > 5:
        violations.append("H-donors > 5")
    if Descriptors.NumHAcceptors(mol) > 10:
        violations.append("H-acceptors > 10")
    
    if len(violations) >= 2:  # Allow one violation
        warnings.append(f"Multiple Lipinski violations: {violations}")
    
    return {
        'plausible': len(issues) == 0,
        'issues': issues,
        'warnings': warnings,
        'properties': {
            'molecular_weight': mw,
            'num_atoms': num_atoms,
            'num_rings': num_rings,
            'logp': Descriptors.MolLogP(mol) if mol else None,
            'lipinski_violations': len(violations)
        }
    }

# Test with some examples
test_molecules = [
    ("CCO", "Simple alcohol"),
    ("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "Ibuprofen"),
    ("C" * 50, "Very long alkyl chain"),  # Unusual but valid
    ("CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC(=O)O", "Long fatty acid"),
    ("c1ccc2c(c1)c3ccccc3c4ccccc24", "Anthracene (PAH)"),
]

print("Chemical Plausibility Assessment:")
print("=" * 50)

for smiles, description in test_molecules:
    mol = Chem.MolFromSmiles(smiles)
    assessment = assess_chemical_plausibility(mol)
    
    print(f"\nMolecule: {description}")
    print(f"SMILES: {smiles}")
    print(f"Plausible: {assessment['plausible']}")
    
    if assessment['issues']:
        print(f"Issues: {'; '.join(assessment['issues'])}")
    if assessment['warnings']:
        print(f"Warnings: {'; '.join(assessment['warnings'])}")
    
    props = assessment['properties']
    print(f"Properties: MW={props['molecular_weight']:.1f}, "
          f"Rings={props['num_rings']}, "
          f"LogP={props['logp']:.2f}, "
          f"Lipinski violations={props['lipinski_violations']}")

Chemical Plausibility Assessment:
==================================================

Molecule: Simple alcohol
SMILES: CCO
Plausible: True
Properties: MW=46.1, Rings=0, LogP=-0.00, Lipinski violations=0

Molecule: Ibuprofen
SMILES: CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
Plausible: True
Properties: MW=206.3, Rings=1, LogP=3.07, Lipinski violations=0

Molecule: Very long alkyl chain
SMILES: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Plausible: True
Warnings: Very high LogP: 19.75; Multiple Lipinski violations: ['MW > 500', 'LogP > 5']
Properties: MW=703.4, Rings=0, LogP=19.75, Lipinski violations=2

Molecule: Long fatty acid
SMILES: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC(=O)O
Plausible: True
Warnings: Very high LogP: 11.40
Properties: MW=466.8, Rings=0, LogP=11.40, Lipinski violations=1

Molecule: Anthracene (PAH)
SMILES: c1ccc2c(c1)c3ccccc3c4ccccc24
Plausible: True
Properties: MW=228.3, Rings=4, LogP=5.15, Lipinski violations=1

Putting It Together: A Complete Validation Pipeline

Here’s how you might integrate these validation functions into a complete pipeline for processing LLM-generated chemical structures:

class LLMChemicalValidator:
    """
    A comprehensive validator for LLM-generated chemical structures.
    Combines SMILES validation, chemical plausibility checks, and filtering.
    """
    
    def __init__(self, strict_mode: bool = False):
        self.strict_mode = strict_mode
        self.validation_stats = {
            'total_processed': 0,
            'valid_smiles': 0,
            'chemically_plausible': 0,
            'passed_filters': 0
        }
    
    def validate_structure(self, smiles: str) -> Dict[str, any]:
        """Validate a single SMILES string with full analysis."""
        result = validate_smiles_with_details(smiles)
        
        if result['is_valid'] and result['mol']:
            # Add plausibility assessment
            plausibility = assess_chemical_plausibility(result['mol'])
            result['plausibility'] = plausibility
            result['chemically_plausible'] = plausibility['plausible']
            
            # Apply filters based on mode
            if self.strict_mode:
                result['passes_filters'] = (plausibility['plausible'] and 
                                          len(plausibility['warnings']) == 0)
            else:
                result['passes_filters'] = plausibility['plausible']
        else:
            result['chemically_plausible'] = False
            result['passes_filters'] = False
        
        # Update stats
        self.validation_stats['total_processed'] += 1
        if result['is_valid']:
            self.validation_stats['valid_smiles'] += 1
        if result.get('chemically_plausible', False):
            self.validation_stats['chemically_plausible'] += 1
        if result.get('passes_filters', False):
            self.validation_stats['passed_filters'] += 1
        
        return result
    
    def validate_batch(self, smiles_list: List[str]) -> List[Dict[str, any]]:
        """Validate a batch of SMILES strings."""
        return [self.validate_structure(smiles) for smiles in smiles_list]
    
    def get_filtered_structures(self, smiles_list: List[str]) -> List[str]:
        """Return only the structures that pass all validation checks."""
        results = self.validate_batch(smiles_list)
        return [r['canonical_smiles'] for r in results 
                if r['passes_filters'] and r['canonical_smiles']]
    
    def get_validation_report(self) -> str:
        """Generate a summary report of validation statistics."""
        stats = self.validation_stats
        total = stats['total_processed']
        
        if total == 0:
            return "No structures processed yet."
        
        report = f"""
Validation Report:
==================
Total structures processed: {total}
Valid SMILES: {stats['valid_smiles']} ({stats['valid_smiles']/total*100:.1f}%)
Chemically plausible: {stats['chemically_plausible']} ({stats['chemically_plausible']/total*100:.1f}%)
Passed all filters: {stats['passed_filters']} ({stats['passed_filters']/total*100:.1f}%)

Success rate: {stats['passed_filters']/total*100:.1f}%
"""
        return report.strip()

# Demonstration with simulated LLM output
print("LLM Chemical Validator Demo")
print("=" * 40)

# Simulate a mix of good and problematic LLM-generated structures
llm_output = [
    "CCO",  # ethanol - should pass
    "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",  # ibuprofen - should pass
    "c1ccccc1",  # benzene - should pass
    "invalidsmiles",  # invalid SMILES
    "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC",  # very long chain
    "C1=CC=C(C=C1)C(=O)O",  # benzoic acid - should pass
    "C[C@H](N)C(=O)O",  # alanine - should pass
    "C1CCC",  # invalid - unclosed ring
    "CC(C)(C)(C)(C)C",  # too many bonds
    "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC",  # extremely long
]

# Test with normal mode
print("Testing with Normal Mode:")
validator_normal = LLMChemicalValidator(strict_mode=False)
filtered_normal = validator_normal.get_filtered_structures(llm_output)

print(f"Input structures: {len(llm_output)}")
print(f"Structures passing validation: {len(filtered_normal)}")
print("\nPassed structures:")
for i, smiles in enumerate(filtered_normal, 1):
    print(f"{i}. {smiles}")

print(validator_normal.get_validation_report())

# Test with strict mode
print("\n" + "="*40)
print("Testing with Strict Mode:")
validator_strict = LLMChemicalValidator(strict_mode=True)
filtered_strict = validator_strict.get_filtered_structures(llm_output)

print(f"Structures passing strict validation: {len(filtered_strict)}")
print(validator_strict.get_validation_report())

LLM Chemical Validator Demo
========================================
Testing with Normal Mode:
Input structures: 10
Structures passing validation: 7

Passed structures:
1. CCO
2. CC(C)Cc1ccc(C(C)C(=O)O)cc1
3. c1ccccc1
4. CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
5. O=C(O)c1ccccc1
6. C[C@H](N)C(=O)O
7. CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Validation Report:
==================
Total structures processed: 10
Valid SMILES: 7 (70.0%)
Chemically plausible: 7 (70.0%)
Passed all filters: 7 (70.0%)

Success rate: 70.0%

========================================
Testing with Strict Mode:
Structures passing strict validation: 5
Validation Report:
==================
Total structures processed: 10
Valid SMILES: 7 (70.0%)
Chemically plausible: 7 (70.0%)
Passed all filters: 5 (50.0%)

Success rate: 50.0%

[09:53:54] SMILES Parse Error: syntax error while parsing: invalidsmiles
[09:53:54] SMILES Parse Error: check for mistakes around position 1:
[09:53:54] invalidsmiles
[09:53:54] ^
[09:53:54] SMILES Parse Error: Failed parsing SMILES 'invalidsmiles' for input: 'invalidsmiles'
[09:53:54] SMILES Parse Error: unclosed ring for input: 'C1CCC'
[09:53:54] Explicit valence for atom # 1 C, 6, is greater than permitted
[09:53:54] SMILES Parse Error: syntax error while parsing: invalidsmiles
[09:53:54] SMILES Parse Error: check for mistakes around position 1:
[09:53:54] invalidsmiles
[09:53:54] ^
[09:53:54] SMILES Parse Error: Failed parsing SMILES 'invalidsmiles' for input: 'invalidsmiles'
[09:53:54] SMILES Parse Error: unclosed ring for input: 'C1CCC'
[09:53:54] Explicit valence for atom # 1 C, 6, is greater than permitted

From Molecules to Text: Generating Rich Descriptions for LLM Training

While validation is crucial for processing LLM outputs, the reverse direction - converting molecular structures into natural language descriptions - is equally important for training chemical AI systems and creating meaningful prompts. RDKit’s rich chemical analysis capabilities make it possible to generate detailed, chemically-informed text descriptions of molecules.

This approach is valuable for: - Creating training data for chemistry-focused LLMs - Building chemical chatbots with rich molecular knowledge - Making chemical databases more searchable with natural language - Generating explanatory text for chemical education applications

def extract_molecular_features(mol: Chem.Mol) -> Dict[str, any]:
    """
    Extract comprehensive molecular features suitable for text generation.
    This creates a rich feature dictionary that can be converted to natural language.
    """
    if mol is None:
        return {}
    
    try:
        # Basic molecular properties
        features = {
            'smiles': Chem.MolToSmiles(mol),
            'molecular_formula': rdMolDescriptors.CalcMolFormula(mol),
            'molecular_weight': round(Descriptors.MolWt(mol), 2),
            'num_atoms': mol.GetNumAtoms(),
            'num_heavy_atoms': mol.GetNumHeavyAtoms(),
            'num_bonds': mol.GetNumBonds(),
        }
        
        # Ring and aromaticity information
        features.update({
            'num_rings': rdMolDescriptors.CalcNumRings(mol),
            'num_aromatic_rings': rdMolDescriptors.CalcNumAromaticRings(mol),
            'num_saturated_rings': rdMolDescriptors.CalcNumSaturatedRings(mol),
            'num_aliphatic_rings': rdMolDescriptors.CalcNumAliphaticRings(mol),
        })
        
        # Physicochemical properties
        features.update({
            'logp': round(Descriptors.MolLogP(mol), 2),
            'tpsa': round(Descriptors.TPSA(mol), 2),
            'h_bond_donors': Descriptors.NumHDonors(mol),
            'h_bond_acceptors': Descriptors.NumHAcceptors(mol),
            'rotatable_bonds': Descriptors.NumRotatableBonds(mol),
        })
        
        # Atom type counts
        atom_counts = {}
        for atom in mol.GetAtoms():
            symbol = atom.GetSymbol()
            atom_counts[symbol] = atom_counts.get(symbol, 0) + 1
        features['atom_composition'] = atom_counts
        
        # Functional group analysis (simplified)
        functional_groups = []
        
        # Check for common functional groups using SMARTS
        fg_patterns = {
            'carboxylic_acid': '[CX3](=O)[OX2H1]',
            'ester': '[#6][CX3](=O)[OX2H0][#6]',
            'amide': '[CX3](=[OX1])[NX3H2]',
            'alcohol': '[OX2H]',
            'amine': '[NX3;H2,H1;!$(NC=O)]',
            'ketone': '[#6][CX3](=O)[#6]',
            'aldehyde': '[CX3H1](=O)[#6]',
            'ether': '[OD2]([#6])[#6]',
            'phenol': '[OX2H][cX3]:[c]',
            'nitro': '[NX3+](=O)[O-]',
            'aromatic_ring': 'c1ccccc1',
        }
        
        for fg_name, smarts in fg_patterns.items():
            pattern = Chem.MolFromSmarts(smarts)
            if pattern and mol.HasSubstructMatch(pattern):
                matches = mol.GetSubstructMatches(pattern)
                if matches:
                    functional_groups.append(f"{fg_name}({len(matches)})")
        
        features['functional_groups'] = functional_groups
        
        # Drug-likeness assessment
        lipinski_violations = []
        if features['molecular_weight'] > 500:
            lipinski_violations.append('MW > 500')
        if features['logp'] > 5:
            lipinski_violations.append('LogP > 5')
        if features['h_bond_donors'] > 5:
            lipinski_violations.append('H-donors > 5')
        if features['h_bond_acceptors'] > 10:
            lipinski_violations.append('H-acceptors > 10')
        
        features['lipinski_violations'] = lipinski_violations
        features['drug_like'] = len(lipinski_violations) <= 1
        
        return features
        
    except Exception as e:
        return {'error': f"Feature extraction failed: {str(e)}"}

# Test the function with some example molecules
test_molecules = [
    ("CCO", "Ethanol"),
    ("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "Ibuprofen"),
    ("C1=CC=C(C=C1)C(=O)O", "Benzoic acid"),
    ("CC(=O)OC1=CC=CC=C1C(=O)O", "Aspirin"),
    ("CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "Caffeine"),
]

print("Molecular Feature Extraction Examples:")
print("=" * 50)

for smiles, name in test_molecules:
    mol = Chem.MolFromSmiles(smiles)
    features = extract_molecular_features(mol)
    
    print(f"\n{name} ({smiles}):")
    print(f"  Formula: {features.get('molecular_formula', 'N/A')}")
    print(f"  MW: {features.get('molecular_weight', 'N/A')} Da")
    print(f"  LogP: {features.get('logp', 'N/A')}")
    print(f"  TPSA: {features.get('tpsa', 'N/A')} Ų")
    print(f"  Rings: {features.get('num_rings', 0)} (aromatic: {features.get('num_aromatic_rings', 0)})")
    print(f"  H-bond donors/acceptors: {features.get('h_bond_donors', 0)}/{features.get('h_bond_acceptors', 0)}")
    print(f"  Functional groups: {', '.join(features.get('functional_groups', [])) or 'None detected'}")
    print(f"  Drug-like: {features.get('drug_like', False)}")
    if features.get('lipinski_violations'):
        print(f"  Lipinski violations: {', '.join(features['lipinski_violations'])}")

Molecular Feature Extraction Examples:
==================================================

Ethanol (CCO):
  Formula: C2H6O
  MW: 46.07 Da
  LogP: -0.0
  TPSA: 20.23 Ų
  Rings: 0 (aromatic: 0)
  H-bond donors/acceptors: 1/1
  Functional groups: alcohol(1)
  Drug-like: True

Ibuprofen (CC(C)CC1=CC=C(C=C1)C(C)C(=O)O):
  Formula: C13H18O2
  MW: 206.28 Da
  LogP: 3.07
  TPSA: 37.3 Ų
  Rings: 1 (aromatic: 1)
  H-bond donors/acceptors: 1/1
  Functional groups: carboxylic_acid(1), alcohol(1), aromatic_ring(1)
  Drug-like: True

Benzoic acid (C1=CC=C(C=C1)C(=O)O):
  Formula: C7H6O2
  MW: 122.12 Da
  LogP: 1.38
  TPSA: 37.3 Ų
  Rings: 1 (aromatic: 1)
  H-bond donors/acceptors: 1/1
  Functional groups: carboxylic_acid(1), alcohol(1), aromatic_ring(1)
  Drug-like: True

Aspirin (CC(=O)OC1=CC=CC=C1C(=O)O):
  Formula: C9H8O4
  MW: 180.16 Da
  LogP: 1.31
  TPSA: 63.6 Ų
  Rings: 1 (aromatic: 1)
  H-bond donors/acceptors: 1/3
  Functional groups: carboxylic_acid(1), ester(1), alcohol(1), ether(1), aromatic_ring(1)
  Drug-like: True

Caffeine (CN1C=NC2=C1C(=O)N(C(=O)N2C)C):
  Formula: C8H10N4O2
  MW: 194.19 Da
  LogP: -1.03
  TPSA: 61.82 Ų
  Rings: 2 (aromatic: 2)
  H-bond donors/acceptors: 0/6
  Functional groups: None detected
  Drug-like: True

Converting Features to Natural Language

Now let’s create functions that transform these molecular features into natural language descriptions suitable for LLM training:

def generate_molecular_description(mol: Chem.Mol, style: str = 'detailed') -> str:
    """
    Generate natural language descriptions of molecules in different styles.
    
    Args:
        mol: RDKit molecule object
        style: 'detailed', 'concise', 'technical', or 'educational'
    
    Returns:
        Natural language description of the molecule
    """
    if mol is None:
        return "Invalid molecule structure."
    
    features = extract_molecular_features(mol)
    if 'error' in features:
        return f"Could not analyze molecule: {features['error']}"
    
    if style == 'concise':
        return _generate_concise_description(features)
    elif style == 'technical':
        return _generate_technical_description(features)
    elif style == 'educational':
        return _generate_educational_description(features)
    else:  # detailed
        return _generate_detailed_description(features)

def _generate_concise_description(features: Dict) -> str:
    """Generate a brief, factual description."""
    desc = f"This is {features['molecular_formula']} with molecular weight {features['molecular_weight']} Da."
    
    if features['functional_groups']:
        fg_text = ', '.join([fg.split('(')[0].replace('_', ' ') for fg in features['functional_groups']])
        desc += f" Contains {fg_text} functional groups."
    
    return desc

def _generate_technical_description(features: Dict) -> str:
    """Generate a technical description with precise chemical language."""
    parts = []
    
    # Basic structure
    parts.append(f"Molecular formula {features['molecular_formula']} (MW: {features['molecular_weight']} Da)")
    
    # Ring systems
    if features['num_rings'] > 0:
        ring_desc = f"{features['num_rings']} ring(s)"
        if features['num_aromatic_rings'] > 0:
            ring_desc += f" including {features['num_aromatic_rings']} aromatic ring(s)"
        parts.append(ring_desc)
    
    # Physicochemical properties
    parts.append(f"LogP: {features['logp']}, TPSA: {features['tpsa']} Ų")
    
    # Hydrogen bonding
    if features['h_bond_donors'] > 0 or features['h_bond_acceptors'] > 0:
        parts.append(f"H-bond donors: {features['h_bond_donors']}, acceptors: {features['h_bond_acceptors']}")
    
    # Drug-likeness
    if features['drug_like']:
        parts.append("Complies with Lipinski's Rule of Five")
    elif features['lipinski_violations']:
        parts.append(f"Violates Lipinski's Rule: {', '.join(features['lipinski_violations'])}")
    
    return ". ".join(parts) + "."

def _generate_detailed_description(features: Dict) -> str:
    """Generate a comprehensive description suitable for training data."""
    parts = []
    
    # Introduction
    parts.append(f"This compound has the molecular formula {features['molecular_formula']} and a molecular weight of {features['molecular_weight']} daltons.")
    
    # Structural features
    structure_parts = []
    if features['num_heavy_atoms'] > 0:
        structure_parts.append(f"{features['num_heavy_atoms']} heavy atoms")
    
    if features['num_rings'] > 0:
        if features['num_aromatic_rings'] > 0 and features['num_saturated_rings'] > 0:
            structure_parts.append(f"{features['num_rings']} rings ({features['num_aromatic_rings']} aromatic, {features['num_saturated_rings']} saturated)")
        elif features['num_aromatic_rings'] > 0:
            structure_parts.append(f"{features['num_aromatic_rings']} aromatic ring(s)")
        else:
            structure_parts.append(f"{features['num_rings']} saturated ring(s)")
    
    if structure_parts:
        parts.append(f"The structure contains {', '.join(structure_parts)}.")
    
    # Atom composition
    atoms = features['atom_composition']
    if len(atoms) > 1:
        atom_desc = []
        for atom, count in sorted(atoms.items()):
            if atom != 'H':  # Skip hydrogen for brevity
                if count == 1:
                    atom_desc.append(f"one {atom}")
                else:
                    atom_desc.append(f"{count} {atom} atoms")
        if atom_desc:
            parts.append(f"It is composed of {', '.join(atom_desc)}.")
    
    # Functional groups
    if features['functional_groups']:
        fg_names = [fg.split('(')[0].replace('_', ' ') for fg in features['functional_groups']]
        if len(fg_names) == 1:
            parts.append(f"The molecule contains a {fg_names[0]} functional group.")
        else:
            parts.append(f"Functional groups present include {', '.join(fg_names[:-1])} and {fg_names[-1]}.")
    
    # Physicochemical properties
    prop_desc = []
    if features['logp'] is not None:
        if features['logp'] > 3:
            prop_desc.append("lipophilic character")
        elif features['logp'] < 0:
            prop_desc.append("hydrophilic nature")
        else:
            prop_desc.append("moderate lipophilicity")
    
    if features['tpsa'] > 90:
        prop_desc.append("high polar surface area")
    elif features['tpsa'] < 30:
        prop_desc.append("low polar surface area")
    
    if prop_desc:
        parts.append(f"The compound exhibits {' and '.join(prop_desc)}.")
    
    # Drug-likeness assessment
    if features['drug_like']:
        parts.append("This molecule satisfies Lipinski's Rule of Five, suggesting good oral bioavailability potential.")
    elif features['lipinski_violations']:
        parts.append(f"The compound violates Lipinski's Rule of Five due to {', '.join(features['lipinski_violations'])}, which may affect its drug-like properties.")
    
    return " ".join(parts)

def _generate_educational_description(features: Dict) -> str:
    """Generate an educational description explaining chemical concepts."""
    parts = []
    
    # Start with basics
    parts.append(f"This molecule, with formula {features['molecular_formula']}, is an organic compound weighing {features['molecular_weight']} daltons.")
    
    # Explain ring systems in educational terms
    if features['num_aromatic_rings'] > 0:
        parts.append(f"It contains {features['num_aromatic_rings']} aromatic ring(s), which are stable ring structures with delocalized electrons that give the molecule special stability and chemical properties.")
    elif features['num_rings'] > 0:
        parts.append(f"The molecule has {features['num_rings']} ring structure(s), which can affect its shape and biological activity.")
    
    # Explain functional groups
    if features['functional_groups']:
        parts.append("Functional groups are specific arrangements of atoms that give molecules their chemical reactivity.")
        fg_explanations = {
            'carboxylic_acid': "carboxylic acid groups (which can donate protons and are often acidic)",
            'alcohol': "alcohol groups (which can form hydrogen bonds)",
            'amine': "amine groups (which are basic and can accept protons)",
            'ketone': "ketone groups (which are reactive carbonyl groups)",
            'ester': "ester groups (often found in fats and can be hydrolyzed)",
            'ether': "ether groups (which are generally unreactive)"
        }
        
        explained_groups = []
        for fg in features['functional_groups']:
            fg_name = fg.split('(')[0]
            if fg_name in fg_explanations:
                explained_groups.append(fg_explanations[fg_name])
        
        if explained_groups:
            parts.append(f"This molecule contains {', '.join(explained_groups)}.")
    
    # Explain drug-likeness in simple terms
    if features['drug_like']:
        parts.append("The molecule's size and properties suggest it could potentially be developed as an oral medication.")
    elif features['lipinski_violations']:
        parts.append("The molecule is quite large or has properties that might make it challenging to develop as an oral drug.")
    
    return " ".join(parts)

# Test the different description styles
test_molecules = [
    ("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "Ibuprofen"),
    ("CC(=O)OC1=CC=CC=C1C(=O)O", "Aspirin"),
    ("CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "Caffeine"),
]

styles = ['concise', 'technical', 'detailed', 'educational']

print("Natural Language Description Examples:")
print("=" * 60)

for smiles, name in test_molecules:
    mol = Chem.MolFromSmiles(smiles)
    print(f"\n{name} ({smiles}):")
    print("-" * (len(name) + len(smiles) + 4))
    
    for style in styles:
        description = generate_molecular_description(mol, style)
        print(f"\n{style.title()} style:")
        print(description)

Natural Language Description Examples:
============================================================

Ibuprofen (CC(C)CC1=CC=C(C=C1)C(C)C(=O)O):
------------------------------------------

Concise style:
This is C13H18O2 with molecular weight 206.28 Da. Contains carboxylic acid, alcohol, aromatic ring functional groups.

Technical style:
Molecular formula C13H18O2 (MW: 206.28 Da). 1 ring(s) including 1 aromatic ring(s). LogP: 3.07, TPSA: 37.3 Ų. H-bond donors: 1, acceptors: 1. Complies with Lipinski's Rule of Five.

Detailed style:
This compound has the molecular formula C13H18O2 and a molecular weight of 206.28 daltons. The structure contains 15 heavy atoms, 1 aromatic ring(s). It is composed of 13 C atoms, 2 O atoms. Functional groups present include carboxylic acid, alcohol and aromatic ring. The compound exhibits lipophilic character. This molecule satisfies Lipinski's Rule of Five, suggesting good oral bioavailability potential.

Educational style:
This molecule, with formula C13H18O2, is an organic compound weighing 206.28 daltons. It contains 1 aromatic ring(s), which are stable ring structures with delocalized electrons that give the molecule special stability and chemical properties. Functional groups are specific arrangements of atoms that give molecules their chemical reactivity. This molecule contains carboxylic acid groups (which can donate protons and are often acidic), alcohol groups (which can form hydrogen bonds). The molecule's size and properties suggest it could potentially be developed as an oral medication.

Aspirin (CC(=O)OC1=CC=CC=C1C(=O)O):
-----------------------------------

Concise style:
This is C9H8O4 with molecular weight 180.16 Da. Contains carboxylic acid, ester, alcohol, ether, aromatic ring functional groups.

Technical style:
Molecular formula C9H8O4 (MW: 180.16 Da). 1 ring(s) including 1 aromatic ring(s). LogP: 1.31, TPSA: 63.6 Ų. H-bond donors: 1, acceptors: 3. Complies with Lipinski's Rule of Five.

Detailed style:
This compound has the molecular formula C9H8O4 and a molecular weight of 180.16 daltons. The structure contains 13 heavy atoms, 1 aromatic ring(s). It is composed of 9 C atoms, 4 O atoms. Functional groups present include carboxylic acid, ester, alcohol, ether and aromatic ring. The compound exhibits moderate lipophilicity. This molecule satisfies Lipinski's Rule of Five, suggesting good oral bioavailability potential.

Educational style:
This molecule, with formula C9H8O4, is an organic compound weighing 180.16 daltons. It contains 1 aromatic ring(s), which are stable ring structures with delocalized electrons that give the molecule special stability and chemical properties. Functional groups are specific arrangements of atoms that give molecules their chemical reactivity. This molecule contains carboxylic acid groups (which can donate protons and are often acidic), ester groups (often found in fats and can be hydrolyzed), alcohol groups (which can form hydrogen bonds), ether groups (which are generally unreactive). The molecule's size and properties suggest it could potentially be developed as an oral medication.

Caffeine (CN1C=NC2=C1C(=O)N(C(=O)N2C)C):
----------------------------------------

Concise style:
This is C8H10N4O2 with molecular weight 194.19 Da.

Technical style:
Molecular formula C8H10N4O2 (MW: 194.19 Da). 2 ring(s) including 2 aromatic ring(s). LogP: -1.03, TPSA: 61.82 Ų. H-bond donors: 0, acceptors: 6. Complies with Lipinski's Rule of Five.

Detailed style:
This compound has the molecular formula C8H10N4O2 and a molecular weight of 194.19 daltons. The structure contains 14 heavy atoms, 2 aromatic ring(s). It is composed of 8 C atoms, 4 N atoms, 2 O atoms. The compound exhibits hydrophilic nature. This molecule satisfies Lipinski's Rule of Five, suggesting good oral bioavailability potential.

Educational style:
This molecule, with formula C8H10N4O2, is an organic compound weighing 194.19 daltons. It contains 2 aromatic ring(s), which are stable ring structures with delocalized electrons that give the molecule special stability and chemical properties. The molecule's size and properties suggest it could potentially be developed as an oral medication.

Creating Structured Training Data

For training LLMs, you often need structured datasets that pair chemical structures with their descriptions. Here’s how to create such datasets systematically:

def create_molecule_text_dataset(smiles_list: List[str], 
                                names: Optional[List[str]] = None,
                                include_multiple_styles: bool = True) -> List[Dict[str, str]]:
    """
    Create a structured dataset pairing molecules with text descriptions.
    Suitable for fine-tuning LLMs on chemical tasks.
    
    Args:
        smiles_list: List of SMILES strings
        names: Optional list of molecule names
        include_multiple_styles: Whether to generate multiple description styles
    
    Returns:
        List of dictionaries with 'structure', 'description', and metadata
    """
    dataset = []
    
    for i, smiles in enumerate(smiles_list):
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue  # Skip invalid molecules
        
        name = names[i] if names and i < len(names) else f"Compound_{i+1}"
        canonical_smiles = Chem.MolToSmiles(mol)
        
        if include_multiple_styles:
            styles = ['concise', 'technical', 'detailed', 'educational']
        else:
            styles = ['detailed']
        
        for style in styles:
            description = generate_molecular_description(mol, style)
            
            # Create training example in various formats
            
            # Format 1: Direct description
            dataset.append({
                'task_type': 'molecule_description',
                'style': style,
                'input_smiles': smiles,
                'canonical_smiles': canonical_smiles,
                'molecule_name': name,
                'description': description,
                'prompt': f"Describe the chemical structure {canonical_smiles}:",
                'response': description
            })
            
            # Format 2: Question-answer format
            dataset.append({
                'task_type': 'chemical_qa',
                'style': style,
                'input_smiles': smiles,
                'canonical_smiles': canonical_smiles,
                'molecule_name': name,
                'description': description,
                'prompt': f"What can you tell me about the molecule with SMILES {canonical_smiles}?",
                'response': description
            })
            
            # Format 3: Named molecule description
            if name != f"Compound_{i+1}":
                dataset.append({
                    'task_type': 'named_molecule_description',
                    'style': style,
                    'input_smiles': smiles,
                    'canonical_smiles': canonical_smiles,
                    'molecule_name': name,
                    'description': description,
                    'prompt': f"Describe the chemical structure and properties of {name}:",
                    'response': f"{name} ({canonical_smiles}) is a compound where {description.lower()}"
                })
    
    return dataset

def create_property_prediction_dataset(smiles_list: List[str], 
                                     target_properties: List[str] = None) -> List[Dict[str, str]]:
    """
    Create a dataset for training property prediction with natural language explanations.
    """
    if target_properties is None:
        target_properties = ['molecular_weight', 'logp', 'h_bond_donors', 'h_bond_acceptors', 'drug_like']
    
    dataset = []
    
    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        
        features = extract_molecular_features(mol)
        canonical_smiles = Chem.MolToSmiles(mol)
        
        for prop in target_properties:
            if prop in features:
                value = features[prop]
                
                # Create natural language explanations for properties
                if prop == 'molecular_weight':
                    explanation = f"The molecular weight is {value} daltons, which "
                    if value < 200:
                        explanation += "indicates a relatively small molecule."
                    elif value > 500:
                        explanation += "suggests a large molecule that may have bioavailability issues."
                    else:
                        explanation += "is in a reasonable range for drug-like compounds."
                        
                elif prop == 'logp':
                    explanation = f"The LogP value is {value}, indicating "
                    if value > 3:
                        explanation += "high lipophilicity and potential membrane permeability."
                    elif value < 0:
                        explanation += "hydrophilic character and good water solubility."
                    else:
                        explanation += "balanced lipophilicity suitable for oral drugs."
                        
                elif prop == 'h_bond_donors':
                    explanation = f"This molecule has {value} hydrogen bond donor(s), which "
                    if value == 0:
                        explanation += "means it cannot donate hydrogen bonds."
                    elif value <= 5:
                        explanation += "is within the acceptable range for drug-like molecules."
                    else:
                        explanation += "may limit its ability to cross biological membranes."
                        
                elif prop == 'drug_like':
                    explanation = f"This molecule {'is' if value else 'is not'} drug-like according to Lipinski's Rule of Five."
                    
                else:
                    explanation = f"The {prop.replace('_', ' ')} value is {value}."
                
                dataset.append({
                    'task_type': 'property_prediction',
                    'property': prop,
                    'input_smiles': smiles,
                    'canonical_smiles': canonical_smiles,
                    'property_value': value,
                    'prompt': f"What is the {prop.replace('_', ' ')} of {canonical_smiles}?",
                    'response': f"{value}",
                    'explanation': explanation,
                    'detailed_response': f"The {prop.replace('_', ' ')} of {canonical_smiles} is {value}. {explanation}"
                })
    
    return dataset

# Example: Create training datasets
example_molecules = [
    ("CCO", "Ethanol"),
    ("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "Ibuprofen"),
    ("CC(=O)OC1=CC=CC=C1C(=O)O", "Aspirin"),
    ("CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "Caffeine"),
    ("CC1=CC=C(C=C1)C(C)C(=O)O", "2-(4-methylphenyl)propanoic acid"),
]

smiles_only = [mol[0] for mol in example_molecules]
names_only = [mol[1] for mol in example_molecules]

print("Creating Molecule-Text Dataset...")
mol_text_data = create_molecule_text_dataset(smiles_only, names_only, include_multiple_styles=False)

print(f"Generated {len(mol_text_data)} training examples")
print("\nExample training entries:")
print("=" * 50)

# Show a few examples
for i, entry in enumerate(mol_text_data[:3]):
    print(f"\nExample {i+1}:")
    print(f"Task: {entry['task_type']}")
    print(f"Molecule: {entry['molecule_name']}")
    print(f"Prompt: {entry['prompt']}")
    print(f"Response: {entry['response'][:100]}{'...' if len(entry['response']) > 100 else ''}")

print("\n" + "="*50)
print("Creating Property Prediction Dataset...")
prop_data = create_property_prediction_dataset(smiles_only[:3])

print(f"Generated {len(prop_data)} property prediction examples")
print("\nExample property prediction entries:")

for i, entry in enumerate(prop_data[:3]):
    print(f"\nExample {i+1}:")
    print(f"Property: {entry['property']}")
    print(f"Molecule: {entry['canonical_smiles']}")
    print(f"Prompt: {entry['prompt']}")
    print(f"Response: {entry['detailed_response']}")

Creating Molecule-Text Dataset...
Generated 15 training examples

Example training entries:
==================================================

Example 1:
Task: molecule_description
Molecule: Ethanol
Prompt: Describe the chemical structure CCO:
Response: This compound has the molecular formula C2H6O and a molecular weight of 46.07 daltons. The structure...

Example 2:
Task: chemical_qa
Molecule: Ethanol
Prompt: What can you tell me about the molecule with SMILES CCO?
Response: This compound has the molecular formula C2H6O and a molecular weight of 46.07 daltons. The structure...

Example 3:
Task: named_molecule_description
Molecule: Ethanol
Prompt: Describe the chemical structure and properties of Ethanol:
Response: Ethanol (CCO) is a compound where this compound has the molecular formula c2h6o and a molecular weig...

==================================================
Creating Property Prediction Dataset...
Generated 15 property prediction examples

Example property prediction entries:

Example 1:
Property: molecular_weight
Molecule: CCO
Prompt: What is the molecular weight of CCO?
Response: The molecular weight of CCO is 46.07. The molecular weight is 46.07 daltons, which indicates a relatively small molecule.

Example 2:
Property: logp
Molecule: CCO
Prompt: What is the logp of CCO?
Response: The logp of CCO is -0.0. The LogP value is -0.0, indicating balanced lipophilicity suitable for oral drugs.

Example 3:
Property: h_bond_donors
Molecule: CCO
Prompt: What is the h bond donors of CCO?
Response: The h bond donors of CCO is 1. This molecule has 1 hydrogen bond donor(s), which is within the acceptable range for drug-like molecules.

Exporting Data for Different LLM Platforms

Different LLM training platforms expect different data formats. Here are functions to export your chemical training data in common formats:

def export_for_huggingface(dataset: List[Dict], output_file: str = "chemical_training_data.jsonl"):
    """
    Export dataset in HuggingFace datasets format (JSONL).
    Each line is a JSON object with 'text' field for language modeling
    or 'input'/'output' fields for instruction tuning.
    """
    with open(output_file, 'w') as f:
        for entry in dataset:
            # Format for instruction tuning
            hf_entry = {
                'instruction': entry['prompt'],
                'input': entry.get('canonical_smiles', ''),
                'output': entry['response'],
                'metadata': {
                    'task_type': entry.get('task_type', ''),
                    'molecule_name': entry.get('molecule_name', ''),
                    'style': entry.get('style', '')
                }
            }
            f.write(json.dumps(hf_entry) + '\n')
    
    print(f"Exported {len(dataset)} entries to {output_file} (HuggingFace format)")

def export_for_openai(dataset: List[Dict], output_file: str = "chemical_training_data_openai.jsonl"):
    """
    Export dataset in OpenAI fine-tuning format.
    Each line has 'messages' with system/user/assistant format.
    """
    with open(output_file, 'w') as f:
        for entry in dataset:
            openai_entry = {
                'messages': [
                    {
                        'role': 'system',
                        'content': 'You are a helpful chemistry assistant that can analyze and describe chemical structures.'
                    },
                    {
                        'role': 'user', 
                        'content': entry['prompt']
                    },
                    {
                        'role': 'assistant',
                        'content': entry['response']
                    }
                ]
            }
            f.write(json.dumps(openai_entry) + '\n')
    
    print(f"Exported {len(dataset)} entries to {output_file} (OpenAI format)")

def export_for_llama(dataset: List[Dict], output_file: str = "chemical_training_data_llama.json"):
    """
    Export dataset in LLaMA/Alpaca instruction format.
    """
    llama_data = []
    
    for entry in dataset:
        llama_entry = {
            'instruction': entry['prompt'],
            'input': entry.get('canonical_smiles', ''),
            'output': entry['response']
        }
        llama_data.append(llama_entry)
    
    with open(output_file, 'w') as f:
        json.dump(llama_data, f, indent=2)
    
    print(f"Exported {len(dataset)} entries to {output_file} (LLaMA/Alpaca format)")

def create_conversational_dataset(smiles_list: List[str], names: List[str] = None) -> List[Dict]:
    """
    Create a conversational dataset simulating a chemistry chatbot interaction.
    """
    conversations = []
    
    for i, smiles in enumerate(smiles_list):
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        
        name = names[i] if names and i < len(names) else f"Compound_{i+1}"
        features = extract_molecular_features(mol)
        
        # Create a multi-turn conversation
        conversation = {
            'conversation_id': f"chem_chat_{i+1}",
            'molecule_smiles': smiles,
            'molecule_name': name,
            'turns': []
        }
        
        # Turn 1: Initial structure inquiry
        conversation['turns'].append({
            'user': f"Can you analyze the structure {smiles}?",
            'assistant': generate_molecular_description(mol, 'detailed')
        })
        
        # Turn 2: Property question
        if features.get('molecular_weight'):
            conversation['turns'].append({
                'user': "What's its molecular weight?",
                'assistant': f"The molecular weight is {features['molecular_weight']} daltons."
            })
        
        # Turn 3: Drug-likeness question
        conversation['turns'].append({
            'user': "Would this be a good drug candidate?",
            'assistant': f"{'Yes' if features.get('drug_like') else 'It may face challenges'}, this molecule {'complies with' if features.get('drug_like') else 'violates'} Lipinski's Rule of Five. {generate_molecular_description(mol, 'educational').split('.')[-2]}."
        })
        
        conversations.append(conversation)
    
    return conversations

# Demonstrate the export functions
print("Exporting Training Data in Different Formats:")
print("=" * 50)

# Use the previously created dataset
sample_dataset = mol_text_data[:5]  # Use a small sample for demonstration

# Export in different formats
export_for_huggingface(sample_dataset, "demo_huggingface.jsonl")
export_for_openai(sample_dataset, "demo_openai.jsonl") 
export_for_llama(sample_dataset, "demo_llama.json")

# Create and show conversational dataset
print("\nCreating Conversational Dataset:")
conv_data = create_conversational_dataset(smiles_only[:2], names_only[:2])

print(f"Generated {len(conv_data)} conversations")
print("\nExample conversation:")
print("="*30)

for i, turn in enumerate(conv_data[0]['turns']):
    print(f"Turn {i+1}:")
    print(f"User: {turn['user']}")
    print(f"Assistant: {turn['assistant'][:100]}{'...' if len(turn['assistant']) > 100 else ''}")
    print()

# Show file contents (first few lines)
print("Sample exported data:")
print("-" * 30)
try:
    with open("demo_huggingface.jsonl", 'r') as f:
        lines = f.readlines()[:2]
        for i, line in enumerate(lines):
            print(f"Entry {i+1}: {line[:100]}...")
except FileNotFoundError:
    print("Demo files not created (running in demo mode)")

Exporting Training Data in Different Formats:
==================================================
Exported 5 entries to demo_huggingface.jsonl (HuggingFace format)
Exported 5 entries to demo_openai.jsonl (OpenAI format)
Exported 5 entries to demo_llama.json (LLaMA/Alpaca format)

Creating Conversational Dataset:
Generated 2 conversations

Example conversation:
==============================
Turn 1:
User: Can you analyze the structure CCO?
Assistant: This compound has the molecular formula C2H6O and a molecular weight of 46.07 daltons. The structure...

Turn 2:
User: What's its molecular weight?
Assistant: The molecular weight is 46.07 daltons.

Turn 3:
User: Would this be a good drug candidate?
Assistant: Yes, this molecule complies with Lipinski's Rule of Five.  The molecule's size and properties sugges...

Sample exported data:
------------------------------
Entry 1: {"instruction": "Describe the chemical structure CCO:", "input": "CCO", "output": "This compound has...
Entry 2: {"instruction": "What can you tell me about the molecule with SMILES CCO?", "input": "CCO", "output"...

Real-World Challenges and Solutions

While the integration of RDKit and LLMs offers exciting possibilities, real-world implementations face several challenges. This section addresses common issues and provides practical solutions based on experience with production systems.

Challenge 1: Performance and Scalability

When processing large datasets or building real-time applications, performance becomes critical. Let’s explore strategies for optimization:

import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import time
from functools import lru_cache

class OptimizedChemicalProcessor:
    """
    An optimized processor for handling large-scale chemical data processing
    with RDKit and LLM integration.
    """
    
    def __init__(self, max_workers: int = None, use_caching: bool = True):
        self.max_workers = max_workers or mp.cpu_count()
        self.use_caching = use_caching
        
        # Pre-compile common SMARTS patterns for efficiency
        self.functional_group_patterns = {
            name: Chem.MolFromSmarts(smarts) for name, smarts in {
                'carboxylic_acid': '[CX3](=O)[OX2H1]',
                'ester': '[#6][CX3](=O)[OX2H0][#6]',
                'alcohol': '[OX2H]',
                'amine': '[NX3;H2,H1;!$(NC=O)]',
                'ketone': '[#6][CX3](=O)[#6]',
                'aromatic_ring': 'c1ccccc1',
            }.items()
        }
    
    @lru_cache(maxsize=10000)
    def cached_mol_from_smiles(self, smiles: str):
        """Cached molecule parsing to avoid repeated work."""
        return Chem.MolFromSmiles(smiles)
    
    def process_smiles_batch_parallel(self, smiles_list: List[str]) -> List[Dict]:
        """
        Process a batch of SMILES in parallel for better performance.
        """
        # Use ProcessPoolExecutor for CPU-bound tasks
        with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
            # Split into chunks to reduce overhead
            chunk_size = max(1, len(smiles_list) // (self.max_workers * 4))
            chunks = [smiles_list[i:i + chunk_size] 
                     for i in range(0, len(smiles_list), chunk_size)]
            
            # Process chunks in parallel
            futures = [executor.submit(self._process_chunk, chunk) for chunk in chunks]
            results = []
            for future in futures:
                results.extend(future.result())
        
        return results
    
    def _process_chunk(self, smiles_chunk: List[str]) -> List[Dict]:
        """Process a chunk of SMILES strings."""
        results = []
        for smiles in smiles_chunk:
            try:
                if self.use_caching:
                    mol = self.cached_mol_from_smiles(smiles)
                else:
                    mol = Chem.MolFromSmiles(smiles)
                
                if mol:
                    features = self._extract_features_optimized(mol)
                    features['smiles'] = smiles
                    results.append(features)
                else:
                    results.append({'smiles': smiles, 'error': 'Invalid SMILES'})
            except Exception as e:
                results.append({'smiles': smiles, 'error': str(e)})
        
        return results
    
    def _extract_features_optimized(self, mol: Chem.Mol) -> Dict:
        """Optimized feature extraction focusing on essential properties."""
        # Only calculate the most important features to save time
        try:
            features = {
                'molecular_weight': round(Descriptors.MolWt(mol), 2),
                'logp': round(Descriptors.MolLogP(mol), 2),
                'num_atoms': mol.GetNumAtoms(),
                'num_rings': rdMolDescriptors.CalcNumRings(mol),
                'h_bond_donors': Descriptors.NumHDonors(mol),
                'h_bond_acceptors': Descriptors.NumHAcceptors(mol),
            }
            
            # Fast functional group detection using pre-compiled patterns
            functional_groups = []
            for fg_name, pattern in self.functional_group_patterns.items():
                if mol.HasSubstructMatch(pattern):
                    functional_groups.append(fg_name)
            
            features['functional_groups'] = functional_groups
            return features
            
        except Exception as e:
            return {'error': f'Feature extraction failed: {str(e)}'}

# Performance comparison demonstration
def benchmark_processing_methods():
    """Compare different processing approaches for performance."""
    
    # Generate test data
    test_smiles = [
        "CCO", "c1ccccc1", "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
        "CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "CC(=O)OC1=CC=CC=C1C(=O)O"
    ] * 200  # 1000 molecules for testing
    
    processor = OptimizedChemicalProcessor()
    
    print("Performance Benchmark:")
    print("=" * 40)
    
    # Method 1: Sequential processing
    start_time = time.time()
    sequential_results = []
    for smiles in test_smiles:
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            features = extract_molecular_features(mol)
            sequential_results.append(features)
    sequential_time = time.time() - start_time
    
    # Method 2: Parallel processing
    start_time = time.time()
    parallel_results = processor.process_smiles_batch_parallel(test_smiles)
    parallel_time = time.time() - start_time
    
    # Method 3: Cached processing
    start_time = time.time()
    cached_results = []
    for smiles in test_smiles:
        mol = processor.cached_mol_from_smiles(smiles)
        if mol:
            features = processor._extract_features_optimized(mol)
            cached_results.append(features)
    cached_time = time.time() - start_time
    
    print(f"Sequential processing: {sequential_time:.2f} seconds")
    print(f"Parallel processing: {parallel_time:.2f} seconds")
    print(f"Cached processing: {cached_time:.2f} seconds")
    print(f"Speedup (parallel): {sequential_time/parallel_time:.1f}x")
    print(f"Speedup (cached): {sequential_time/cached_time:.1f}x")
    
    # Verify results are consistent
    print(f"\nResults consistency check:")
    print(f"Sequential results: {len(sequential_results)} molecules processed")
    print(f"Parallel results: {len([r for r in parallel_results if 'error' not in r])} molecules processed")
    print(f"Cached results: {len(cached_results)} molecules processed")

# Run the benchmark
benchmark_processing_methods()

Performance Benchmark:
========================================
Sequential processing: 0.32 seconds
Parallel processing: 0.09 seconds
Cached processing: 0.02 seconds
Speedup (parallel): 3.7x
Speedup (cached): 13.3x

Results consistency check:
Sequential results: 1000 molecules processed
Parallel results: 1000 molecules processed
Cached results: 1000 molecules processed

Challenge 2: Handling Inconsistent LLM Outputs

LLMs can produce inconsistent or partially correct chemical information. Here’s how to build robust systems that handle these issues:

class RobustChemicalParser:
    """
    A parser that can handle messy, inconsistent LLM outputs and extract
    useful chemical information with confidence scoring.
    """
    
    def __init__(self):
        # Common patterns for extracting chemical information from text
        self.smiles_patterns = [
            r'\b([A-Za-z0-9@+\-\[\]\(\)=#$:%\\\/\.]+)\b',  # General SMILES pattern
            r'SMILES[:\s]*([A-Za-z0-9@+\-\[\]\(\)=#$:%\\\/\.]+)',  # SMILES: prefix
            r'structure[:\s]*([A-Za-z0-9@+\-\[\]\(\)=#$:%\\\/\.]+)',  # structure: prefix
        ]
        
        self.molecular_weight_patterns = [
            r'molecular weight[:\s]*(\d+\.?\d*)\s*(?:Da|daltons?|g/mol)?',
            r'MW[:\s]*(\d+\.?\d*)',
            r'(\d+\.?\d*)\s*(?:Da|daltons?|g/mol)',
        ]
        
        self.property_patterns = {
            'logp': [r'LogP[:\s]*(-?\d+\.?\d*)', r'log\s*P[:\s]*(-?\d+\.?\d*)'],
            'tpsa': [r'TPSA[:\s]*(\d+\.?\d*)', r'polar surface area[:\s]*(\d+\.?\d*)'],
            'hbd': [r'H-bond donors?[:\s]*(\d+)', r'hydrogen bond donors?[:\s]*(\d+)'],
            'hba': [r'H-bond acceptors?[:\s]*(\d+)', r'hydrogen bond acceptors?[:\s]*(\d+)'],
        }
    
    def extract_chemical_entities(self, text: str) -> Dict[str, any]:
        """
        Extract chemical entities from potentially messy LLM text output.
        Returns results with confidence scores.
        """
        results = {
            'extracted_smiles': [],
            'molecular_weight': None,
            'properties': {},
            'confidence_scores': {},
            'raw_text': text
        }
        
        # Extract SMILES candidates
        smiles_candidates = self._extract_smiles_candidates(text)
        validated_smiles = []
        
        for candidate, confidence in smiles_candidates:
            validation = validate_smiles_with_details(candidate)
            if validation['is_valid']:
                validated_smiles.append({
                    'smiles': candidate,
                    'canonical_smiles': validation['canonical_smiles'],
                    'confidence': confidence,
                    'properties': validation['properties']
                })
        
        results['extracted_smiles'] = validated_smiles
        
        # Extract molecular weight
        mw_match = self._extract_molecular_weight(text)
        if mw_match:
            results['molecular_weight'] = mw_match['value']
            results['confidence_scores']['molecular_weight'] = mw_match['confidence']
        
        # Extract other properties
        for prop_name, patterns in self.property_patterns.items():
            prop_match = self._extract_property(text, patterns)
            if prop_match:
                results['properties'][prop_name] = prop_match['value']
                results['confidence_scores'][prop_name] = prop_match['confidence']
        
        return results
    
    def _extract_smiles_candidates(self, text: str) -> List[Tuple[str, float]]:
        """Extract SMILES candidates with confidence scores."""
        candidates = []
        
        for i, pattern in enumerate(self.smiles_patterns):
            matches = re.findall(pattern, text, re.IGNORECASE)
            for match in matches:
                # Basic heuristics for SMILES likelihood
                confidence = self._calculate_smiles_confidence(match, pattern_index=i)
                if confidence > 0.3:  # Minimum confidence threshold
                    candidates.append((match, confidence))
        
        # Remove duplicates and sort by confidence
        seen = set()
        unique_candidates = []
        for smiles, conf in sorted(candidates, key=lambda x: x[1], reverse=True):
            if smiles not in seen:
                seen.add(smiles)
                unique_candidates.append((smiles, conf))
        
        return unique_candidates[:5]  # Return top 5 candidates
    
    def _calculate_smiles_confidence(self, smiles: str, pattern_index: int) -> float:
        """Calculate confidence score for a SMILES candidate."""
        confidence = 0.5  # Base confidence
        
        # Pattern-based confidence adjustment
        if pattern_index == 1:  # Explicit SMILES: prefix
            confidence += 0.3
        elif pattern_index == 2:  # structure: prefix
            confidence += 0.2
        
        # Length-based heuristics
        if 5 <= len(smiles) <= 200:
            confidence += 0.2
        elif len(smiles) < 3 or len(smiles) > 500:
            confidence -= 0.3
        
        # Character composition heuristics
        valid_chars = set('CNOSPFClBrIH[]()=@+-.0123456789#\\/%$')
        if all(c in valid_chars for c in smiles):
            confidence += 0.2
        
        # Common SMILES patterns
        if any(pattern in smiles for pattern in ['c1ccccc1', 'CC', 'CN', 'CO']):
            confidence += 0.1
        
        return min(1.0, max(0.0, confidence))
    
    def _extract_molecular_weight(self, text: str) -> Optional[Dict]:
        """Extract molecular weight with confidence."""
        for pattern in self.molecular_weight_patterns:
            matches = re.findall(pattern, text, re.IGNORECASE)
            if matches:
                try:
                    value = float(matches[0])
                    if 10 <= value <= 5000:  # Reasonable MW range
                        return {
                            'value': value,
                            'confidence': 0.8 if 'molecular weight' in pattern else 0.6
                        }
                except ValueError:
                    continue
        return None
    
    def _extract_property(self, text: str, patterns: List[str]) -> Optional[Dict]:
        """Extract a property value with confidence."""
        for pattern in patterns:
            matches = re.findall(pattern, text, re.IGNORECASE)
            if matches:
                try:
                    value = float(matches[0])
                    return {'value': value, 'confidence': 0.7}
                except ValueError:
                    continue
        return None
    
    def cross_validate_extractions(self, text: str) -> Dict[str, any]:
        """
        Cross-validate extracted information against RDKit calculations.
        """
        extraction = self.extract_chemical_entities(text)
        validation_results = {
            'extraction': extraction,
            'validation': {},
            'discrepancies': [],
            'overall_confidence': 0.0
        }
        
        if extraction['extracted_smiles']:
            best_smiles = extraction['extracted_smiles'][0]
            mol = Chem.MolFromSmiles(best_smiles['canonical_smiles'])
            
            if mol:
                rdkit_props = extract_molecular_features(mol)
                
                # Cross-validate molecular weight
                if extraction['molecular_weight']:
                    rdkit_mw = rdkit_props['molecular_weight']
                    extracted_mw = extraction['molecular_weight']
                    mw_diff = abs(rdkit_mw - extracted_mw)
                    
                    if mw_diff < 1.0:
                        validation_results['validation']['molecular_weight'] = 'MATCH'
                    elif mw_diff < 10.0:
                        validation_results['validation']['molecular_weight'] = 'CLOSE'
                        validation_results['discrepancies'].append(
                            f"MW discrepancy: extracted={extracted_mw}, calculated={rdkit_mw}"
                        )
                    else:
                        validation_results['validation']['molecular_weight'] = 'MISMATCH'
                        validation_results['discrepancies'].append(
                            f"Major MW discrepancy: extracted={extracted_mw}, calculated={rdkit_mw}"
                        )
                
                # Cross-validate other properties
                for prop in ['logp', 'hbd', 'hba']:
                    if prop in extraction['properties']:
                        # Add property validation logic here
                        pass
                
                # Calculate overall confidence
                confidence_scores = list(extraction['confidence_scores'].values())
                if confidence_scores:
                    validation_results['overall_confidence'] = sum(confidence_scores) / len(confidence_scores)
        
        return validation_results

# Demonstration with messy LLM outputs
parser = RobustChemicalParser()

# Simulate various types of messy LLM outputs
messy_outputs = [
    "The compound has SMILES: CC(C)CC1=CC=C(C=C1)C(C)C(=O)O and molecular weight 206.3 Da",
    "This molecule (c1ccccc1) is benzene, MW: 78.11 g/mol, LogP: 2.13",
    "Structure: CCO, ethanol, with 2 H-bond donors and 1 H-bond acceptor",
    "The SMILES string is probably C1=CC=C(C=C1)C(=O)O but I'm not completely sure about the molecular weight being around 122",
    "Invalid SMILES: XYZ123ABC and some random text with molecular weight 999999 Da",
]

print("Robust Chemical Information Extraction:")
print("=" * 50)

for i, text in enumerate(messy_outputs):
    print(f"\nExample {i+1}:")
    print(f"Input: {text}")
    
    # Extract and validate
    result = parser.cross_validate_extractions(text)
    extraction = result['extraction']
    
    print(f"Extracted SMILES: {len(extraction['extracted_smiles'])} candidates")
    if extraction['extracted_smiles']:
        best = extraction['extracted_smiles'][0]
        print(f"  Best: {best['canonical_smiles']} (confidence: {best['confidence']:.2f})")
    
    if extraction['molecular_weight']:
        print(f"Molecular Weight: {extraction['molecular_weight']} (confidence: {extraction['confidence_scores'].get('molecular_weight', 0):.2f})")
    
    if result['discrepancies']:
        print(f"Discrepancies: {'; '.join(result['discrepancies'])}")
    
    print(f"Overall confidence: {result['overall_confidence']:.2f}")
    print("-" * 30)

Robust Chemical Information Extraction:
==================================================

Example 1:
Input: The compound has SMILES: CC(C)CC1=CC=C(C=C1)C(C)C(=O)O and molecular weight 206.3 Da
Extracted SMILES: 1 candidates
  Best: CC(C)Cc1ccc(C(C)C(=O)O)cc1 (confidence: 1.00)
Molecular Weight: 206.3 (confidence: 0.80)
Overall confidence: 0.80
------------------------------

Example 2:
Input: This molecule (c1ccccc1) is benzene, MW: 78.11 g/mol, LogP: 2.13
Extracted SMILES: 1 candidates
  Best: c1ccccc1 (confidence: 0.80)
Molecular Weight: 78.11 (confidence: 0.60)
Overall confidence: 0.65
------------------------------

Example 3:
Input: Structure: CCO, ethanol, with 2 H-bond donors and 1 H-bond acceptor
Extracted SMILES: 1 candidates
  Best: CCO (confidence: 1.00)
Overall confidence: 0.00
------------------------------

Example 4:
Input: The SMILES string is probably C1=CC=C(C=C1)C(=O)O but I'm not completely sure about the molecular weight being around 122
Extracted SMILES: 1 candidates
  Best: O=C(O)c1ccccc1 (confidence: 1.00)
Overall confidence: 0.00
------------------------------

Example 5:
Input: Invalid SMILES: XYZ123ABC and some random text with molecular weight 999999 Da
Extracted SMILES: 0 candidates
Overall confidence: 0.00
------------------------------

[09:53:55] SMILES Parse Error: syntax error while parsing: 206.3
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] 206.3
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES '206.3' for input: '206.3'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'compound' for input: 'compound'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'SMILES' for input: 'SMILES'
[09:53:55] SMILES Parse Error: syntax error while parsing: molecular
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] molecular
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'molecular' for input: 'molecular'
[09:53:55] SMILES Parse Error: syntax error while parsing: 78.11
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] 78.11
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES '78.11' for input: '78.11'
[09:53:55] SMILES Parse Error: syntax error while parsing: molecule
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] molecule
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'molecule' for input: 'molecule'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'benzene' for input: 'benzene'
[09:53:55] SMILES Parse Error: syntax error while parsing: g/mol
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] g/mol
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'g/mol' for input: 'g/mol'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'Structure' for input: 'Structure'
[09:53:55] SMILES Parse Error: syntax error while parsing: ethanol
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] ethanol
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'ethanol' for input: 'ethanol'
[09:53:55] SMILES Parse Error: syntax error while parsing: H-bond
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] H-bond
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'H-bond' for input: 'H-bond'
[09:53:55] SMILES Parse Error: syntax error while parsing: donors
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] donors
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'donors' for input: 'donors'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'string' for input: 'string'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'SMILES' for input: 'SMILES'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'probably' for input: 'probably'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'completely' for input: 'completely'
[09:53:55] SMILES Parse Error: syntax error while parsing: XYZ123ABC
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] XYZ123ABC
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'XYZ123ABC' for input: 'XYZ123ABC'
[09:53:55] SMILES Parse Error: syntax error while parsing: 999999
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] 999999
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES '999999' for input: '999999'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'Invalid' for input: 'Invalid'
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'SMILES' for input: 'SMILES'
[09:53:55] SMILES Parse Error: syntax error while parsing: random
[09:53:55] SMILES Parse Error: check for mistakes around position 1:
[09:53:55] random
[09:53:55] ^
[09:53:55] SMILES Parse Error: Failed parsing SMILES 'random' for input: 'random'

Challenge 3: Data Quality and Contamination

Chemical datasets often contain errors, duplicates, and inconsistencies. Here’s how to build quality control systems:

NOTE FROM GREG: I needed to add an import to the following block

# EDIT BY GREG: add missing import
from rdkit import DataStructs

class ChemicalDataQualityController:
    """
    A comprehensive system for detecting and handling data quality issues
    in chemical datasets used for LLM training.
    """
    
    def __init__(self, similarity_threshold: float = 0.85):
        self.similarity_threshold = similarity_threshold
        self.quality_stats = {
            'total_processed': 0,
            'duplicates_found': 0,
            'invalid_structures': 0,
            'outliers_detected': 0,
            'cleaned_structures': 0
        }
    
    def comprehensive_quality_check(self, data: List[Dict]) -> Dict[str, any]:
        """
        Perform comprehensive quality checks on chemical dataset.
        
        Args:
            data: List of dictionaries with 'smiles' and optional 'name', 'description' keys
        
        Returns:
            Quality report with flagged issues and cleaned data
        """
        report = {
            'original_size': len(data),
            'issues': {
                'duplicates': [],
                'invalid_smiles': [],
                'structural_outliers': [],
                'text_inconsistencies': [],
                'suspicious_patterns': []
            },
            'cleaned_data': [],
            'recommendations': []
        }
        
        print("Running comprehensive data quality checks...")
        
        # Step 1: Basic validation and cleaning
        valid_entries = []
        for i, entry in enumerate(data):
            smiles = entry.get('smiles', '')
            validation = validate_smiles_with_details(smiles)
            
            if validation['is_valid']:
                # Store canonical SMILES for consistency
                entry['canonical_smiles'] = validation['canonical_smiles']
                entry['original_index'] = i
                valid_entries.append(entry)
            else:
                report['issues']['invalid_smiles'].append({
                    'index': i,
                    'smiles': smiles,
                    'errors': validation['errors']
                })
        
        print(f"  Valid structures: {len(valid_entries)}/{len(data)}")
        
        # Step 2: Duplicate detection
        duplicates = self._find_duplicates(valid_entries)
        report['issues']['duplicates'] = duplicates
        
        # Step 3: Structural outlier detection
        outliers = self._detect_structural_outliers(valid_entries)
        report['issues']['structural_outliers'] = outliers
        
        # Step 4: Text-structure consistency checks
        if any('description' in entry for entry in valid_entries):
            inconsistencies = self._check_text_structure_consistency(valid_entries)
            report['issues']['text_inconsistencies'] = inconsistencies
        
        # Step 5: Suspicious pattern detection
        suspicious = self._detect_suspicious_patterns(valid_entries)
        report['issues']['suspicious_patterns'] = suspicious
        
        # Step 6: Create cleaned dataset
        cleaned_data = self._create_cleaned_dataset(valid_entries, report['issues'])
        report['cleaned_data'] = cleaned_data
        report['final_size'] = len(cleaned_data)
        
        # Step 7: Generate recommendations
        report['recommendations'] = self._generate_recommendations(report)
        
        return report
    
    def _find_duplicates(self, entries: List[Dict]) -> List[Dict]:
        """Find duplicate structures using canonical SMILES and similarity."""
        print("  Checking for duplicates...")
        
        duplicates = []
        seen_smiles = {}
        
        # Exact duplicates (same canonical SMILES)
        for entry in entries:
            canonical = entry['canonical_smiles']
            if canonical in seen_smiles:
                duplicates.append({
                    'type': 'exact_duplicate',
                    'indices': [seen_smiles[canonical]['original_index'], entry['original_index']],
                    'smiles': canonical,
                    'names': [seen_smiles[canonical].get('name', 'Unknown'), 
                             entry.get('name', 'Unknown')]
                })
            else:
                seen_smiles[canonical] = entry
        
        # Near-duplicates (high structural similarity)
        if len(entries) < 1000:  # Only for smaller datasets due to O(n²) complexity
            fingerprints = []
            for entry in entries:
                mol = Chem.MolFromSmiles(entry['canonical_smiles'])
                if mol:
                    fp = Chem.RDKFingerprint(mol)
                    fingerprints.append((entry, fp))
            
            for i, (entry1, fp1) in enumerate(fingerprints):
                for j, (entry2, fp2) in enumerate(fingerprints[i+1:], i+1):
                    similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
                    if similarity > self.similarity_threshold:
                        duplicates.append({
                            'type': 'near_duplicate',
                            'similarity': similarity,
                            'indices': [entry1['original_index'], entry2['original_index']],
                            'smiles': [entry1['canonical_smiles'], entry2['canonical_smiles']],
                            'names': [entry1.get('name', 'Unknown'), entry2.get('name', 'Unknown')]
                        })
        
        return duplicates
    
    def _detect_structural_outliers(self, entries: List[Dict]) -> List[Dict]:
        """Detect structural outliers that might indicate data quality issues."""
        print("  Detecting structural outliers...")
        
        outliers = []
        properties = []
        
        # Calculate properties for all molecules
        for entry in entries:
            mol = Chem.MolFromSmiles(entry['canonical_smiles'])
            if mol:
                props = {
                    'mw': Descriptors.MolWt(mol),
                    'logp': Descriptors.MolLogP(mol),
                    'num_atoms': mol.GetNumAtoms(),
                    'num_rings': rdMolDescriptors.CalcNumRings(mol),
                    'entry': entry
                }
                properties.append(props)
        
        if not properties:
            return outliers
        
        # Statistical outlier detection using IQR method
        for prop_name in ['mw', 'logp', 'num_atoms', 'num_rings']:
            values = [p[prop_name] for p in properties]
            q1 = pd.Series(values).quantile(0.25)
            q3 = pd.Series(values).quantile(0.75)
            iqr = q3 - q1
            
            lower_bound = q1 - 3 * iqr  # More permissive than standard 1.5*IQR
            upper_bound = q3 + 3 * iqr
            
            for props in properties:
                value = props[prop_name]
                if value < lower_bound or value > upper_bound:
                    outliers.append({
                        'type': f'{prop_name}_outlier',
                        'index': props['entry']['original_index'],
                        'smiles': props['entry']['canonical_smiles'],
                        'property': prop_name,
                        'value': value,
                        'expected_range': f'{lower_bound:.1f} - {upper_bound:.1f}'
                    })
        
        return outliers
    
    def _check_text_structure_consistency(self, entries: List[Dict]) -> List[Dict]:
        """Check consistency between text descriptions and calculated properties."""
        print("  Checking text-structure consistency...")
        
        inconsistencies = []
        
        for entry in entries:
            if 'description' not in entry:
                continue
            
            description = entry['description'].lower()
            mol = Chem.MolFromSmiles(entry['canonical_smiles'])
            if not mol:
                continue
            
            features = extract_molecular_features(mol)
            
            # Check molecular weight consistency
            if 'molecular weight' in description or 'mw' in description:
                # Extract numbers that might be molecular weights
                mw_numbers = re.findall(r'\b(\d{2,4}(?:\.\d+)?)\b', description)
                actual_mw = features['molecular_weight']
                
                found_matching_mw = False
                for num_str in mw_numbers:
                    try:
                        described_mw = float(num_str)
                        if abs(described_mw - actual_mw) < 5:  # 5 Da tolerance
                            found_matching_mw = True
                            break
                    except ValueError:
                        continue
                
                if not found_matching_mw and mw_numbers:
                    inconsistencies.append({
                        'type': 'molecular_weight_mismatch',
                        'index': entry['original_index'],
                        'smiles': entry['canonical_smiles'],
                        'described_mw': mw_numbers,
                        'calculated_mw': actual_mw,
                        'description_excerpt': description[:100] + '...'
                    })
            
            # Check functional group mentions
            mentioned_groups = []
            if 'alcohol' in description:
                mentioned_groups.append('alcohol')
            if 'carboxylic acid' in description or 'carboxyl' in description:
                mentioned_groups.append('carboxylic_acid')
            if 'amine' in description:
                mentioned_groups.append('amine')
            if 'aromatic' in description or 'benzene' in description:
                mentioned_groups.append('aromatic')
            
            detected_groups = [fg.split('(')[0] for fg in features.get('functional_groups', [])]
            
            for mentioned in mentioned_groups:
                if mentioned not in detected_groups and mentioned != 'aromatic':
                    inconsistencies.append({
                        'type': 'functional_group_mismatch',
                        'index': entry['original_index'],
                        'smiles': entry['canonical_smiles'],
                        'mentioned_group': mentioned,
                        'detected_groups': detected_groups,
                        'description_excerpt': description[:100] + '...'
                    })
        
        return inconsistencies
    
    def _detect_suspicious_patterns(self, entries: List[Dict]) -> List[Dict]:
        """Detect suspicious patterns that might indicate synthetic or corrupted data."""
        print("  Detecting suspicious patterns...")
        
        suspicious = []
        
        # Pattern 1: Repetitive SMILES patterns
        smiles_parts = {}
        for entry in entries:
            smiles = entry['canonical_smiles']
            # Look for repeated substrings
            for length in [3, 4, 5]:
                for i in range(len(smiles) - length + 1):
                    substring = smiles[i:i+length]
                    if substring.count(substring[0]) != len(substring):  # Not all same character
                        smiles_parts[substring] = smiles_parts.get(substring, 0) + 1
        
        # Flag SMILES with highly repetitive patterns
        for entry in entries:
            smiles = entry['canonical_smiles']
            repetitive_score = 0
            for substring, count in smiles_parts.items():
                if count > len(entries) * 0.1 and substring in smiles:  # Appears in >10% of dataset
                    repetitive_score += count
            
            if repetitive_score > len(entries) * 0.2:
                suspicious.append({
                    'type': 'repetitive_smiles_pattern',
                    'index': entry['original_index'],
                    'smiles': smiles,
                    'repetitive_score': repetitive_score
                })
        
        # Pattern 2: Unrealistic property combinations
        for entry in entries:
            mol = Chem.MolFromSmiles(entry['canonical_smiles'])
            if mol:
                mw = Descriptors.MolWt(mol)
                num_atoms = mol.GetNumAtoms()
                
                # Flag extremely dense molecules (too much mass for number of atoms)
                if num_atoms > 0 and mw / num_atoms > 50:  # Average atomic weight > 50
                    suspicious.append({
                        'type': 'unrealistic_density',
                        'index': entry['original_index'],
                        'smiles': entry['canonical_smiles'],
                        'mw_per_atom': mw / num_atoms
                    })
        
        return suspicious
    
    def _create_cleaned_dataset(self, entries: List[Dict], issues: Dict) -> List[Dict]:
        """Create a cleaned dataset by removing problematic entries."""
        print("  Creating cleaned dataset...")
        
        # Collect indices to remove
        indices_to_remove = set()
        
        # Remove exact duplicates (keep first occurrence)
        for dup in issues['duplicates']:
            if dup['type'] == 'exact_duplicate':
                indices_to_remove.add(dup['indices'][1])  # Remove second occurrence
        
        # Remove severe outliers
        for outlier in issues['structural_outliers']:
            if outlier['type'] in ['num_atoms_outlier', 'mw_outlier']:
                # Only remove extreme outliers
                if outlier['type'] == 'num_atoms_outlier' and outlier['value'] > 200:
                    indices_to_remove.add(outlier['index'])
                elif outlier['type'] == 'mw_outlier' and outlier['value'] > 2000:
                    indices_to_remove.add(outlier['index'])
        
        # Remove highly suspicious entries
        for susp in issues['suspicious_patterns']:
            if susp['type'] == 'unrealistic_density' and susp['mw_per_atom'] > 100:
                indices_to_remove.add(susp['index'])
        
        # Create cleaned dataset
        cleaned = []
        for entry in entries:
            if entry['original_index'] not in indices_to_remove:
                # Clean up the entry
                cleaned_entry = {
                    'smiles': entry['canonical_smiles'],  # Use canonical form
                    'name': entry.get('name', ''),
                    'description': entry.get('description', '')
                }
                cleaned.append(cleaned_entry)
        
        return cleaned
    
    def _generate_recommendations(self, report: Dict) -> List[str]:
        """Generate recommendations based on quality issues found."""
        recommendations = []
        
        # Duplicate recommendations
        if report['issues']['duplicates']:
            exact_dups = len([d for d in report['issues']['duplicates'] if d['type'] == 'exact_duplicate'])
            near_dups = len([d for d in report['issues']['duplicates'] if d['type'] == 'near_duplicate'])
            
            if exact_dups > 0:
                recommendations.append(f"Remove {exact_dups} exact duplicate structures")
            if near_dups > 0:
                recommendations.append(f"Review {near_dups} near-duplicate pairs (similarity > {self.similarity_threshold})")
        
        # Data size recommendations
        removal_rate = (report['original_size'] - report['final_size']) / report['original_size']
        if removal_rate > 0.2:
            recommendations.append(f"High removal rate ({removal_rate:.1%}) suggests significant data quality issues")
        
        # Outlier recommendations
        if report['issues']['structural_outliers']:
            recommendations.append("Review structural outliers - they may indicate data entry errors")
        
        # Consistency recommendations
        if report['issues']['text_inconsistencies']:
            recommendations.append("Text descriptions inconsistent with structures - consider regenerating descriptions")
        
        if not recommendations:
            recommendations.append("Dataset quality looks good! No major issues detected.")
        
        return recommendations

# Demonstration with a problematic dataset
def create_test_dataset_with_issues():
    """Create a test dataset with various quality issues for demonstration."""
    return [
        # Good entries
        {'smiles': 'CCO', 'name': 'Ethanol', 'description': 'Ethanol is an alcohol with molecular weight 46.07 Da'},
        {'smiles': 'c1ccccc1', 'name': 'Benzene', 'description': 'Benzene is an aromatic compound'},
        
        # Duplicates
        {'smiles': 'CCO', 'name': 'Ethyl alcohol', 'description': 'Another entry for ethanol'},  # Exact duplicate
        {'smiles': 'c1ccccc1', 'name': 'Benzene ring', 'description': 'Benzene ring structure'},  # Exact duplicate
        
        # Invalid SMILES
        {'smiles': 'INVALID123', 'name': 'Bad entry', 'description': 'This is not a valid SMILES'},
        {'smiles': 'C1CCC', 'name': 'Unclosed ring', 'description': 'Missing ring closure'},
        
        # Outliers
        {'smiles': 'C' * 100, 'name': 'Very long chain', 'description': 'Extremely long alkyl chain'},
        
        # Text inconsistencies
        {'smiles': 'CCO', 'name': 'Ethanol', 'description': 'This alcohol has molecular weight 200 Da'},  # Wrong MW
        {'smiles': 'c1ccccc1', 'name': 'Benzene', 'description': 'Contains carboxylic acid groups'},  # Wrong functional group
        
        # Near duplicates
        {'smiles': 'CC(C)O', 'name': 'Isopropanol', 'description': 'Isopropyl alcohol'},
        
        # Good entries for padding
        {'smiles': 'CC(=O)O', 'name': 'Acetic acid', 'description': 'Acetic acid is a carboxylic acid'},
        {'smiles': 'CN', 'name': 'Methylamine', 'description': 'Simple amine compound'},
    ]

# Run quality control demonstration
print("Chemical Data Quality Control Demonstration")
print("=" * 50)

# Create test dataset with known issues
test_data = create_test_dataset_with_issues()
print(f"Test dataset created with {len(test_data)} entries")

# Run quality control
qc = ChemicalDataQualityController(similarity_threshold=0.8)
quality_report = qc.comprehensive_quality_check(test_data)

# Display results
print(f"\nQuality Control Results:")
print(f"Original dataset size: {quality_report['original_size']}")
print(f"Cleaned dataset size: {quality_report['final_size']}")
print(f"Removal rate: {(1 - quality_report['final_size']/quality_report['original_size'])*100:.1f}%")

print(f"\nIssues Found:")
for issue_type, issues in quality_report['issues'].items():
    if issues:
        print(f"  {issue_type.replace('_', ' ').title()}: {len(issues)} cases")

print(f"\nRecommendations:")
for i, rec in enumerate(quality_report['recommendations'], 1):
    print(f"  {i}. {rec}")

# Show example of cleaned data
print(f"\nFirst 3 entries of cleaned dataset:")
for i, entry in enumerate(quality_report['cleaned_data'][:3]):
    print(f"  {i+1}. {entry['name']}: {entry['smiles']}")

Chemical Data Quality Control Demonstration
==================================================
Test dataset created with 12 entries
Running comprehensive data quality checks...
  Valid structures: 10/12
  Checking for duplicates...
  Detecting structural outliers...
  Checking text-structure consistency...
  Detecting suspicious patterns...
  Creating cleaned dataset...

Quality Control Results:
Original dataset size: 12
Cleaned dataset size: 6
Removal rate: 50.0%

Issues Found:
  Duplicates: 10 cases
  Invalid Smiles: 2 cases
  Structural Outliers: 3 cases
  Text Inconsistencies: 2 cases
  Suspicious Patterns: 6 cases

Recommendations:
  1. Remove 4 exact duplicate structures
  2. Review 6 near-duplicate pairs (similarity > 0.8)
  3. High removal rate (50.0%) suggests significant data quality issues
  4. Review structural outliers - they may indicate data entry errors
  5. Text descriptions inconsistent with structures - consider regenerating descriptions

First 3 entries of cleaned dataset:
  1. Ethanol: CCO
  2. Benzene: c1ccccc1
  3. Very long chain: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

[09:55:35] SMILES Parse Error: Failed parsing SMILES 'INVALID123' for input: 'INVALID123'
[09:55:35] SMILES Parse Error: unclosed ring for input: 'C1CCC'

Advanced Contamination Detection

NOTE FROM GREG: the following section of code does not work. I’m leaving it here as the first example of something that didn’t actually work as generated by the LLM.

Data contamination is a critical issue when training LLMs on chemical data. Beyond basic quality checks, sophisticated contamination detection is essential:

Cross-Dataset Leakage Detection:

def detect_cross_dataset_leakage(train_smiles, test_smiles, validation_smiles):
    """Detect potential data leakage between training, test, and validation sets"""
    # Implementation would check for exact matches and high-similarity compounds
    pass

Temporal Contamination: Chemical databases often contain compounds discovered after certain dates. For historically-aware models, ensure temporal consistency:

Vendor Catalog Contamination: Many “novel” compounds in datasets are actually from commercial catalogs, which can lead to unrealistic performance expectations.

Synthetic Accessibility Bias: Datasets often over-represent easily synthesizable compounds, creating bias in LLM outputs.

class ContaminationDetector:
    """
    Advanced contamination detection for chemical datasets used in LLM training.
    """
    
    def __init__(self, similarity_threshold: float = 0.85):
        self.similarity_threshold = similarity_threshold
    
    def detect_cross_dataset_leakage(self, datasets: Dict[str, List[str]]) -> Dict:
        """
        Detect potential data leakage between multiple datasets.
        
        Args:
            datasets: Dictionary with dataset names as keys and lists of SMILES as values
                     e.g., {'train': [...], 'test': [...], 'validation': [...]}
        
        Returns:
            Dictionary with contamination analysis results
        """
        print("Detecting cross-dataset contamination...")
        
        results = {
            'exact_overlaps': {},
            'similarity_overlaps': {},
            'summary': {}
        }
        
        # Convert SMILES to canonical form and create fingerprints
        canonical_datasets = {}
        fingerprint_datasets = {}
        
        for name, smiles_list in datasets.items():
            canonical_smiles = []
            fingerprints = []
            
            for smiles in smiles_list:
                mol = Chem.MolFromSmiles(smiles)
                if mol:
                    canonical = Chem.MolToSmiles(mol)
                    canonical_smiles.append(canonical)
                    fingerprints.append(Chem.RDKFingerprint(mol))
            
            canonical_datasets[name] = canonical_smiles
            fingerprint_datasets[name] = fingerprints
            print(f"  {name}: {len(canonical_smiles)} valid structures")
        
        # Check exact overlaps
        dataset_names = list(datasets.keys())
        for i, name1 in enumerate(dataset_names):
            for name2 in dataset_names[i+1:]:
                overlap_key = f"{name1}_vs_{name2}"
                
                set1 = set(canonical_datasets[name1])
                set2 = set(canonical_datasets[name2])
                exact_overlap = set1.intersection(set2)
                
                results['exact_overlaps'][overlap_key] = {
                    'count': len(exact_overlap),
                    'percentage_of_first': len(exact_overlap) / len(set1) * 100 if set1 else 0,
                    'percentage_of_second': len(exact_overlap) / len(set2) * 100 if set2 else 0,
                    'overlapping_smiles': list(exact_overlap)[:10]  # First 10 examples
                }
        
        # Check similarity-based overlaps (computationally expensive, limited to smaller datasets)
        for i, name1 in enumerate(dataset_names):
            for name2 in dataset_names[i+1:]:
                overlap_key = f"{name1}_vs_{name2}"
                
                if len(fingerprint_datasets[name1]) < 1000 and len(fingerprint_datasets[name2]) < 1000:
                    similar_pairs = []
                    
                    for idx1, fp1 in enumerate(fingerprint_datasets[name1]):
                        for idx2, fp2 in enumerate(fingerprint_datasets[name2]):
                            similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
                            if similarity > self.similarity_threshold:
                                similar_pairs.append({
                                    'smiles1': canonical_datasets[name1][idx1],
                                    'smiles2': canonical_datasets[name2][idx2],
                                    'similarity': similarity
                                })
                    
                    results['similarity_overlaps'][overlap_key] = {
                        'count': len(similar_pairs),
                        'pairs': similar_pairs[:5]  # First 5 examples
                    }
                else:
                    results['similarity_overlaps'][overlap_key] = {
                        'count': 'skipped_large_dataset',
                        'note': 'Dataset too large for pairwise comparison'
                    }
        
        # Generate summary
        total_exact_overlaps = sum(result['count'] for result in results['exact_overlaps'].values() 
                                 if isinstance(result['count'], int))
        
        results['summary'] = {
            'total_exact_overlaps': total_exact_overlaps,
            'datasets_analyzed': len(datasets),
            'contamination_severity': self._assess_contamination_severity(results),
            'recommendations': self._generate_contamination_recommendations(results)
        }
        
        return results
    
    def detect_temporal_contamination(self, smiles_list: List[str], 
                                    target_date: str = "2020-01-01") -> Dict:
        """
        Detect compounds that might be from after a target date.
        Note: This is a simplified version - real implementation would require
        a database of compound discovery/publication dates.
        """
        print(f"Checking for temporal contamination (compounds after {target_date})...")
        
        # This is a placeholder implementation
        # In practice, you'd need a database mapping SMILES to discovery dates
        suspicious_compounds = []
        
        for smiles in smiles_list:
            mol = Chem.MolFromSmiles(smiles)
            if mol:
                # Example heuristic: very complex molecules might be recent
                complexity_score = self._calculate_molecular_complexity(mol)
                if complexity_score > 20:  # Arbitrary threshold
                    suspicious_compounds.append({
                        'smiles': smiles,
                        'complexity_score': complexity_score,
                        'reason': 'high_complexity'
                    })
        
        return {
            'suspicious_count': len(suspicious_compounds),
            'suspicious_compounds': suspicious_compounds[:10],
            'recommendation': f"Manual review recommended for {len(suspicious_compounds)} complex compounds"
        }
    
    def detect_vendor_catalog_bias(self, smiles_list: List[str]) -> Dict:
        """
        Detect potential bias from commercial compound catalogs.
        """
        print("Detecting vendor catalog bias...")
        
        # Common patterns in commercial compounds
        vendor_patterns = {
            'simple_aromatics': r'c1ccccc1',  # Simple benzene rings
            'common_protecting_groups': r'C\(=O\)OC\(C\)\(C\)C',  # Boc groups
            'standard_linkers': r'OCCOCCOC',  # PEG-like linkers
        }
        
        pattern_counts = {pattern: 0 for pattern in vendor_patterns.keys()}
        flagged_compounds = []
        
        for smiles in smiles_list:
            mol = Chem.MolFromSmiles(smiles)
            if mol:
                # Check for overly simple structures
                if mol.GetNumAtoms() < 10 and mol.GetNumBonds() < 10:
                    flagged_compounds.append({
                        'smiles': smiles,
                        'reason': 'overly_simple',
                        'num_atoms': mol.GetNumAtoms()
                    })
                
                # Check for common patterns
                for pattern_name, pattern in vendor_patterns.items():
                    if re.search(pattern, smiles):
                        pattern_counts[pattern_name] += 1
        
        return {
            'pattern_counts': pattern_counts,
            'flagged_simple': len(flagged_compounds),
            'total_analyzed': len(smiles_list),
            'bias_indicators': {
                'high_simple_proportion': len(flagged_compounds) / len(smiles_list) > 0.3,
                'pattern_dominance': any(count > len(smiles_list) * 0.2 for count in pattern_counts.values())
            }
        }
    
    def _calculate_molecular_complexity(self, mol) -> float:
        """Calculate a simple molecular complexity score."""
        if not mol:
            return 0
        
        complexity = 0
        complexity += mol.GetNumAtoms() * 0.1
        complexity += mol.GetNumBonds() * 0.1
        complexity += rdMolDescriptors.CalcNumRings(mol) * 2
        complexity += rdMolDescriptors.CalcNumHeteroatoms(mol) * 1.5
        complexity += len(rdMolDescriptors.GetUSRScore(mol)) * 0.01 if hasattr(rdMolDescriptors, 'GetUSRScore') else 0
        
        return complexity
    
    def _assess_contamination_severity(self, results: Dict) -> str:
        """Assess the overall severity of contamination."""
        exact_overlaps = sum(result['count'] for result in results['exact_overlaps'].values() 
                           if isinstance(result['count'], int))
        
        if exact_overlaps == 0:
            return "low"
        elif exact_overlaps < 10:
            return "moderate"
        else:
            return "high"
    
    def _generate_contamination_recommendations(self, results: Dict) -> List[str]:
        """Generate recommendations based on contamination analysis."""
        recommendations = []
        
        severity = self._assess_contamination_severity(results)
        
        if severity == "high":
            recommendations.append("CRITICAL: High level of contamination detected - review data splitting strategy")
            recommendations.append("Consider using scaffold-based or time-based splitting instead of random splitting")
        elif severity == "moderate":
            recommendations.append("Moderate contamination detected - remove overlapping compounds")
        else:
            recommendations.append("Low contamination level - acceptable for most applications")
        
        # Check for specific issues
        for overlap_key, overlap_data in results['exact_overlaps'].items():
            if overlap_data['count'] > 0:
                recommendations.append(f"Remove {overlap_data['count']} overlapping compounds between {overlap_key}")
        
        return recommendations

# Demonstration of contamination detection
print("Contamination Detection Demonstration")
print("=" * 40)

# Create example datasets with known contamination
example_datasets = {
    'train': [
        'CCO',  # ethanol
        'c1ccccc1',  # benzene
        'CC(=O)O',  # acetic acid
        'CC(C)O',  # isopropanol
        'CN',  # methylamine
    ],
    'test': [
        'CCO',  # CONTAMINATION: same as train
        'c1ccc(O)cc1',  # phenol
        'CC(=O)OC',  # methyl acetate
    ],
    'validation': [
        'CC(=O)O',  # CONTAMINATION: same as train
        'CC(C)(C)O',  # tert-butanol
        'c1cccnc1',  # pyridine
    ]
}

detector = ContaminationDetector(similarity_threshold=0.8)

# Run contamination analysis
contamination_results = detector.detect_cross_dataset_leakage(example_datasets)

print("\nContamination Analysis Results:")
print(f"Contamination severity: {contamination_results['summary']['contamination_severity']}")
print(f"Total exact overlaps: {contamination_results['summary']['total_exact_overlaps']}")

print("\nExact overlaps found:")
for overlap_key, overlap_data in contamination_results['exact_overlaps'].items():
    if overlap_data['count'] > 0:
        print(f"  {overlap_key}: {overlap_data['count']} compounds ({overlap_data['percentage_of_first']:.1f}% of first dataset)")
        print(f"    Examples: {overlap_data['overlapping_smiles']}")

print("\nRecommendations:")
for i, rec in enumerate(contamination_results['summary']['recommendations'], 1):
    print(f"  {i}. {rec}")

# Test temporal contamination detection
print(f"\nTemporal Contamination Check:")
temporal_results = detector.detect_temporal_contamination(example_datasets['train'])
print(f"Suspicious compounds: {temporal_results['suspicious_count']}")

# Test vendor bias detection
print(f"\nVendor Catalog Bias Check:")
bias_results = detector.detect_vendor_catalog_bias(example_datasets['train'])
print(f"Simple compounds flagged: {bias_results['flagged_simple']}/{bias_results['total_analyzed']}")
print(f"Bias indicators detected: {any(bias_results['bias_indicators'].values())}")

Contamination Detection Demonstration
========================================
Detecting cross-dataset contamination...
  train: 5 valid structures
  test: 3 valid structures
  validation: 3 valid structures

Contamination Analysis Results:
Contamination severity: moderate
Total exact overlaps: 2

Exact overlaps found:
  train_vs_test: 1 compounds (20.0% of first dataset)
    Examples: ['CCO']
  train_vs_validation: 1 compounds (20.0% of first dataset)
    Examples: ['CC(=O)O']

Recommendations:
  1. Moderate contamination detected - remove overlapping compounds
  2. Remove 1 overlapping compounds between train_vs_test
  3. Remove 1 overlapping compounds between train_vs_validation

Temporal Contamination Check:
Checking for temporal contamination (compounds after 2020-01-01)...

---------------------------------------------------------------------------
ArgumentError                             Traceback (most recent call last)
Cell In[15], line 271
    269 # Test temporal contamination detection
    270 print(f"\nTemporal Contamination Check:")
--> 271 temporal_results = detector.detect_temporal_contamination(example_datasets['train'])
    272 print(f"Suspicious compounds: {temporal_results['suspicious_count']}")
    274 # Test vendor bias detection

Cell In[15], line 122, in ContaminationDetector.detect_temporal_contamination(self, smiles_list, target_date)
    119 mol = Chem.MolFromSmiles(smiles)
    120 if mol:
    121     # Example heuristic: very complex molecules might be recent
--> 122     complexity_score = self._calculate_molecular_complexity(mol)
    123     if complexity_score > 20:  # Arbitrary threshold
    124         suspicious_compounds.append({
    125             'smiles': smiles,
    126             'complexity_score': complexity_score,
    127             'reason': 'high_complexity'
    128         })

Cell In[15], line 188, in ContaminationDetector._calculate_molecular_complexity(self, mol)
    186 complexity += rdMolDescriptors.CalcNumRings(mol) * 2
    187 complexity += rdMolDescriptors.CalcNumHeteroatoms(mol) * 1.5
--> 188 complexity += len(rdMolDescriptors.GetUSRScore(mol)) * 0.01 if hasattr(rdMolDescriptors, 'GetUSRScore') else 0
    190 return complexity

ArgumentError: Python argument types in
    rdkit.Chem.rdMolDescriptors.GetUSRScore(Mol)
did not match C++ signature:
    GetUSRScore(boost::python::api::object descriptor1, boost::python::api::object descriptor2, boost::python::api::object weights=[])