Generative models for molecular graphs have progressed quickly, yet most cannot guarantee structural compliance while sampling, which limits their reliability in scientific use. This thesis investigates hard constraints within discrete diffusion for molecular graph generation by adapting and extending the ConStruct framework to QM9. I introduce sampling-time projectors that impose symbolic graph constraints, namely upper bounds on total ring count and on maximum ring length, defined over all simple cycles. The projectors act only during reverse diffusion, are edge-deletion invariant, and prevent precisely those edges that would lead to a violation, leaving the learned score network unchanged. Across all settings, the method attains near-perfect satisfaction of the targeted constraints (essentially 100%) with competitive generative quality: RDKit validity, uniqueness, and novelty remain high, and Fréchet ChemNet Distance stays close to the unconstrained baseline. When constraints are non-binding, behavior matches the baseline as expected. To make these effects measurable, I separate structural satisfaction at sampling time from chemical metrics computed post-hoc, and provide diagnostics that track shifts in ring spectra and connectivity under different constraint levels. The analysis also characterizes trade-offs introduced by tight constraints, such as mild distributional bias and occasional novelty loss. Overall, the results show that strict, interpretable constraints can be integrated into discrete diffusion without retraining, enabling controllable molecular generation. The same mechanism aims to extend naturally to additional rules (for example, drug-likeness), suggesting a general template for constraint-aware graph generative models beyond molecules.

Generative models for molecular graphs have progressed quickly, yet most cannot guarantee structural compliance while sampling, which limits their reliability in scientific use. This thesis investigates hard constraints within discrete diffusion for molecular graph generation by adapting and extending the ConStruct framework to QM9. I introduce sampling-time projectors that impose symbolic graph constraints, namely upper bounds on total ring count and on maximum ring length, defined over all simple cycles. The projectors act only during reverse diffusion, are edge-deletion invariant, and prevent precisely those edges that would lead to a violation, leaving the learned score network unchanged. Across all settings, the method attains near-perfect satisfaction of the targeted constraints (essentially 100%) with competitive generative quality: RDKit validity, uniqueness, and novelty remain high, and Fréchet ChemNet Distance stays close to the unconstrained baseline. When constraints are non-binding, behavior matches the baseline as expected. To make these effects measurable, I separate structural satisfaction at sampling time from chemical metrics computed post-hoc, and provide diagnostics that track shifts in ring spectra and connectivity under different constraint levels. The analysis also characterizes trade-offs introduced by tight constraints, such as mild distributional bias and occasional novelty loss. Overall, the results show that strict, interpretable constraints can be integrated into discrete diffusion without retraining, enabling controllable molecular generation. The same mechanism aims to extend naturally to additional rules (for example, drug-likeness), suggesting a general template for constraint-aware graph generative models beyond molecules.

Constrained Molecular Graph Generation with Diffusion Models

ISLEK, RANA
2024/2025

Abstract

Generative models for molecular graphs have progressed quickly, yet most cannot guarantee structural compliance while sampling, which limits their reliability in scientific use. This thesis investigates hard constraints within discrete diffusion for molecular graph generation by adapting and extending the ConStruct framework to QM9. I introduce sampling-time projectors that impose symbolic graph constraints, namely upper bounds on total ring count and on maximum ring length, defined over all simple cycles. The projectors act only during reverse diffusion, are edge-deletion invariant, and prevent precisely those edges that would lead to a violation, leaving the learned score network unchanged. Across all settings, the method attains near-perfect satisfaction of the targeted constraints (essentially 100%) with competitive generative quality: RDKit validity, uniqueness, and novelty remain high, and Fréchet ChemNet Distance stays close to the unconstrained baseline. When constraints are non-binding, behavior matches the baseline as expected. To make these effects measurable, I separate structural satisfaction at sampling time from chemical metrics computed post-hoc, and provide diagnostics that track shifts in ring spectra and connectivity under different constraint levels. The analysis also characterizes trade-offs introduced by tight constraints, such as mild distributional bias and occasional novelty loss. Overall, the results show that strict, interpretable constraints can be integrated into discrete diffusion without retraining, enabling controllable molecular generation. The same mechanism aims to extend naturally to additional rules (for example, drug-likeness), suggesting a general template for constraint-aware graph generative models beyond molecules.
2024
Constrained Molecular Graph Generation with Diffusion Models
Generative models for molecular graphs have progressed quickly, yet most cannot guarantee structural compliance while sampling, which limits their reliability in scientific use. This thesis investigates hard constraints within discrete diffusion for molecular graph generation by adapting and extending the ConStruct framework to QM9. I introduce sampling-time projectors that impose symbolic graph constraints, namely upper bounds on total ring count and on maximum ring length, defined over all simple cycles. The projectors act only during reverse diffusion, are edge-deletion invariant, and prevent precisely those edges that would lead to a violation, leaving the learned score network unchanged. Across all settings, the method attains near-perfect satisfaction of the targeted constraints (essentially 100%) with competitive generative quality: RDKit validity, uniqueness, and novelty remain high, and Fréchet ChemNet Distance stays close to the unconstrained baseline. When constraints are non-binding, behavior matches the baseline as expected. To make these effects measurable, I separate structural satisfaction at sampling time from chemical metrics computed post-hoc, and provide diagnostics that track shifts in ring spectra and connectivity under different constraint levels. The analysis also characterizes trade-offs introduced by tight constraints, such as mild distributional bias and occasional novelty loss. Overall, the results show that strict, interpretable constraints can be integrated into discrete diffusion without retraining, enabling controllable molecular generation. The same mechanism aims to extend naturally to additional rules (for example, drug-likeness), suggesting a general template for constraint-aware graph generative models beyond molecules.
diffusion models
graph generation
molecular generation
molecular datasets
deep learning
File in questo prodotto:
File Dimensione Formato  
RanaIslek_Thesis_Report.pdf

accesso aperto

Dimensione 2.52 MB
Formato Adobe PDF
2.52 MB Adobe PDF Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/102115