Uncertainty quantification with graph neural networks for efficient molecular design

Table of Contents

Molecular design benchmarks

To effectively evaluate molecular design strategies, tasks must be complex enough to reflect the challenges encountered in real-world applications. Our study provides a comprehensive assessment of different optimization approaches across 19 molecular property datasets, encompassing 10 single-objective and 6 multi-objective tasks (Table 1), derived from the Tartarus⁴⁹ and GuacaMol⁴⁸ platforms.

Table 1 Summary of the molecular design tasks investigated in this study

The first platform, Tartarus⁴⁹, offers a sophisticated suite of benchmark tasks tailored to address practical molecular design challenges within the realms of materials science, pharmaceuticals, and chemical reactions. Utilizing well-established computational chemistry techniques, including force fields and density functional theory (DFT), Tartarus models complex molecular systems with high computational efficiency. The benchmarks encompass a wide array of applications, ranging from optimizing organic photovoltaics and discovering novel organic light-emitting diodes (OLEDs) to designing protein ligands and pioneering new chemical reactions. This breadth enables a comprehensive evaluation of various molecular design algorithms across multiple real-world simulation scenarios.

Three molecular design categories from Tartarus, comprising seven single-objective and two multi-objective tasks, are listed in Table 1. Each task employs specific computational methods: organic emitter design involves conformer sampling⁵¹, semi-empirical quantum mechanical methods for geometry optimization^52,53, and time-dependent DFT for single-point energy calculations⁵⁴. Protein ligand design utilizes docking pose searches to determine stable binding energies⁵⁵, supplemented by empirical functions for final score calculations⁵⁶. Reaction substrate design tasks employ force fields for optimizing reactant and product structures⁵⁷, with transition state structures further refined using the SEAM method⁵⁸. These methods include stochastic elements such as conformer search and docking site sampling, introducing variability in simulation outcomes due to the inherent randomness of geometry optimization. For multi-objective tasks, a typical approach might involve aggregating multiple objectives into a single composite score. However, this can lead to suboptimal compromises where certain objectives are sacrificed to maximize the overall score. In practical applications, molecules often need to satisfy multiple objectives simultaneously, which can be particularly challenging when these objectives are mutually constraining. To evaluate the efficacy of molecular design strategies under these conditions, we analyzed each objective within multi-objective tasks, choosing scenarios where objectives could potentially conflict. For example, the task of simultaneously minimizing both activation energy and reaction energy was excluded due to their positive correlation, as explained by the Bell–Evans–Polanyi principle^59,60. Conversely, we included the task of simultaneously maximizing activation energy while minimizing reaction energy, as it poses a significant challenge by deviating from conventional expectations and thus aligns more closely with the aims of our study. These choices are detailed in Table 1, illustrating the structured approach to assessing molecular design algorithms against complex, real-world criteria.

The second molecular design platform, GuacaMol⁴⁸, serves as a widely recognized benchmark in drug discovery and is extensively utilized in various molecular optimization studies. The design tasks include marketed drug rediscovery, similarity assessment, median molecule generation, and isomer generation. From these, we selected tasks suitable for molecular property optimization, comprising three single-objective tasks aimed at identifying structures similar to a specific drug and four multi-objective tasks focused on finding median molecules between two drugs or achieving multi-property optimization (MPO), as detailed in Table 1. Unlike the physical simulations in Tartarus, GuacaMol uses deterministic functions implemented in RDKit to compute property values, thereby eliminating data randomness. To simulate real-world scenarios where machine learning (ML) surrogate models are rarely perfect, we downsample the GuacaMol dataset to build ML surrogate models for fitness prediction during the GA process. In this setup, the molecular design process initially relies on a potentially imperfect surrogate model to propose molecular structures, which are subsequently validated using the RDKit-based oracle functions.

Uncertainty-aware and uncertainty-agnostic fitness functions

In conventional molecular design, the typical single-objective optimization approach focuses on maximizing a specific fitness function ${F}_{{{{\rm{DOM}}}}}\left(m\right)$ without consideration of uncertainty. This naïve approach, referred to as the direct objective maximization (DOM), or greedy method⁶¹, is defined as

$${F}_{{{{\rm{DOM}}}}}\left(m\right)=\eta \mu \left(m\right)$$

(1)

where $\mu \left(m\right)$ represents the predicted property value of molecule $m$ by the surrogate model, and $\eta$ is the sign factor taking the values of +1 or -1. This factor is assigned a value of +1 when a higher $\mu$ is desired, and −1 when a lower $\mu$ is preferred. However, practical applications often do not necessitate driving the property values to their extremes. Instead, it is usually sufficient for the property to meet a certain threshold $\delta$ that is deemed acceptable for a given application^47,48.

In such scenarios, the goal should shift from merely optimizing the property value to ensuring that the property of the molecule $m$ exceeds this threshold $\delta$. Assuming the property predicted by the surrogate model follows a Gaussian distribution with mean $\mu \left(m\right)$ and variance ${\sigma }^{2}\left(m\right)$, the PIO fitness function can be defined as

$${F}_{{{{\rm{PIO}}}}}\left({m;}\delta \right)=\eta {\int }_{\delta }^{\eta \infty }\frac{1}{\sigma \left(m\right)\sqrt{2{{\pi }}}}\exp \left(-\frac{1}{2}{\left(\frac{x-\mu \left(m\right)}{\sigma \left(m\right)}\right)}^{2}\right){{{\rm{d}}}}x$$

(2)

where ${F}_{{{{\rm{PIO}}}}}\left({m};\delta \right)$ ranges between 0 and 1. In this expression, ${F}_{{{{\rm{PIO}}}}}\left({m};\delta \right)$ quantifies the probability that the property value of molecule $m$ will exceed the threshold $\delta$. The PIO approach inherently incorporates the uncertainty (variance) of the prediction and mitigates the risk of extrapolating the surrogate model beyond its reliable range and has been recently adopted in other works utilizing active learning for drug discovery⁶¹, co-cured polycyanurates⁶², organic semiconductors⁶³ and boron–carbon–nitrogen crystal structure design⁶⁴. By establishing a realistic threshold $\delta$, this method significantly enhances the practicality and applicability of molecular design optimization in real-world settings.

An alternative approach to incorporating uncertainty into the fitness function is the expected improvement (EI) method, which evaluates the expected magnitude of the improvement⁴⁶. Assuming the property predicted by the surrogate model follows a Gaussian distribution with mean $\mu \left(m\right)$ and variance ${\sigma }^{2}\left(m\right)$, the fitness function for EI can be defined as

$${F}_{{{{\rm{EI}}}}}\left({m;}\delta \right)=\eta {\int }_{\delta }^{\eta \infty }\frac{x-\delta}{\sigma \left(m\right)\sqrt{2{{\pi }}}}\exp \left(-\frac{1}{2}{\left(\frac{x-\mu \left(m\right)}{\sigma \left(m\right)}\right)}^{2}\right){{{\rm{d}}}}x$$

(3)

The primary difference between PIO (Eq. 2) and EI (Eq. 3) lies in whether the integration over possible improvements at a given $x$ is considered. The PIO method focuses solely on the likelihood of improvement at a specific threshold, making it a unitless probability value, whereas the EI method considers both the probability and magnitude of potential gains, yielding a value with the same unit as the target variable. Both approaches are commonly used as acquisition functions in BO⁴⁶. In this work, we compare the performance of DOM with both PIO and EI for single-objective optimization tasks. This comparative analysis aims to highlight the advantages of incorporating UQ with GNN in molecular design and to determine the most effective optimization strategy for various practical applications.

When considering multiple properties in molecular design, a common method, known as scalarization, aggregates all objectives into a single value using their corresponding weights. The weights should be carefully chosen to balance the contributions of each property in the objective function. One approach is to use the reciprocal of the standard deviation $\widetilde{\sigma }$ of each target’s distribution in the dataset as weights⁶⁵, scaling the contribution of each property according to the potential variability of its values:

$${F}_{{{{\rm{WS}}}}}^{{{{\rm{multi}}}}}\left(m\right)={\sum }_{i=1}^{k}\frac{{F}_{{{{\rm{DOM}}}}}^{i}\left(m\right)}{{\widetilde{\sigma }}_{i}}$$

(4)

where $k$ is the number of properties considered. However, this weighted sum (WS) approach can still lead to suboptimal compromises where certain objectives are sacrificed to enhance the overall score⁶⁵. In practical scenarios, molecules often need to meet multiple objectives simultaneously, typically represented by thresholds. To address this, we propose calculating the product of individual probabilities that each property surpasses its respective threshold, representing the overall probability of meeting all specified targets:

$${F}_{{{{\rm{PIO}}}}}^{{{{\rm{multi}}}}}\left({m};{\delta }_{1},{\delta }_{2},\ldots,{\delta }_{k}\right)={\prod }_{i=1}^{k}{F}_{{{{\rm{PIO}}}}}^{i}\left(m;{\delta }_{i}\right)$$

(5)

If any single probability approaches zero, the overall fitness score will also approach zero, regardless of high scores in other targets. This method emphasizes balancing trade-offs and aligns more closely with the complex demands of real-world applications^66,67,68.

One limitation of the WS method in Eq. 4 is that it does not account for specific optimization thresholds, which may lead to unbalanced solutions in multi-objective optimization. Inspired by the $\varepsilon$-constraint method⁶⁹, which reformulates additional objectives as constraints with threshold values to ensure feasible trade-offs, this study explores alternative formulations, such as the normalized Manhattan distance (NMD) to the ideal threshold values $\left({\delta }_{1},{\delta }_{2},\ldots,{\delta }_{k}\right)$ as the objective function⁷⁰. This approach treats objective values that meet or exceed the thresholds as equally favorable, potentially reducing the risk of overemphasizing certain properties at the expense of others

$${F}_{{{{\rm{NMD}}}}}^{{{{\rm{multi}}}}}\left({m};{\delta }_{1},{\delta }_{2},\ldots,{\delta }_{k}\right)={\sum }_{i=1}^{k}\frac{\min \left({\eta }_{i}\left(\mu \left(m\right){-\delta }_{i}\right),0\right)}{{\widetilde{\sigma }}_{i}}$$

(6)

where ${F}_{{{{\rm{NMD}}}}\,}^{{{{\rm{multi}}}}}\le 0$. Both the NMD and $\varepsilon$-constraint methods aim to achieve balanced solutions by incorporating objective-specific limits. However, while NMD minimizes cumulative deviations to treat all objectives meeting thresholds as equally favorable, the $\varepsilon$-constraint method enforces strict feasibility by converting secondary objectives into constraints, resulting in a more rigid adherence to specified bounds. A key limitation of NMD, though, is that it restricts further optimization once all thresholds are met. To overcome this, we propose a hybrid fitness function that combines the NMD and the simple WS approach, transitioning form ${F}_{{{{\rm{NMD}}}}\,}^{{{{\rm{multi}}}}}$ to ${F}_{{{{\rm{WS}}}}}^{{{{\rm{multi}}}}}$ once all property values meet their respective thresholds

$${F}_{{{{\rm{NMD}}}}-{{{\rm{WS}}}}}^{{{{\rm{multi}}}}}\left({m};{\delta }_{1},{\delta }_{2},\ldots,{\delta }_{k}\right)=\left\{\begin{array}{c}{F}_{{{{\rm{NMD}}}}}^{{{{\rm{multi}}}}}\left({m};{\delta }_{1},{\delta }_{2},\ldots,{\delta }_{k}\right),{{{\rm{if}}}}{F}_{{{{\rm{NMD}}}}}^{{{{\rm{multi}}}}}\left({m};{\delta }_{1},{\delta }_{2},\ldots,{\delta }_{k}\right) \, { < } \, 0\\ {F}_{{{{\rm{WS}}}}}^{{{{\rm{multi}}}}}\left({m};{\delta }_{1},{\delta }_{2},\ldots,{\delta }_{k}\right),{{{\rm{if}}}}{F}_{{{{\rm{NMD}}}}}^{{{{\rm{multi}}}}}\left({m};{\delta }_{1},{\delta }_{2},\ldots,{\delta }_{k}\right) \,=\, 0\end{array}\right.$$

(7)

This hybrid approach (NMD-WS), similar to methods combining $\varepsilon$-constraint and weighted sum techniques⁶⁹, aims to combine the strengths of both methods, achieving a balanced optimization that respects the thresholds while allowing further improvements once the initial conditions are met.

Surrogate model and UQ performance

For effective CAMD, it is crucial that the surrogate model accurately represents the molecular properties of interest. To this end, we first assessed the performance of the D-MPNN model along with two UQ methods—deep ensemble combined with mean-variance estimation (MVE)⁷¹ and evidential learning⁷²—on the target properties of the design tasks specified in Table 1. Our evaluations revealed that neither UQ method delivered consistent performance across all datasets. Notably, the MVE loss function exhibited a tendency to diverge when training models on the reactivity dataset, which occasionally led to the premature termination of training sessions. In contrast, the evidential loss function faced convergence issues during the training of models for the organic emitter dataset, resulting in reduced accuracy. These challenges in model training may be partly attributed to data noise inherent in the property values, a consequence of the non-deterministic computational procedures used to generate these data, as detailed in the method section and illustrated in Supplementary Figs. S1, S2, and S3. These observations highlight the critical need for further development and refinement of these UQ methods to enhance their robustness. In response to these findings, we selected the deep ensemble and MVE approach for the organic emitter dataset, while applying evidential regression for the other datasets in our molecular design experiments. The efficacy of these approaches was visually assessed using parity plots and confidence-based calibration curves, displayed in Figs. 2 and 3, respectively. These figures show that the D-MPNN effectively captures the trends in property values, with the estimated uncertainties generally well-calibrated against the test set.

**Fig. 2: Parity plots comparing reference values with predictions from the directed message passing neural network (D-MPNN) surrogate models on the test set.**

**Fig. 3: Confidence-based calibration curves (orange) for various models assessed using testing data.**

It is important to recognize that prediction uncertainties may arise from multiple sources, such as data noise and model uncertainty, meaning that the residuals between predicted and reference values may not always follow a Gaussian distribution. Therefore, we validated the Gaussian distribution assumption by examining the actual distribution of residuals. This was achieved by using confidence-based calibration curves (Fig. 3)^73,74. These curves assess the proportion of test data points that fall within a confidence interval around their predicted values. The intervals are calculated based on predicted variance under the Gaussian assumption, and the observed proportions are then compared to the expected confidence levels. Ideally, a perfect calibration curve would follow a diagonal line, where predicted probabilities align with observed proportions across various confidence levels. To quantify deviations from this ideal calibration, we calculated the area under the calibration error curve (AUCE), with higher AUCE values reflecting greater deviations from perfect calibration. As shown in Fig. 3, the calibration curves closely follow the diagonal line across all test sets, with AUCE values remaining below 0.1, suggesting that the residual distribution for the test data does not significantly deviate from Gaussian assumptions and aligns well with estimated variance. Nonetheless, further improvement in uncertainty estimation may be achieved through additional recalibration steps^75,76 or by employing alternative UQ methods^77,78 that do not rely on strong distributional assumptions. Incorporating these enhanced UQ methods with uncertainty-aware optimization presents a promising direction for future research.

Optimization results of single-objective task

This section evaluates the optimization results obtained using DOM (Eq. 1), PIO (Eq. 2), and EI (Eq. 3) fitness functions across ten single-objective tasks, focusing on the hit rate of molecules—i.e., their ability to exceed predetermined threshold values. This metric assesses whether the integration of uncertainty into the fitness function could improve the success rate of generating molecules that surpass these thresholds. As shown in Table 2, the PIO method consistently achieved the highest hit rates for most tasks. Additionally, Fig. 4 illustrates that, under the PIO approach, the top-100 molecules for these tasks exhibit a greater proportion of candidates meeting or exceeding the threshold, compared to those generated by DOM, further demonstrating the benefit of incorporating uncertainty in the optimization process. However, despite integrating uncertainty, the EI method does not consistently outperform the uncertainty-agnostic DOM method.

Table 2 Comparison of top-k hit rates for single-objective optimization results across various methods

**Fig. 4: Comparative distribution of true property values for the top-100 molecules generated by different methods.**

To understand why only the PIO method outperformed its uncertainty-agnostic counterpart while the EI method did not, we generated parity plots of the molecules optimized by each method (Fig. 5). These plots show that the leading molecules selected by EI tend to exhibit the highest uncertainties in the surrogate model compared to those identified by DOM and PIO. Conversely, DOM-selected molecules often display extreme predicted mean values—either lowest or highest for minimization or maximization tasks, respectively. This outcome is expected, as DOM focuses solely on optimizing the predicted mean without considering uncertainty, often pushing optimization toward extrapolative regions where predictions are less reliable. While EI does incorporate uncertainty, its performance in most single-objective optimization tasks was not particularly robust. This outcome likely stems from EI’s tendency to favor candidates with high uncertainty when their predicted mean values are similar, as it calculates expected improvements as the fitness function. Such a preference can lead to the selection of molecules with significant prediction uncertainties, which often causes discrepancies between predicted and actual properties, contributing to EI’s relatively unstable performance across tasks. It is worth noting that EI is widely used as an acquisition function in BO with Gaussian processes^{79,80,81,82,83}, where it effectively identifies optimal solutions within smaller, more confined search spaces over fewer iterations. However, molecular design requires navigating a much larger chemical space, where D-MPNN surrogate models can assign considerable uncertainty to numerous candidate structures, inflating expected improvements and diminishing EI’s effectiveness in our test cases. In contrast, PIO focuses exclusively on the probability of improvement, yielding a bounded fitness value between 0 and 1, which makes it less susceptible to the issues of extreme variance. By emphasizing candidates with a higher probability of exceeding the threshold without overemphasizing uncertain regions, PIO achieves more stable and reliable performance. This balance enables PIO to identify candidates that meet cutoff criteria while maintaining lower uncertainties, leading to more reliable predictions.

**Fig. 5: Parity plots comparing reference values with predictions as well as uncertainties from the directed message passing neural network (D-MPNN) models.**

However, certain challenging tasks reveal limitations across all methods—DOM, EI, and PIO—in identifying candidates that surpass thresholds. For example, tasks involving singlet-triplet gap, 1SYH score, and similarity to mestranol demonstrate cases where all of these approaches struggle to find candidates meeting the set criteria, highlighting areas for further improvement. The first limitation arises from the dependency of both PIO and EI on threshold-based guidance during the search process. When thresholds are set too stringently, far beyond the performance range of the current population, the fitness score remains zero regardless of optimization direction, which can impede the search for high-performing molecules. In our study, most thresholds were set near the performance of top molecules within each task’s original dataset (as shown in Supplementary Table S5). Consequently, the difficulty of exceeding these thresholds varies depending on the structural diversity in each dataset. The second limitation involves decreased model accuracy in predicting property mean and variance in extrapolated regions. This issue is evident in the mestranol similarity task, where the objective is to identify molecules resembling mestranol’s complex polycyclic structure, featuring four fused rings (Fig. 4j). Accurately capturing these complex ring structures remains challenging for D-MPNN, which would benefit from additional structural features—such as ring size indicators—to improve prediction accuracy for complex species in highly extrapolated regions⁸⁴. Therefore, although D-MPNN performed reasonably well on the test set for this task (Fig. 2k), it struggled to identify similar molecules within the broader chemical space, consistently yielding similarity predictions for recommended candidates that deviated significantly from the true reference values (Fig. 5j). An additional concern is that these predictions frequently showed small uncertainty estimates, suggesting that D-MPNN may have inaccurately assessed uncertainty in these cases. This finding underscores a critical limitation: even well-calibrated models may struggle to generalize accurately during molecular optimization over an extensive chemical space, leading to unreliable predictions not only for mean values but also for variance estimates with current UQ methods. One approach to address these challenges is adaptive modeling, which iteratively incorporates newly validated molecules to refine predictions and improve uncertainty estimates. However, improving the reliability of UQ methods is essential to address these challenges and strengthen the robustness of molecular design workflows.

Optimization results of multi-objective tasks

In this subsection, we evaluate the impact of various fitness function designs on the performance of molecule generation for multi-objective tasks. These designs included uncertainty-agnostic methods such as the WS (Eq. 4), NMD (Eq. 6), and the hybrid approach NMD-WS (Eq. 7), as well as the uncertainty-aware PIO method (Eq. 5), which calculates the product of single-objective probabilities where each indicator exceeds its corresponding cutoff. A molecule was considered a hit in multi-objective tasks if it met all specified property cutoffs.

As detailed in Table 3, the PIO approach emerged as the most effective in identifying molecules that satisfied criteria for multi-objective criteria, achieving the highest hit rates for most tasks. Among the uncertainty-agnostic methods, no single approach demonstrated consistent superiority across all tasks. The NMD method showed higher success rates in generating viable molecules for organic emitter designs and the fexofenadine MPO task, while the hybrid NMD-WS method outperformed other uncertainty-agnostic approaches in the remaining multi-objective tasks. In contrast, the WS method consistently struggled, failing to identify molecules that met all required thresholds in any multi-objective task.

Table 3 Comparison of top-k hit rates for multi-objective optimization results across various methods

A primary challenge in multi-objective tasks, as opposed to single-objective tasks, lies in balancing the contributions of different properties. For instance, in the organic emitter design task, there is a moderate positive correlation between the singlet-triplet gap and oscillator strength (Supplementary Fig. S24), complicating the task, which demands minimizing the singlet-triplet gap while maximizing oscillator strength, thereby creating conflicting optimization directions. This complexity was exacerbated by the disproportionate emphasis on oscillator strength, whose unbounded maximum value could lead the WS method to overly prioritize this trait, neglecting the other (Fig. 6). Similar challenges were observed in the fexofenadine and ranolazine MPO tasks, which involve maximizing similarity to target molecules while optimizing octanol-water partition coefficient (logP) and topological polar surface area (TPSA). Here, the WS method tended to prioritize logP and TPSA optimization at the expense of similarity scores (Fig. 6). This observation aligns with previous research⁸⁵, emphasizing the challenge of balancing each target’s contribution in the fitness function to prevent bias in multi-objective optimization scenarios. Methods such as NMD and NMD-WS, which incorporate cutoff values into fitness functions, better address this balancing challenge. However, these uncertainty-agnostic methods can still lead to over-optimization in regions beyond the model’s predictive range, potentially resulting in discrepancies between predicted and actual outcomes. Consequently, the PIO method generally demonstrates a higher hit rate by incorporating uncertainty information and thus achieving a better balance across all targets.

**Fig. 6: Parallel coordinate plots illustrating the true property values for the top-50 molecules derived from various optimization methods in multi-objective design tasks.**

Despite the overall success of the PIO method, one task presented challenges for all optimizers: the median molecules 2 task, which aimed to find molecules similar to both camphor and menthol. In this case, none of the optimization methods succeeded in identifying molecules with similarity scores exceeding the cutoff of 0.2. This difficulty is likely due to the low similarity scores in the original dataset, where the majority of scores fall below 0.1 for these target molecules (Supplementary Figs. S17 and S18). This task proved more challenging compared to the median molecules 1 task, where similarity scores with the target molecules (tadalafil and sildenafil) in the original data generally ranged between 0.1 and 0.2, closer to the target value of 0.2 (Supplementary Figs. S15 and S16).

Multi-objective optimization problems are prevalent in fields such as chemical, drug, and material design, where property cutoffs are often required to meet specific commercial objectives. The PIO method achieves the highest hit rates across most multi-objective tasks by integrating uncertainty information and balancing optimization across all targets. In the PIO fitness function (Eq. 5), any molecule deviating significantly from a target threshold receives a lower overall score, guiding the optimization process to consider all objectives equally. When certain objectives are of lower priority, cutoff values for these properties can be relaxed to minimum acceptable levels, reducing their impact on the overall fitness score as long as the values remain within acceptable ranges. Conversely, if a property approaches its minimum acceptable threshold, it appropriately impacts the fitness score, signaling the need for further optimization in that direction.

link

Facebook baixar gratis

Uncertainty quantification with graph neural networks for efficient molecular design

Molecular design benchmarks

Uncertainty-aware and uncertainty-agnostic fitness functions

Surrogate model and UQ performance

Optimization results of single-objective task

Optimization results of multi-objective tasks

Top 10 Business Analytics careers (and their salaries) in the UK 2025

Greg Kutzin on the Role of Python in Modern Business Analysis

How to Stand Out as a Business Analyst

Scaling Medical Digital Transformation in Global Pharma While Ensuring Local Fit