-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore various dataset generation strategies on simplified chemical space #89
Comments
Let's define exactly what questions we want to answer. Our goal is to train a model that
The questions we want to answer are
We can use GFN2-xTB to generate lots of data very quickly. It would include lots of molecules of varying sizes, with lots of conformations for each one generated in different ways. Then we could train models on lots of subsets to see how well they achieve the goals. This would make for an interesting paper. |
Here's a more concrete proposal for how this could be done. For every molecule, start by having RDKit generate five conformations. Starting from each one, generate ten conformations in each of several ways.
That would be a total of 550 conformations for each molecule. We would compute forces and energies with GFN2-xTB. We could then train models on a variety of subsets, evaluating each one to see how well it works on an independent test set (accuracy of forces and energies, stability of trajectories). Here are some tests we could do.
|
I agree with your assessment of the goals and questions (though may add one more: "What is the best way to select molecules?") The suggestion to first generate this data with GFN2-xTB seems reasonable, though there would appear to be value in subsequently repeating it for a true QM level of theory (even if just the faster OpenFF level of theory).
What is the rationale behind this choice? For generating conformers, I think the emphasis on keeping each dataset to 50 conformers/molecule hinders us from addressing some questions. In particular:
|
It's very cheap to do, especially compared to running dynamics with a semi-empirical method, or even worse doing an optimization with the full QM method. And maybe it works just as well. The goal is to find out.
That's one of the methods I suggested: an optimization trajectory with the full QM method, which in this case is just GFN2-xTB, but for a real dataset would be something more expensive. |
As we continue to explore the best ways to generate data for future iterations of SPICE, it would be useful to apply a variety of dataset generation strategies to a simplified chemical space to enable experiments that can help identify the most useful strategies.
OpenFF has made extensive use of the AlkEthOH dataset that contains only three elements (C, H, O), making it feasible to relatively exhaustively explore the relevant chemical space. The name "AlkEthOH" refers to "alkanes, ethers, and alcohols (OH)".
A few subsets have already been generated at the OpenFF
default
level of theory by QCFractal here:Examples are below:
AlkEthOH chain molecules
AlkEthOH_chain.pdf
AlkEthOH with rings
AlkEthOH_rings.pdf
PhAlkEthOH
PhEthOH.pdf
We could generate several kinds of datasets:
OptimizationDataset
from RDKit-enumerated conformersOptimizationDataset
with the number of minimization steps limited to 3-4 steps (requires an argument to geomeTRIC to register success even if the convergence tolerance is not met)TorsionDriveDataset
The text was updated successfully, but these errors were encountered: