Usage Guide

This guide provides a general overview of how to work with STRUCTURES25 once the project is installed. For instructions on reproducing the experiments from our paper, refer to the REPLICATION_GUIDE.md in the repository.

We rely on Hydra to manage configurations. The main configuration files are located in configs/.

Data Generation

Our datasets are available at Dryad.

(Optional) Create your own dataset class or use the MISC dataset and provide XYZ files to define which molecules should be generated.
(Optional) Create a config file in configs/datagen/dataset/.
Run Kohn-Sham DFT on the dataset and create .chk files in $DFT_DATA/dataset/kohn_sham:
```
mldft_ks dataset=<your_dataset_config_name> n_molecules=1000 start_idx=0
```
Based on the Kohn-Sham results, perform density fitting, compute energy and gradients, and save them as labels for the machine learning model in $DFT_DATA/dataset/labels:
```
mldft_labelgen dataset=<your_dataset_config_name> n_molecules=-1
```
Split the file into train, validation, and test datasets using mldft/utils/create_dataset_splits.py.
```
python mldft/utils/create_dataset_splits.py <dataset_name>
```
Create a training data config in configs/ml/data to link to the dataset. Ensure the dataset_name and atom types are set correctly.
Transform the dataset into a basis (to reduce data loading computations during training). For Graphformer models, use local_frames_global_natrep.
```
python mldft/datagen/transform_dataset.py data=<your_train_data_config_name> data/transforms=local_frames_global_natrep
```
Compute dataset statistics, making sure to do so for the transformation and target energy you plan to use.
```
python mldft/ml/compute_dataset_statistics.py data=<your_dataset_config_name>
```

Training

Start training with:

mldft_train data=<train_data_config> model=<model_config>

Key settings:

data/transforms: Selects whether the data has been pre-transformed. The default local_frames_global_natrep applies both local frames and global natural reparametrization.
data.target_key: Determines the target you train on. The default kin_plus_xc trains on the total kinetic and exchange-correlation energy (and their gradients). Alternatives include kin_minus_apbe (delta learning relative to the APBE kinetic energy functional) and tot (total electronic energy).

Density Optimization

On a Dataset

To run density optimization on a dataset in the project format:

mldft_denop run_path=<path_to_ml_model> \
    n_molecules=<number_of_molecules> device=<device> initialization=<initialization> num_devices=<num_devices>

run_path: Path to the model relative to DFT_MODELS.
n_molecules: Number of molecules to compute.
device: Target device (for example cuda or cpu).
initialization: Initialization strategy: sad, sad_default, minao, or hueckel. The sad option requires matching dataset statistics. sad_default is using the dataset statistics of ground states of the QM9 dataset. minao and hückel first predict the density in a Kohn-Sham basis and require density fitting, which becomes expensive for large molecules.

By default the command runs on the validation split of the dataset used during training. Override split_file_path to load a different split file and split to switch between the train, val, and test partitions. Results are written to density_optimization.pdf and density_optimization_summary.pdf.

On Arbitrary Molecules

To optimize densities for molecules from standalone .xyz files:

mldft example.xyz --model /path/to/some/model
# view all options
mldft --help

--model must point to a directory containing hparams.yaml and a checkpoints/ directory with a last.ckpt checkpoint. Ensure the model was trained for all atom types present in the molecule. A log file with the same basename as the .xyz file and a .log suffix is created. When dataset statistics are available you can select the sad initialization; otherwise minao is used.

If you have installed the pretrained models using mldft_setup, you can reference them by name:

mldft xyzfile --model str25_qm9

# or

mldft xyzfile --model str25_qmugs

The optimization result is saved as a .pt file matching the basename of the input .xyz file.