Usage Guide

This guide provides a general overview of how to work with STRUCTURES25 once the project is installed. For instructions on reproducing the experiments from our paper, refer to the REPLICATION_GUIDE.md in the repository.

We rely on Hydra to manage configurations. The main configuration files are located in configs/.

Data Generation

Our datasets are available at Dryad.

  1. (Optional) Create your own dataset class or use the MISC dataset and provide XYZ files to define which molecules should be generated.

  2. (Optional) Create a config file in configs/datagen/dataset/.

  3. Run Kohn-Sham DFT on the dataset and create .chk files in $DFT_DATA/dataset/kohn_sham:

    mldft_ks dataset=<your_dataset_config_name> n_molecules=1000 start_idx=0
    
  4. Based on the Kohn-Sham results, perform density fitting, compute energy and gradients, and save them as labels for the machine learning model in $DFT_DATA/dataset/labels:

    mldft_labelgen dataset=<your_dataset_config_name> n_molecules=-1
    
  5. Split the file into train, validation, and test datasets using mldft/utils/create_dataset_splits.py.

    python mldft/utils/create_dataset_splits.py <dataset_name>
    
  6. Create a training data config in configs/ml/data to link to the dataset. Ensure the dataset_name and atom types are set correctly.

  7. Transform the dataset into a basis (to reduce data loading computations during training). For Graphformer models, use local_frames_global_natrep.

    python mldft/datagen/transform_dataset.py data=<your_train_data_config_name> data/transforms=local_frames_global_natrep
    
  8. Compute dataset statistics, making sure to do so for the transformation and target energy you plan to use.

    python mldft/ml/compute_dataset_statistics.py data=<your_dataset_config_name>
    

Training

Start training with:

mldft_train data=<train_data_config> model=<model_config>

Key settings:

  • data/transforms: Selects whether the data has been pre-transformed. The default local_frames_global_natrep applies both local frames and global natural reparametrization.

  • data.target_key: Determines the target you train on. The default kin_plus_xc trains on the total kinetic and exchange-correlation energy (and their gradients). Alternatives include kin_minus_apbe (delta learning relative to the APBE kinetic energy functional) and tot (total electronic energy).

Density Optimization

On a Dataset

To run density optimization on a dataset in the project format:

mldft_denop run_path=<path_to_ml_model> \
    n_molecules=<number_of_molecules> device=<device> initialization=<initialization> num_devices=<num_devices>
  • run_path: Path to the model relative to DFT_MODELS.

  • n_molecules: Number of molecules to compute.

  • device: Target device (for example cuda or cpu).

  • initialization: Initialization strategy: sad, minao, or hückel. The sad option requires matching dataset statistics.

By default the command runs on the validation split of the dataset used during training. Override split_file_path to load a different split file and split to switch between the train, val, and test partitions. Results are written to density_optimization.pdf and density_optimization_summary.pdf.

On Arbitrary Molecules

To optimize densities for molecules from standalone .xyz files:

mldft example.xyz --model /path/to/some/model
# view all options
mldft --help

--model must point to a directory containing hparams.yaml and a checkpoints/ directory with a last.ckpt checkpoint. Ensure the model was trained for all atom types present in the molecule. A log file with the same basename as the .xyz file and a .log suffix is created. When dataset statistics are available you can select the sad initialization; otherwise minao is used.

If you have installed the pretrained models using mldft_setup, you can reference them by name:

mldft xyzfile --model str25_qm9

# or

mldft xyzfile --model str25_qmugs

The optimization result is saved as a .pt file matching the basename of the input .xyz file.