Usage Guide =========== This guide provides a general overview of how to work with STRUCTURES25 once the project is installed. For instructions on reproducing the experiments from our paper, refer to the `REPLICATION_GUIDE.md `_ in the repository. We rely on `Hydra `_ to manage configurations. The main configuration files are located in ``configs/``. Data Generation --------------- Our datasets are available at `Dryad `_. 1. (Optional) Create your own dataset class or use the MISC dataset and provide XYZ files to define which molecules should be generated. 2. (Optional) Create a config file in ``configs/datagen/dataset/``. 3. Run Kohn-Sham DFT on the dataset and create ``.chk`` files in ``$DFT_DATA/dataset/kohn_sham``: .. code-block:: bash mldft_ks dataset= n_molecules=1000 start_idx=0 4. Based on the Kohn-Sham results, perform density fitting, compute energy and gradients, and save them as labels for the machine learning model in ``$DFT_DATA/dataset/labels``: .. code-block:: bash mldft_labelgen dataset= n_molecules=-1 5. Split the file into train, validation, and test datasets using ``mldft/utils/create_dataset_splits.py``. .. code-block:: bash python mldft/utils/create_dataset_splits.py 6. Create a training data config in ``configs/ml/data`` to link to the dataset. Ensure the ``dataset_name`` and atom types are set correctly. 7. Transform the dataset into a basis (to reduce data loading computations during training). For ``Graphformer`` models, use ``local_frames_global_natrep``. .. code-block:: bash python mldft/datagen/transform_dataset.py data= data/transforms=local_frames_global_natrep 8. Compute dataset statistics, making sure to do so for the transformation and target energy you plan to use. .. code-block:: bash python mldft/ml/compute_dataset_statistics.py data= .. _training: Training -------- Start training with: .. code-block:: bash mldft_train data= model= Key settings: - ``data/transforms``: Selects whether the data has been pre-transformed. The default ``local_frames_global_natrep`` applies both local frames and global natural reparametrization. - ``data.target_key``: Determines the target you train on. The default ``kin_plus_xc`` trains on the total kinetic and exchange-correlation energy (and their gradients). Alternatives include ``kin_minus_apbe`` (delta learning relative to the APBE kinetic energy functional) and ``tot`` (total electronic energy). Density Optimization -------------------- On a Dataset ^^^^^^^^^^^^ To run density optimization on a dataset in the project format: .. code-block:: bash mldft_denop run_path= \ n_molecules= device= initialization= num_devices= - ``run_path``: Path to the model relative to ``DFT_MODELS``. - ``n_molecules``: Number of molecules to compute. - ``device``: Target device (for example ``cuda`` or ``cpu``). - ``initialization``: Initialization strategy: ``sad``, ``minao``, or ``hückel``. The ``sad`` option requires matching dataset statistics. By default the command runs on the validation split of the dataset used during training. Override ``split_file_path`` to load a different split file and ``split`` to switch between the ``train``, ``val``, and ``test`` partitions. Results are written to ``density_optimization.pdf`` and ``density_optimization_summary.pdf``. On Arbitrary Molecules ^^^^^^^^^^^^^^^^^^^^^^ To optimize densities for molecules from standalone ``.xyz`` files: .. code-block:: bash mldft example.xyz --model /path/to/some/model # view all options mldft --help ``--model`` must point to a directory containing ``hparams.yaml`` and a ``checkpoints/`` directory with a ``last.ckpt`` checkpoint. Ensure the model was trained for all atom types present in the molecule. A log file with the same basename as the ``.xyz`` file and a ``.log`` suffix is created. When dataset statistics are available you can select the ``sad`` initialization; otherwise ``minao`` is used. If you have installed the pretrained models using ``mldft_setup``, you can reference them by name: .. code-block:: bash mldft xyzfile --model str25_qm9 # or mldft xyzfile --model str25_qmugs The optimization result is saved as a ``.pt`` file matching the basename of the input ``.xyz`` file.