dataset

The base class for all datasets.

It implements methods for checking which ids have been computed, which ids still have to be computed and for verifying the chk files. The following methods have to be implemented by the the subclasses:

  • download(): If the dataset is not yet downloaded, download it.

  • get_num_molecules(): Get the number of molecules in the dataset.

  • load_charges_and_positions(ids): Load nuclear charges and positions for the given molecule indices.

  • get_ids(): Get the indices of the molecules in the dataset.

class DataGenDataset(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str, num_processes: int = 1)[source]

Base class for all datasets.

name

Name of the dataset.

raw_data_dir

Path to the directory containing the raw data.

kohn_sham_data_dir

Path to the directory containing the Kohn-Sham data.

num_processes

Number of processes to use for the computation.

num_molecules

Number of molecules in the dataset.

__init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str, num_processes: int = 1)[source]

Initialize the dataset by setting attributes.

Parameters:
  • raw_data_dir – Path to the directory containing the raw data.

  • kohn_sham_data_dir – Path to the directory containing the Kohn-Sham data.

  • label_dir – Path to the directory containing the labels.

  • filename – The filename to use for the output files.

  • name – Name of the dataset.

  • num_processes – Number of processes to use for dataset verifying or loading.

static check_chk_file(chk_file: Path, remove_broken_files: bool = True) None[source]

Check if the computation of the chk file is finished and remove it if it is not.

Parameters:
  • chk_file – Path to the chk file.

  • remove_broken_files – Whether to remove the broken files or raise an error.

abstractmethod download() None[source]

Download the raw data.

get_all_atomic_numbers() ndarray[source]

Get the atomic numbers of all atoms in the dataset.

Returns:

Array of atomic numbers.

Return type:

np.ndarray

get_all_chk_files_from_id(id: int) Sequence[Path][source]

Get the paths to all possible chk files from an id, including those from external potential sampling.

Returns:

Array of paths to the chk files.

Return type:

Sequence[Path]

get_all_chk_files_from_ids(ids: ndarray) Sequence[Path][source]

Get the paths to all possible chk files from a list of id, including those from external potential sampling.

Returns:

Array of paths to the chk files.

Return type:

Sequence[Path]

get_chk_file_from_id(id: int) Path[source]

Get the path to the chk file for the given molecule index.

Parameters:

id – Index of the molecule to compute.

Returns:

Path to the chk file.

Return type:

Path

abstractmethod get_ids() ndarray[source]

Get the indices of the molecules in the dataset.

get_ids_done_ks() ndarray[source]

Get the indices of the molecules that have already been computed.

Returns:

Array of indices of the molecules that have already been computed.

Return type:

np.ndarray

get_ids_done_labelgen() ndarray[source]

Get the indices of the molecules that have already been computed.

Returns:

Array of indices of the molecules that have already been computed.

Return type:

np.ndarray

get_ids_todo_ks(start_idx: int = 0, max_num_molecules: int = 1) ndarray[source]

Get the indices of the molecules that haven’t been computed, typically by comparing total indices with indices of already computed molecules.

Parameters:
  • start_idx – Index of the first molecule to compute.

  • max_num_molecules – Number of molecules to compute.

Returns:

Array of indices of the molecules that haven’t been computed.

Return type:

np.ndarray

get_ids_todo_labelgen(start_idx: int = 0, max_num_molecules: int = 1) ndarray[source]

Get the indices of the molecules that haven’t been computed, typically by comparing total indices with indices of already computed molecules.

Parameters:
  • start_idx – Index of the first molecule to compute.

  • max_num_molecules – Number of molecules to compute.

Returns:

Array of indices of the molecules that haven’t been computed.

Return type:

np.ndarray

abstractmethod get_num_molecules()[source]

Get the number of molecules in the dataset.

abstractmethod load_charges_and_positions(ids: int) Tuple[ndarray, ndarray][source]

Load nuclear charges and positions for the given molecule indices.

Parameters:

ids – Array of indices of the molecules to compute.

load_molecule(id: int, basis: str) Mole[source]

Load nuclear charges and positions for the given molecule index.

Parameters:
  • id – Index of the molecule to compute.

  • basis – Basis set to use for the molecule.

verify_files(remove_broken_files: bool = True) None[source]

Remove the files that are not finished.

Parameters:

remove_broken_files – Whether to remove the broken files or raise an error.

delete_dataset(dataset: DataGenDataset) None[source]

Delete the dataset.

Deletes the raw data, koohn-sham data and label directories.

Parameters:

dataset – The dataset to delete.