dataset
Dataset class for machine learning.
- class OFDataset(paths: List[Path], basis_info: BasisInfo, transforms: MasterTransformation = None, limit_scf_iterations: int | list[int] | None = None, keep_initial_guess: bool = True, num_scf_iterations_per_path: List[int] | None = None, cache_in_memory: bool = False, **of_data_kwargs)[source]
Dataset of OF-DFT data.
Each scf iteration of each molecule is a sample.
- __getitem__(item: int) OFData[source]
Returns the sample at the given index.
- Parameters:
item – Index of the sample.
- __init__(paths: List[Path], basis_info: BasisInfo, transforms: MasterTransformation = None, limit_scf_iterations: int | list[int] | None = None, keep_initial_guess: bool = True, num_scf_iterations_per_path: List[int] | None = None, cache_in_memory: bool = False, **of_data_kwargs) None[source]
- Parameters:
paths – List of paths to .zarr files.
basis_info – Basis information for all samples.
transforms – List of transforms to apply to each sample.
limit_scf_iterations – Which scf iterations to use. When passing an int s>0 we use all scf iterations larger or equal to s. If s<0 we use the s last iterations and when passing a list of ints, we use only the scf iterations in the list.
keep_initial_guess – Whether to keep the initial guess in the dataset when filtering with limit_scf_iterations. This is useful if you want to train the energy model on the later scf iterations but also want to train the initial guess model on the minao label. Setting this to False will not remove the initial guess, limit_scf_iterations=1 and keep_initial_guess=False will remove them.
num_scf_iterations_per_path – List of number of scf iterations per path. If None, the number of scf iterations is read from the paths directly which can take some time.
of_data_kwargs – Keyword arguments passed to
OFData.from_file().cache_in_memory – Whether to cache the dataset in memory. This is useful if the dataset fits into memory and the dataset is used multiple times, especially if expensive transforms are used. Warning: Do not enable if non-deterministic transforms are used.
- configure_scf_iterations(num_scf_iterations_per_path: list[int], keep_initial_guess: bool = True) list[ndarray][source]
Determine for each path how many scf iterations are there, depending on scf_iterations.
If scf_iterations>0 we use all scf iterations larger or equal to s, this could delete some geometries but only for large numbers. For larger numbers rather use the negative integer functionality. If scf_iterations<0 we use the s last iterations. When passing a list of ints, we use only the scf iterations in the list. This could also delete some geometries.
- Parameters:
num_scf_iterations_per_path – List of number of scf iterations per path in the whole non-filtered dataset.
keep_initial_guess – Whether to keep the initial guess in the dataset when filtering. Setting this to False will not remove the initial guess, limit_scf_iterations=1 and keep_initial_guess=False will remove them.
- Returns:
List of arrays containing the indices of the scf iterations to use for each path.
- Return type:
scf_iterations_per_path
- classmethod from_directory(directory: str | Path, **kwargs) OFDataset[source]
Initialize the dataset from a directory containing .zarr files.
- Parameters:
directory – Path to the directory.
**kwargs – Keyword arguments passed to the constructor.
- getitem(item: int) OFData[source]
Returns the sample at the given index.
Note: This method is cached if cache_in_memory is True. As
__getitem__()cannot be overwritten at runtime, this method is used instead.- Parameters:
item – Index of the sample.
- Returns:
OFDataobject.- Return type:
sample