qmugs
QMUGS datasets.
Multiple possible subsets of the QMUGS dataset are defined.
- class QMUGS(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, allowed_atomic_numbers: tuple[int] = (1, 6, 7, 8, 9))[source]
QMUGS dataset.
This includes the whole dataset with 2 million molecules. The ids of the molecules are given by 3 * chembl_id + conf_id, where conf_id is 0, 1 or 2.
- __init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, allowed_atomic_numbers: tuple[int] = (1, 6, 7, 8, 9))[source]
Initialize the QMUGS dataset.
- get_ids() ndarray[source]
Get the indices of the molecules in the dataset.
- Returns:
Array of indices of the molecules in the dataset.
- Return type:
np.ndarray
- static id_to_chembl_conf_id(ids: ndarray) tuple[ndarray, ndarray][source]
Convert the ids to chembl_id and conf_id.
- Parameters:
ids – Array of indices of the molecules to compute.
- Returns:
Array of chembl_ids. np.ndarray: Array of conf_ids.
- Return type:
np.ndarray
- load_charges_and_positions(ids: ndarray) tuple[ndarray, ndarray][source]
Load nuclear charges and positions for the given molecule indices.
- Parameters:
ids – Array of indices of the molecules to compute.
- Returns:
List of arrays of atomic numbers (N) (A). list: List of arrays of atomic positions (N) (A, 3).
- Return type:
list
- class QMUGSBin(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, bin: int = 0)[source]
A subset of QMUGS containing molecules from a specific bin of heavy atoms.
- __init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, bin: int = 0)[source]
Initialize the QMUGS dataset.
- Parameters:
raw_data_dir – Path to the raw data directory.
kohn_sham_data_dir – Path to the kohn-sham data directory.
label_dir – Path to the label directory.
filename – Filename to use for the output files.
name – Name of the dataset.
num_processes – Number of processes to use for dataset verifying or loading.
bin – The bin of heavy atoms to include.
- class QMUGSLargeBins(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, use_original_ids: bool = True)[source]
A subset of QMUGS containing molecules larger than 15 heavy atoms.
50 molecules from each bin of heavy atoms are randomly (but deterministically with seed 1) sampled.
- __init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, use_original_ids: bool = True)[source]
Initialize the QMUGS dataset.
- Parameters:
raw_data_dir – Path to the raw data directory.
kohn_sham_data_dir – Path to the kohn-sham data directory.
label_dir – Path to the label directory.
filename – Filename to use for the output files.
name – Name of the dataset.
num_processes – Number of processes to use for dataset verifying or loading.
- get_bin_from_num_atoms(num_heavy_atoms: int)[source]
Get the bin of the molecule based on the number of heavy atoms.
Under 10 will be mapped to -1, 10-15 to 0, 16-20 to 1 and so on.
- Parameters:
num_heavy_atoms – Number of heavy atoms in the molecule.
- Returns:
The bin of the molecule.
- Return type:
int