qmugs

QMUGS datasets.

Multiple possible subsets of the QMUGS dataset are defined.

class QMUGS(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, allowed_atomic_numbers: tuple[int] = (1, 6, 7, 8, 9))[source]

QMUGS dataset.

This includes the whole dataset with 2 million molecules. The ids of the molecules are given by 3 * chembl_id + conf_id, where conf_id is 0, 1 or 2.

__init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, allowed_atomic_numbers: tuple[int] = (1, 6, 7, 8, 9))[source]: Initialize the QMUGS dataset.

download() → None[source]: Download the raw data.

get_all_atomic_numbers()[source]: Get all atomic numbers in the dataset.

get_ids() → ndarray[source]

Get the indices of the molecules in the dataset.

Returns:: Array of indices of the molecules in the dataset.
Return type:: np.ndarray

get_num_molecules()[source]: Get the number of molecules in the dataset.

static id_to_chembl_conf_id(ids: ndarray) → tuple[ndarray, ndarray][source]

Convert the ids to chembl_id and conf_id.

Parameters:: ids – Array of indices of the molecules to compute.
Returns:: Array of chembl_ids. np.ndarray: Array of conf_ids.
Return type:: np.ndarray

load_charges_and_positions(ids: ndarray) → tuple[ndarray, ndarray][source]

Load nuclear charges and positions for the given molecule indices.

Parameters:: ids – Array of indices of the molecules to compute.
Returns:: List of arrays of atomic numbers (N) (A). list: List of arrays of atomic positions (N) (A, 3).
Return type:: list

class QMUGSBin(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, bin: int = 0)[source]

A subset of QMUGS containing molecules from a specific bin of heavy atoms.

__init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, bin: int = 0)[source]

Initialize the QMUGS dataset.

Parameters:

raw_data_dir – Path to the raw data directory.
kohn_sham_data_dir – Path to the kohn-sham data directory.
label_dir – Path to the label directory.
filename – Filename to use for the output files.
name – Name of the dataset.
num_processes – Number of processes to use for dataset verifying or loading.
bin – The bin of heavy atoms to include.

class QMUGSLargeBins(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, use_original_ids: bool = True)[source]

A subset of QMUGS containing molecules larger than 15 heavy atoms.

50 molecules from each bin of heavy atoms are randomly (but deterministically with seed 1) sampled.

__init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, use_original_ids: bool = True)[source]

Initialize the QMUGS dataset.

Parameters:

raw_data_dir – Path to the raw data directory.
kohn_sham_data_dir – Path to the kohn-sham data directory.
label_dir – Path to the label directory.
filename – Filename to use for the output files.
name – Name of the dataset.
num_processes – Number of processes to use for dataset verifying or loading.

get_bin_from_num_atoms(num_heavy_atoms: int)[source]

Get the bin of the molecule based on the number of heavy atoms.

Under 10 will be mapped to -1, 10-15 to 0, 16-20 to 1 and so on.

Parameters:: num_heavy_atoms – Number of heavy atoms in the molecule.
Returns:: The bin of the molecule.
Return type:: int