qmugs

QMUGS datasets.

Multiple possible subsets of the QMUGS dataset are defined.

class QMUGS(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, allowed_atomic_numbers: tuple[int] = (1, 6, 7, 8, 9))[source]

QMUGS dataset.

This includes the whole dataset with 2 million molecules. The ids of the molecules are given by 3 * chembl_id + conf_id, where conf_id is 0, 1 or 2.

__init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, allowed_atomic_numbers: tuple[int] = (1, 6, 7, 8, 9))[source]

Initialize the QMUGS dataset.

download() None[source]

Download the raw data.

get_all_atomic_numbers()[source]

Get all atomic numbers in the dataset.

get_ids() ndarray[source]

Get the indices of the molecules in the dataset.

Returns:

Array of indices of the molecules in the dataset.

Return type:

np.ndarray

get_num_molecules()[source]

Get the number of molecules in the dataset.

static id_to_chembl_conf_id(ids: ndarray) tuple[ndarray, ndarray][source]

Convert the ids to chembl_id and conf_id.

Parameters:

ids – Array of indices of the molecules to compute.

Returns:

Array of chembl_ids. np.ndarray: Array of conf_ids.

Return type:

np.ndarray

load_charges_and_positions(ids: ndarray) tuple[ndarray, ndarray][source]

Load nuclear charges and positions for the given molecule indices.

Parameters:

ids – Array of indices of the molecules to compute.

Returns:

List of arrays of atomic numbers (N) (A). list: List of arrays of atomic positions (N) (A, 3).

Return type:

list

class QMUGSBin(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, bin: int = 0)[source]

A subset of QMUGS containing molecules from a specific bin of heavy atoms.

__init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, bin: int = 0)[source]

Initialize the QMUGS dataset.

Parameters:
  • raw_data_dir – Path to the raw data directory.

  • kohn_sham_data_dir – Path to the kohn-sham data directory.

  • label_dir – Path to the label directory.

  • filename – Filename to use for the output files.

  • name – Name of the dataset.

  • num_processes – Number of processes to use for dataset verifying or loading.

  • bin – The bin of heavy atoms to include.

class QMUGSLargeBins(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, use_original_ids: bool = True)[source]

A subset of QMUGS containing molecules larger than 15 heavy atoms.

50 molecules from each bin of heavy atoms are randomly (but deterministically with seed 1) sampled.

__init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QMUGS', num_processes: int = 1, use_original_ids: bool = True)[source]

Initialize the QMUGS dataset.

Parameters:
  • raw_data_dir – Path to the raw data directory.

  • kohn_sham_data_dir – Path to the kohn-sham data directory.

  • label_dir – Path to the label directory.

  • filename – Filename to use for the output files.

  • name – Name of the dataset.

  • num_processes – Number of processes to use for dataset verifying or loading.

get_bin_from_num_atoms(num_heavy_atoms: int)[source]

Get the bin of the molecule based on the number of heavy atoms.

Under 10 will be mapped to -1, 10-15 to 0, 16-20 to 1 and so on.

Parameters:

num_heavy_atoms – Number of heavy atoms in the molecule.

Returns:

The bin of the molecule.

Return type:

int