qm9

QM9 dataset.

Contains 129,133 molecules from the QM9 dataset. The ids of the molecules are given by the index of the xyz file.

class QM9(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QM9', num_processes: int = 1)[source]

Class for the QM9 dataset.

name: Name of the dataset.

raw_data_dir: Path to the raw data directory.

kohn_sham_data_dir: Path to the kohn-sham data directory.

__init__(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QM9', num_processes: int = 1)[source]

Initialize the QM9 dataset.

Parameters:

raw_data_dir – Path to the raw data directory.
kohn_sham_data_dir – Path to the kohn-sham data directory.
label_dir – Path to the directory containing the labels.
filename – The filename to use for the output files.
name – Name of the dataset.
num_processes – Number of processes to use for dataset verifying or loading.

Raises:

AssertionError – If the subset is not in the list of available subsets.

convert_xyz_files() → None[source]: Convert the xyz files from QM9 to have the format 1e-6 instead of 1*^-6 which can’t be read by pyscf.

download() → None[source]: Download the raw data.

get_all_atomic_numbers() → ndarray[source]

Get the atomic numbers of all atoms in the dataset.

Returns:: Array of atomic numbers.
Return type:: np.ndarray

get_ids() → ndarray[source]

Get the indices of the molecules in the dataset.

Returns:: Array of indices of the molecules in the dataset.
Return type:: np.ndarray

get_num_molecules() → int[source]

Get the number of molecules in the dataset.

Returns:: Number of molecules in the dataset.
Return type:: int

load_charges_and_positions(id: int) → tuple[list, list][source]

Load nuclear charges and positions for the given molecule indices from the .xyz files. :param ids: Array of indices of the molecules to compute.

Returns:: Array of atomic numbers (A). np.ndarray: Array of atomic positions (A, 3).
Return type:: np.ndarray

class QM9Test(raw_data_dir: str, kohn_sham_data_dir: str, label_dir: str, filename: str, name: str = 'QM9', num_processes: int = 1)[source]

download() → None[source]: Download the raw data.

load_charges_and_positions(id: int) → tuple[list, list][source]

Load nuclear charges and positions for the given molecule indices from the .xyz files. :param ids: Array of indices of the molecules to compute.

Returns:: Array of atomic numbers (A). np.ndarray: Array of atomic positions (A, 3).
Return type:: np.ndarray

convert_folder_sorted_parallel(in_folder: Path, out_folder: Path, num_processes: int) → None[source]

Apply the conversion function to all xyz files in the folder in parallel.

Parameters:

in_folder – Path to the input folder
out_folder – Path to the output folder
num_processes – Number of processes to use

convert_string_format(xyz_file_path: Path, out_folder: Path) → None[source]

Convert the xyz files from QM9 to have the format 1e-6 instead of 1*^-6 which can’t be read by pyscf.

Parameters:

xyz_file_path – Path to the xyz file
out_folder – Path to the output folder