:py:mod:`mlair.run_modules.pre_processing` ========================================== .. py:module:: mlair.run_modules.pre_processing .. autoapi-nested-parse:: Pre-processing module. Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: mlair.run_modules.pre_processing.PreProcessing Functions ~~~~~~~~~ .. autoapisummary:: mlair.run_modules.pre_processing.f_proc mlair.run_modules.pre_processing.f_proc_create_info_df mlair.run_modules.pre_processing.f_inspect_error Attributes ~~~~~~~~~~ .. autoapisummary:: mlair.run_modules.pre_processing.__author__ mlair.run_modules.pre_processing.__date__ .. py:data:: __author__ :annotation: = Lukas Leufen, Felix Kleinert .. py:data:: __date__ :annotation: = 2019-11-25 .. py:class:: PreProcessing Bases: :py:obj:`mlair.run_modules.run_environment.RunEnvironment` Pre-process your data by using this class. Schedule of pre-processing: #. load and check valid stations (either download or load from disk) #. split subsets (train, val, test, train & val) #. create small report on data metrics Required objects [scope] from data store: * all elements from `DEFAULT_ARGS_LIST` in scope preprocessing for general data loading * all elements from `DEFAULT_ARGS_LIST` in scopes [train, val, test, train_val] for custom subset settings * `fraction_of_training` [.] * `experiment_path` [.] * `use_all_stations_on_all_data_sets` [.] Optional objects * all elements from `DEFAULT_KWARGS_LIST` in scope preprocessing for general data loading * all elements from `DEFAULT_KWARGS_LIST` in scopes [train, val, test, train_val] for custom subset settings Sets * `stations` in [., train, val, test, train_val] * `generator` in [train, val, test, train_val] * `transformation` [.] Creates * all input and output data in `data_path` * latex reports in `experiment_path/latex_report` .. py:method:: _run(self) .. py:method:: report_pre_processing(self) Log some metrics on data and create latex report. .. py:method:: create_latex_report(self) Create tables with information on the station meta data and a summary on subset sample sizes. * station_sample_size.md: see table below as markdown * station_sample_size.tex: same as table below as latex table * station_sample_size_short.tex: reduced size table without any meta data besides station ID, as latex table All tables are stored inside experiment_path inside the folder latex_report. The table format (e.g. which meta data is highlighted) is currently hardcoded to have a stable table style. If further styles are needed, it is better to add an additional style than modifying the existing table styles. +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | stat. ID | station_name | station_lon | station_lat | station_alt | train | val | test | +============+===========================================+===============+===============+===============+=========+=======+========+ | DEBW013 | Stuttgart Bad Cannstatt | 9.2297 | 48.8088 | 235 | 1434 | 712 | 1080 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | DEBW076 | Baden-Baden | 8.2202 | 48.7731 | 148 | 3037 | 722 | 710 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | DEBW087 | Schwäbische_Alb | 9.2076 | 48.3458 | 798 | 3044 | 714 | 1087 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | DEBW107 | Tübingen | 9.0512 | 48.5077 | 325 | 1803 | 715 | 1087 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | DEBY081 | Garmisch-Partenkirchen/Kreuzeckbahnstraße | 11.0631 | 47.4764 | 735 | 2935 | 525 | 714 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | # Stations | nan | nan | nan | nan | 6 | 6 | 6 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | # Samples | nan | nan | nan | nan | 12253 | 3388 | 4678 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ .. py:method:: create_describe_df(df, percentiles=None, ignore_last_lines: int = 2) :staticmethod: .. py:method:: create_info_df(self, meta_cols, meta_round, names_of_set, precision) .. py:method:: split_train_val_test(self) -> None Split data into subsets. Currently: train, val, test and train_val (actually this is only the merge of train and val, but as an separate data_collection). IMPORTANT: Do not change to order of the execution of create_set_split. The train subset needs always to be executed at first, to set a proper transformation. .. py:method:: split_set_indices(total_length: int, fraction: float) -> Tuple[slice, slice, slice, slice] :staticmethod: Create the training, validation and test subset slice indices for given total_length. The test data consists on (1-fraction) of total_length (fraction*len:end). Train and validation data therefore are made from fraction of total_length (0:fraction*len). Train and validation data is split by the factor 0.8 for train and 0.2 for validation. In addition, split_set_indices returns also the combination of training and validation subset. :param total_length: list with all objects to split :param fraction: ratio between test and union of train/val data :return: slices for each subset in the order: train, val, test, train_val .. py:method:: create_set_split(self, index_list: slice, set_name: str) -> None .. py:method:: validate_station(self, data_handler: mlair.data_handler.AbstractDataHandler, set_stations, set_name=None, store_processed_data=True) Check if all given stations in `all_stations` are valid. Valid means, that there is data available for the given time range (is included in `kwargs`). The shape and the loading time are logged in debug mode. :return: Corrected list containing only valid station IDs. .. py:method:: store_data_handler_attributes(self, data_handler, collection) .. py:method:: _store_apriori(self) .. py:method:: _load_apriori(self) .. py:method:: transformation(self, data_handler: mlair.data_handler.AbstractDataHandler, stations) .. py:method:: _load_transformation(self) Try to load transformation options from file if transformation_file is provided. .. py:method:: _store_transformation(self, transformation_opts) Store transformation options locally inside experiment_path if not exists already. .. py:method:: prepare_competitors(self) Prepare competitor models already in the preprocessing stage. This is performed here, because some models might need to have internet access, which is depending on the operating system not possible during postprocessing. This method checks currently only, if the Intelli03-ts-v1 model is requested as competitor and downloads the data if required. .. py:method:: create_snapshot(self) .. py:method:: load_snapshot(self, file) .. py:function:: f_proc(data_handler, station, name_affix, store, return_strategy='', tmp_path=None, **kwargs) Try to create a data handler for given arguments. If build fails, this station does not fulfil all requirements and therefore f_proc will return None as indication. On a successful build, f_proc returns the built data handler and the station that was used. This function must be implemented globally to work together with multiprocessing. .. py:function:: f_proc_create_info_df(data, meta_cols) .. py:function:: f_inspect_error(formatted)