:py:mod:`mlair.run_modules.pre_processing`
==========================================

.. py:module:: mlair.run_modules.pre_processing

.. autoapi-nested-parse::

   Pre-processing module.


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   mlair.run_modules.pre_processing.PreProcessing


Functions
~~~~~~~~~

.. autoapisummary::

   mlair.run_modules.pre_processing.f_proc
   mlair.run_modules.pre_processing.f_proc_create_info_df
   mlair.run_modules.pre_processing.f_inspect_error


Attributes
~~~~~~~~~~

.. autoapisummary::

   mlair.run_modules.pre_processing.__author__
   mlair.run_modules.pre_processing.__date__


.. py:data:: __author__
   :annotation: = Lukas Leufen, Felix Kleinert

   
.. py:data:: __date__
   :annotation: = 2019-11-25

   
.. py:class:: PreProcessing

   Bases: :py:obj:`mlair.run_modules.run_environment.RunEnvironment`

   Pre-process your data by using this class.

   Schedule of pre-processing:
       #. load and check valid stations (either download or load from disk)
       #. split subsets (train, val, test, train & val)
       #. create small report on data metrics

   Required objects [scope] from data store:
       * all elements from `DEFAULT_ARGS_LIST` in scope preprocessing for general data loading
       * all elements from `DEFAULT_ARGS_LIST` in scopes [train, val, test, train_val] for custom subset settings
       * `fraction_of_training` [.]
       * `experiment_path` [.]
       * `use_all_stations_on_all_data_sets` [.]

   Optional objects
       * all elements from `DEFAULT_KWARGS_LIST` in scope preprocessing for general data loading
       * all elements from `DEFAULT_KWARGS_LIST` in scopes [train, val, test, train_val] for custom subset settings

   Sets
       * `stations` in [., train, val, test, train_val]
       * `generator` in [train, val, test, train_val]
       * `transformation` [.]

   Creates
       * all input and output data in `data_path`
       * latex reports in `experiment_path/latex_report`


   .. py:method:: _run(self)


   .. py:method:: report_pre_processing(self)

      Log some metrics on data and create latex report.


   .. py:method:: create_latex_report(self)

      Create tables with information on the station meta data and a summary on subset sample sizes.

      * station_sample_size.md: see table below as markdown
      * station_sample_size.tex: same as table below as latex table
      * station_sample_size_short.tex: reduced size table without any meta data besides station ID, as latex table

      All tables are stored inside experiment_path inside the folder latex_report. The table format (e.g. which meta
      data is highlighted) is currently hardcoded to have a stable table style. If further styles are needed, it is
      better to add an additional style than modifying the existing table styles.

      +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+
      | stat. ID   | station_name                              |   station_lon |   station_lat |   station_alt |   train |   val |   test |
      +============+===========================================+===============+===============+===============+=========+=======+========+
      | DEBW013    | Stuttgart Bad Cannstatt                   |        9.2297 |       48.8088 |           235 |    1434 |   712 |   1080 |
      +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+
      | DEBW076    | Baden-Baden                               |        8.2202 |       48.7731 |           148 |    3037 |   722 |    710 |
      +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+
      | DEBW087    | Schwäbische_Alb                           |        9.2076 |       48.3458 |           798 |    3044 |   714 |   1087 |
      +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+
      | DEBW107    | Tübingen                                  |        9.0512 |       48.5077 |           325 |    1803 |   715 |   1087 |
      +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+
      | DEBY081    | Garmisch-Partenkirchen/Kreuzeckbahnstraße |       11.0631 |       47.4764 |           735 |    2935 |   525 |    714 |
      +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+
      | # Stations | nan                                       |      nan      |      nan      |           nan |       6 |     6 |      6 |
      +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+
      | # Samples  | nan                                       |      nan      |      nan      |           nan |   12253 |  3388 |   4678 |
      +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+


   .. py:method:: create_describe_df(df, percentiles=None, ignore_last_lines: int = 2)
      :staticmethod:


   .. py:method:: create_info_df(self, meta_cols, meta_round, names_of_set, precision)


   .. py:method:: split_train_val_test(self) -> None

      Split data into subsets.

      Currently: train, val, test and train_val (actually this is only the merge of train and val, but as an separate
      data_collection). IMPORTANT: Do not change to order of the execution of create_set_split. The train subset needs
      always to be executed at first, to set a proper transformation.


   .. py:method:: split_set_indices(total_length: int, fraction: float) -> Tuple[slice, slice, slice, slice]
      :staticmethod:

      Create the training, validation and test subset slice indices for given total_length.

      The test data consists on (1-fraction) of total_length (fraction*len:end). Train and validation data therefore
      are made from fraction of total_length (0:fraction*len). Train and validation data is split by the factor 0.8
      for train and 0.2 for validation. In addition, split_set_indices returns also the combination of training and
      validation subset.

      :param total_length: list with all objects to split
      :param fraction: ratio between test and union of train/val data

      :return: slices for each subset in the order: train, val, test, train_val


   .. py:method:: create_set_split(self, index_list: slice, set_name: str) -> None


   .. py:method:: validate_station(self, data_handler: mlair.data_handler.AbstractDataHandler, set_stations, set_name=None, store_processed_data=True)

      Check if all given stations in `all_stations` are valid.

      Valid means, that there is data available for the given time range (is included in `kwargs`). The shape and the
      loading time are logged in debug mode.

      :return: Corrected list containing only valid station IDs.


   .. py:method:: store_data_handler_attributes(self, data_handler, collection)


   .. py:method:: _store_apriori(self)


   .. py:method:: _load_apriori(self)


   .. py:method:: transformation(self, data_handler: mlair.data_handler.AbstractDataHandler, stations)


   .. py:method:: _load_transformation(self)

      Try to load transformation options from file if transformation_file is provided.


   .. py:method:: _store_transformation(self, transformation_opts)

      Store transformation options locally inside experiment_path if not exists already.


   .. py:method:: prepare_competitors(self)

      Prepare competitor models already in the preprocessing stage. This is performed here, because some models might
      need to have internet access, which is depending on the operating system not possible during postprocessing.
      This method checks currently only, if the Intelli03-ts-v1 model is requested as competitor and downloads the
      data if required.


   .. py:method:: create_snapshot(self)


   .. py:method:: load_snapshot(self, file)


.. py:function:: f_proc(data_handler, station, name_affix, store, return_strategy='', tmp_path=None, **kwargs)

   Try to create a data handler for given arguments. If build fails, this station does not fulfil all requirements and
   therefore f_proc will return None as indication. On a successful build, f_proc returns the built data handler and
   the station that was used. This function must be implemented globally to work together with multiprocessing.


.. py:function:: f_proc_create_info_df(data, meta_cols)


.. py:function:: f_inspect_error(formatted)