mlair.run_modules.pre_processing

Pre-processing module.

Module Contents

Classes

PreProcessing

Pre-process your data by using this class.

Functions

f_proc(data_handler, station, name_affix, store, return_strategy=’’, tmp_path=None, **kwargs)

Try to create a data handler for given arguments. If build fails, this station does not fulfil all requirements and

f_proc_create_info_df(data, meta_cols)

f_inspect_error(formatted)

Attributes

__author__

__date__

mlair.run_modules.pre_processing.__author__ = Lukas Leufen, Felix Kleinert
mlair.run_modules.pre_processing.__date__ = 2019-11-25
class mlair.run_modules.pre_processing.PreProcessing

Bases: mlair.run_modules.run_environment.RunEnvironment

Pre-process your data by using this class.

Schedule of pre-processing:
  1. load and check valid stations (either download or load from disk)

  2. split subsets (train, val, test, train & val)

  3. create small report on data metrics

Required objects [scope] from data store:
  • all elements from DEFAULT_ARGS_LIST in scope preprocessing for general data loading

  • all elements from DEFAULT_ARGS_LIST in scopes [train, val, test, train_val] for custom subset settings

  • fraction_of_training [.]

  • experiment_path [.]

  • use_all_stations_on_all_data_sets [.]

Optional objects
  • all elements from DEFAULT_KWARGS_LIST in scope preprocessing for general data loading

  • all elements from DEFAULT_KWARGS_LIST in scopes [train, val, test, train_val] for custom subset settings

Sets
  • stations in [., train, val, test, train_val]

  • generator in [train, val, test, train_val]

  • transformation [.]

Creates
  • all input and output data in data_path

  • latex reports in experiment_path/latex_report

_run(self)
report_pre_processing(self)

Log some metrics on data and create latex report.

create_latex_report(self)

Create tables with information on the station meta data and a summary on subset sample sizes.

  • station_sample_size.md: see table below as markdown

  • station_sample_size.tex: same as table below as latex table

  • station_sample_size_short.tex: reduced size table without any meta data besides station ID, as latex table

All tables are stored inside experiment_path inside the folder latex_report. The table format (e.g. which meta data is highlighted) is currently hardcoded to have a stable table style. If further styles are needed, it is better to add an additional style than modifying the existing table styles.

stat. ID

station_name

station_lon

station_lat

station_alt

train

val

test

DEBW013

Stuttgart Bad Cannstatt

9.2297

48.8088

235

1434

712

1080

DEBW076

Baden-Baden

8.2202

48.7731

148

3037

722

710

DEBW087

Schwäbische_Alb

9.2076

48.3458

798

3044

714

1087

DEBW107

Tübingen

9.0512

48.5077

325

1803

715

1087

DEBY081

Garmisch-Partenkirchen/Kreuzeckbahnstraße

11.0631

47.4764

735

2935

525

714

# Stations

nan

nan

nan

nan

6

6

6

# Samples

nan

nan

nan

nan

12253

3388

4678

static create_describe_df(df, percentiles=None, ignore_last_lines: int = 2)
create_info_df(self, meta_cols, meta_round, names_of_set, precision)
split_train_val_test(self)None

Split data into subsets.

Currently: train, val, test and train_val (actually this is only the merge of train and val, but as an separate data_collection). IMPORTANT: Do not change to order of the execution of create_set_split. The train subset needs always to be executed at first, to set a proper transformation.

static split_set_indices(total_length: int, fraction: float) → Tuple[slice, slice, slice, slice]

Create the training, validation and test subset slice indices for given total_length.

The test data consists on (1-fraction) of total_length (fraction*len:end). Train and validation data therefore are made from fraction of total_length (0:fraction*len). Train and validation data is split by the factor 0.8 for train and 0.2 for validation. In addition, split_set_indices returns also the combination of training and validation subset.

Parameters
  • total_length – list with all objects to split

  • fraction – ratio between test and union of train/val data

Returns

slices for each subset in the order: train, val, test, train_val

create_set_split(self, index_list: slice, set_name: str)None
validate_station(self, data_handler: mlair.data_handler.AbstractDataHandler, set_stations, set_name=None, store_processed_data=True)

Check if all given stations in all_stations are valid.

Valid means, that there is data available for the given time range (is included in kwargs). The shape and the loading time are logged in debug mode.

Returns

Corrected list containing only valid station IDs.

store_data_handler_attributes(self, data_handler, collection)
_store_apriori(self)
_load_apriori(self)
transformation(self, data_handler: mlair.data_handler.AbstractDataHandler, stations)
_load_transformation(self)

Try to load transformation options from file if transformation_file is provided.

_store_transformation(self, transformation_opts)

Store transformation options locally inside experiment_path if not exists already.

prepare_competitors(self)

Prepare competitor models already in the preprocessing stage. This is performed here, because some models might need to have internet access, which is depending on the operating system not possible during postprocessing. This method checks currently only, if the Intelli03-ts-v1 model is requested as competitor and downloads the data if required.

create_snapshot(self)
load_snapshot(self, file)
mlair.run_modules.pre_processing.f_proc(data_handler, station, name_affix, store, return_strategy='', tmp_path=None, **kwargs)

Try to create a data handler for given arguments. If build fails, this station does not fulfil all requirements and therefore f_proc will return None as indication. On a successful build, f_proc returns the built data handler and the station that was used. This function must be implemented globally to work together with multiprocessing.

mlair.run_modules.pre_processing.f_proc_create_info_df(data, meta_cols)
mlair.run_modules.pre_processing.f_inspect_error(formatted)