mlair.run_modules.experiment_setup

Module Contents

Classes

ExperimentSetup

Set up the model.

Attributes

__author__

__date__

formatter

mlair.run_modules.experiment_setup.__author__ = Lukas Leufen, Felix Kleinert
mlair.run_modules.experiment_setup.__date__ = 2019-11-15
class mlair.run_modules.experiment_setup.ExperimentSetup(experiment_date=None, stations: Union[str, List[str]] = None, variables: Union[str, List[str]] = None, statistics_per_var: Dict = None, start: str = None, end: str = None, window_history_size: int = None, target_var='o3', target_dim=None, window_lead_time: int = None, window_dim=None, dimensions=None, time_dim=None, iter_dim=None, interpolation_method=None, interpolation_limit=None, train_start=None, train_end=None, val_start=None, val_end=None, test_start=None, test_end=None, use_all_stations_on_all_data_sets=None, train_model: bool = None, fraction_of_train: float = None, experiment_path=None, plot_path: str = None, forecast_path: str = None, overwrite_local_data=None, sampling: str = None, create_new_model=None, bootstrap_path=None, permute_data_on_training=None, transformation=None, train_min_length=None, val_min_length=None, test_min_length=None, extreme_values: list = None, extremes_on_right_tail_only: bool = None, evaluate_feature_importance: bool = None, plot_list=None, feature_importance_n_boots: int = None, feature_importance_create_new_bootstraps: bool = None, feature_importance_bootstrap_method=None, feature_importance_bootstrap_type=None, data_path: str = None, batch_path: str = None, login_nodes=None, hpc_hosts=None, model=None, batch_size=None, epochs=None, early_stopping_epochs: int = None, restore_best_model_weights: bool = None, data_handler=None, data_origin: Dict = None, competitors: list = None, competitor_path: str = None, use_multiprocessing: bool = None, use_multiprocessing_on_debug: bool = None, max_number_multiprocessing: int = None, start_script: Union[Callable, str] = None, overwrite_lazy_data: bool = None, uncertainty_estimate_block_length: str = None, uncertainty_estimate_evaluate_competitors: bool = None, uncertainty_estimate_n_boots: int = None, do_uncertainty_estimate: bool = None, do_bias_free_evaluation: bool = None, model_display_name: str = None, transformation_file: str = None, calculate_fresh_transformation: bool = None, snapshot_load_path: str = None, create_snapshot: bool = None, snapshot_path: str = None, model_path: str = None, **kwargs)

Bases: mlair.run_modules.run_environment.RunEnvironment

Set up the model.

Schedule of experiment setup:
  • set up experiment path

  • set up data path (according to host system)

  • set up forecast, bootstrap and plot path (inside experiment path)

  • set all parameters given in args (or use default values)

  • check target variable

  • check variables and statistics_per_var parameter for consistency

Sets
  • data_path [.]

  • create_new_model [.]

  • bootstrap_path [.]

  • train_model [.]

  • fraction_of_training [.]

  • extreme_values [train]

  • extremes_on_right_tail_only [train]

  • upsampling [train]

  • permute_data [train]

  • experiment_name [.]

  • experiment_path [.]

  • plot_path [.]

  • forecast_path [.]

  • stations [.]

  • statistics_per_var [.]

  • variables [.]

  • start [.]

  • end [.]

  • window_history_size [.]

  • overwrite_local_data [preprocessing]

  • sampling [.]

  • transformation [., preprocessing]

  • target_var [.]

  • target_dim [.]

  • window_lead_time [.]

Creates
  • plot of model architecture in <model_name>.pdf

Parameters
  • parser_args – argument parser, currently only accepting experiment_date argument to be used for experiment’s name and path creation. Final experiment’s name is derived from given name and the time series sampling as <name>_network_<sampling>/ . All interim and final results, logging, plots, … of this run are stored in this directory if not explicitly provided in kwargs. Only the data itself and data for bootstrap investigations are stored outside this structure.

  • stations – list of stations or single station to use in experiment. If not provided, stations are set to default stations.

  • variables – list of all variables to use. Valid names can be found in Section 2.1 Parameters. If not provided, this parameter is filled with keys from statistics_per_var.

  • statistics_per_var

    dictionary with statistics to use for variables (if data is daily and loaded from JOIN). If not provided, default statistics is applied. statistics_per_var is compared with given variables and unused variables are removed. Therefore, statistics at least need to provide all variables from variables. For more details on available statistics, we refer to Section 3.3 List of statistics/metrics for stats service in the JOIN documentation. Valid parameter names can be found in Section 2.1 Parameters.

  • start – start date of overall data (default “1997-01-01”)

  • end – end date of overall data (default “2017-12-31”)

  • window_history_size – number of time steps to use for input data (default 13). Time steps t_0 - w to t_0 are used as input data (therefore actual data size is w+1).

  • target_var – target variable to predict by model, currently only a single target variable is supported. Because this framework was originally designed to predict ozone, default is “o3”.

  • target_dim – dimension of target variable (default “variables”).

  • window_lead_time – number of time steps to predict by model (default 3). Time steps t_0+1 to t_0+w are predicted.

  • dimensions

  • time_dim

  • interpolation_method – The method to use for interpolation.

  • interpolation_limit – The maximum number of subsequent time steps in a gap to fill by interpolation. If the gap exceeds this number, the gap is not filled by interpolation at all. The value of time steps is an arbitrary number that is applied depending on the sampling frequency. A limit of 2 means that either 2 hours or 2 days are allowed to be interpolated in dependency of the set sampling rate.

  • train_start

  • train_end

  • val_start

  • val_end

  • test_start

  • test_end

  • use_all_stations_on_all_data_sets

  • train_model – train a new model from scratch or resume training with existing model if True (default) or freeze loaded model and do not perform any modification on it. train_model is set to True if create_new_model is True.

  • fraction_of_train – given value is used to split between test data and train data (including validation data). The value of fraction_of_train must be in (0, 1) but is recommended to be in the interval [0.6, 0.9]. Default value is 0.8. Split between train and validation is fixed to 80% - 20% and currently not changeable.

  • experiment_path

  • plot_path – path to save all plots. If left blank, this will be included in the experiment path (recommended). Otherwise customise the location to save all plots.

  • forecast_path – path to save all forecasts in files. It is recommended to leave this parameter blank, all forecasts will be the directory forecasts inside the experiment path (default). For customisation, add your path here.

  • overwrite_local_data – Reload input and target data from web and replace local data if True (default False).

  • sampling – set temporal sampling rate of data. You can choose from daily (default), monthly, seasonal, vegseason, summer and annual for aggregated values and hourly for the actual values. Note, that hourly values on JOIN are currently not accessible from outside. To access this data, you need to add your personal token in join settings and make sure to untrack this file!

  • create_new_model – determine whether a new model will be created (True, default) or not (False). If this parameter is set to False, make sure, that a suitable model already exists in the experiment path. This model must fit in terms of input and output dimensions as well as window_history_size and window_lead_time and must be implemented as a model class and imported in model setup. If create_new_model is True, parameter train_model is automatically set to True too.

  • bootstrap_path

  • permute_data_on_training – shuffle train data individually for each station if True. This is performed each iteration for new, so that each sample very likely differs from epoch to epoch. Train data permutation is disabled (False) per default. If the case of extreme value manifolding, data permutation is enabled anyway.

  • transformation – set transformation options in dictionary style. All information about transformation options can be found in setup transformation. If no transformation is provided, all options are set to default transformation.

  • train_min_length

  • val_min_length

  • test_min_length

  • extreme_values – augment target samples with values of lower occurrences indicated by its normalised deviation from mean by manifolding. These extreme values need to be indicated by a list of thresholds. For each entry in this list, all values outside an +/- interval will be added in the training (and only the training) set for a second time to the sample. If multiple valus are given, a sample is added for each exceedence once. E.g. a sample with value=2.5 occurs twice in the training set for given extreme_values=[2, 3], whereas a sample with value=5 occurs three times in the training set. For default, upsampling of extreme values is disabled (None). Upsampling can be modified to manifold only values that are actually larger than given values from extreme_values (apply only on right side of distribution) by using extremes_on_right_tail_only. This can be useful for positive skew variables.

  • extremes_on_right_tail_only – applies only if extreme_values are given. If extremes_on_right_tail_only is True, only manifold values that are larger than given extremes (apply upsampling only on right side of distribution). In default mode, this is set to False to manifold extremes on both sides.

  • evaluate_bootstraps

  • plot_list

  • number_of_bootstraps

  • create_new_bootstraps

  • data_path – path to find and store meteorological and environmental / air quality data. Leave this parameter empty, if your host system is known and a suitable path was already hardcoded in the program (see prepare host).

  • experiment_date

  • window_dim – “Temporal” dimension of the input and target data, that is provided for each sample. The number of samples provided in this dimension can be set using window_history_size for inputs and window_lead_time on target site.

  • iter_dim

  • batch_path

  • login_nodes

  • hpc_hosts

  • model

  • batch_size

  • epochs – Number of epochs used in training. If a training is resumed and the number of epochs of the already (partly) trained model is lower than this parameter, training is continue. In case this number is higher than the given epochs parameter, no training is resumed. Epochs is set to 20 per default, but this value is just a placeholder that should be adjusted for a meaningful training.

  • early_stopping_epochs – number of consecutive epochs with no improvement on val loss to stop training. When set to np.inf or not providing at all, training is not stopped before reaching epochs.

  • restore_best_model_weights – indicates whether to use model state with best val loss (if True) or model state on ending of training (if False). The later depends on the parameters epochs and early_stopping_epochs which trigger stopping of training.

  • data_handler

  • data_origin

  • competitors – Provide names of reference models trained by MLAir that can be found in the competitor_path. These models will be used in the postprocessing for comparison.

  • competitor_path – The path where MLAir can find competing models. If not provided, this path is assumed to be in the ´data_path´ directory as a subdirectory called competitors (default).

  • use_multiprocessing – Enable parallel preprocessing (postprocessing not implemented yet) by setting this parameter to True (default). If set to False the computation is performed in an serial approach. Multiprocessing is disabled when running in debug mode and cannot be switched on.

  • transformation_file – Use transformation options from this file for transformation

  • calculate_fresh_transformation – can either be True or False, indicates if new transformation options should be calculated in any case (transformation_file is not used in this case!).

  • snapshot_path – path to store snapshot of current run (default inside experiment path)

  • create_snapshot – indicate if a snapshot is taken from current run or not (default False)

  • snapshot_load_path – path to load a snapshot from (default None). In contrast to snapshot_path, which is only for storing a snapshot, snapshot_load_path indicates where to load the snapshot from. If this parameter is not provided at all, no snapshot is loaded. Note, the workflow will apply the default preprocessing without loading a snapshot only if this parameter is None!

_set_param(self, param: str, value: Any, default: Any = None, scope: str = 'general', apply: Callable = None) → Any

Set given parameter and log in debug. Use apply parameter to adjust the stored value (e.g. to transform value to a list use apply=helpers.to_list).

static _store_start_script(start_script, store_path)
_compare_variables_and_statistics(self)

Compare variables and statistics.

  • raise error, if a variable is missing.

  • remove unused variables from statistics.

_check_target_var(self)

Check if target variable is in statistics_per_var dictionary.

mlair.run_modules.experiment_setup.formatter = %(asctime)s - %(levelname)s: %(message)s [%(filename)s:%(funcName)s:%(lineno)s]