:py:mod:`mlair.run_modules.experiment_setup` ============================================ .. py:module:: mlair.run_modules.experiment_setup Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: mlair.run_modules.experiment_setup.ExperimentSetup Attributes ~~~~~~~~~~ .. autoapisummary:: mlair.run_modules.experiment_setup.__author__ mlair.run_modules.experiment_setup.__date__ mlair.run_modules.experiment_setup.formatter .. py:data:: __author__ :annotation: = Lukas Leufen, Felix Kleinert .. py:data:: __date__ :annotation: = 2019-11-15 .. py:class:: ExperimentSetup(experiment_date=None, stations: Union[str, List[str]] = None, variables: Union[str, List[str]] = None, statistics_per_var: Dict = None, start: str = None, end: str = None, window_history_size: int = None, target_var='o3', target_dim=None, window_lead_time: int = None, window_dim=None, dimensions=None, time_dim=None, iter_dim=None, interpolation_method=None, interpolation_limit=None, train_start=None, train_end=None, val_start=None, val_end=None, test_start=None, test_end=None, use_all_stations_on_all_data_sets=None, train_model: bool = None, fraction_of_train: float = None, experiment_path=None, plot_path: str = None, forecast_path: str = None, overwrite_local_data=None, sampling: str = None, create_new_model=None, bootstrap_path=None, permute_data_on_training=None, transformation=None, train_min_length=None, val_min_length=None, test_min_length=None, extreme_values: list = None, extremes_on_right_tail_only: bool = None, evaluate_feature_importance: bool = None, plot_list=None, feature_importance_n_boots: int = None, feature_importance_create_new_bootstraps: bool = None, feature_importance_bootstrap_method=None, feature_importance_bootstrap_type=None, data_path: str = None, batch_path: str = None, login_nodes=None, hpc_hosts=None, model=None, batch_size=None, epochs=None, early_stopping_epochs: int = None, restore_best_model_weights: bool = None, data_handler=None, data_origin: Dict = None, competitors: list = None, competitor_path: str = None, use_multiprocessing: bool = None, use_multiprocessing_on_debug: bool = None, max_number_multiprocessing: int = None, start_script: Union[Callable, str] = None, overwrite_lazy_data: bool = None, uncertainty_estimate_block_length: str = None, uncertainty_estimate_evaluate_competitors: bool = None, uncertainty_estimate_n_boots: int = None, do_uncertainty_estimate: bool = None, do_bias_free_evaluation: bool = None, model_display_name: str = None, transformation_file: str = None, calculate_fresh_transformation: bool = None, snapshot_load_path: str = None, create_snapshot: bool = None, snapshot_path: str = None, model_path: str = None, **kwargs) Bases: :py:obj:`mlair.run_modules.run_environment.RunEnvironment` Set up the model. Schedule of experiment setup: * set up experiment path * set up data path (according to host system) * set up forecast, bootstrap and plot path (inside experiment path) * set all parameters given in args (or use default values) * check target variable * check `variables` and `statistics_per_var` parameter for consistency Sets * `data_path` [.] * `create_new_model` [.] * `bootstrap_path` [.] * `train_model` [.] * `fraction_of_training` [.] * `extreme_values` [train] * `extremes_on_right_tail_only` [train] * `upsampling` [train] * `permute_data` [train] * `experiment_name` [.] * `experiment_path` [.] * `plot_path` [.] * `forecast_path` [.] * `stations` [.] * `statistics_per_var` [.] * `variables` [.] * `start` [.] * `end` [.] * `window_history_size` [.] * `overwrite_local_data` [preprocessing] * `sampling` [.] * `transformation` [., preprocessing] * `target_var` [.] * `target_dim` [.] * `window_lead_time` [.] Creates * plot of model architecture in `.pdf` :param parser_args: argument parser, currently only accepting ``experiment_date argument`` to be used for experiment's name and path creation. Final experiment's name is derived from given name and the time series sampling as `_network_/` . All interim and final results, logging, plots, ... of this run are stored in this directory if not explicitly provided in kwargs. Only the data itself and data for bootstrap investigations are stored outside this structure. :param stations: list of stations or single station to use in experiment. If not provided, stations are set to :py:const:`default stations `. :param variables: list of all variables to use. Valid names can be found in `Section 2.1 Parameters `_. If not provided, this parameter is filled with keys from ``statistics_per_var``. :param statistics_per_var: dictionary with statistics to use for variables (if data is daily and loaded from JOIN). If not provided, :py:const:`default statistics ` is applied. ``statistics_per_var`` is compared with given ``variables`` and unused variables are removed. Therefore, statistics at least need to provide all variables from ``variables``. For more details on available statistics, we refer to `Section 3.3 List of statistics/metrics for stats service `_ in the JOIN documentation. Valid parameter names can be found in `Section 2.1 Parameters `_. :param start: start date of overall data (default `"1997-01-01"`) :param end: end date of overall data (default `"2017-12-31"`) :param window_history_size: number of time steps to use for input data (default 13). Time steps `t_0 - w` to `t_0` are used as input data (therefore actual data size is `w+1`). :param target_var: target variable to predict by model, currently only a single target variable is supported. Because this framework was originally designed to predict ozone, default is `"o3"`. :param target_dim: dimension of target variable (default `"variables"`). :param window_lead_time: number of time steps to predict by model (default 3). Time steps `t_0+1` to `t_0+w` are predicted. :param dimensions: :param time_dim: :param interpolation_method: The method to use for interpolation. :param interpolation_limit: The maximum number of subsequent time steps in a gap to fill by interpolation. If the gap exceeds this number, the gap is not filled by interpolation at all. The value of time steps is an arbitrary number that is applied depending on the `sampling` frequency. A limit of 2 means that either 2 hours or 2 days are allowed to be interpolated in dependency of the set sampling rate. :param train_start: :param train_end: :param val_start: :param val_end: :param test_start: :param test_end: :param use_all_stations_on_all_data_sets: :param train_model: train a new model from scratch or resume training with existing model if `True` (default) or freeze loaded model and do not perform any modification on it. ``train_model`` is set to `True` if ``create_new_model`` is `True`. :param fraction_of_train: given value is used to split between test data and train data (including validation data). The value of ``fraction_of_train`` must be in `(0, 1)` but is recommended to be in the interval `[0.6, 0.9]`. Default value is `0.8`. Split between train and validation is fixed to 80% - 20% and currently not changeable. :param experiment_path: :param plot_path: path to save all plots. If left blank, this will be included in the experiment path (recommended). Otherwise customise the location to save all plots. :param forecast_path: path to save all forecasts in files. It is recommended to leave this parameter blank, all forecasts will be the directory `forecasts` inside the experiment path (default). For customisation, add your path here. :param overwrite_local_data: Reload input and target data from web and replace local data if `True` (default `False`). :param sampling: set temporal sampling rate of data. You can choose from daily (default), monthly, seasonal, vegseason, summer and annual for aggregated values and hourly for the actual values. Note, that hourly values on JOIN are currently not accessible from outside. To access this data, you need to add your personal token in :py:mod:`join settings ` and make sure to untrack this file! :param create_new_model: determine whether a new model will be created (`True`, default) or not (`False`). If this parameter is set to `False`, make sure, that a suitable model already exists in the experiment path. This model must fit in terms of input and output dimensions as well as ``window_history_size`` and ``window_lead_time`` and must be implemented as a :py:mod:`model class ` and imported in :py:mod:`model setup `. If ``create_new_model`` is `True`, parameter ``train_model`` is automatically set to `True` too. :param bootstrap_path: :param permute_data_on_training: shuffle train data individually for each station if `True`. This is performed each iteration for new, so that each sample very likely differs from epoch to epoch. Train data permutation is disabled (`False`) per default. If the case of extreme value manifolding, data permutation is enabled anyway. :param transformation: set transformation options in dictionary style. All information about transformation options can be found in :py:meth:`setup transformation `. If no transformation is provided, all options are set to :py:const:`default transformation `. :param train_min_length: :param val_min_length: :param test_min_length: :param extreme_values: augment target samples with values of lower occurrences indicated by its normalised deviation from mean by manifolding. These extreme values need to be indicated by a list of thresholds. For each entry in this list, all values outside an +/- interval will be added in the training (and only the training) set for a second time to the sample. If multiple valus are given, a sample is added for each exceedence once. E.g. a sample with `value=2.5` occurs twice in the training set for given `extreme_values=[2, 3]`, whereas a sample with `value=5` occurs three times in the training set. For default, upsampling of extreme values is disabled (`None`). Upsampling can be modified to manifold only values that are actually larger than given values from ``extreme_values`` (apply only on right side of distribution) by using ``extremes_on_right_tail_only``. This can be useful for positive skew variables. :param extremes_on_right_tail_only: applies only if ``extreme_values`` are given. If ``extremes_on_right_tail_only`` is `True`, only manifold values that are larger than given extremes (apply upsampling only on right side of distribution). In default mode, this is set to `False` to manifold extremes on both sides. :param evaluate_bootstraps: :param plot_list: :param number_of_bootstraps: :param create_new_bootstraps: :param data_path: path to find and store meteorological and environmental / air quality data. Leave this parameter empty, if your host system is known and a suitable path was already hardcoded in the program (see :py:func:`prepare host `). :param experiment_date: :param window_dim: "Temporal" dimension of the input and target data, that is provided for each sample. The number of samples provided in this dimension can be set using `window_history_size` for inputs and `window_lead_time` on target site. :param iter_dim: :param batch_path: :param login_nodes: :param hpc_hosts: :param model: :param batch_size: :param epochs: Number of epochs used in training. If a training is resumed and the number of epochs of the already (partly) trained model is lower than this parameter, training is continue. In case this number is higher than the given epochs parameter, no training is resumed. Epochs is set to 20 per default, but this value is just a placeholder that should be adjusted for a meaningful training. :param early_stopping_epochs: number of consecutive epochs with no improvement on val loss to stop training. When set to `np.inf` or not providing at all, training is not stopped before reaching `epochs`. :param restore_best_model_weights: indicates whether to use model state with best val loss (if True) or model state on ending of training (if False). The later depends on the parameters `epochs` and `early_stopping_epochs` which trigger stopping of training. :param data_handler: :param data_origin: :param competitors: Provide names of reference models trained by MLAir that can be found in the `competitor_path`. These models will be used in the postprocessing for comparison. :param competitor_path: The path where MLAir can find competing models. If not provided, this path is assumed to be in the ´data_path´ directory as a subdirectory called `competitors` (default). :param use_multiprocessing: Enable parallel preprocessing (postprocessing not implemented yet) by setting this parameter to `True` (default). If set to `False` the computation is performed in an serial approach. Multiprocessing is disabled when running in debug mode and cannot be switched on. :param transformation_file: Use transformation options from this file for transformation :param calculate_fresh_transformation: can either be True or False, indicates if new transformation options should be calculated in any case (transformation_file is not used in this case!). :param snapshot_path: path to store snapshot of current run (default inside experiment path) :param create_snapshot: indicate if a snapshot is taken from current run or not (default False) :param snapshot_load_path: path to load a snapshot from (default None). In contrast to `snapshot_path`, which is only for storing a snapshot, `snapshot_load_path` indicates where to load the snapshot from. If this parameter is not provided at all, no snapshot is loaded. Note, the workflow will apply the default preprocessing without loading a snapshot only if this parameter is None! .. py:method:: _set_param(self, param: str, value: Any, default: Any = None, scope: str = 'general', apply: Callable = None) -> Any Set given parameter and log in debug. Use apply parameter to adjust the stored value (e.g. to transform value to a list use apply=helpers.to_list). .. py:method:: _store_start_script(start_script, store_path) :staticmethod: .. py:method:: _compare_variables_and_statistics(self) Compare variables and statistics. * raise error, if a variable is missing. * remove unused variables from statistics. .. py:method:: _check_target_var(self) Check if target variable is in statistics_per_var dictionary. .. py:data:: formatter :annotation: = %(asctime)s - %(levelname)s: %(message)s [%(filename)s:%(funcName)s:%(lineno)s]