:py:mod:`mlair` =============== .. py:module:: mlair Subpackages ----------- .. toctree:: :titlesonly: :maxdepth: 3 configuration/index.rst data_handler/index.rst helpers/index.rst keras_legacy/index.rst model_modules/index.rst plotting/index.rst reference_models/index.rst run_modules/index.rst workflows/index.rst Submodules ---------- .. toctree:: :titlesonly: :maxdepth: 1 run_script/index.rst Package Contents ---------------- Classes ~~~~~~~ .. autoapisummary:: mlair.RunEnvironment mlair.ExperimentSetup mlair.PreProcessing mlair.ModelSetup mlair.Training mlair.PostProcessing mlair.AbstractModelClass Functions ~~~~~~~~~ .. autoapisummary:: mlair.get_version Attributes ~~~~~~~~~~ .. autoapisummary:: mlair.__version_info__ mlair.__version__ mlair.__author__ mlair.__email__ .. py:data:: __version_info__ .. py:class:: RunEnvironment(name=None, log_level_stream=None) Bases: :py:obj:`object` Basic run class to measure execution time. Either call this class by 'with' statement or delete the class instance after finishing the measurement. The duration result is logged. .. code-block:: python >>> with RunEnvironment(): INFO: RunEnvironment started ... INFO: RunEnvironment finished after 00:00:04 (hh:mm:ss) If you want to embed your custom module in a RunEnvironment, you can easily call it inside the with statement. If you want to exchange between different modules in addition, create your module as inheritance of the RunEnvironment and call it after you initialised the RunEnvironment itself. .. code-block:: python class CustomClass(RunEnvironment): def __init__(self): super().__init__() ... ... >>> with RunEnvironment(): CustomClass() INFO: RunEnvironment started INFO: CustomClass started INFO: CustomClass finished after 00:00:04 (hh:mm:ss) INFO: RunEnvironment finished after 00:00:04 (hh:mm:ss) All data that is stored in the data store will be available for all other modules that inherit from RunEnvironment as long the RunEnvironemnt base class is running. If the base class is deleted either by hand or on exit of the with statement, this storage is cleared. .. code-block:: python class CustomClassA(RunEnvironment): def __init__(self): super().__init__() self.data_store.set("testVar", 12) class CustomClassB(RunEnvironment): def __init__(self): super().__init__() self.test_var = self.data_store.get("testVar") logging.info(f"testVar = {self.test_var}") >>> with RunEnvironment(): CustomClassA() CustomClassB() INFO: RunEnvironment started INFO: CustomClassA started INFO: CustomClassA finished after 00:00:01 (hh:mm:ss) INFO: CustomClassB started INFO: testVar = 12 INFO: CustomClassB finished after 00:00:02 (hh:mm:ss) INFO: RunEnvironment finished after 00:00:03 (hh:mm:ss) .. py:attribute:: del_by_exit :annotation: = False .. py:attribute:: data_store .. py:attribute:: logger .. py:attribute:: tracker_list :annotation: = [] .. py:method:: __del__(self) Finalise class. Only stop time tracking, if not already called by exit method to prevent duplicated logging (__exit__ is always executed before __del__) it this class was used in a with statement. If instance is called as base class and not as inheritance from this class, log file is copied and data store is cleared. .. py:method:: __enter__(self) Enter run environment. .. py:method:: __exit__(self, exc_type, exc_val, exc_tb) Exit run environment. .. py:method:: __move_log_file(self) .. py:method:: __save_tracking(self) .. py:method:: __plot_tracking(self) .. py:method:: __find_file_pattern(self, name) .. py:method:: update_datastore(cls, new_data_store: mlair.helpers.datastore.DataStoreByScope, excluded_params=None, apply_full_replacement=False) :classmethod: .. py:method:: do_stuff(length=2) :staticmethod: Just a placeholder method for testing without any sense. .. py:class:: ExperimentSetup(experiment_date=None, stations: Union[str, List[str]] = None, variables: Union[str, List[str]] = None, statistics_per_var: Dict = None, start: str = None, end: str = None, window_history_size: int = None, target_var='o3', target_dim=None, window_lead_time: int = None, window_dim=None, dimensions=None, time_dim=None, iter_dim=None, interpolation_method=None, interpolation_limit=None, train_start=None, train_end=None, val_start=None, val_end=None, test_start=None, test_end=None, use_all_stations_on_all_data_sets=None, train_model: bool = None, fraction_of_train: float = None, experiment_path=None, plot_path: str = None, forecast_path: str = None, overwrite_local_data=None, sampling: str = None, create_new_model=None, bootstrap_path=None, permute_data_on_training=None, transformation=None, train_min_length=None, val_min_length=None, test_min_length=None, extreme_values: list = None, extremes_on_right_tail_only: bool = None, evaluate_feature_importance: bool = None, plot_list=None, feature_importance_n_boots: int = None, feature_importance_create_new_bootstraps: bool = None, feature_importance_bootstrap_method=None, feature_importance_bootstrap_type=None, data_path: str = None, batch_path: str = None, login_nodes=None, hpc_hosts=None, model=None, batch_size=None, epochs=None, early_stopping_epochs: int = None, restore_best_model_weights: bool = None, data_handler=None, data_origin: Dict = None, competitors: list = None, competitor_path: str = None, use_multiprocessing: bool = None, use_multiprocessing_on_debug: bool = None, max_number_multiprocessing: int = None, start_script: Union[Callable, str] = None, overwrite_lazy_data: bool = None, uncertainty_estimate_block_length: str = None, uncertainty_estimate_evaluate_competitors: bool = None, uncertainty_estimate_n_boots: int = None, do_uncertainty_estimate: bool = None, do_bias_free_evaluation: bool = None, model_display_name: str = None, transformation_file: str = None, calculate_fresh_transformation: bool = None, snapshot_load_path: str = None, create_snapshot: bool = None, snapshot_path: str = None, model_path: str = None, **kwargs) Bases: :py:obj:`mlair.run_modules.run_environment.RunEnvironment` Set up the model. Schedule of experiment setup: * set up experiment path * set up data path (according to host system) * set up forecast, bootstrap and plot path (inside experiment path) * set all parameters given in args (or use default values) * check target variable * check `variables` and `statistics_per_var` parameter for consistency Sets * `data_path` [.] * `create_new_model` [.] * `bootstrap_path` [.] * `train_model` [.] * `fraction_of_training` [.] * `extreme_values` [train] * `extremes_on_right_tail_only` [train] * `upsampling` [train] * `permute_data` [train] * `experiment_name` [.] * `experiment_path` [.] * `plot_path` [.] * `forecast_path` [.] * `stations` [.] * `statistics_per_var` [.] * `variables` [.] * `start` [.] * `end` [.] * `window_history_size` [.] * `overwrite_local_data` [preprocessing] * `sampling` [.] * `transformation` [., preprocessing] * `target_var` [.] * `target_dim` [.] * `window_lead_time` [.] Creates * plot of model architecture in `.pdf` :param parser_args: argument parser, currently only accepting ``experiment_date argument`` to be used for experiment's name and path creation. Final experiment's name is derived from given name and the time series sampling as `_network_/` . All interim and final results, logging, plots, ... of this run are stored in this directory if not explicitly provided in kwargs. Only the data itself and data for bootstrap investigations are stored outside this structure. :param stations: list of stations or single station to use in experiment. If not provided, stations are set to :py:const:`default stations `. :param variables: list of all variables to use. Valid names can be found in `Section 2.1 Parameters `_. If not provided, this parameter is filled with keys from ``statistics_per_var``. :param statistics_per_var: dictionary with statistics to use for variables (if data is daily and loaded from JOIN). If not provided, :py:const:`default statistics ` is applied. ``statistics_per_var`` is compared with given ``variables`` and unused variables are removed. Therefore, statistics at least need to provide all variables from ``variables``. For more details on available statistics, we refer to `Section 3.3 List of statistics/metrics for stats service `_ in the JOIN documentation. Valid parameter names can be found in `Section 2.1 Parameters `_. :param start: start date of overall data (default `"1997-01-01"`) :param end: end date of overall data (default `"2017-12-31"`) :param window_history_size: number of time steps to use for input data (default 13). Time steps `t_0 - w` to `t_0` are used as input data (therefore actual data size is `w+1`). :param target_var: target variable to predict by model, currently only a single target variable is supported. Because this framework was originally designed to predict ozone, default is `"o3"`. :param target_dim: dimension of target variable (default `"variables"`). :param window_lead_time: number of time steps to predict by model (default 3). Time steps `t_0+1` to `t_0+w` are predicted. :param dimensions: :param time_dim: :param interpolation_method: The method to use for interpolation. :param interpolation_limit: The maximum number of subsequent time steps in a gap to fill by interpolation. If the gap exceeds this number, the gap is not filled by interpolation at all. The value of time steps is an arbitrary number that is applied depending on the `sampling` frequency. A limit of 2 means that either 2 hours or 2 days are allowed to be interpolated in dependency of the set sampling rate. :param train_start: :param train_end: :param val_start: :param val_end: :param test_start: :param test_end: :param use_all_stations_on_all_data_sets: :param train_model: train a new model from scratch or resume training with existing model if `True` (default) or freeze loaded model and do not perform any modification on it. ``train_model`` is set to `True` if ``create_new_model`` is `True`. :param fraction_of_train: given value is used to split between test data and train data (including validation data). The value of ``fraction_of_train`` must be in `(0, 1)` but is recommended to be in the interval `[0.6, 0.9]`. Default value is `0.8`. Split between train and validation is fixed to 80% - 20% and currently not changeable. :param experiment_path: :param plot_path: path to save all plots. If left blank, this will be included in the experiment path (recommended). Otherwise customise the location to save all plots. :param forecast_path: path to save all forecasts in files. It is recommended to leave this parameter blank, all forecasts will be the directory `forecasts` inside the experiment path (default). For customisation, add your path here. :param overwrite_local_data: Reload input and target data from web and replace local data if `True` (default `False`). :param sampling: set temporal sampling rate of data. You can choose from daily (default), monthly, seasonal, vegseason, summer and annual for aggregated values and hourly for the actual values. Note, that hourly values on JOIN are currently not accessible from outside. To access this data, you need to add your personal token in :py:mod:`join settings ` and make sure to untrack this file! :param create_new_model: determine whether a new model will be created (`True`, default) or not (`False`). If this parameter is set to `False`, make sure, that a suitable model already exists in the experiment path. This model must fit in terms of input and output dimensions as well as ``window_history_size`` and ``window_lead_time`` and must be implemented as a :py:mod:`model class ` and imported in :py:mod:`model setup `. If ``create_new_model`` is `True`, parameter ``train_model`` is automatically set to `True` too. :param bootstrap_path: :param permute_data_on_training: shuffle train data individually for each station if `True`. This is performed each iteration for new, so that each sample very likely differs from epoch to epoch. Train data permutation is disabled (`False`) per default. If the case of extreme value manifolding, data permutation is enabled anyway. :param transformation: set transformation options in dictionary style. All information about transformation options can be found in :py:meth:`setup transformation `. If no transformation is provided, all options are set to :py:const:`default transformation `. :param train_min_length: :param val_min_length: :param test_min_length: :param extreme_values: augment target samples with values of lower occurrences indicated by its normalised deviation from mean by manifolding. These extreme values need to be indicated by a list of thresholds. For each entry in this list, all values outside an +/- interval will be added in the training (and only the training) set for a second time to the sample. If multiple valus are given, a sample is added for each exceedence once. E.g. a sample with `value=2.5` occurs twice in the training set for given `extreme_values=[2, 3]`, whereas a sample with `value=5` occurs three times in the training set. For default, upsampling of extreme values is disabled (`None`). Upsampling can be modified to manifold only values that are actually larger than given values from ``extreme_values`` (apply only on right side of distribution) by using ``extremes_on_right_tail_only``. This can be useful for positive skew variables. :param extremes_on_right_tail_only: applies only if ``extreme_values`` are given. If ``extremes_on_right_tail_only`` is `True`, only manifold values that are larger than given extremes (apply upsampling only on right side of distribution). In default mode, this is set to `False` to manifold extremes on both sides. :param evaluate_bootstraps: :param plot_list: :param number_of_bootstraps: :param create_new_bootstraps: :param data_path: path to find and store meteorological and environmental / air quality data. Leave this parameter empty, if your host system is known and a suitable path was already hardcoded in the program (see :py:func:`prepare host `). :param experiment_date: :param window_dim: "Temporal" dimension of the input and target data, that is provided for each sample. The number of samples provided in this dimension can be set using `window_history_size` for inputs and `window_lead_time` on target site. :param iter_dim: :param batch_path: :param login_nodes: :param hpc_hosts: :param model: :param batch_size: :param epochs: Number of epochs used in training. If a training is resumed and the number of epochs of the already (partly) trained model is lower than this parameter, training is continue. In case this number is higher than the given epochs parameter, no training is resumed. Epochs is set to 20 per default, but this value is just a placeholder that should be adjusted for a meaningful training. :param early_stopping_epochs: number of consecutive epochs with no improvement on val loss to stop training. When set to `np.inf` or not providing at all, training is not stopped before reaching `epochs`. :param restore_best_model_weights: indicates whether to use model state with best val loss (if True) or model state on ending of training (if False). The later depends on the parameters `epochs` and `early_stopping_epochs` which trigger stopping of training. :param data_handler: :param data_origin: :param competitors: Provide names of reference models trained by MLAir that can be found in the `competitor_path`. These models will be used in the postprocessing for comparison. :param competitor_path: The path where MLAir can find competing models. If not provided, this path is assumed to be in the ´data_path´ directory as a subdirectory called `competitors` (default). :param use_multiprocessing: Enable parallel preprocessing (postprocessing not implemented yet) by setting this parameter to `True` (default). If set to `False` the computation is performed in an serial approach. Multiprocessing is disabled when running in debug mode and cannot be switched on. :param transformation_file: Use transformation options from this file for transformation :param calculate_fresh_transformation: can either be True or False, indicates if new transformation options should be calculated in any case (transformation_file is not used in this case!). :param snapshot_path: path to store snapshot of current run (default inside experiment path) :param create_snapshot: indicate if a snapshot is taken from current run or not (default False) :param snapshot_load_path: path to load a snapshot from (default None). In contrast to `snapshot_path`, which is only for storing a snapshot, `snapshot_load_path` indicates where to load the snapshot from. If this parameter is not provided at all, no snapshot is loaded. Note, the workflow will apply the default preprocessing without loading a snapshot only if this parameter is None! .. py:method:: _set_param(self, param: str, value: Any, default: Any = None, scope: str = 'general', apply: Callable = None) -> Any Set given parameter and log in debug. Use apply parameter to adjust the stored value (e.g. to transform value to a list use apply=helpers.to_list). .. py:method:: _store_start_script(start_script, store_path) :staticmethod: .. py:method:: _compare_variables_and_statistics(self) Compare variables and statistics. * raise error, if a variable is missing. * remove unused variables from statistics. .. py:method:: _check_target_var(self) Check if target variable is in statistics_per_var dictionary. .. py:class:: PreProcessing Bases: :py:obj:`mlair.run_modules.run_environment.RunEnvironment` Pre-process your data by using this class. Schedule of pre-processing: #. load and check valid stations (either download or load from disk) #. split subsets (train, val, test, train & val) #. create small report on data metrics Required objects [scope] from data store: * all elements from `DEFAULT_ARGS_LIST` in scope preprocessing for general data loading * all elements from `DEFAULT_ARGS_LIST` in scopes [train, val, test, train_val] for custom subset settings * `fraction_of_training` [.] * `experiment_path` [.] * `use_all_stations_on_all_data_sets` [.] Optional objects * all elements from `DEFAULT_KWARGS_LIST` in scope preprocessing for general data loading * all elements from `DEFAULT_KWARGS_LIST` in scopes [train, val, test, train_val] for custom subset settings Sets * `stations` in [., train, val, test, train_val] * `generator` in [train, val, test, train_val] * `transformation` [.] Creates * all input and output data in `data_path` * latex reports in `experiment_path/latex_report` .. py:method:: _run(self) .. py:method:: report_pre_processing(self) Log some metrics on data and create latex report. .. py:method:: create_latex_report(self) Create tables with information on the station meta data and a summary on subset sample sizes. * station_sample_size.md: see table below as markdown * station_sample_size.tex: same as table below as latex table * station_sample_size_short.tex: reduced size table without any meta data besides station ID, as latex table All tables are stored inside experiment_path inside the folder latex_report. The table format (e.g. which meta data is highlighted) is currently hardcoded to have a stable table style. If further styles are needed, it is better to add an additional style than modifying the existing table styles. +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | stat. ID | station_name | station_lon | station_lat | station_alt | train | val | test | +============+===========================================+===============+===============+===============+=========+=======+========+ | DEBW013 | Stuttgart Bad Cannstatt | 9.2297 | 48.8088 | 235 | 1434 | 712 | 1080 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | DEBW076 | Baden-Baden | 8.2202 | 48.7731 | 148 | 3037 | 722 | 710 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | DEBW087 | Schwäbische_Alb | 9.2076 | 48.3458 | 798 | 3044 | 714 | 1087 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | DEBW107 | Tübingen | 9.0512 | 48.5077 | 325 | 1803 | 715 | 1087 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | DEBY081 | Garmisch-Partenkirchen/Kreuzeckbahnstraße | 11.0631 | 47.4764 | 735 | 2935 | 525 | 714 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | # Stations | nan | nan | nan | nan | 6 | 6 | 6 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ | # Samples | nan | nan | nan | nan | 12253 | 3388 | 4678 | +------------+-------------------------------------------+---------------+---------------+---------------+---------+-------+--------+ .. py:method:: create_describe_df(df, percentiles=None, ignore_last_lines: int = 2) :staticmethod: .. py:method:: create_info_df(self, meta_cols, meta_round, names_of_set, precision) .. py:method:: split_train_val_test(self) -> None Split data into subsets. Currently: train, val, test and train_val (actually this is only the merge of train and val, but as an separate data_collection). IMPORTANT: Do not change to order of the execution of create_set_split. The train subset needs always to be executed at first, to set a proper transformation. .. py:method:: split_set_indices(total_length: int, fraction: float) -> Tuple[slice, slice, slice, slice] :staticmethod: Create the training, validation and test subset slice indices for given total_length. The test data consists on (1-fraction) of total_length (fraction*len:end). Train and validation data therefore are made from fraction of total_length (0:fraction*len). Train and validation data is split by the factor 0.8 for train and 0.2 for validation. In addition, split_set_indices returns also the combination of training and validation subset. :param total_length: list with all objects to split :param fraction: ratio between test and union of train/val data :return: slices for each subset in the order: train, val, test, train_val .. py:method:: create_set_split(self, index_list: slice, set_name: str) -> None .. py:method:: validate_station(self, data_handler: mlair.data_handler.AbstractDataHandler, set_stations, set_name=None, store_processed_data=True) Check if all given stations in `all_stations` are valid. Valid means, that there is data available for the given time range (is included in `kwargs`). The shape and the loading time are logged in debug mode. :return: Corrected list containing only valid station IDs. .. py:method:: store_data_handler_attributes(self, data_handler, collection) .. py:method:: _store_apriori(self) .. py:method:: _load_apriori(self) .. py:method:: transformation(self, data_handler: mlair.data_handler.AbstractDataHandler, stations) .. py:method:: _load_transformation(self) Try to load transformation options from file if transformation_file is provided. .. py:method:: _store_transformation(self, transformation_opts) Store transformation options locally inside experiment_path if not exists already. .. py:method:: prepare_competitors(self) Prepare competitor models already in the preprocessing stage. This is performed here, because some models might need to have internet access, which is depending on the operating system not possible during postprocessing. This method checks currently only, if the Intelli03-ts-v1 model is requested as competitor and downloads the data if required. .. py:method:: create_snapshot(self) .. py:method:: load_snapshot(self, file) .. py:class:: ModelSetup Bases: :py:obj:`mlair.run_modules.run_environment.RunEnvironment` Set up the model. Schedule of model setup: #. set channels (from variables dimension) #. build imported model #. plot model architecture #. load weights if enabled (e.g. to resume a training) #. set callbacks and checkpoint #. compile model Required objects [scope] from data store: * `experiment_path` [.] * `experiment_name` [.] * `train_model` [.] * `create_new_model` [.] * `generator` [train] * `model_class` [.] Optional objects * `lr_decay` [model] Sets * `channels` [model] * `model` [model] * `hist` [model] * `callbacks` [model] * `model_name` [model] * all settings from model class like `dropout_rate`, `initial_lr`, and `optimizer` [model] Creates * plot of model architecture `.pdf` .. py:method:: _run(self) .. py:method:: _set_model_path(self) .. py:method:: _set_shapes(self) Set input and output shapes from train collection. .. py:method:: _set_num_of_training_samples(self) Set number of training samples - needed for example for Bayesian NNs .. py:method:: compile_model(self) Compiles the keras model. Compile options are mandatory and have to be set by implementing set_compile() method in child class of AbstractModelClass. .. py:method:: _set_callbacks(self) Set all callbacks for the training phase. Add all callbacks with the .add_callback statement. Finally, the advanced model checkpoint is added. .. py:method:: copy_model(self) Copy external model to internal experiment structure. .. py:method:: load_model(self) Try to load model from disk or skip if not possible. .. py:method:: build_model(self) Build model using input and output shapes from data store. .. py:method:: broadcast_custom_objects(self) Broadcast custom objects to keras utils. This method is very important, because it adds the model's custom objects to the keras utils. By doing so, all custom objects can be treated as standard keras modules. Therefore, problems related to model or callback loading are solved. .. py:method:: get_model_settings(self) Load all model settings and store in data store. .. py:method:: plot_model(self) Plot model architecture as `.pdf`. .. py:method:: report_model(self) .. py:method:: _clean_name(orig_name: str) :staticmethod: .. py:class:: Training Bases: :py:obj:`mlair.run_modules.run_environment.RunEnvironment` Train your model with this module. This module isn't required to run, if only a fresh post-processing is preformed. Either remove training call from your run script or set create_new_model and train_model both to false. Schedule of training: #. set_generators(): set generators for training, validation and testing and distribute according to batch size #. make_predict_function(): create predict function before distribution on multiple nodes (detailed information in method description) #. train(): start or resume training of model and save callbacks #. save_model(): save best model from training as final model Required objects [scope] from data store: * `model` [model] * `batch_size` [.] * `epochs` [.] * `callbacks` [model] * `model_name` [model] * `experiment_name` [.] * `experiment_path` [.] * `train_model` [.] * `create_new_model` [.] * `generator` [train, val, test] * `plot_path` [.] Optional objects * `permute_data` [train, val, test] * `upsampling` [train, val, test] Sets * `model` [.] Creates * `_model-best.h5` * `_model-best-callbacks-.h5` (all callbacks from CallbackHandler) * `history.json` * `history_lr.json` (optional) * `_history_.pdf` (different monitoring plots depending on loss metrics and callbacks) .. py:method:: _run(self) -> None Run training. Details in class description. .. py:method:: make_predict_function(self) -> None Create predict function. Must be called before distributing. This is necessary, because tf will compile the predict function just in the moment it is used the first time. This can cause problems, if the model is distributed on different workers. To prevent this, the function is pre-compiled. See discussion @ https://stackoverflow.com/questions/40850089/is-keras-thread-safe/43393252#43393252 .. py:method:: _set_gen(self, mode: str) -> None Set and distribute the generators for given mode regarding batch size. :param mode: name of set, should be from ["train", "val", "test"] .. py:method:: set_generators(self) -> None Set all generators for training, validation, and testing subsets. The called sub-method will automatically distribute the data according to the batch size. The subsets can be accessed as class variables train_set, val_set, and test_set. .. py:method:: train(self) -> None Perform training using keras fit(). Callbacks are stored locally in the experiment directory. Best model from training is saved for class variable model. If the file path of checkpoint is not empty, this method assumes, that this is not a new training starting from the very beginning, but a resumption from a previous started but interrupted training (or a stopped and now continued training). Train will automatically load the locally stored information and the corresponding model and proceed with the already started training. .. py:method:: save_model(self) -> None Save model in local experiment directory. Model is named as `_.h5`. .. py:method:: save_callbacks_as_json(self, history: tensorflow.keras.callbacks.Callback, lr_sc: tensorflow.keras.callbacks.Callback, epo_timing: tensorflow.keras.callbacks.Callback) -> None Save callbacks (history, learning rate) of training. * history.history -> history.json * lr_sc.lr -> history_lr.json :param history: history object of training :param lr_sc: learning rate object .. py:method:: create_monitoring_plots(self, history: tensorflow.keras.callbacks.Callback, lr_sc: tensorflow.keras.callbacks.Callback, epoch_best: int = None) -> None Create plot of history and learning rate in dependence of the number of epochs. The plots are saved in the experiment's plot_path. History plot is named `_history_loss_val_loss.pdf`, the learning rate with `_history_learning_rate.pdf`. :param history: keras history object with losses to plot (must at least include `loss` and `val_loss`) :param lr_sc: learning rate decay object with 'lr' attribute :param epoch_best: number of best epoch (starts counting as 0) .. py:method:: report_training(self) .. py:class:: PostProcessing Bases: :py:obj:`mlair.run_modules.run_environment.RunEnvironment` Perform post-processing for performance evaluation. Schedule of post-processing: #. train an ordinary least squared model (ols) for reference #. create forecasts for nn, ols, and persistence #. evaluate feature importance with bootstrapped predictions #. calculate skill scores #. create plots Required objects [scope] from data store: * `model` [.] or locally saved model plus `model_name` [model] and `model` [model] * `generator` [train, val, test, train_val] * `forecast_path` [.] * `plot_path` [postprocessing] * `model_path` [.] * `target_var` [.] * `sampling` [.] * `output_shape` [model] * `evaluate_feature_importance` [postprocessing] and if enabled: * `create_new_bootstraps` [postprocessing] * `bootstrap_path` [postprocessing] * `number_of_bootstraps` [postprocessing] Optional objects * `batch_size` [model] Creates * forecasts in `forecast_path` if enabled * bootstraps in `bootstrap_path` if enabled * plots in `plot_path` .. py:method:: _run(self) .. py:method:: estimate_sample_uncertainty(self, separate_ahead=False) Estimate sample uncertainty by using a bootstrap approach. Forecasts are split into individual blocks along time and randomly drawn with replacement. The resulting behaviour of the error indicates the robustness of each analyzed model to quantify which model might be superior compared to others. .. py:method:: report_sample_uncertainty(self, percentiles: list = None) Store raw results of uncertainty estimate and calculate aggregate statistics and store as raw data but also as markdown and latex. .. py:method:: calculate_block_mse(self, evaluate_competitors=True, separate_ahead=False, block_length='1m') Transform data into blocks along time axis. Block length can be any frequency like '1m' or '7d. Data are only split along time axis, which means that a single block can have very diverse quantities regarding the number of station or actual data contained. This is intended to analyze not only the robustness against the time but also against the number of observations and diversity ot stations. .. py:method:: create_error_array(self, data) Calculate squared error of all given time series in relation to observation. .. py:method:: create_full_time_dim(data, dim, sampling, start, end) :staticmethod: Ensure time dimension to be equidistant. Sometimes dates if missing values have been dropped. .. py:method:: load_competitors(self, station_name: str) -> xarray.DataArray Load all requested and available competitors for a given station. Forecasts must be available in the competitor path like `//forecasts__test.nc`. The naming style is equal for all forecasts of MLAir, so that forecasts of a different experiment can easily be copied into the competitor path without any change. :param station_name: station indicator to load competitors for :return: a single xarray with all competing forecasts .. py:method:: calculate_feature_importance(self, create_new_bootstraps: bool, _iter: int = 0, bootstrap_type='singleinput', bootstrap_method='shuffle') -> None Calculate skill scores of bootstrapped data. Create bootstrapped data if create_new_bootstraps is true or a failure occurred during skill score calculation (this will happen by default, if no bootstrapped data is available locally). Set class attribute bootstrap_skill_scores. This method is implemented in a recursive fashion, but is only allowed to call itself once. :param create_new_bootstraps: calculate all bootstrap predictions and overwrite already available predictions :param _iter: internal counter to reduce unnecessary recursive calls (maximum number is 2, otherwise something went wrong). .. py:method:: create_feature_importance_bootstrap_forecast(self, bootstrap_type, bootstrap_method) -> None Create bootstrapped predictions for all stations and variables. These forecasts are saved in bootstrap_path with the names `bootstraps_{var}_{station}.nc` and `bootstraps_labels_{station}.nc`. .. py:method:: calculate_feature_importance_skill_scores(self, bootstrap_type, bootstrap_method) -> Dict[str, xarray.DataArray] Calculate skill score of bootstrapped variables. Use already created bootstrap predictions and the original predictions (the not-bootstrapped ones) and calculate skill scores for the bootstraps. The result is saved as a xarray DataArray in a dictionary structure separated for each station (keys of dictionary). :return: The result dictionary with station-wise skill scores .. py:method:: get_distinct_branches_from_bootstrap_iter(bootstrap_iter) :staticmethod: .. py:method:: rename_boot_var_with_branch(self, boot_var, bootstrap_type, branch_names=None, expected_len=0) .. py:method:: get_orig_prediction(self, path, file_name, prediction_name=None, reference_name=None) .. py:method:: repeat_data(data, number_of_repetition) :staticmethod: .. py:method:: _get_model_name(self) Return model name without path information. .. py:method:: _load_model(self) -> mlair.model_modules.AbstractModelClass Load NN model either from data store or from local path. :return: the model .. py:method:: plot(self) Create all plots. Plots are defined in experiment set up by `plot_list`. As default, all (following) plots are enabled: * :py:class:`PlotBootstrapSkillScore ` * :py:class:`PlotConditionalQuantiles ` * :py:class:`PlotStationMap ` * :py:class:`PlotMonthlySummary ` * :py:class:`PlotClimatologicalSkillScore ` * :py:class:`PlotCompetitiveSkillScore ` * :py:class:`PlotTimeSeries ` * :py:class:`PlotAvailability ` .. note:: Bootstrap plots are only created if bootstraps are evaluated. .. py:method:: calculate_test_score(self) Evaluate test score of model and save locally. .. py:method:: train_ols_model(self) Train ordinary least squared model on train data. .. py:method:: setup_persistence(self) Check if persistence is requested from competitors and store this information. .. py:method:: make_prediction(self, subset) Create predictions for NN, OLS, and persistence and add true observation as reference. Predictions are filled in an array with full index range. Therefore, predictions can have missing values. All predictions for a single station are stored locally under `__test.nc` and can be found inside `forecast_path`. .. py:method:: _get_frequency(self) -> str Get frequency abbreviation. .. py:method:: _create_competitor_forecast(self, station_name: str, competitor_name: str) -> xarray.DataArray Load and format the competing forecast of a distinct model indicated by `competitor_name` for a distinct station indicated by `station_name`. The name of the competitor is set in the `type` axis as indicator. This method will raise either a `FileNotFoundError` or `KeyError` if no competitor could be found for the given station. Either there is no file provided in the expected path or no forecast for given `competitor_name` in the forecast file. Forecast is trimmed on interval start and end of test subset. :param station_name: name of the station to load data for :param competitor_name: name of the model :return: the forecast of the given competitor .. py:method:: _create_observation(self, data, _, transformation_func: Callable, normalised: bool) -> xarray.DataArray Create observation as ground truth from given data. Inverse transformation is applied to the ground truth to get the output in the original space. :param data: observation :param transformation_func: a callable function to apply inverse transformation :param normalised: transform ground truth in original space if false, or use normalised predictions if true :return: filled data array with observation .. py:method:: _create_ols_forecast(self, input_data: xarray.DataArray, ols_prediction: xarray.DataArray, transformation_func: Callable, normalised: bool) -> xarray.DataArray Create ordinary least square model forecast with given input data. Inverse transformation is applied to the forecast to get the output in the original space. :param input_data: transposed history from DataPrep :param ols_prediction: empty array in right shape to fill with data :param transformation_func: a callable function to apply inverse transformation :param normalised: transform prediction in original space if false, or use normalised predictions if true :return: filled data array with ols predictions .. py:method:: _create_persistence_forecast(self, data, persistence_prediction: xarray.DataArray, transformation_func: Callable, normalised: bool) -> xarray.DataArray Create persistence forecast with given data. Persistence is deviated from the value at t=0 and applied to all following time steps (t+1, ..., t+window). Inverse transformation is applied to the forecast to get the output in the original space. :param data: observation :param persistence_prediction: empty array in right shape to fill with data :param transformation_func: a callable function to apply inverse transformation :param normalised: transform prediction in original space if false, or use normalised predictions if true :return: filled data array with persistence predictions .. py:method:: _create_nn_forecast(self, nn_output: xarray.DataArray, nn_prediction: xarray.DataArray, transformation_func: Callable, normalised: bool) -> xarray.DataArray Create NN forecast for given input data. Inverse transformation is applied to the forecast to get the output in the original space. Furthermore, only the output of the main branch is returned (not all minor branches, if the network has multiple output branches). The main branch is defined to be the last entry of all outputs. :param nn_output: Full NN model output :param nn_prediction: empty array in right shape to fill with data :param transformation_func: a callable function to apply inverse transformation :param normalised: transform prediction in original space if false, or use normalised predictions if true :return: filled data array with nn predictions .. py:method:: _create_empty_prediction_arrays(target_data, count=1) :staticmethod: Create array to collect all predictions. Expand target data by a station dimension. .. py:method:: create_fullindex(df: Union[xarray.DataArray, pandas.DataFrame, pandas.DatetimeIndex], freq: str) -> pandas.DataFrame :staticmethod: Create full index from first and last date inside df and resample with given frequency. :param df: use time range of this data set :param freq: frequency of full index :return: empty data frame with full index. .. py:method:: create_forecast_arrays(index: pandas.DataFrame, ahead_names: List[Union[str, int]], time_dimension, ahead_dim='ahead', index_dim='index', type_dim='type', **kwargs) :staticmethod: Combine different forecast types into single xarray. :param index: index for forecasts (e.g. time) :param ahead_names: names of ahead values (e.g. hours or days) :param kwargs: as xarrays; data of forecasts :return: xarray of dimension 3: index, ahead_names, # predictions .. py:method:: _get_internal_data(self, station: str, path: str) -> Union[xarray.DataArray, None] Get internal data for given station. Internal data is defined as data that is already known to the model. From an evaluation perspective, this refers to data, that is no test data, and therefore to train and val data. :param station: name of station to load internal data. .. py:method:: _get_external_data(self, station: str, path: str) -> Union[xarray.DataArray, None] Get external data for given station. External data is defined as data that is not known to the model. From an evaluation perspective, this refers to data, that is not train or val data, and therefore to test data. :param station: name of station to load external data. .. py:method:: _combine_forecasts(self, forecast, competitor, dim=None) Combine forecast and competitor if both are xarray. If competitor is None, this returns forecasts and vise versa. .. py:method:: calculate_bias_free_error_metrics(self) .. py:method:: calculate_error_metrics(self) -> Tuple[Dict, Dict, Dict, Dict] Calculate error metrics and skill scores of NN forecast. The competitive skill score compares the NN prediction with persistence and ordinary least squares forecasts. Whereas, the climatological skill scores evaluates the NN prediction in terms of meaningfulness in comparison to different climatological references. :return: competitive and climatological skill scores, error metrics .. py:method:: calculate_average_skill_scores(scores, counts) :staticmethod: .. py:method:: calculate_average_errors(errors) :staticmethod: .. py:method:: report_feature_importance_results(self, results) Create a csv file containing all results from feature importance. .. py:method:: report_error_metrics(self, errors, tag=None) .. py:method:: store_errors(self, errors) .. py:class:: AbstractModelClass(input_shape, output_shape) Bases: :py:obj:`abc.ABC` The AbstractModelClass provides a unified skeleton for any model provided to the machine learning workflow. The model can always be accessed by calling ModelClass.model or directly by an model method without parsing the model attribute name (e.g. ModelClass.model.compile -> ModelClass.compile). Beside the model, this class provides the corresponding loss function. .. py:attribute:: _requirements :annotation: = [] .. py:method:: load_model(self, name: str, compile: bool = False) -> None .. py:method:: __getattr__(self, name: str) -> Any Is called if __getattribute__ is not able to find requested attribute. Normally, the model class is saved into a variable like `model = ModelClass()`. To bypass a call like `model.model` to access the _model attribute, this method tries to search for the named attribute in the self.model namespace and returns this attribute if available. Therefore, following expression is true: `ModelClass().compile == ModelClass().model.compile` as long the called attribute/method is not part if the ModelClass itself. :param name: name of the attribute or method to call :return: attribute or method from self.model namespace .. py:method:: model(self) -> tensorflow.keras.Model :property: The model property containing a keras.Model instance. :return: the keras model .. py:method:: custom_objects(self) -> Dict :property: The custom objects property collects all non-keras utilities that are used in the model class. To load such a customised and already compiled model (e.g. from local disk), this information is required. :return: custom objects in a dictionary .. py:method:: compile_options(self) -> Dict :property: The compile options property allows the user to use all keras.compile() arguments. They can ether be passed as dictionary (1), as attribute, without setting compile_options (2) or as mixture (partly defined as instance attributes and partly parsing a dictionary) of both of them (3). The method will raise an Error when the same parameter is set differently. Example (1) Recommended (includes check for valid keywords which are used as args in keras.compile) .. code-block:: python def set_compile_options(self): self.compile_options = {"optimizer": keras.optimizers.SGD(), "loss": keras.losses.mean_squared_error, "metrics": ["mse", "mae"]} Example (2) .. code-block:: python def set_compile_options(self): self.optimizer = keras.optimizers.SGD() self.loss = keras.losses.mean_squared_error self.metrics = ["mse", "mae"] Example (3) Correct: .. code-block:: python def set_compile_options(self): self.optimizer = keras.optimizers.SGD() self.loss = keras.losses.mean_squared_error self.compile_options = {"metrics": ["mse", "mae"]} Incorrect: (Will raise an error) .. code-block:: python def set_compile_options(self): self.optimizer = keras.optimizers.SGD() self.loss = keras.losses.mean_squared_error self.compile_options = {"optimizer": keras.optimizers.Adam(), "metrics": ["mse", "mae"]} Note: * As long as the attribute and the dict value have exactly the same values, the setter method will not raise an error * For example (2) there is no check implemented, if the attributes are valid compile options :return: .. py:method:: __extract_from_tuple(tup) :staticmethod: Return element of tuple if it contains only a single element. .. py:method:: __compare_keras_optimizers(first, second) :staticmethod: Compares if optimiser and all settings of the optimisers are exactly equal. :return True if optimisers are interchangeable, or False if optimisers are distinguishable. .. py:method:: get_settings(self) -> Dict Get all class attributes that are not protected in the AbstractModelClass as dictionary. :return: all class attributes .. py:method:: set_model(self) :abstractmethod: Abstract method to set model. .. py:method:: set_compile_options(self) :abstractmethod: This method only has to be defined in child class, when additional compile options should be used () (other options than optimizer and loss) Has to be set as dictionary: {'optimizer': None, 'loss': None, 'metrics': None, 'loss_weights': None, 'sample_weight_mode': None, 'weighted_metrics': None, 'target_tensors': None } :return: .. py:method:: set_custom_objects(self, **kwargs) -> None Set custom objects that are not part of keras framework. These custom objects are needed if an already compiled model is loaded from disk. There is a special treatment for the Padding2D class, which is a base class for different padding types. For a correct behaviour, all supported subclasses are added as custom objects in addition to the given ones. :param kwargs: all custom objects, that should be saved .. py:method:: requirements(cls) :classmethod: Return requirements and own arguments without duplicates. .. py:method:: own_args(cls, *args) :classmethod: Return all arguments (including kwonlyargs). .. py:method:: super_args(cls) :classmethod: .. py:function:: get_version() .. py:data:: __version__ .. py:data:: __author__ :annotation: = Lukas H. Leufen, Felix Kleinert .. py:data:: __email__ :annotation: = ['l.leufen@fz-juelich.de']