mlair
¶
Subpackages¶
mlair.configuration
mlair.data_handler
mlair.data_handler.abstract_data_handler
mlair.data_handler.data_handler_mixed_sampling
mlair.data_handler.data_handler_neighbors
mlair.data_handler.data_handler_single_station
mlair.data_handler.data_handler_with_filter
mlair.data_handler.default_data_handler
mlair.data_handler.input_bootstraps
mlair.data_handler.iterator
mlair.helpers
mlair.keras_legacy
mlair.model_modules
mlair.model_modules.abstract_model_class
mlair.model_modules.advanced_paddings
mlair.model_modules.branched_input_networks
mlair.model_modules.convolutional_networks
mlair.model_modules.flatten
mlair.model_modules.fully_connected_networks
mlair.model_modules.inception_model
mlair.model_modules.keras_extensions
mlair.model_modules.linear_model
mlair.model_modules.loss
mlair.model_modules.model_class
mlair.model_modules.probability_models
mlair.model_modules.recurrent_networks
mlair.model_modules.residual_networks
mlair.model_modules.u_networks
mlair.plotting
mlair.reference_models
mlair.run_modules
mlair.workflows
Submodules¶
Package Contents¶
Classes¶
Basic run class to measure execution time. |
|
Set up the model. |
|
Pre-process your data by using this class. |
|
Set up the model. |
|
Train your model with this module. |
|
Perform post-processing for performance evaluation. |
|
The AbstractModelClass provides a unified skeleton for any model provided to the machine learning workflow. |
Functions¶
Attributes¶
-
mlair.
__version_info__
¶
-
class
mlair.
RunEnvironment
(name=None, log_level_stream=None)¶ Bases:
object
Basic run class to measure execution time.
Either call this class by ‘with’ statement or delete the class instance after finishing the measurement. The duration result is logged.
>>> with RunEnvironment(): <your code> INFO: RunEnvironment started ... INFO: RunEnvironment finished after 00:00:04 (hh:mm:ss)
If you want to embed your custom module in a RunEnvironment, you can easily call it inside the with statement. If you want to exchange between different modules in addition, create your module as inheritance of the RunEnvironment and call it after you initialised the RunEnvironment itself.
class CustomClass(RunEnvironment): def __init__(self): super().__init__() ... ... >>> with RunEnvironment(): CustomClass() INFO: RunEnvironment started INFO: CustomClass started INFO: CustomClass finished after 00:00:04 (hh:mm:ss) INFO: RunEnvironment finished after 00:00:04 (hh:mm:ss)
All data that is stored in the data store will be available for all other modules that inherit from RunEnvironment as long the RunEnvironemnt base class is running. If the base class is deleted either by hand or on exit of the with statement, this storage is cleared.
class CustomClassA(RunEnvironment): def __init__(self): super().__init__() self.data_store.set("testVar", 12) class CustomClassB(RunEnvironment): def __init__(self): super().__init__() self.test_var = self.data_store.get("testVar") logging.info(f"testVar = {self.test_var}") >>> with RunEnvironment(): CustomClassA() CustomClassB() INFO: RunEnvironment started INFO: CustomClassA started INFO: CustomClassA finished after 00:00:01 (hh:mm:ss) INFO: CustomClassB started INFO: testVar = 12 INFO: CustomClassB finished after 00:00:02 (hh:mm:ss) INFO: RunEnvironment finished after 00:00:03 (hh:mm:ss)
-
del_by_exit
= False¶
-
data_store
¶
-
logger
¶
-
tracker_list
= []¶
-
__del__
(self)¶ Finalise class.
Only stop time tracking, if not already called by exit method to prevent duplicated logging (__exit__ is always executed before __del__) it this class was used in a with statement. If instance is called as base class and not as inheritance from this class, log file is copied and data store is cleared.
-
__enter__
(self)¶ Enter run environment.
-
__exit__
(self, exc_type, exc_val, exc_tb)¶ Exit run environment.
-
__move_log_file
(self)¶
-
__save_tracking
(self)¶
-
__plot_tracking
(self)¶
-
__find_file_pattern
(self, name)¶
-
classmethod
update_datastore
(cls, new_data_store: mlair.helpers.datastore.DataStoreByScope, excluded_params=None, apply_full_replacement=False)¶
-
static
do_stuff
(length=2)¶ Just a placeholder method for testing without any sense.
-
-
class
mlair.
ExperimentSetup
(experiment_date=None, stations: Union[str, List[str]] = None, variables: Union[str, List[str]] = None, statistics_per_var: Dict = None, start: str = None, end: str = None, window_history_size: int = None, target_var='o3', target_dim=None, window_lead_time: int = None, window_dim=None, dimensions=None, time_dim=None, iter_dim=None, interpolation_method=None, interpolation_limit=None, train_start=None, train_end=None, val_start=None, val_end=None, test_start=None, test_end=None, use_all_stations_on_all_data_sets=None, train_model: bool = None, fraction_of_train: float = None, experiment_path=None, plot_path: str = None, forecast_path: str = None, overwrite_local_data=None, sampling: str = None, create_new_model=None, bootstrap_path=None, permute_data_on_training=None, transformation=None, train_min_length=None, val_min_length=None, test_min_length=None, extreme_values: list = None, extremes_on_right_tail_only: bool = None, evaluate_feature_importance: bool = None, plot_list=None, feature_importance_n_boots: int = None, feature_importance_create_new_bootstraps: bool = None, feature_importance_bootstrap_method=None, feature_importance_bootstrap_type=None, data_path: str = None, batch_path: str = None, login_nodes=None, hpc_hosts=None, model=None, batch_size=None, epochs=None, early_stopping_epochs: int = None, restore_best_model_weights: bool = None, data_handler=None, data_origin: Dict = None, competitors: list = None, competitor_path: str = None, use_multiprocessing: bool = None, use_multiprocessing_on_debug: bool = None, max_number_multiprocessing: int = None, start_script: Union[Callable, str] = None, overwrite_lazy_data: bool = None, uncertainty_estimate_block_length: str = None, uncertainty_estimate_evaluate_competitors: bool = None, uncertainty_estimate_n_boots: int = None, do_uncertainty_estimate: bool = None, do_bias_free_evaluation: bool = None, model_display_name: str = None, transformation_file: str = None, calculate_fresh_transformation: bool = None, snapshot_load_path: str = None, create_snapshot: bool = None, snapshot_path: str = None, model_path: str = None, **kwargs)¶ Bases:
mlair.run_modules.run_environment.RunEnvironment
Set up the model.
- Schedule of experiment setup:
set up experiment path
set up data path (according to host system)
set up forecast, bootstrap and plot path (inside experiment path)
set all parameters given in args (or use default values)
check target variable
check variables and statistics_per_var parameter for consistency
- Sets
data_path [.]
create_new_model [.]
bootstrap_path [.]
train_model [.]
fraction_of_training [.]
extreme_values [train]
extremes_on_right_tail_only [train]
upsampling [train]
permute_data [train]
experiment_name [.]
experiment_path [.]
plot_path [.]
forecast_path [.]
stations [.]
statistics_per_var [.]
variables [.]
start [.]
end [.]
window_history_size [.]
overwrite_local_data [preprocessing]
sampling [.]
transformation [., preprocessing]
target_var [.]
target_dim [.]
window_lead_time [.]
- Creates
plot of model architecture in <model_name>.pdf
- Parameters
parser_args – argument parser, currently only accepting
experiment_date argument
to be used for experiment’s name and path creation. Final experiment’s name is derived from given name and the time series sampling as <name>_network_<sampling>/ . All interim and final results, logging, plots, … of this run are stored in this directory if not explicitly provided in kwargs. Only the data itself and data for bootstrap investigations are stored outside this structure.stations – list of stations or single station to use in experiment. If not provided, stations are set to
default stations
.variables – list of all variables to use. Valid names can be found in Section 2.1 Parameters. If not provided, this parameter is filled with keys from
statistics_per_var
.statistics_per_var –
dictionary with statistics to use for variables (if data is daily and loaded from JOIN). If not provided,
default statistics
is applied.statistics_per_var
is compared with givenvariables
and unused variables are removed. Therefore, statistics at least need to provide all variables fromvariables
. For more details on available statistics, we refer to Section 3.3 List of statistics/metrics for stats service in the JOIN documentation. Valid parameter names can be found in Section 2.1 Parameters.start – start date of overall data (default “1997-01-01”)
end – end date of overall data (default “2017-12-31”)
window_history_size – number of time steps to use for input data (default 13). Time steps t_0 - w to t_0 are used as input data (therefore actual data size is w+1).
target_var – target variable to predict by model, currently only a single target variable is supported. Because this framework was originally designed to predict ozone, default is “o3”.
target_dim – dimension of target variable (default “variables”).
window_lead_time – number of time steps to predict by model (default 3). Time steps t_0+1 to t_0+w are predicted.
dimensions –
time_dim –
interpolation_method – The method to use for interpolation.
interpolation_limit – The maximum number of subsequent time steps in a gap to fill by interpolation. If the gap exceeds this number, the gap is not filled by interpolation at all. The value of time steps is an arbitrary number that is applied depending on the sampling frequency. A limit of 2 means that either 2 hours or 2 days are allowed to be interpolated in dependency of the set sampling rate.
train_start –
train_end –
val_start –
val_end –
test_start –
test_end –
use_all_stations_on_all_data_sets –
train_model – train a new model from scratch or resume training with existing model if True (default) or freeze loaded model and do not perform any modification on it.
train_model
is set to True ifcreate_new_model
is True.fraction_of_train – given value is used to split between test data and train data (including validation data). The value of
fraction_of_train
must be in (0, 1) but is recommended to be in the interval [0.6, 0.9]. Default value is 0.8. Split between train and validation is fixed to 80% - 20% and currently not changeable.experiment_path –
plot_path – path to save all plots. If left blank, this will be included in the experiment path (recommended). Otherwise customise the location to save all plots.
forecast_path – path to save all forecasts in files. It is recommended to leave this parameter blank, all forecasts will be the directory forecasts inside the experiment path (default). For customisation, add your path here.
overwrite_local_data – Reload input and target data from web and replace local data if True (default False).
sampling – set temporal sampling rate of data. You can choose from daily (default), monthly, seasonal, vegseason, summer and annual for aggregated values and hourly for the actual values. Note, that hourly values on JOIN are currently not accessible from outside. To access this data, you need to add your personal token in
join settings
and make sure to untrack this file!create_new_model – determine whether a new model will be created (True, default) or not (False). If this parameter is set to False, make sure, that a suitable model already exists in the experiment path. This model must fit in terms of input and output dimensions as well as
window_history_size
andwindow_lead_time
and must be implemented as amodel class
and imported inmodel setup
. Ifcreate_new_model
is True, parametertrain_model
is automatically set to True too.bootstrap_path –
permute_data_on_training – shuffle train data individually for each station if True. This is performed each iteration for new, so that each sample very likely differs from epoch to epoch. Train data permutation is disabled (False) per default. If the case of extreme value manifolding, data permutation is enabled anyway.
transformation – set transformation options in dictionary style. All information about transformation options can be found in
setup transformation
. If no transformation is provided, all options are set todefault transformation
.train_min_length –
val_min_length –
test_min_length –
extreme_values – augment target samples with values of lower occurrences indicated by its normalised deviation from mean by manifolding. These extreme values need to be indicated by a list of thresholds. For each entry in this list, all values outside an +/- interval will be added in the training (and only the training) set for a second time to the sample. If multiple valus are given, a sample is added for each exceedence once. E.g. a sample with value=2.5 occurs twice in the training set for given extreme_values=[2, 3], whereas a sample with value=5 occurs three times in the training set. For default, upsampling of extreme values is disabled (None). Upsampling can be modified to manifold only values that are actually larger than given values from
extreme_values
(apply only on right side of distribution) by usingextremes_on_right_tail_only
. This can be useful for positive skew variables.extremes_on_right_tail_only – applies only if
extreme_values
are given. Ifextremes_on_right_tail_only
is True, only manifold values that are larger than given extremes (apply upsampling only on right side of distribution). In default mode, this is set to False to manifold extremes on both sides.evaluate_bootstraps –
plot_list –
number_of_bootstraps –
create_new_bootstraps –
data_path – path to find and store meteorological and environmental / air quality data. Leave this parameter empty, if your host system is known and a suitable path was already hardcoded in the program (see
prepare host
).experiment_date –
window_dim – “Temporal” dimension of the input and target data, that is provided for each sample. The number of samples provided in this dimension can be set using window_history_size for inputs and window_lead_time on target site.
iter_dim –
batch_path –
login_nodes –
hpc_hosts –
model –
batch_size –
epochs – Number of epochs used in training. If a training is resumed and the number of epochs of the already (partly) trained model is lower than this parameter, training is continue. In case this number is higher than the given epochs parameter, no training is resumed. Epochs is set to 20 per default, but this value is just a placeholder that should be adjusted for a meaningful training.
early_stopping_epochs – number of consecutive epochs with no improvement on val loss to stop training. When set to np.inf or not providing at all, training is not stopped before reaching epochs.
restore_best_model_weights – indicates whether to use model state with best val loss (if True) or model state on ending of training (if False). The later depends on the parameters epochs and early_stopping_epochs which trigger stopping of training.
data_handler –
data_origin –
competitors – Provide names of reference models trained by MLAir that can be found in the competitor_path. These models will be used in the postprocessing for comparison.
competitor_path – The path where MLAir can find competing models. If not provided, this path is assumed to be in the ´data_path´ directory as a subdirectory called competitors (default).
use_multiprocessing – Enable parallel preprocessing (postprocessing not implemented yet) by setting this parameter to True (default). If set to False the computation is performed in an serial approach. Multiprocessing is disabled when running in debug mode and cannot be switched on.
transformation_file – Use transformation options from this file for transformation
calculate_fresh_transformation – can either be True or False, indicates if new transformation options should be calculated in any case (transformation_file is not used in this case!).
snapshot_path – path to store snapshot of current run (default inside experiment path)
create_snapshot – indicate if a snapshot is taken from current run or not (default False)
snapshot_load_path – path to load a snapshot from (default None). In contrast to snapshot_path, which is only for storing a snapshot, snapshot_load_path indicates where to load the snapshot from. If this parameter is not provided at all, no snapshot is loaded. Note, the workflow will apply the default preprocessing without loading a snapshot only if this parameter is None!
-
_set_param
(self, param: str, value: Any, default: Any = None, scope: str = 'general', apply: Callable = None) → Any¶ Set given parameter and log in debug. Use apply parameter to adjust the stored value (e.g. to transform value to a list use apply=helpers.to_list).
-
static
_store_start_script
(start_script, store_path)¶
-
_compare_variables_and_statistics
(self)¶ Compare variables and statistics.
raise error, if a variable is missing.
remove unused variables from statistics.
-
_check_target_var
(self)¶ Check if target variable is in statistics_per_var dictionary.
-
class
mlair.
PreProcessing
¶ Bases:
mlair.run_modules.run_environment.RunEnvironment
Pre-process your data by using this class.
- Schedule of pre-processing:
load and check valid stations (either download or load from disk)
split subsets (train, val, test, train & val)
create small report on data metrics
- Required objects [scope] from data store:
all elements from DEFAULT_ARGS_LIST in scope preprocessing for general data loading
all elements from DEFAULT_ARGS_LIST in scopes [train, val, test, train_val] for custom subset settings
fraction_of_training [.]
experiment_path [.]
use_all_stations_on_all_data_sets [.]
- Optional objects
all elements from DEFAULT_KWARGS_LIST in scope preprocessing for general data loading
all elements from DEFAULT_KWARGS_LIST in scopes [train, val, test, train_val] for custom subset settings
- Sets
stations in [., train, val, test, train_val]
generator in [train, val, test, train_val]
transformation [.]
- Creates
all input and output data in data_path
latex reports in experiment_path/latex_report
-
_run
(self)¶
-
report_pre_processing
(self)¶ Log some metrics on data and create latex report.
-
create_latex_report
(self)¶ Create tables with information on the station meta data and a summary on subset sample sizes.
station_sample_size.md: see table below as markdown
station_sample_size.tex: same as table below as latex table
station_sample_size_short.tex: reduced size table without any meta data besides station ID, as latex table
All tables are stored inside experiment_path inside the folder latex_report. The table format (e.g. which meta data is highlighted) is currently hardcoded to have a stable table style. If further styles are needed, it is better to add an additional style than modifying the existing table styles.
stat. ID
station_name
station_lon
station_lat
station_alt
train
val
test
DEBW013
Stuttgart Bad Cannstatt
9.2297
48.8088
235
1434
712
1080
DEBW076
Baden-Baden
8.2202
48.7731
148
3037
722
710
DEBW087
Schwäbische_Alb
9.2076
48.3458
798
3044
714
1087
DEBW107
Tübingen
9.0512
48.5077
325
1803
715
1087
DEBY081
Garmisch-Partenkirchen/Kreuzeckbahnstraße
11.0631
47.4764
735
2935
525
714
# Stations
nan
nan
nan
nan
6
6
6
# Samples
nan
nan
nan
nan
12253
3388
4678
-
create_info_df
(self, meta_cols, meta_round, names_of_set, precision)¶
-
split_train_val_test
(self) → None¶ Split data into subsets.
Currently: train, val, test and train_val (actually this is only the merge of train and val, but as an separate data_collection). IMPORTANT: Do not change to order of the execution of create_set_split. The train subset needs always to be executed at first, to set a proper transformation.
-
static
split_set_indices
(total_length: int, fraction: float) → Tuple[slice, slice, slice, slice]¶ Create the training, validation and test subset slice indices for given total_length.
The test data consists on (1-fraction) of total_length (fraction*len:end). Train and validation data therefore are made from fraction of total_length (0:fraction*len). Train and validation data is split by the factor 0.8 for train and 0.2 for validation. In addition, split_set_indices returns also the combination of training and validation subset.
- Parameters
total_length – list with all objects to split
fraction – ratio between test and union of train/val data
- Returns
slices for each subset in the order: train, val, test, train_val
-
validate_station
(self, data_handler: mlair.data_handler.AbstractDataHandler, set_stations, set_name=None, store_processed_data=True)¶ Check if all given stations in all_stations are valid.
Valid means, that there is data available for the given time range (is included in kwargs). The shape and the loading time are logged in debug mode.
- Returns
Corrected list containing only valid station IDs.
-
store_data_handler_attributes
(self, data_handler, collection)¶
-
_store_apriori
(self)¶
-
_load_apriori
(self)¶
-
transformation
(self, data_handler: mlair.data_handler.AbstractDataHandler, stations)¶
-
_load_transformation
(self)¶ Try to load transformation options from file if transformation_file is provided.
-
_store_transformation
(self, transformation_opts)¶ Store transformation options locally inside experiment_path if not exists already.
-
prepare_competitors
(self)¶ Prepare competitor models already in the preprocessing stage. This is performed here, because some models might need to have internet access, which is depending on the operating system not possible during postprocessing. This method checks currently only, if the Intelli03-ts-v1 model is requested as competitor and downloads the data if required.
-
create_snapshot
(self)¶
-
load_snapshot
(self, file)¶
-
class
mlair.
ModelSetup
¶ Bases:
mlair.run_modules.run_environment.RunEnvironment
Set up the model.
- Schedule of model setup:
set channels (from variables dimension)
build imported model
plot model architecture
load weights if enabled (e.g. to resume a training)
set callbacks and checkpoint
compile model
- Required objects [scope] from data store:
experiment_path [.]
experiment_name [.]
train_model [.]
create_new_model [.]
generator [train]
model_class [.]
- Optional objects
lr_decay [model]
- Sets
channels [model]
model [model]
hist [model]
callbacks [model]
model_name [model]
all settings from model class like dropout_rate, initial_lr, and optimizer [model]
- Creates
plot of model architecture <model_name>.pdf
-
_run
(self)¶
-
_set_model_path
(self)¶
-
_set_shapes
(self)¶ Set input and output shapes from train collection.
-
_set_num_of_training_samples
(self)¶ Set number of training samples - needed for example for Bayesian NNs
-
compile_model
(self)¶ Compiles the keras model. Compile options are mandatory and have to be set by implementing set_compile() method in child class of AbstractModelClass.
-
_set_callbacks
(self)¶ Set all callbacks for the training phase.
Add all callbacks with the .add_callback statement. Finally, the advanced model checkpoint is added.
-
copy_model
(self)¶ Copy external model to internal experiment structure.
-
load_model
(self)¶ Try to load model from disk or skip if not possible.
-
build_model
(self)¶ Build model using input and output shapes from data store.
-
broadcast_custom_objects
(self)¶ Broadcast custom objects to keras utils.
This method is very important, because it adds the model’s custom objects to the keras utils. By doing so, all custom objects can be treated as standard keras modules. Therefore, problems related to model or callback loading are solved.
-
get_model_settings
(self)¶ Load all model settings and store in data store.
-
plot_model
(self)¶ Plot model architecture as <model_name>.pdf.
-
report_model
(self)¶
-
class
mlair.
Training
¶ Bases:
mlair.run_modules.run_environment.RunEnvironment
Train your model with this module.
This module isn’t required to run, if only a fresh post-processing is preformed. Either remove training call from your run script or set create_new_model and train_model both to false.
- Schedule of training:
set_generators(): set generators for training, validation and testing and distribute according to batch size
make_predict_function(): create predict function before distribution on multiple nodes (detailed information in method description)
train(): start or resume training of model and save callbacks
save_model(): save best model from training as final model
- Required objects [scope] from data store:
model [model]
batch_size [.]
epochs [.]
callbacks [model]
model_name [model]
experiment_name [.]
experiment_path [.]
train_model [.]
create_new_model [.]
generator [train, val, test]
plot_path [.]
- Optional objects
permute_data [train, val, test]
upsampling [train, val, test]
- Sets
model [.]
- Creates
<exp_name>_model-best.h5
<exp_name>_model-best-callbacks-<name>.h5 (all callbacks from CallbackHandler)
history.json
history_lr.json (optional)
<exp_name>_history_<name>.pdf (different monitoring plots depending on loss metrics and callbacks)
-
make_predict_function
(self) → None¶ Create predict function.
Must be called before distributing. This is necessary, because tf will compile the predict function just in the moment it is used the first time. This can cause problems, if the model is distributed on different workers. To prevent this, the function is pre-compiled. See discussion @ https://stackoverflow.com/questions/40850089/is-keras-thread-safe/43393252#43393252
-
_set_gen
(self, mode: str) → None¶ Set and distribute the generators for given mode regarding batch size.
- Parameters
mode – name of set, should be from [“train”, “val”, “test”]
-
set_generators
(self) → None¶ Set all generators for training, validation, and testing subsets.
The called sub-method will automatically distribute the data according to the batch size. The subsets can be accessed as class variables train_set, val_set, and test_set.
-
train
(self) → None¶ Perform training using keras fit().
Callbacks are stored locally in the experiment directory. Best model from training is saved for class variable model. If the file path of checkpoint is not empty, this method assumes, that this is not a new training starting from the very beginning, but a resumption from a previous started but interrupted training (or a stopped and now continued training). Train will automatically load the locally stored information and the corresponding model and proceed with the already started training.
-
save_model
(self) → None¶ Save model in local experiment directory. Model is named as <experiment_name>_<custom_model_name>.h5.
-
save_callbacks_as_json
(self, history: tensorflow.keras.callbacks.Callback, lr_sc: tensorflow.keras.callbacks.Callback, epo_timing: tensorflow.keras.callbacks.Callback) → None¶ Save callbacks (history, learning rate) of training.
history.history -> history.json
lr_sc.lr -> history_lr.json
- Parameters
history – history object of training
lr_sc – learning rate object
-
create_monitoring_plots
(self, history: tensorflow.keras.callbacks.Callback, lr_sc: tensorflow.keras.callbacks.Callback, epoch_best: int = None) → None¶ Create plot of history and learning rate in dependence of the number of epochs.
The plots are saved in the experiment’s plot_path. History plot is named <exp_name>_history_loss_val_loss.pdf, the learning rate with <exp_name>_history_learning_rate.pdf.
- Parameters
history – keras history object with losses to plot (must at least include loss and val_loss)
lr_sc – learning rate decay object with ‘lr’ attribute
epoch_best – number of best epoch (starts counting as 0)
-
report_training
(self)¶
-
class
mlair.
PostProcessing
¶ Bases:
mlair.run_modules.run_environment.RunEnvironment
Perform post-processing for performance evaluation.
- Schedule of post-processing:
train an ordinary least squared model (ols) for reference
create forecasts for nn, ols, and persistence
evaluate feature importance with bootstrapped predictions
calculate skill scores
create plots
- Required objects [scope] from data store:
model [.] or locally saved model plus model_name [model] and model [model]
generator [train, val, test, train_val]
forecast_path [.]
plot_path [postprocessing]
model_path [.]
target_var [.]
sampling [.]
output_shape [model]
evaluate_feature_importance [postprocessing] and if enabled:
create_new_bootstraps [postprocessing]
bootstrap_path [postprocessing]
number_of_bootstraps [postprocessing]
- Optional objects
batch_size [model]
- Creates
forecasts in forecast_path if enabled
bootstraps in bootstrap_path if enabled
plots in plot_path
-
_run
(self)¶
-
estimate_sample_uncertainty
(self, separate_ahead=False)¶ Estimate sample uncertainty by using a bootstrap approach. Forecasts are split into individual blocks along time and randomly drawn with replacement. The resulting behaviour of the error indicates the robustness of each analyzed model to quantify which model might be superior compared to others.
-
report_sample_uncertainty
(self, percentiles: list = None)¶ Store raw results of uncertainty estimate and calculate aggregate statistics and store as raw data but also as markdown and latex.
-
calculate_block_mse
(self, evaluate_competitors=True, separate_ahead=False, block_length='1m')¶ Transform data into blocks along time axis. Block length can be any frequency like ‘1m’ or ‘7d. Data are only split along time axis, which means that a single block can have very diverse quantities regarding the number of station or actual data contained. This is intended to analyze not only the robustness against the time but also against the number of observations and diversity ot stations.
-
create_error_array
(self, data)¶ Calculate squared error of all given time series in relation to observation.
-
static
create_full_time_dim
(data, dim, sampling, start, end)¶ Ensure time dimension to be equidistant. Sometimes dates if missing values have been dropped.
-
load_competitors
(self, station_name: str) → xarray.DataArray¶ Load all requested and available competitors for a given station. Forecasts must be available in the competitor path like <competitor_path>/<target_var>/forecasts_<station_name>_test.nc. The naming style is equal for all forecasts of MLAir, so that forecasts of a different experiment can easily be copied into the competitor path without any change.
- Parameters
station_name – station indicator to load competitors for
- Returns
a single xarray with all competing forecasts
-
calculate_feature_importance
(self, create_new_bootstraps: bool, _iter: int = 0, bootstrap_type='singleinput', bootstrap_method='shuffle') → None¶ Calculate skill scores of bootstrapped data.
Create bootstrapped data if create_new_bootstraps is true or a failure occurred during skill score calculation (this will happen by default, if no bootstrapped data is available locally). Set class attribute bootstrap_skill_scores. This method is implemented in a recursive fashion, but is only allowed to call itself once.
- Parameters
create_new_bootstraps – calculate all bootstrap predictions and overwrite already available predictions
_iter – internal counter to reduce unnecessary recursive calls (maximum number is 2, otherwise something went wrong).
-
create_feature_importance_bootstrap_forecast
(self, bootstrap_type, bootstrap_method) → None¶ Create bootstrapped predictions for all stations and variables.
These forecasts are saved in bootstrap_path with the names bootstraps_{var}_{station}.nc and bootstraps_labels_{station}.nc.
-
calculate_feature_importance_skill_scores
(self, bootstrap_type, bootstrap_method) → Dict[str, xarray.DataArray]¶ Calculate skill score of bootstrapped variables.
Use already created bootstrap predictions and the original predictions (the not-bootstrapped ones) and calculate skill scores for the bootstraps. The result is saved as a xarray DataArray in a dictionary structure separated for each station (keys of dictionary).
- Returns
The result dictionary with station-wise skill scores
-
static
get_distinct_branches_from_bootstrap_iter
(bootstrap_iter)¶
-
rename_boot_var_with_branch
(self, boot_var, bootstrap_type, branch_names=None, expected_len=0)¶
-
get_orig_prediction
(self, path, file_name, prediction_name=None, reference_name=None)¶
-
static
repeat_data
(data, number_of_repetition)¶
-
_get_model_name
(self)¶ Return model name without path information.
-
_load_model
(self) → mlair.model_modules.AbstractModelClass¶ Load NN model either from data store or from local path.
- Returns
the model
-
plot
(self)¶ Create all plots.
Plots are defined in experiment set up by plot_list. As default, all (following) plots are enabled:
PlotBootstrapSkillScore
PlotConditionalQuantiles
PlotStationMap
PlotMonthlySummary
PlotClimatologicalSkillScore
PlotCompetitiveSkillScore
PlotTimeSeries
PlotAvailability
Note
Bootstrap plots are only created if bootstraps are evaluated.
-
calculate_test_score
(self)¶ Evaluate test score of model and save locally.
-
train_ols_model
(self)¶ Train ordinary least squared model on train data.
-
setup_persistence
(self)¶ Check if persistence is requested from competitors and store this information.
-
make_prediction
(self, subset)¶ Create predictions for NN, OLS, and persistence and add true observation as reference.
Predictions are filled in an array with full index range. Therefore, predictions can have missing values. All predictions for a single station are stored locally under <forecast/forecast_norm>_<station>_test.nc and can be found inside forecast_path.
-
_create_competitor_forecast
(self, station_name: str, competitor_name: str) → xarray.DataArray¶ Load and format the competing forecast of a distinct model indicated by competitor_name for a distinct station indicated by station_name. The name of the competitor is set in the type axis as indicator. This method will raise either a FileNotFoundError or KeyError if no competitor could be found for the given station. Either there is no file provided in the expected path or no forecast for given competitor_name in the forecast file. Forecast is trimmed on interval start and end of test subset.
- Parameters
station_name – name of the station to load data for
competitor_name – name of the model
- Returns
the forecast of the given competitor
-
_create_observation
(self, data, _, transformation_func: Callable, normalised: bool) → xarray.DataArray¶ Create observation as ground truth from given data.
Inverse transformation is applied to the ground truth to get the output in the original space.
- Parameters
data – observation
transformation_func – a callable function to apply inverse transformation
normalised – transform ground truth in original space if false, or use normalised predictions if true
- Returns
filled data array with observation
-
_create_ols_forecast
(self, input_data: xarray.DataArray, ols_prediction: xarray.DataArray, transformation_func: Callable, normalised: bool) → xarray.DataArray¶ Create ordinary least square model forecast with given input data.
Inverse transformation is applied to the forecast to get the output in the original space.
- Parameters
input_data – transposed history from DataPrep
ols_prediction – empty array in right shape to fill with data
transformation_func – a callable function to apply inverse transformation
normalised – transform prediction in original space if false, or use normalised predictions if true
- Returns
filled data array with ols predictions
-
_create_persistence_forecast
(self, data, persistence_prediction: xarray.DataArray, transformation_func: Callable, normalised: bool) → xarray.DataArray¶ Create persistence forecast with given data.
Persistence is deviated from the value at t=0 and applied to all following time steps (t+1, …, t+window). Inverse transformation is applied to the forecast to get the output in the original space.
- Parameters
data – observation
persistence_prediction – empty array in right shape to fill with data
transformation_func – a callable function to apply inverse transformation
normalised – transform prediction in original space if false, or use normalised predictions if true
- Returns
filled data array with persistence predictions
-
_create_nn_forecast
(self, nn_output: xarray.DataArray, nn_prediction: xarray.DataArray, transformation_func: Callable, normalised: bool) → xarray.DataArray¶ Create NN forecast for given input data.
Inverse transformation is applied to the forecast to get the output in the original space. Furthermore, only the output of the main branch is returned (not all minor branches, if the network has multiple output branches). The main branch is defined to be the last entry of all outputs.
- Parameters
nn_output – Full NN model output
nn_prediction – empty array in right shape to fill with data
transformation_func – a callable function to apply inverse transformation
normalised – transform prediction in original space if false, or use normalised predictions if true
- Returns
filled data array with nn predictions
-
static
_create_empty_prediction_arrays
(target_data, count=1)¶ Create array to collect all predictions. Expand target data by a station dimension.
-
static
create_fullindex
(df: Union[xarray.DataArray, pandas.DataFrame, pandas.DatetimeIndex], freq: str) → pandas.DataFrame¶ Create full index from first and last date inside df and resample with given frequency.
- Parameters
df – use time range of this data set
freq – frequency of full index
- Returns
empty data frame with full index.
-
static
create_forecast_arrays
(index: pandas.DataFrame, ahead_names: List[Union[str, int]], time_dimension, ahead_dim='ahead', index_dim='index', type_dim='type', **kwargs)¶ Combine different forecast types into single xarray.
- Parameters
index – index for forecasts (e.g. time)
ahead_names – names of ahead values (e.g. hours or days)
kwargs – as xarrays; data of forecasts
- Returns
xarray of dimension 3: index, ahead_names, # predictions
-
_get_internal_data
(self, station: str, path: str) → Union[xarray.DataArray, None]¶ Get internal data for given station.
Internal data is defined as data that is already known to the model. From an evaluation perspective, this refers to data, that is no test data, and therefore to train and val data.
- Parameters
station – name of station to load internal data.
-
_get_external_data
(self, station: str, path: str) → Union[xarray.DataArray, None]¶ Get external data for given station.
External data is defined as data that is not known to the model. From an evaluation perspective, this refers to data, that is not train or val data, and therefore to test data.
- Parameters
station – name of station to load external data.
-
_combine_forecasts
(self, forecast, competitor, dim=None)¶ Combine forecast and competitor if both are xarray. If competitor is None, this returns forecasts and vise versa.
-
calculate_bias_free_error_metrics
(self)¶
-
calculate_error_metrics
(self) → Tuple[Dict, Dict, Dict, Dict]¶ Calculate error metrics and skill scores of NN forecast.
The competitive skill score compares the NN prediction with persistence and ordinary least squares forecasts. Whereas, the climatological skill scores evaluates the NN prediction in terms of meaningfulness in comparison to different climatological references.
- Returns
competitive and climatological skill scores, error metrics
-
static
calculate_average_skill_scores
(scores, counts)¶
-
static
calculate_average_errors
(errors)¶
-
report_feature_importance_results
(self, results)¶ Create a csv file containing all results from feature importance.
-
report_error_metrics
(self, errors, tag=None)¶
-
store_errors
(self, errors)¶
-
class
mlair.
AbstractModelClass
(input_shape, output_shape)¶ Bases:
abc.ABC
The AbstractModelClass provides a unified skeleton for any model provided to the machine learning workflow.
The model can always be accessed by calling ModelClass.model or directly by an model method without parsing the model attribute name (e.g. ModelClass.model.compile -> ModelClass.compile). Beside the model, this class provides the corresponding loss function.
-
_requirements
= []¶
-
__getattr__
(self, name: str) → Any¶ Is called if __getattribute__ is not able to find requested attribute.
Normally, the model class is saved into a variable like model = ModelClass(). To bypass a call like model.model to access the _model attribute, this method tries to search for the named attribute in the self.model namespace and returns this attribute if available. Therefore, following expression is true: ModelClass().compile == ModelClass().model.compile as long the called attribute/method is not part if the ModelClass itself.
- Parameters
name – name of the attribute or method to call
- Returns
attribute or method from self.model namespace
-
property
model
(self) → tensorflow.keras.Model¶ The model property containing a keras.Model instance.
- Returns
the keras model
-
property
custom_objects
(self) → Dict¶ The custom objects property collects all non-keras utilities that are used in the model class.
To load such a customised and already compiled model (e.g. from local disk), this information is required.
- Returns
custom objects in a dictionary
-
property
compile_options
(self) → Dict¶ The compile options property allows the user to use all keras.compile() arguments. They can ether be passed as dictionary (1), as attribute, without setting compile_options (2) or as mixture (partly defined as instance attributes and partly parsing a dictionary) of both of them (3). The method will raise an Error when the same parameter is set differently.
Example (1) Recommended (includes check for valid keywords which are used as args in keras.compile) .. code-block:: python
- def set_compile_options(self):
- self.compile_options = {“optimizer”: keras.optimizers.SGD(),
“loss”: keras.losses.mean_squared_error, “metrics”: [“mse”, “mae”]}
Example (2) .. code-block:: python
- def set_compile_options(self):
self.optimizer = keras.optimizers.SGD() self.loss = keras.losses.mean_squared_error self.metrics = [“mse”, “mae”]
Example (3) Correct: .. code-block:: python
- def set_compile_options(self):
self.optimizer = keras.optimizers.SGD() self.loss = keras.losses.mean_squared_error self.compile_options = {“metrics”: [“mse”, “mae”]}
Incorrect: (Will raise an error) .. code-block:: python
- def set_compile_options(self):
self.optimizer = keras.optimizers.SGD() self.loss = keras.losses.mean_squared_error self.compile_options = {“optimizer”: keras.optimizers.Adam(), “metrics”: [“mse”, “mae”]}
Note: * As long as the attribute and the dict value have exactly the same values, the setter method will not raise an error * For example (2) there is no check implemented, if the attributes are valid compile options
- Returns
-
static
__extract_from_tuple
(tup)¶ Return element of tuple if it contains only a single element.
-
static
__compare_keras_optimizers
(first, second)¶ Compares if optimiser and all settings of the optimisers are exactly equal.
:return True if optimisers are interchangeable, or False if optimisers are distinguishable.
-
get_settings
(self) → Dict¶ Get all class attributes that are not protected in the AbstractModelClass as dictionary.
- Returns
all class attributes
-
abstract
set_model
(self)¶ Abstract method to set model.
-
abstract
set_compile_options
(self)¶ This method only has to be defined in child class, when additional compile options should be used () (other options than optimizer and loss) Has to be set as dictionary: {‘optimizer’: None,
‘loss’: None, ‘metrics’: None, ‘loss_weights’: None, ‘sample_weight_mode’: None, ‘weighted_metrics’: None, ‘target_tensors’: None }
- Returns
-
set_custom_objects
(self, **kwargs) → None¶ Set custom objects that are not part of keras framework.
These custom objects are needed if an already compiled model is loaded from disk. There is a special treatment for the Padding2D class, which is a base class for different padding types. For a correct behaviour, all supported subclasses are added as custom objects in addition to the given ones.
- Parameters
kwargs – all custom objects, that should be saved
-
classmethod
requirements
(cls)¶ Return requirements and own arguments without duplicates.
-
classmethod
own_args
(cls, *args)¶ Return all arguments (including kwonlyargs).
-
classmethod
super_args
(cls)¶
-
-
mlair.
get_version
()¶
-
mlair.
__version__
¶
-
mlair.
__email__
= ['l.leufen@fz-juelich.de']¶