Changelog

All notable changes to this project will be documented in this file.

v2.4.0 - 2023-06-30- IFS data and bias-corrected evaluation

general:

  • support IFS data (local) and ERA5 data (from toar db)

  • bias free evaluation ### new features:

  • can load local IFS forecast data as input (#450)

  • can also load ERA5 data from toar db as alternative to locally stored data (#449)

  • new plot to show monthly data distributions in subsets (#445)

  • can load a DL model from external path (#448)

  • introduced option to bias-correct model’s and competitors’ forecasts (#442)

  • can use different interpolation methods when having CAMS as competitor (#444) ### technical:

  • change toar statistics from api v1 to api v2 (#454)

  • now able to set configuration paths for local era5 and ifs data as experiment parameter (#457)

  • improved retry strategy when downloading data from toar db (#453)

  • updated packages (#452)

  • calculation of filter apriori is more robust now, properties are stored inside experiment folder (#447, #451)

v2.3.0 - 2022-11-25 - new models and plots

general:

  • new model classes for ResNet and U-Net

  • new plots and variations of existing plots

new features:

  • new model classes: ResNet (#419), U-Net (#423)

  • seasonal mse stack plot (#422)

  • new aggregated and line versions of Time Evolution Plot (#424, #427)

  • box-and-whisker plots are created for all error metrics (#431)

  • new split and frequency distribution versions of box-and-whisker plots for error metrics (#425, #434)

  • new evaluation metric: mean error / bias (#430)

  • conditional quantiles are now available for all competitors too (#435)

  • new map plot showing mse at locations (#432)

technical:

  • speed up in model setup (#421)

  • bugfix for boundary trim in FIR filter (#418)

  • persistence is now calculated only on demand (#426)

  • block mse are stored locally in a file (#428)

  • fix issue with boolean variables not recognized by argparse (#417)

  • renaming of ahead labels (#436)

v2.2.0 - 2022-08-16 - new data sources and python3.9

general:

  • new data sources: era5 data and ToarDB V2

  • CAMS competitor available

  • improved execution speed

  • MLAir is now updated to python3.9

new features:

  • new data loading method to load era5 data on Jülich systems (#393)

  • new data loading method to load data from ToarDB V2 (#396)

  • implemented competitor model using CAMS ensemble forecasts (#394)

  • OLS competitor is only calculated if provided in competitor list (#404)

  • experimental: snapshot creation to skip preprocessing stage (#346, #405, #406)

  • new workflow HyperSearchWorkflow stopping after training stage (#408)

technical:

  • fixed minor issues and improved execution speed in postprocessing (#401, #413)

  • improved speed in keras iterator creation (#409)

  • solved bug for very long competitor time series (#395)

  • updated python, HPC and CI environment (#402, #403, #407, #410)

  • fix for climateFIR data handler (#399)

  • fix for report model error (#416)

v2.1.0 - 2022-06-07 - new evaluation metrics and improved training

general:

  • new evaluation metrics, IOA and MNMB

  • advanced train options for early stopping

  • reduced execution time by refactoring

new features:

  • uncertainty estimation of MSE is now applied for each season separately (#374)

  • added different configurations of early stopping to use either last trained or best epoch (#378)

  • train monitoring plots now add a star for best epoch when using early stopping (#367)

  • new evaluation metric index of agreement, IOA (#376)

  • new evaluation metric modified normalised mean bias, MNMB (#380)

  • new plot available that shows temporal evolution of MSE for each station (#381)

technical:

  • reduced loading of forecast path from data store (#328)

  • bug fix for not catched error during transformation (#385)

  • bug fix for data handler with climate and fir filter leading to calculate transformation always with fir filter (#387)

  • improved duration for latex report creation at end of preprocessing (#388)

  • enhanced speed for make prediction in postprocessing (#389)

  • fix to always create version badge from version and not from tag name (#382)

v2.0.0 - 2022-04-08 - tf2 usage, new model classes, and improved uncertainty estimate

general:

  • MLAir now uses tensorflow v2

  • new customisable model classes for CNN and RNN

  • improved uncertainty estimate

new features:

  • MLAir depends now on tensorflow v2 (#331)

  • new CNN class that can be configured layer-wise (#368)

  • new RNN class that can be configured in more detail (#361)

  • new branched-input CNN class (#368)

  • new branched-input RNN class (#362)

  • set custom model display name that is used in plots (#341)

  • specify names of input branches to use in feature importance plots (#356)

  • uncertainty estimate of model error is now calculated for each forecast step additionally (#359)

  • data transformation properties are stored locally and can be loaded into an experiment run (#345)

  • uncertainty estimate includes now a Mann-Whitney U rank test (#355)

  • data handlers can now have access to “future” data specified by new parameter extend_length_opts (#339)

technical:

  • MLAir now uses python3.8 on Jülich HPC systems (#375)

  • no support of MLAir for tensorflow v1.X, replaced by tf v2.X (#331)

  • all data handlers with filters can return data as branches (#370)

  • bug fix to force model name and competitor names to be unique (#366, #369)

  • fix to use only a single forecast step (#315)

  • CI pipeline adjustments (#340, #365)

  • new option to set the level of the print logging (#364)

  • advanced logging for batch data creation and in postprocessing (#350, #360)

  • batch data creation is skipped on disabled training (#341)

  • multiprocessing pools are now closed properly (#342)

  • bug fix if no competitor data is available (#343)

  • bug fix for model loading (#343)

  • models plotted by PlotSampleUncertaintyFromBootstrap are now ordered by mean error (#344)

  • fix for usage of lazy data caused unintended reloading of data (#347)

  • fix for latex reports no showing all stations and competitors (#349)

  • refactoring of hard coded dimension names in skill scores calculation (#357)

  • bug fix of order of bootstrap method in feature importance calculation causes errors (#358)

  • distinguish now between window_history_offset (pos of last time step), window_history_size (total length of input sample), and extend_length_opts (“future” data that is available at given time) (#353)

v1.5.0 - 2021-11-11 - new uncertainty estimation

general:

  • introduces method to estimate sample uncertainty

  • improved multiprocessing

  • last release with tensorflow v1 support

new features:

  • test set sample uncertainty estmation during postprocessing (#333)

  • support of Kolmogorov Zurbenko filter for data handlers with filters (#334)

technical:

  • new communication scheme for multiprocessing (#321, #322)

  • improved error reporting (#323)

  • feature importance returns now unaggregated results (#335)

  • error metrics are reported for all competitors (#332)

  • minor bugfixes and refacs (#330, #326, #329, #325, #324, #320, #337)

v1.4.0 - 2021-07-27 - new model classes and data handlers, improved usability and transparency

general:

  • many technical adjustments to improve usability and transparency of MLAir

  • new FCN and CNN classes for easy NN model creation

  • new plots

new features:

  • new FCN class that can be customized in many ways (#284)

  • also new CNN class (#289)

  • added new bootstrap analysis method: mean bootstrapping (#300)

  • new data handler using FIR filters (#306)

  • performance measures are now stored in local files (#286)

  • histogram plots for inputs and targets (#299)

  • periodogram plots for filtered data (#298)

technical:

  • a calling run script can be stored inside experiment folder if reference to this script is parsed as argument (#99)

  • new callback to track epoch-runtime (#312)

  • added switch to use multiprocessing (#297)

  • customize maximum number of parallel processes (#308)

  • support non-monotonic window lead times (#313)

  • resolved bug with FileExistsError (#311)

  • resolved bug if no chemical is used at all (#307)

  • min/max scaler now scales between -1 and 1 (#302)

  • added missing offset parameter to some data handlers (#305)

  • improved data store logging (#304)

  • improved logging message on station removal in preprocessing (#294)

  • limited number of retries in JOIN module (#296)

  • adjusted competing skill score plot (#301)

  • transformation parameter check (#295)

  • implemented lazy data preprocessing for selected data handlers (#292)

  • fix bug in separation of scales data handler (#290)

v1.3.0 - 2021-02-24 - competitors and improved transformation

general:

  • release of official MLAir logo (#274)

  • new transformation schema for better independence of MLAir and data handler (#272)

  • competing models can be included in postprocessing for direct comparison (#198)

new features:

  • new helper functions for geographic issues (#280)

  • default data handler and inheritances can use min/max and log transformation (#276, #275)

  • include IntelliO3-ts model as reference via automatic download (#131)

technical:

  • experiment name now always includes target sampling type (#263)

  • competitive skill score plot is refactored (#260)

  • bug fix for climatological skill scores (#259)

  • bug fix for custom objects handling (#277)

  • bug fix for monitoring plots when multiple output branches are used (#278)

  • update requirements to newer version and dependencies (#262, #273)

  • HPC scripts are updated to work properly with parallel data processing (#281)

v1.2.1 - 2021-02-08 - bug fix for recursive import error

general:

  • applied bug fix

technical:

  • bug fix for recursive import error, (#269)

v1.2.0 - 2020-12-18 - parallel preprocessing and improved data handlers

general:

  • new plots

  • parallelism for faster preprocessing

  • improved data handler with mixed sampling types

  • enhanced test coverage

new features:

  • station map plot highlights now subsets on the map and displays number of stations for each subset (#227, #231)

  • two new data availability plots PlotAvailabilityHistogram (#191, #192, #223)

  • introduced parallel code in preprocessing if system supports parallelism (#164, #224, #225)

  • data handler DataHandlerMixedSampling (and inheritances) supports an offset parameter to end inputs at a different time than 00 hours (#220)

  • args for data handler DataHandlerMixedSampling (and inheritances) that differ for input and target can now be parsed as tuple (#229)

technical:

  • added templates for release and bug issues (#189)

  • improved test coverage (#236, #238, #239, #240, #241, #242, #243, #244, #245)

  • station map plot includes now number of stations for each subset (#231)

  • postprocessing plots are encapsulated in try except statements (#107)

  • updated git settings (#213)

  • bug fix for data handler (#235)

  • reordering and bug fix for preprocessing reporting (#207, #232)

  • bug fix for outdated system path style (#226)

  • new plots are included in default plot list (#211)

  • helpers/join connection to ToarDB (e.g. used by DefaultDataHandler) reports now which variable could not be loaded (#222)

  • plot PlotBootstrapSkillScore can now additionally highlight specific variables, but not included in postprocessing up to now (#201)

  • data handler DataHandlerMixedSampling has now a reduced data loading (#221)

v1.1.0 - 2020-11-18 - hourly resolution support and new data handlers

general:

  • MLAir can be used with 1H resolution data from JOIN

  • new data handlers to use the Kolmogorov-Zurbenko filter and mixed sampling types

new features:

  • new data handler DataHandlerKzFilter to use Kolmogorov-Zurbenko filter (kz filter) on inputs (#195)

  • new data handler DataHandlerMixedSampling that can used mixed sampling types for input and target (#197)

  • new data handler DataHandlerMixedSamplingWithFilter that uses kz filter and mixed sampling (#197)

  • new data handler DataHandlerSeparationOfScales to filter-depended time steps sizes on filtered inputs using mixed sampling (#196)

technical:

  • bug fix for very short time series in TimeSeriesPlot (#215)

  • bug fix for variable dictionary when using hourly resolution (#212)

  • variable naming for data from JOIN interface harmonised (#206)

  • transformation setup is now separated for inputs and targets (#202)

  • bug fix in PlotClimatologicalSkillScore if only single station is used (#193)

  • preprocessed data is now stored inside experiment and not in the data folder

v1.0.0 - 2020-10-08 - official release of new version 1.0.0

general:

  • This is the first official release of MLAir ready for use

  • updated license, installation instruction

technical:

  • restructured order of packages in requirements

v0.12.2 - 2020-10-01 - HDFML support

general:

  • HDFML support

technical:

  • installation script for HDFML adjusted, #183

v0.12.1 - 2020-09-28 - examples in notebook

general:

  • introduced a notebook documentation for easy starting, #174

  • updated special installation instructions for the Juelich HPC systems, #172

new features:

  • names of input and output shape are renamed consistently to: input_shape, and output_shape, #175

technical:

  • it is possible to assign a custom name to a run module (e.g. used in logging), #173

v0.12.0 - 2020-09-21 - Documentation and Bugfixes

general:

  • improved documentation include installation instructions and many examples from the paper, #153

  • bugfixes (see technical)

new features:

  • MyLittleModel is now a pure feed-forward network (before it had a CNN part), #168

technical:

  • new compile options check to ensure its execution, #154

  • bugfix for key errors in time series plot, #169

  • bugfix for not used kwargs in DefaultDataHandler, #170

  • trainable parameter is renamed by train_model to prevent confusion with the tf trainable parameter, #162

  • fixed HPC installation failure, #159

v0.11.0 - 2020-08-24 - Advanced Data Handling for MLAir

general

  • Introduce advanced data handling with much more flexibility (independent of TOAR DB, custom data handling is pluggable), #144

  • default data handler is still using TOAR DB

new features

  • default data handler using TOAR DB refactored according to advanced data handling, #140, #141, #152

  • data sets are handled as collections, #142, and are iterable in a standard way (StandardIterator) and optimised for keras (KerasIterator), #143

  • automatically moving station map plot, #136

technical

  • model modules available from package, #139

  • renaming of parameter time dimension, #151

  • refactoring of README.md, #138

v0.10.0 - 2020-07-15 - MLAir is official name, Workflows, easy Model plug-in

general

  • Official project name is released: MLAir (Machine Learning on Air data)

  • a model class can now easily be plugged in into MLAir. #121

  • introduced new concept of workflows, #134

new features

  • workflows are used to execute a sequence of run modules, #134

  • default workflows for standard and the Juelich HPC systems are available, custom workflows can be defined, #134

  • seasonal decomposition is available for conditional quantile plot, #112

  • map plot is created with coordinates, #108

  • flatten_tails are now more general and easier to customise, #114

  • model classes have custom compile options (replaces set_loss), #110

  • model can be set in ExperimentSetup from outside, #121

  • default experiment settings can be queried using get_defaults(), #123

  • training and model settings are reported as MarkDown and Tex tables, #145

technical

  • Juelich HPC systems are supported and installation scripts are available, #106

  • data store is tracked, I/O is saved and illustrated in a plot, #116

  • batch size, epoch parameter have to be defined in ExperimentSetup, #127, #122

  • automatic documentation with sphinx, #109

  • default experiment settings are updated, #123

  • refactoring of experiment path and its default naming, #124

  • refactoring of some parameter names, #146

  • preparation for package distribution with pip, #119

  • all run scripts are updated to run with workflows, #134

  • the experiment folder is restructured, #130

v0.9.0 - 2020-04-15 - faster bootstraps, extreme value upsamling

general

  • improved and faster bootstrap workflow

  • new plot PlotAvailability

  • extreme values upsampling

  • improved runtime environment

new features

  • entire bootstrap workflow has been refactored and much faster now, can be skipped with evaluate_bootstraps=False, #60

  • upsampling of extreme values, set with parameter extreme_values=[your_values_standardised] (e.g. [1, 2]) and extremes_on_right_tail_only=<True/False> if only right tail of distribution is affected or both, #58, #87

  • minimal data length property (in total and for all subsets), #76

  • custom objects in model class to load customised model objects like padding class, loss, #72

  • new plot for data availability: PlotAvailability, #103

  • introduced (default) plot_list to specify which plots to draw

  • latex and markdown information on sample sizes for each station, #90

technical

  • implemented tests on gpu and from scratch for develop, release and master branches, #95

  • usage of tensorflow 1.13.1 (gpu / cpu), separated in 2 different requirements, #81

  • new abstract plot class to have uniform plot class design

  • New time tracking wrapper to use for functions or classes

  • improved logger (info on display, debug into file), #73, #85, #88

  • improved run environment, especially for error handling, #86

  • prefix general in data store scope is now optional and can be skipped. If given scope is not general, it is treated as subscope, #82

  • all 2D Padding classes are now selected by Padding2D(padding_name=<padding_type>) e.g. Padding2D(padding_name="SymPad2D"), #78

  • custom learning rate (or lr_decay) is optional now, #71