mlair.run_modules.pre_processing
¶
Pre-processing module.
Module Contents¶
Classes¶
Pre-process your data by using this class. |
Functions¶
|
Try to create a data handler for given arguments. If build fails, this station does not fulfil all requirements and |
|
|
|
Attributes¶
-
mlair.run_modules.pre_processing.
__date__
= 2019-11-25¶
-
class
mlair.run_modules.pre_processing.
PreProcessing
¶ Bases:
mlair.run_modules.run_environment.RunEnvironment
Pre-process your data by using this class.
- Schedule of pre-processing:
load and check valid stations (either download or load from disk)
split subsets (train, val, test, train & val)
create small report on data metrics
- Required objects [scope] from data store:
all elements from DEFAULT_ARGS_LIST in scope preprocessing for general data loading
all elements from DEFAULT_ARGS_LIST in scopes [train, val, test, train_val] for custom subset settings
fraction_of_training [.]
experiment_path [.]
use_all_stations_on_all_data_sets [.]
- Optional objects
all elements from DEFAULT_KWARGS_LIST in scope preprocessing for general data loading
all elements from DEFAULT_KWARGS_LIST in scopes [train, val, test, train_val] for custom subset settings
- Sets
stations in [., train, val, test, train_val]
generator in [train, val, test, train_val]
transformation [.]
- Creates
all input and output data in data_path
latex reports in experiment_path/latex_report
-
_run
(self)¶
-
report_pre_processing
(self)¶ Log some metrics on data and create latex report.
-
create_latex_report
(self)¶ Create tables with information on the station meta data and a summary on subset sample sizes.
station_sample_size.md: see table below as markdown
station_sample_size.tex: same as table below as latex table
station_sample_size_short.tex: reduced size table without any meta data besides station ID, as latex table
All tables are stored inside experiment_path inside the folder latex_report. The table format (e.g. which meta data is highlighted) is currently hardcoded to have a stable table style. If further styles are needed, it is better to add an additional style than modifying the existing table styles.
stat. ID
station_name
station_lon
station_lat
station_alt
train
val
test
DEBW013
Stuttgart Bad Cannstatt
9.2297
48.8088
235
1434
712
1080
DEBW076
Baden-Baden
8.2202
48.7731
148
3037
722
710
DEBW087
Schwäbische_Alb
9.2076
48.3458
798
3044
714
1087
DEBW107
Tübingen
9.0512
48.5077
325
1803
715
1087
DEBY081
Garmisch-Partenkirchen/Kreuzeckbahnstraße
11.0631
47.4764
735
2935
525
714
# Stations
nan
nan
nan
nan
6
6
6
# Samples
nan
nan
nan
nan
12253
3388
4678
-
create_info_df
(self, meta_cols, meta_round, names_of_set, precision)¶
-
split_train_val_test
(self) → None¶ Split data into subsets.
Currently: train, val, test and train_val (actually this is only the merge of train and val, but as an separate data_collection). IMPORTANT: Do not change to order of the execution of create_set_split. The train subset needs always to be executed at first, to set a proper transformation.
-
static
split_set_indices
(total_length: int, fraction: float) → Tuple[slice, slice, slice, slice]¶ Create the training, validation and test subset slice indices for given total_length.
The test data consists on (1-fraction) of total_length (fraction*len:end). Train and validation data therefore are made from fraction of total_length (0:fraction*len). Train and validation data is split by the factor 0.8 for train and 0.2 for validation. In addition, split_set_indices returns also the combination of training and validation subset.
- Parameters
total_length – list with all objects to split
fraction – ratio between test and union of train/val data
- Returns
slices for each subset in the order: train, val, test, train_val
-
validate_station
(self, data_handler: mlair.data_handler.AbstractDataHandler, set_stations, set_name=None, store_processed_data=True)¶ Check if all given stations in all_stations are valid.
Valid means, that there is data available for the given time range (is included in kwargs). The shape and the loading time are logged in debug mode.
- Returns
Corrected list containing only valid station IDs.
-
store_data_handler_attributes
(self, data_handler, collection)¶
-
_store_apriori
(self)¶
-
_load_apriori
(self)¶
-
transformation
(self, data_handler: mlair.data_handler.AbstractDataHandler, stations)¶
-
_load_transformation
(self)¶ Try to load transformation options from file if transformation_file is provided.
-
_store_transformation
(self, transformation_opts)¶ Store transformation options locally inside experiment_path if not exists already.
-
prepare_competitors
(self)¶ Prepare competitor models already in the preprocessing stage. This is performed here, because some models might need to have internet access, which is depending on the operating system not possible during postprocessing. This method checks currently only, if the Intelli03-ts-v1 model is requested as competitor and downloads the data if required.
-
create_snapshot
(self)¶
-
load_snapshot
(self, file)¶
-
mlair.run_modules.pre_processing.
f_proc
(data_handler, station, name_affix, store, return_strategy='', tmp_path=None, **kwargs)¶ Try to create a data handler for given arguments. If build fails, this station does not fulfil all requirements and therefore f_proc will return None as indication. On a successful build, f_proc returns the built data handler and the station that was used. This function must be implemented globally to work together with multiprocessing.
-
mlair.run_modules.pre_processing.
f_proc_create_info_df
(data, meta_cols)¶
-
mlair.run_modules.pre_processing.
f_inspect_error
(formatted)¶