`mlair.data_handler.data_handler_single_station`¶

Data Preparation class to handle data processing for machine learning.

Module Contents¶

Classes¶

DataHandlerSingleStation

param window_history_offset: used to shift t0 according to the specified value.

Attributes¶

`__author__`
`__date__`
`date`
`str_or_list`
`number`
`num_or_list`
`data_or_none`

mlair.data_handler.data_handler_single_station.__author__ = Lukas Leufen, Felix Kleinert¶

mlair.data_handler.data_handler_single_station.__date__ = 2020-07-20¶

mlair.data_handler.data_handler_single_station.date¶

mlair.data_handler.data_handler_single_station.str_or_list¶

mlair.data_handler.data_handler_single_station.number¶

mlair.data_handler.data_handler_single_station.num_or_list¶

mlair.data_handler.data_handler_single_station.data_or_none¶

class mlair.data_handler.data_handler_single_station.DataHandlerSingleStation(station, data_path, statistics_per_var=None, sampling: Union[str, Tuple[str]] = DEFAULT_SAMPLING, target_dim=DEFAULT_TARGET_DIM, target_var=DEFAULT_TARGET_VAR, time_dim=DEFAULT_TIME_DIM, iter_dim=DEFAULT_ITER_DIM, window_dim=DEFAULT_WINDOW_DIM, window_history_size=DEFAULT_WINDOW_HISTORY_SIZE, window_history_offset=DEFAULT_WINDOW_HISTORY_OFFSET, window_history_end=DEFAULT_WINDOW_HISTORY_END, window_lead_time=DEFAULT_WINDOW_LEAD_TIME, interpolation_limit: Union[int, Tuple[int]] = DEFAULT_INTERPOLATION_LIMIT, interpolation_method: Union[str, Tuple[str]] = DEFAULT_INTERPOLATION_METHOD, overwrite_local_data: bool = False, transformation=None, store_data_locally: bool = True, min_length: int = 0, start=None, end=None, variables=None, data_origin: Dict = None, lazy_preprocessing: bool = False, overwrite_lazy_data=False, era5_data_path=None, era5_file_names=None, ifs_data_path=None, ifs_file_names=None, **kwargs)¶

Bases: mlair.data_handler.abstract_data_handler.AbstractDataHandler

Parameters

window_history_offset – used to shift t0 according to the specified value.
window_history_end – used to set the last time step that is used to create a sample. A negative value indicates that not all values up to t0 are used, a positive values indicates usage of values at t>t0. Default is 0.

DEFAULT_VAR_ALL_DICT¶

DEFAULT_WINDOW_LEAD_TIME = 3¶

DEFAULT_WINDOW_HISTORY_SIZE = 13¶

DEFAULT_WINDOW_HISTORY_OFFSET = 0¶

DEFAULT_WINDOW_HISTORY_END = 0¶

DEFAULT_TIME_DIM = datetime¶

DEFAULT_TARGET_VAR = o3¶

DEFAULT_TARGET_DIM = variables¶

DEFAULT_ITER_DIM = Stations¶

DEFAULT_WINDOW_DIM = window¶

DEFAULT_SAMPLING = daily¶

DEFAULT_INTERPOLATION_LIMIT = 0¶

DEFAULT_INTERPOLATION_METHOD = linear¶

chem_vars = ['benzene', 'ch4', 'co', 'ethane', 'no', 'no2', 'nox', 'o3', 'ox', 'pm1', 'pm10', 'pm2p5',...¶

_hash = ['station', 'statistics_per_var', 'data_origin', 'sampling', 'target_dim', 'target_var',...¶

clean_up(self)¶

__str__(self)¶: Return str(self).

__len__(self)¶

property shape(self)¶

__repr__(self)¶: Return repr(self).

get_transposed_history(self) → xarray.DataArray¶

Return history.

Returns: history with dimensions datetime, window, Stations, variables.

get_transposed_label(self) → xarray.DataArray¶

Return label.

Returns: label with dimensions datetime*, window*, Stations, variables.

get_X(self, **kwargs)¶

get_Y(self, **kwargs)¶

get_coordinates(self)¶: Return coordinates as dictionary with keys lon and lat.

call_transform(self, inverse=False)¶

transform(self, data_in, dim: Union[str, int] = 0, inverse: bool = False, opts=None, transformation_dim=DEFAULT_TARGET_DIM)¶

Transform data according to given transformation settings.

This function transforms a xarray.dataarray (along dim) or pandas.DataFrame (along axis) either with mean=0 and std=1 (method=standardise) or centers the data with mean=0 and no change in data scale (method=centre). Furthermore, this sets an internal instance attribute for later inverse transformation. This method will raise an AssertionError if an internal transform method was already set (‘inverse=False’) or if the internal transform method, internal mean and internal standard deviation weren’t set (‘inverse=True’).

Parameters

dim (string/int) – This param is not used for inverse transformation. | for xarray.DataArray as string: name of dimension which should be standardised | for pandas.DataFrame as int: axis of dimension which should be standardised
inverse – Switch between transformation and inverse transformation.

Returns

xarray.DataArrays or pandas.DataFrames: #. mean: Mean of data #. std: Standard deviation of data #. data: Standardised data

setup_samples(self)¶: Setup samples. This method prepares and creates samples X, and labels Y.

store_lazy(self)¶

_create_lazy_data(self)¶

load_lazy(self)¶

_extract_lazy(self, lazy_data)¶

make_input_target(self)¶

set_inputs_and_targets(self)¶

make_samples(self)¶

load_data(self, path, station, statistics_per_var, sampling, store_data_locally=False, data_origin: Dict = None, start=None, end=None)¶

Load data and meta data either from local disk (preferred) or download new data by using a custom download method.

Data is either downloaded, if no local data is available or parameter overwrite_local_data is true. In both cases, downloaded data is only stored locally if store_data_locally is not disabled. If this parameter is not set, it is assumed, that data should be saved locally.

static check_station_meta(meta, station, data_origin, statistics_per_var)¶

Search for the entries in meta data and compare the value with the requested values.

Will raise a FileNotFoundError if the values mismatch.

check_for_negative_concentrations(self, data: xarray.DataArray, minimum: int = 0) → xarray.DataArray¶

Set all negative concentrations to zero.

Names of all concentrations are extracted from https://join.fz-juelich.de/services/rest/surfacedata/ #2.1 Parameters. Currently, this check is applied on “benzene”, “ch4”, “co”, “ethane”, “no”, “no2”, “nox”, “o3”, “ox”, “pm1”, “pm10”, “pm2p5”, “propane”, “so2”, and “toluene”.

Parameters

data – data array containing variables to check
minimum – minimum value, by default this should be 0

Returns

corrected data

setup_data_path(self, data_path: str, sampling: str)¶

shift(self, data: xarray.DataArray, dim: str, window: int, offset: int = 0) → xarray.DataArray¶

Shift data multiple times to represent history (if window <= 0) or lead time (if window > 0).

Parameters

data – data set to shift
dim – dimension along shift is applied
window – number of steps to shift (corresponds to the window length)
offset – use offset to move the window by as many time steps as given in offset. This can be used, if the index time of a history element is not the last timestamp. E.g. you could use offset=23 when dealing with hourly data in combination with daily data (values from 00 to 23 are aggregated on 00 the same day).

Returns

shifted data

static create_index_array(index_name: str, index_value: Iterable[int], squeeze_dim: str) → xarray.DataArray¶

Create an 1D xr.DataArray with given index name and value.

Parameters

index_name – name of dimension
index_value – values of this dimension

Returns

this array

static _set_file_name(path, station, statistics_per_var)¶

static _set_meta_file_name(path, station, statistics_per_var)¶

interpolate(self, data, dim: str, method: str = 'linear', limit: int = None, use_coordinate: Union[bool, str] = True, sampling='daily', **kwargs)¶

Interpolate values according to different methods.

(Copy paste from dataarray.interpolate_na)

Parameters

dim – Specifies the dimension along which to interpolate.
method –
{‘linear’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’,
’polynomial’, ‘barycentric’, ‘krog’, ‘pchip’, ‘spline’, ‘akima’}, optional

String indicating which method to use for interpolation:
- ’linear’: linear interpolation (Default). Additional keyword arguments are passed to numpy.interp
- ’nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘polynomial’: are passed to scipy.interpolate.interp1d. If method==’polynomial’, the order keyword argument must also be provided.
- ’barycentric’, ‘krog’, ‘pchip’, ‘spline’, and akima: use their respective``scipy.interpolate`` classes.
limit – default None Maximum number of consecutive NaNs to fill. Must be greater than 0 or None for no limit.
use_coordinate –

default True
Specifies which index to use as the x values in the interpolation formulated as y = f(x). If False, values are treated as if eqaully-spaced along dim. If True, the IndexVariable dim is used. If use_coordinate is a string, it specifies the name of a coordinate variariable to use as the index.
kwargs –

Returns

xarray.DataArray

static create_full_time_dim(data, dim, sampling)¶: Ensure time dimension to be equidistant. Sometimes dates if missing values have been dropped.

make_history_window(self, dim_name_of_inputs: str, window: int, dim_name_of_shift: str) → None ¶

Create a xr.DataArray containing history data.

Shift the data window+1 times and return a xarray which has a new dimension ‘window’ containing the shifted data. This is used to represent history in the data. Results are stored in history attribute.

Parameters

dim_name_of_inputs – Name of dimension which contains the input variables
window – number of time steps to look back in history Note: window will be treated as negative value. This should be in agreement with looking back on a time line. Nonetheless positive values are allowed but they are converted to its negative expression
dim_name_of_shift – Dimension along shift will be applied

make_labels(self, dim_name_of_target: str, target_var: str_or_list, dim_name_of_shift: str, window: int) → None ¶

Create a xr.DataArray containing labels.

Labels are defined as the consecutive target values (t+1, …t+n) following the current time step t. Set label attribute.

Parameters

dim_name_of_target – Name of dimension which contains the target variable
target_var – Name of target variable in ‘dimension’
dim_name_of_shift – Name of dimension on which xarray.DataArray.shift will be applied
window – lead time of label

make_observation(self, dim_name_of_target: str, target_var: str_or_list, dim_name_of_shift: str) → None ¶

Create a xr.DataArray containing observations.

Observations are defined as value of the current time step t. Set observation attribute.

Parameters

dim_name_of_target – Name of dimension which contains the observation variable
target_var – Name of observation variable(s) in ‘dimension’
dim_name_of_shift – Name of dimension on which xarray.DataArray.shift will be applied

remove_nan(self, dim: str) → None ¶

Remove all NAs slices along dim which contain nans in history, label and observation.

This is done to present only a full matrix to keras.fit. Update history, label, and observation attribute.

Parameters: dim – dimension along the remove is performed.

_slice_prep(self, data: xarray.DataArray, start=None, end=None) → xarray.DataArray¶

Set start and end date for slicing and execute self._slice().

Parameters

data – data to slice
coord – name of axis to slice

Returns

sliced data

static _slice(data: xarray.DataArray, start: Union[date, str], end: Union[date, str], coord: str) → xarray.DataArray¶

Slice through a given data_item (for example select only values of 2011).

Parameters

data – data to slice
start – start date of slice
end – end date of slice
coord – name of axis to slice

Returns

sliced data

setup_transformation(self, transformation: Union[None, dict, Tuple]) → Tuple[Optional[dict], Optional[dict]]¶

Set up transformation by extracting all relevant information.

Either return new empty DataClass instances if given transformation arg is None,
or return given object twice if transformation is a DataClass instance,
or return the inputs and targets attributes if transformation is a TransformationClass instance (default design behaviour)

static check_inverse_transform_params(method: str, mean=None, std=None, min=None, max=None) → None ¶

Support inverse_transformation method.

Validate if all required statistics are available for given method. E.g. centering requires mean only, whereas normalisation requires mean and standard deviation. Will raise an AttributeError on missing requirements.

Parameters

mean – data with all mean values
std – data with all standard deviation values
method – name of transformation method

inverse_transform(self, data_in, opts, transformation_dim) → xarray.DataArray¶

Perform inverse transformation.

Will raise an AssertionError, if no transformation was performed before. Checks first, if all required statistics are available for inverse transformation. Class attributes data, mean and std are overwritten by new data afterwards. Thereby, mean, std, and the private transform method are set to None to indicate, that the current data is not transformed.

apply_transformation(self, data, base=None, dim=0, inverse=False)¶

Apply transformation on external data. Specify if transformation should be based on parameters related to input or target data using base. This method can also apply inverse transformation.

Parameters

data –
base –
dim –
inverse –

Returns

_hash_list(self)¶

_get_hash(self)¶

mlair.data_handler.data_handler_single_station¶

Module Contents¶

Classes¶

Attributes¶

`mlair.data_handler.data_handler_single_station`¶