mlair.data_handler.data_handler_single_station

Data Preparation class to handle data processing for machine learning.

Module Contents

Classes

DataHandlerSingleStation

param window_history_offset

used to shift t0 according to the specified value.

Attributes

__author__

__date__

date

str_or_list

number

num_or_list

data_or_none

mlair.data_handler.data_handler_single_station.__author__ = Lukas Leufen, Felix Kleinert
mlair.data_handler.data_handler_single_station.__date__ = 2020-07-20
mlair.data_handler.data_handler_single_station.date
mlair.data_handler.data_handler_single_station.str_or_list
mlair.data_handler.data_handler_single_station.number
mlair.data_handler.data_handler_single_station.num_or_list
mlair.data_handler.data_handler_single_station.data_or_none
class mlair.data_handler.data_handler_single_station.DataHandlerSingleStation(station, data_path, statistics_per_var=None, sampling: Union[str, Tuple[str]] = DEFAULT_SAMPLING, target_dim=DEFAULT_TARGET_DIM, target_var=DEFAULT_TARGET_VAR, time_dim=DEFAULT_TIME_DIM, iter_dim=DEFAULT_ITER_DIM, window_dim=DEFAULT_WINDOW_DIM, window_history_size=DEFAULT_WINDOW_HISTORY_SIZE, window_history_offset=DEFAULT_WINDOW_HISTORY_OFFSET, window_history_end=DEFAULT_WINDOW_HISTORY_END, window_lead_time=DEFAULT_WINDOW_LEAD_TIME, interpolation_limit: Union[int, Tuple[int]] = DEFAULT_INTERPOLATION_LIMIT, interpolation_method: Union[str, Tuple[str]] = DEFAULT_INTERPOLATION_METHOD, overwrite_local_data: bool = False, transformation=None, store_data_locally: bool = True, min_length: int = 0, start=None, end=None, variables=None, data_origin: Dict = None, lazy_preprocessing: bool = False, overwrite_lazy_data=False, era5_data_path=None, era5_file_names=None, ifs_data_path=None, ifs_file_names=None, **kwargs)

Bases: mlair.data_handler.abstract_data_handler.AbstractDataHandler

Parameters
  • window_history_offset – used to shift t0 according to the specified value.

  • window_history_end – used to set the last time step that is used to create a sample. A negative value indicates that not all values up to t0 are used, a positive values indicates usage of values at t>t0. Default is 0.

DEFAULT_VAR_ALL_DICT
DEFAULT_WINDOW_LEAD_TIME = 3
DEFAULT_WINDOW_HISTORY_SIZE = 13
DEFAULT_WINDOW_HISTORY_OFFSET = 0
DEFAULT_WINDOW_HISTORY_END = 0
DEFAULT_TIME_DIM = datetime
DEFAULT_TARGET_VAR = o3
DEFAULT_TARGET_DIM = variables
DEFAULT_ITER_DIM = Stations
DEFAULT_WINDOW_DIM = window
DEFAULT_SAMPLING = daily
DEFAULT_INTERPOLATION_LIMIT = 0
DEFAULT_INTERPOLATION_METHOD = linear
chem_vars = ['benzene', 'ch4', 'co', 'ethane', 'no', 'no2', 'nox', 'o3', 'ox', 'pm1', 'pm10', 'pm2p5',...
_hash = ['station', 'statistics_per_var', 'data_origin', 'sampling', 'target_dim', 'target_var',...
clean_up(self)
__str__(self)

Return str(self).

__len__(self)
property shape(self)
__repr__(self)

Return repr(self).

get_transposed_history(self) → xarray.DataArray

Return history.

Returns

history with dimensions datetime, window, Stations, variables.

get_transposed_label(self) → xarray.DataArray

Return label.

Returns

label with dimensions datetime*, window*, Stations, variables.

get_X(self, **kwargs)
get_Y(self, **kwargs)
get_coordinates(self)

Return coordinates as dictionary with keys lon and lat.

call_transform(self, inverse=False)
transform(self, data_in, dim: Union[str, int] = 0, inverse: bool = False, opts=None, transformation_dim=DEFAULT_TARGET_DIM)

Transform data according to given transformation settings.

This function transforms a xarray.dataarray (along dim) or pandas.DataFrame (along axis) either with mean=0 and std=1 (method=standardise) or centers the data with mean=0 and no change in data scale (method=centre). Furthermore, this sets an internal instance attribute for later inverse transformation. This method will raise an AssertionError if an internal transform method was already set (‘inverse=False’) or if the internal transform method, internal mean and internal standard deviation weren’t set (‘inverse=True’).

Parameters
  • dim (string/int) – This param is not used for inverse transformation. | for xarray.DataArray as string: name of dimension which should be standardised | for pandas.DataFrame as int: axis of dimension which should be standardised

  • inverse – Switch between transformation and inverse transformation.

Returns

xarray.DataArrays or pandas.DataFrames: #. mean: Mean of data #. std: Standard deviation of data #. data: Standardised data

setup_samples(self)

Setup samples. This method prepares and creates samples X, and labels Y.

store_lazy(self)
_create_lazy_data(self)
load_lazy(self)
_extract_lazy(self, lazy_data)
make_input_target(self)
set_inputs_and_targets(self)
make_samples(self)
load_data(self, path, station, statistics_per_var, sampling, store_data_locally=False, data_origin: Dict = None, start=None, end=None)

Load data and meta data either from local disk (preferred) or download new data by using a custom download method.

Data is either downloaded, if no local data is available or parameter overwrite_local_data is true. In both cases, downloaded data is only stored locally if store_data_locally is not disabled. If this parameter is not set, it is assumed, that data should be saved locally.

static check_station_meta(meta, station, data_origin, statistics_per_var)

Search for the entries in meta data and compare the value with the requested values.

Will raise a FileNotFoundError if the values mismatch.

check_for_negative_concentrations(self, data: xarray.DataArray, minimum: int = 0) → xarray.DataArray

Set all negative concentrations to zero.

Names of all concentrations are extracted from https://join.fz-juelich.de/services/rest/surfacedata/ #2.1 Parameters. Currently, this check is applied on “benzene”, “ch4”, “co”, “ethane”, “no”, “no2”, “nox”, “o3”, “ox”, “pm1”, “pm10”, “pm2p5”, “propane”, “so2”, and “toluene”.

Parameters
  • data – data array containing variables to check

  • minimum – minimum value, by default this should be 0

Returns

corrected data

setup_data_path(self, data_path: str, sampling: str)
shift(self, data: xarray.DataArray, dim: str, window: int, offset: int = 0) → xarray.DataArray

Shift data multiple times to represent history (if window <= 0) or lead time (if window > 0).

Parameters
  • data – data set to shift

  • dim – dimension along shift is applied

  • window – number of steps to shift (corresponds to the window length)

  • offset – use offset to move the window by as many time steps as given in offset. This can be used, if the index time of a history element is not the last timestamp. E.g. you could use offset=23 when dealing with hourly data in combination with daily data (values from 00 to 23 are aggregated on 00 the same day).

Returns

shifted data

static create_index_array(index_name: str, index_value: Iterable[int], squeeze_dim: str) → xarray.DataArray

Create an 1D xr.DataArray with given index name and value.

Parameters
  • index_name – name of dimension

  • index_value – values of this dimension

Returns

this array

static _set_file_name(path, station, statistics_per_var)
static _set_meta_file_name(path, station, statistics_per_var)
interpolate(self, data, dim: str, method: str = 'linear', limit: int = None, use_coordinate: Union[bool, str] = True, sampling='daily', **kwargs)

Interpolate values according to different methods.

(Copy paste from dataarray.interpolate_na)

Parameters
  • dim – Specifies the dimension along which to interpolate.

  • method

    {‘linear’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’,

    ’polynomial’, ‘barycentric’, ‘krog’, ‘pchip’, ‘spline’, ‘akima’}, optional

    String indicating which method to use for interpolation:

    • ’linear’: linear interpolation (Default). Additional keyword arguments are passed to numpy.interp

    • ’nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘polynomial’: are passed to scipy.interpolate.interp1d. If method==’polynomial’, the order keyword argument must also be provided.

    • ’barycentric’, ‘krog’, ‘pchip’, ‘spline’, and akima: use their respective``scipy.interpolate`` classes.

  • limit – default None Maximum number of consecutive NaNs to fill. Must be greater than 0 or None for no limit.

  • use_coordinate

    default True

    Specifies which index to use as the x values in the interpolation formulated as y = f(x). If False, values are treated as if eqaully-spaced along dim. If True, the IndexVariable dim is used. If use_coordinate is a string, it specifies the name of a coordinate variariable to use as the index.

  • kwargs

Returns

xarray.DataArray

static create_full_time_dim(data, dim, sampling)

Ensure time dimension to be equidistant. Sometimes dates if missing values have been dropped.

make_history_window(self, dim_name_of_inputs: str, window: int, dim_name_of_shift: str)None

Create a xr.DataArray containing history data.

Shift the data window+1 times and return a xarray which has a new dimension ‘window’ containing the shifted data. This is used to represent history in the data. Results are stored in history attribute.

Parameters
  • dim_name_of_inputs – Name of dimension which contains the input variables

  • window – number of time steps to look back in history Note: window will be treated as negative value. This should be in agreement with looking back on a time line. Nonetheless positive values are allowed but they are converted to its negative expression

  • dim_name_of_shift – Dimension along shift will be applied

make_labels(self, dim_name_of_target: str, target_var: str_or_list, dim_name_of_shift: str, window: int)None

Create a xr.DataArray containing labels.

Labels are defined as the consecutive target values (t+1, …t+n) following the current time step t. Set label attribute.

Parameters
  • dim_name_of_target – Name of dimension which contains the target variable

  • target_var – Name of target variable in ‘dimension’

  • dim_name_of_shift – Name of dimension on which xarray.DataArray.shift will be applied

  • window – lead time of label

make_observation(self, dim_name_of_target: str, target_var: str_or_list, dim_name_of_shift: str)None

Create a xr.DataArray containing observations.

Observations are defined as value of the current time step t. Set observation attribute.

Parameters
  • dim_name_of_target – Name of dimension which contains the observation variable

  • target_var – Name of observation variable(s) in ‘dimension’

  • dim_name_of_shift – Name of dimension on which xarray.DataArray.shift will be applied

remove_nan(self, dim: str)None

Remove all NAs slices along dim which contain nans in history, label and observation.

This is done to present only a full matrix to keras.fit. Update history, label, and observation attribute.

Parameters

dim – dimension along the remove is performed.

_slice_prep(self, data: xarray.DataArray, start=None, end=None) → xarray.DataArray

Set start and end date for slicing and execute self._slice().

Parameters
  • data – data to slice

  • coord – name of axis to slice

Returns

sliced data

static _slice(data: xarray.DataArray, start: Union[date, str], end: Union[date, str], coord: str) → xarray.DataArray

Slice through a given data_item (for example select only values of 2011).

Parameters
  • data – data to slice

  • start – start date of slice

  • end – end date of slice

  • coord – name of axis to slice

Returns

sliced data

setup_transformation(self, transformation: Union[None, dict, Tuple]) → Tuple[Optional[dict], Optional[dict]]

Set up transformation by extracting all relevant information.

  • Either return new empty DataClass instances if given transformation arg is None,

  • or return given object twice if transformation is a DataClass instance,

  • or return the inputs and targets attributes if transformation is a TransformationClass instance (default design behaviour)

static check_inverse_transform_params(method: str, mean=None, std=None, min=None, max=None)None

Support inverse_transformation method.

Validate if all required statistics are available for given method. E.g. centering requires mean only, whereas normalisation requires mean and standard deviation. Will raise an AttributeError on missing requirements.

Parameters
  • mean – data with all mean values

  • std – data with all standard deviation values

  • method – name of transformation method

inverse_transform(self, data_in, opts, transformation_dim) → xarray.DataArray

Perform inverse transformation.

Will raise an AssertionError, if no transformation was performed before. Checks first, if all required statistics are available for inverse transformation. Class attributes data, mean and std are overwritten by new data afterwards. Thereby, mean, std, and the private transform method are set to None to indicate, that the current data is not transformed.

apply_transformation(self, data, base=None, dim=0, inverse=False)

Apply transformation on external data. Specify if transformation should be based on parameters related to input or target data using base. This method can also apply inverse transformation.

Parameters
  • data

  • base

  • dim

  • inverse

Returns

_hash_list(self)
_get_hash(self)