mlair.data_handler.data_handler_single_station
¶
Data Preparation class to handle data processing for machine learning.
Module Contents¶
Classes¶
|
Attributes¶
-
mlair.data_handler.data_handler_single_station.
__date__
= 2020-07-20¶
-
mlair.data_handler.data_handler_single_station.
date
¶
-
mlair.data_handler.data_handler_single_station.
str_or_list
¶
-
mlair.data_handler.data_handler_single_station.
number
¶
-
mlair.data_handler.data_handler_single_station.
num_or_list
¶
-
mlair.data_handler.data_handler_single_station.
data_or_none
¶
-
class
mlair.data_handler.data_handler_single_station.
DataHandlerSingleStation
(station, data_path, statistics_per_var=None, sampling: Union[str, Tuple[str]] = DEFAULT_SAMPLING, target_dim=DEFAULT_TARGET_DIM, target_var=DEFAULT_TARGET_VAR, time_dim=DEFAULT_TIME_DIM, iter_dim=DEFAULT_ITER_DIM, window_dim=DEFAULT_WINDOW_DIM, window_history_size=DEFAULT_WINDOW_HISTORY_SIZE, window_history_offset=DEFAULT_WINDOW_HISTORY_OFFSET, window_history_end=DEFAULT_WINDOW_HISTORY_END, window_lead_time=DEFAULT_WINDOW_LEAD_TIME, interpolation_limit: Union[int, Tuple[int]] = DEFAULT_INTERPOLATION_LIMIT, interpolation_method: Union[str, Tuple[str]] = DEFAULT_INTERPOLATION_METHOD, overwrite_local_data: bool = False, transformation=None, store_data_locally: bool = True, min_length: int = 0, start=None, end=None, variables=None, data_origin: Dict = None, lazy_preprocessing: bool = False, overwrite_lazy_data=False, era5_data_path=None, era5_file_names=None, ifs_data_path=None, ifs_file_names=None, **kwargs)¶ Bases:
mlair.data_handler.abstract_data_handler.AbstractDataHandler
- Parameters
window_history_offset – used to shift t0 according to the specified value.
window_history_end – used to set the last time step that is used to create a sample. A negative value indicates that not all values up to t0 are used, a positive values indicates usage of values at t>t0. Default is 0.
-
DEFAULT_VAR_ALL_DICT
¶
-
DEFAULT_WINDOW_LEAD_TIME
= 3¶
-
DEFAULT_WINDOW_HISTORY_SIZE
= 13¶
-
DEFAULT_WINDOW_HISTORY_OFFSET
= 0¶
-
DEFAULT_WINDOW_HISTORY_END
= 0¶
-
DEFAULT_TIME_DIM
= datetime¶
-
DEFAULT_TARGET_VAR
= o3¶
-
DEFAULT_TARGET_DIM
= variables¶
-
DEFAULT_ITER_DIM
= Stations¶
-
DEFAULT_WINDOW_DIM
= window¶
-
DEFAULT_SAMPLING
= daily¶
-
DEFAULT_INTERPOLATION_LIMIT
= 0¶
-
DEFAULT_INTERPOLATION_METHOD
= linear¶
-
chem_vars
= ['benzene', 'ch4', 'co', 'ethane', 'no', 'no2', 'nox', 'o3', 'ox', 'pm1', 'pm10', 'pm2p5',...¶
-
_hash
= ['station', 'statistics_per_var', 'data_origin', 'sampling', 'target_dim', 'target_var',...¶
-
clean_up
(self)¶
-
__str__
(self)¶ Return str(self).
-
__len__
(self)¶
-
property
shape
(self)¶
-
__repr__
(self)¶ Return repr(self).
-
get_transposed_history
(self) → xarray.DataArray¶ Return history.
- Returns
history with dimensions datetime, window, Stations, variables.
-
get_transposed_label
(self) → xarray.DataArray¶ Return label.
- Returns
label with dimensions datetime*, window*, Stations, variables.
-
get_X
(self, **kwargs)¶
-
get_Y
(self, **kwargs)¶
-
get_coordinates
(self)¶ Return coordinates as dictionary with keys lon and lat.
-
call_transform
(self, inverse=False)¶
-
transform
(self, data_in, dim: Union[str, int] = 0, inverse: bool = False, opts=None, transformation_dim=DEFAULT_TARGET_DIM)¶ Transform data according to given transformation settings.
This function transforms a xarray.dataarray (along dim) or pandas.DataFrame (along axis) either with mean=0 and std=1 (method=standardise) or centers the data with mean=0 and no change in data scale (method=centre). Furthermore, this sets an internal instance attribute for later inverse transformation. This method will raise an AssertionError if an internal transform method was already set (‘inverse=False’) or if the internal transform method, internal mean and internal standard deviation weren’t set (‘inverse=True’).
- Parameters
dim (string/int) – This param is not used for inverse transformation. | for xarray.DataArray as string: name of dimension which should be standardised | for pandas.DataFrame as int: axis of dimension which should be standardised
inverse – Switch between transformation and inverse transformation.
- Returns
xarray.DataArrays or pandas.DataFrames: #. mean: Mean of data #. std: Standard deviation of data #. data: Standardised data
-
setup_samples
(self)¶ Setup samples. This method prepares and creates samples X, and labels Y.
-
store_lazy
(self)¶
-
_create_lazy_data
(self)¶
-
load_lazy
(self)¶
-
_extract_lazy
(self, lazy_data)¶
-
make_input_target
(self)¶
-
set_inputs_and_targets
(self)¶
-
make_samples
(self)¶
-
load_data
(self, path, station, statistics_per_var, sampling, store_data_locally=False, data_origin: Dict = None, start=None, end=None)¶ Load data and meta data either from local disk (preferred) or download new data by using a custom download method.
Data is either downloaded, if no local data is available or parameter overwrite_local_data is true. In both cases, downloaded data is only stored locally if store_data_locally is not disabled. If this parameter is not set, it is assumed, that data should be saved locally.
-
static
check_station_meta
(meta, station, data_origin, statistics_per_var)¶ Search for the entries in meta data and compare the value with the requested values.
Will raise a FileNotFoundError if the values mismatch.
-
check_for_negative_concentrations
(self, data: xarray.DataArray, minimum: int = 0) → xarray.DataArray¶ Set all negative concentrations to zero.
Names of all concentrations are extracted from https://join.fz-juelich.de/services/rest/surfacedata/ #2.1 Parameters. Currently, this check is applied on “benzene”, “ch4”, “co”, “ethane”, “no”, “no2”, “nox”, “o3”, “ox”, “pm1”, “pm10”, “pm2p5”, “propane”, “so2”, and “toluene”.
- Parameters
data – data array containing variables to check
minimum – minimum value, by default this should be 0
- Returns
corrected data
-
shift
(self, data: xarray.DataArray, dim: str, window: int, offset: int = 0) → xarray.DataArray¶ Shift data multiple times to represent history (if window <= 0) or lead time (if window > 0).
- Parameters
data – data set to shift
dim – dimension along shift is applied
window – number of steps to shift (corresponds to the window length)
offset – use offset to move the window by as many time steps as given in offset. This can be used, if the index time of a history element is not the last timestamp. E.g. you could use offset=23 when dealing with hourly data in combination with daily data (values from 00 to 23 are aggregated on 00 the same day).
- Returns
shifted data
-
static
create_index_array
(index_name: str, index_value: Iterable[int], squeeze_dim: str) → xarray.DataArray¶ Create an 1D xr.DataArray with given index name and value.
- Parameters
index_name – name of dimension
index_value – values of this dimension
- Returns
this array
-
static
_set_file_name
(path, station, statistics_per_var)¶
-
static
_set_meta_file_name
(path, station, statistics_per_var)¶
-
interpolate
(self, data, dim: str, method: str = 'linear', limit: int = None, use_coordinate: Union[bool, str] = True, sampling='daily', **kwargs)¶ Interpolate values according to different methods.
(Copy paste from dataarray.interpolate_na)
- Parameters
dim – Specifies the dimension along which to interpolate.
method –
- {‘linear’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’,
’polynomial’, ‘barycentric’, ‘krog’, ‘pchip’, ‘spline’, ‘akima’}, optional
String indicating which method to use for interpolation:
’linear’: linear interpolation (Default). Additional keyword arguments are passed to
numpy.interp
’nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘polynomial’: are passed to
scipy.interpolate.interp1d
. If method==’polynomial’, theorder
keyword argument must also be provided.’barycentric’, ‘krog’, ‘pchip’, ‘spline’, and akima: use their respective``scipy.interpolate`` classes.
limit – default None Maximum number of consecutive NaNs to fill. Must be greater than 0 or None for no limit.
use_coordinate –
- default True
Specifies which index to use as the x values in the interpolation formulated as y = f(x). If False, values are treated as if eqaully-spaced along dim. If True, the IndexVariable dim is used. If use_coordinate is a string, it specifies the name of a coordinate variariable to use as the index.
kwargs –
- Returns
xarray.DataArray
-
static
create_full_time_dim
(data, dim, sampling)¶ Ensure time dimension to be equidistant. Sometimes dates if missing values have been dropped.
-
make_history_window
(self, dim_name_of_inputs: str, window: int, dim_name_of_shift: str) → None¶ Create a xr.DataArray containing history data.
Shift the data window+1 times and return a xarray which has a new dimension ‘window’ containing the shifted data. This is used to represent history in the data. Results are stored in history attribute.
- Parameters
dim_name_of_inputs – Name of dimension which contains the input variables
window – number of time steps to look back in history Note: window will be treated as negative value. This should be in agreement with looking back on a time line. Nonetheless positive values are allowed but they are converted to its negative expression
dim_name_of_shift – Dimension along shift will be applied
-
make_labels
(self, dim_name_of_target: str, target_var: str_or_list, dim_name_of_shift: str, window: int) → None¶ Create a xr.DataArray containing labels.
Labels are defined as the consecutive target values (t+1, …t+n) following the current time step t. Set label attribute.
- Parameters
dim_name_of_target – Name of dimension which contains the target variable
target_var – Name of target variable in ‘dimension’
dim_name_of_shift – Name of dimension on which xarray.DataArray.shift will be applied
window – lead time of label
-
make_observation
(self, dim_name_of_target: str, target_var: str_or_list, dim_name_of_shift: str) → None¶ Create a xr.DataArray containing observations.
Observations are defined as value of the current time step t. Set observation attribute.
- Parameters
dim_name_of_target – Name of dimension which contains the observation variable
target_var – Name of observation variable(s) in ‘dimension’
dim_name_of_shift – Name of dimension on which xarray.DataArray.shift will be applied
-
remove_nan
(self, dim: str) → None¶ Remove all NAs slices along dim which contain nans in history, label and observation.
This is done to present only a full matrix to keras.fit. Update history, label, and observation attribute.
- Parameters
dim – dimension along the remove is performed.
-
_slice_prep
(self, data: xarray.DataArray, start=None, end=None) → xarray.DataArray¶ Set start and end date for slicing and execute self._slice().
- Parameters
data – data to slice
coord – name of axis to slice
- Returns
sliced data
-
static
_slice
(data: xarray.DataArray, start: Union[date, str], end: Union[date, str], coord: str) → xarray.DataArray¶ Slice through a given data_item (for example select only values of 2011).
- Parameters
data – data to slice
start – start date of slice
end – end date of slice
coord – name of axis to slice
- Returns
sliced data
-
setup_transformation
(self, transformation: Union[None, dict, Tuple]) → Tuple[Optional[dict], Optional[dict]]¶ Set up transformation by extracting all relevant information.
Either return new empty DataClass instances if given transformation arg is None,
or return given object twice if transformation is a DataClass instance,
or return the inputs and targets attributes if transformation is a TransformationClass instance (default design behaviour)
-
static
check_inverse_transform_params
(method: str, mean=None, std=None, min=None, max=None) → None¶ Support inverse_transformation method.
Validate if all required statistics are available for given method. E.g. centering requires mean only, whereas normalisation requires mean and standard deviation. Will raise an AttributeError on missing requirements.
- Parameters
mean – data with all mean values
std – data with all standard deviation values
method – name of transformation method
-
inverse_transform
(self, data_in, opts, transformation_dim) → xarray.DataArray¶ Perform inverse transformation.
Will raise an AssertionError, if no transformation was performed before. Checks first, if all required statistics are available for inverse transformation. Class attributes data, mean and std are overwritten by new data afterwards. Thereby, mean, std, and the private transform method are set to None to indicate, that the current data is not transformed.
-
apply_transformation
(self, data, base=None, dim=0, inverse=False)¶ Apply transformation on external data. Specify if transformation should be based on parameters related to input or target data using base. This method can also apply inverse transformation.
- Parameters
data –
base –
dim –
inverse –
- Returns
-
_hash_list
(self)¶
-
_get_hash
(self)¶