`mlair.helpers.statistics`¶

Collection of stastical methods: Transformation and Skill Scores.

Module Contents¶

Classes¶

SkillScores

Calculate different kinds of skill scores.

Functions¶

`apply_inverse_transformation`(data: Data, method: str = ‘standardise’, mean: Data = None, std: Data = None, max: Data = None, min: Data = None, feature_range: Data = None) → Data	Apply inverse transformation for given statistics.
`standardise`(data: Data, dim: Union[str, int]) → Tuple[Data, Dict[str, Data]]	Standardise a xarray.dataarray (along dim) or pandas.DataFrame (along axis) with mean=0 and std=1.
`standardise_inverse`(data: Data, mean: Data, std: Data) → Data	Apply inverse function of standardise on data and therefore vanishes the standardising.
`standardise_apply`(data: Data, mean: Data, std: Data) → Data	Apply standardise on data using given mean and std.
`centre`(data: Data, dim: Union[str, int]) → Tuple[Data, Dict[str, Data]]	Centre a xarray.dataarray (along dim) or pandas.DataFrame (along axis) to mean=0.
`centre_inverse`(data: Data, mean: Data) → Data	Apply inverse function of centre and therefore add given values of mean to data.
`centre_apply`(data: Data, mean: Data) → Data	Apply centre on data using given mean.
`min_max`(data: Data, dim: Union[str, int], feature_range: Tuple = (0, 1)) → Tuple[Data, Dict[str, Data]]	Apply min/max scaling using (x - x_min) / (x_max - x_min). Returned data is in interval [0, 1].
`min_max_inverse`(data: Data, _min: Data, _max: Data, feature_range: Tuple = (0, 1)) → Data	Apply inverse transformation of min_max scaling.
`min_max_apply`(data: Data, _min: Data, _max: Data, feature_range: Data = (0, 1)) → Data	Apply min_max scaling with given minimum and maximum.
`log`(data: Data, dim: Union[str, int]) → Tuple[Data, Dict[str, Data]]	Apply logarithmic transformation (and standarization) to data. This method first uses the logarithm for
`log_inverse`(data: Data, mean: Data, std: Data) → Data	Apply inverse log transformation (therefore exponential transformation). Because log is using np.log1p this
`log_apply`(data: Data, mean: Data, std: Data) → Data	Apply numpy’s log1p on given data. Further information can be found in description of log method.
`mean_squared_error`(a, b, dim=None)	Calculate mean squared error.
`mean_absolute_error`(a, b, dim=None)	Calculate mean absolute error.
`mean_error`(a, b, dim=None)	Calculate mean error where a is forecast and b the reference (e.g. observation).
`index_of_agreement`(a, b, dim=None)	Calculate index of agreement (IOA) where a is the forecast and b the reference (e.g. observation).
`modified_normalized_mean_bias`(a, b, dim=None)	Calculate modified normalized mean bias (MNMB) where a is the forecast and b the reference (e.g. observation).
`calculate_error_metrics`(a, b, dim)	Calculate MSE, ME, RMSE, MAE, IOA, and MNMB. Additionally, return number of used values for calculation.
`get_error_metrics_units`(base_unit)
`get_error_metrics_long_name`()
`mann_whitney_u_test`(data: pandas.DataFrame, reference_col_name: str, **kwargs)	Calculate Mann-Whitney u-test. Uses pandas’ .apply() on scipy.stats.mannwhitneyu(x, y, …).
`represent_p_values_as_asteriks`(p_values: pandas.Series, threshold_representation: collections.OrderedDict = None)	Represent p-values as asteriks based on its value.
`create_single_bootstrap_realization`(data: xarray.DataArray, dim_name_time: str) → xarray.DataArray	Return a bootstraped realization of data
`calculate_average`(data: xarray.DataArray, **kwargs) → xarray.DataArray	Calculate mean of data
`create_n_bootstrap_realizations`(data: xarray.DataArray, dim_name_time: str, dim_name_model: str, n_boots: int = 1000, dim_name_boots: str = ‘boots’, seasons: List = None) → Dict[str, xarray.DataArray]	Create n bootstrap realizations and calculate averages across realizations
`calculate_bias_free_data`(data, time_dim=’index’, window_size=30)

Attributes¶

`__author__`
`__date__`
`Data`

mlair.helpers.statistics.__author__ = Lukas Leufen, Felix Kleinert¶

mlair.helpers.statistics.__date__ = 2019-10-23¶

mlair.helpers.statistics.Data¶

mlair.helpers.statistics.apply_inverse_transformation(data: Data, method: str = 'standardise', mean: Data = None, std: Data = None, max: Data = None, min: Data = None, feature_range: Data = None) → Data¶

Apply inverse transformation for given statistics.

Parameters

data – transform this data back
method – transformation method (optional)
mean – mean of transformation (optional)
std – standard deviation of transformation (optional)
max – maximum value for min/max transformation (optional)
min – minimum value for min/max transformation (optional)

Returns

inverse transformed data

mlair.helpers.statistics.standardise(data: Data, dim: Union[str, int]) → Tuple[Data, Dict[str, Data]]¶

Standardise a xarray.dataarray (along dim) or pandas.DataFrame (along axis) with mean=0 and std=1.

Parameters

data – data to standardise
dim – name (xarray) or axis (pandas) of dimension which should be standardised

Returns

standardised data, and dictionary with keys method, mean, and standard deviation

mlair.helpers.statistics.standardise_inverse(data: Data, mean: Data, std: Data) → Data¶

Apply inverse function of standardise on data and therefore vanishes the standardising.

Parameters

data – standardised data
mean – mean of standardisation
std – standard deviation of transformation

Returns

inverse standardised data

mlair.helpers.statistics.standardise_apply(data: Data, mean: Data, std: Data) → Data¶

Apply standardise on data using given mean and std.

Parameters

data – data to transform
mean – mean to use for transformation
std – standard deviation for transformation

Returns

transformed data

mlair.helpers.statistics.centre(data: Data, dim: Union[str, int]) → Tuple[Data, Dict[str, Data]]¶

Centre a xarray.dataarray (along dim) or pandas.DataFrame (along axis) to mean=0.

Parameters

data – data to centre
dim – name (xarray) or axis (pandas) of dimension which should be centred

Returns

centred data, and dictionary with keys method, and mean

mlair.helpers.statistics.centre_inverse(data: Data, mean: Data) → Data¶

Apply inverse function of centre and therefore add given values of mean to data.

Parameters

data – data to apply inverse centering
mean – mean to use for inverse transformation

Returns

inverted centering transformation data

mlair.helpers.statistics.centre_apply(data: Data, mean: Data) → Data¶

Apply centre on data using given mean.

Parameters

data – data to transform
mean – mean to use for transformation

Returns

transformed data

mlair.helpers.statistics.min_max(data: Data, dim: Union[str, int], feature_range: Tuple = 0, 1) → Tuple[Data, Dict[str, Data]]¶

Apply min/max scaling using (x - x_min) / (x_max - x_min). Returned data is in interval [0, 1].

Parameters

data – data to transform
dim – name (xarray) or axis (pandas) of dimension which should be centred
feature_range – scale data to any interval given in feature range. Default is scaling on interval [0, 1].

Returns

transformed data, and dictionary with keys method, min, and max

mlair.helpers.statistics.min_max_inverse(data: Data, _min: Data, _max: Data, feature_range: Tuple = 0, 1) → Data¶

Apply inverse transformation of min_max scaling.

Parameters

data – data to apply inverse scaling
_min – minimum value to use for min/max scaling
_max – maximum value to use for min/max scaling
feature_range – scale data to any interval given in feature range. Default is scaling on interval [0, 1].

Returns

inverted min/max scaled data

mlair.helpers.statistics.min_max_apply(data: Data, _min: Data, _max: Data, feature_range: Data = 0, 1) → Data¶

Apply min_max scaling with given minimum and maximum.

Parameters

data – data to apply scaling
_min – minimum value to use for min/max scaling
_max – maximum value to use for min/max scaling
feature_range – scale data to any interval given in feature range. Default is scaling on interval [0, 1].

Returns

min/max scaled data

mlair.helpers.statistics.log(data: Data, dim: Union[str, int]) → Tuple[Data, Dict[str, Data]]¶

Apply logarithmic transformation (and standarization) to data. This method first uses the logarithm for transformation and second applies the standardise method additionally. A logarithmic function numpy’s log1p is used (res = log(1+x)) instead of the pure logarithm to be applicable to values of 0 too.

Parameters

data – transform this data
dim – name (xarray) or axis (pandas) of dimension which should be transformed

Returns

transformed data, and option dictionary with keys method, mean, and std

mlair.helpers.statistics.log_inverse(data: Data, mean: Data, std: Data) → Data¶

Apply inverse log transformation (therefore exponential transformation). Because log is using np.log1p this method is based on the equivalent method np.exp1m. Data are first rescaled using standardise_inverse and then given to the exponential function.

Parameters

data – apply inverse log transformation on this data
mean – mean of the standarization
std – std of the standarization

Returns

inverted data

mlair.helpers.statistics.log_apply(data: Data, mean: Data, std: Data) → Data¶

Apply numpy’s log1p on given data. Further information can be found in description of log method.

Parameters

data – transform this data
mean – mean of the standarization
std – std of the standarization

Returns

transformed data

mlair.helpers.statistics.mean_squared_error(a, b, dim=None)¶: Calculate mean squared error.

mlair.helpers.statistics.mean_absolute_error(a, b, dim=None)¶: Calculate mean absolute error.

mlair.helpers.statistics.mean_error(a, b, dim=None)¶: Calculate mean error where a is forecast and b the reference (e.g. observation).

mlair.helpers.statistics.index_of_agreement(a, b, dim=None)¶: Calculate index of agreement (IOA) where a is the forecast and b the reference (e.g. observation).

mlair.helpers.statistics.modified_normalized_mean_bias(a, b, dim=None)¶: Calculate modified normalized mean bias (MNMB) where a is the forecast and b the reference (e.g. observation).

mlair.helpers.statistics.calculate_error_metrics(a, b, dim)¶

Calculate MSE, ME, RMSE, MAE, IOA, and MNMB. Additionally, return number of used values for calculation.

Parameters

a – forecast data to calculate metrics for
b – reference (e.g. observation)
dim – dimension to calculate metrics along

Returns

dict with results for all metrics indicated by lowercase metric short name

mlair.helpers.statistics.get_error_metrics_units(base_unit)¶

mlair.helpers.statistics.get_error_metrics_long_name()¶

mlair.helpers.statistics.mann_whitney_u_test(data: pandas.DataFrame, reference_col_name: str, **kwargs)¶: Calculate Mann-Whitney u-test. Uses pandas’ .apply() on scipy.stats.mannwhitneyu(x, y, …). :param data: :type data: :param reference_col_name: Name of column which is used for comparison (y) :type reference_col_name: :param kwargs: :type kwargs: :return: :rtype:

mlair.helpers.statistics.represent_p_values_as_asteriks(p_values: pandas.Series, threshold_representation: collections.OrderedDict = None)¶: Represent p-values as asteriks based on its value. :param p_values: :type p_values: :param threshold_representation: :type threshold_representation: :return: :rtype:

class mlair.helpers.statistics.SkillScores(external_data: Union[Data, None], models=None, observation_name='obs', ahead_dim='ahead', type_dim='type', index_dim='index')¶

Calculate different kinds of skill scores.

Skill score on MSE:

Calculate skill score based on MSE for given forecast, reference and observations.

\text{SkillScore} = 1 - \frac{\text{MSE(obs, for)}}{\text{MSE(obs, ref)}}

To run:

skill_scores = SkillScores(None).general_skill_score(data, observation_name, forecast_name, reference_name)

Competitive skill score:

Calculate skill scores to highlight differences between forecasts. This skill score is also based on the MSE. Currently required forecasts are CNN, OLS and persi, as well as the observation obs.

skill_scores_class = SkillScores(internal_data)  # must contain columns CNN, OLS, persi and obs.
skill_scores = skill_scores_class.skill_scores(window_lead_time=3)

Skill score according to Murphy:

Follow climatological skill score definition of Murphy (1988). External data is data from another time period than the internal data set on initialisation. In other terms, this should be the train and validation data whereas the external data is the test data. This sounds perhaps counter-intuitive, but if a skill score is evaluated to a model to another, this must be performed on test data set. Therefore, for this case the foreign data is test.

skill_scores_class = SkillScores(external_data)  # must contain columns obs and CNN.
skill_scores_clim = skill_scores_class.climatological_skill_scores(internal_data, window_lead_time=3)

models_default = ['cnn', 'persi', 'ols']¶

set_model_names(self, models: List[str]) → List[str]¶: Either use given models or use defaults.

static _reorder(model_list: List[str]) → List[str]¶: Set elements persi and obs at the very end of given list.

get_model_name_combinations(self)¶: Return all combinations of two models as tuple and string.

skill_scores(self) → [pandas.DataFrame, pandas.DataFrame]¶

Calculate skill scores for all combinations of model names.

Returns: skill score for each comparison and forecast step

climatological_skill_scores(self, internal_data: Data, forecast_name: str) → xarray.DataArray¶

Calculate climatological skill scores according to Murphy (1988).

Calculate all CASES I - IV and terms [ABC][I-IV]. Internal data has to be set by initialisation, external data is part of parameters.

Parameters

internal_data – internal data
forecast_name – name of the forecast to use for this calculation (must be available in data)

Returns

all CASES as well as all terms

_climatological_skill_score(self, internal_data, observation_name, forecast_name, mu_type=1, external_data=None)¶

general_skill_score(self, data: Data, forecast_name: str, reference_name: str, observation_name: str = None, dim: str = 'index') → numpy.ndarray ¶

Calculate general skill score based on mean squared error.

Parameters

data – internal data containing data for observation, forecast and reference
observation_name – name of observation
forecast_name – name of forecast
reference_name – name of reference

Returns

skill score of forecast

get_count(self, data: Data, dim: str = 'index') → numpy.ndarray ¶: Count data and return number

skill_score_pre_calculations(self, data: Data, observation_name: str, forecast_name: str) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, Data, Dict[str, Data]]¶

Calculate terms AI, BI, and CI, mean, variance and pearson’s correlation and clean up data.

The additional information on mean, variance and pearson’s correlation (and the p-value) are returned as dictionary with the corresponding keys mean, sigma, r and p.

Parameters

data – internal data to use for calculations
observation_name – name of observation
forecast_name – name of forecast

Returns

Terms AI, BI, and CI, internal data without nans and mean, variance, correlation and its p-value

skill_score_mu_case_1(self, internal_data, observation_name, forecast_name)¶: Calculate CASE I.

skill_score_mu_case_2(self, internal_data, observation_name, forecast_name)¶: Calculate CASE II.

skill_score_mu_case_3(self, internal_data, observation_name, forecast_name, external_data=None)¶: Calculate CASE III.

skill_score_mu_case_4(self, internal_data, observation_name, forecast_name, external_data=None)¶: Calculate CASE IV.

create_monthly_mean_from_daily_data(self, data, columns=None, index=None)¶

Calculate average for each month and save as daily values with flag ‘X’.

Parameters

data – data to average
columns – columns to work on (all columns from given data are used if empty)
index – index of returned data (index of given data is used if empty)

Returns

data containing monthly means in daily resolution

mlair.helpers.statistics.create_single_bootstrap_realization(data: xarray.DataArray, dim_name_time: str) → xarray.DataArray¶: Return a bootstraped realization of data :param data: data from which to draw ONE bootstrap realization :param dim_name_time: name of time dimension :return: bootstrapped realization of data

mlair.helpers.statistics.calculate_average(data: xarray.DataArray, **kwargs) → xarray.DataArray¶: Calculate mean of data :param data: data for which to calculate mean :return: mean of data

mlair.helpers.statistics.create_n_bootstrap_realizations(data: xarray.DataArray, dim_name_time: str, dim_name_model: str, n_boots: int = 1000, dim_name_boots: str = 'boots', seasons: List = None) → Dict[str, xarray.DataArray]¶

Create n bootstrap realizations and calculate averages across realizations

Parameters

data – original data from which to create bootstrap realizations
dim_name_time – name of time dimension
dim_name_model – name of model dimension
n_boots – number of bootstap realizations
dim_name_boots – name of bootstap dimension
seasons – calculate errors for given seasons in addition (default None)

Returns

mlair.helpers.statistics.calculate_bias_free_data(data, time_dim='index', window_size=30)¶

mlair.helpers.statistics¶

Module Contents¶

Classes¶

Functions¶

Attributes¶

`mlair.helpers.statistics`¶