mlair.helpers.statistics
¶
Collection of stastical methods: Transformation and Skill Scores.
Module Contents¶
Classes¶
Calculate different kinds of skill scores. |
Functions¶
|
Apply inverse transformation for given statistics. |
|
Standardise a xarray.dataarray (along dim) or pandas.DataFrame (along axis) with mean=0 and std=1. |
|
Apply inverse function of standardise on data and therefore vanishes the standardising. |
|
Apply standardise on data using given mean and std. |
|
Centre a xarray.dataarray (along dim) or pandas.DataFrame (along axis) to mean=0. |
|
Apply inverse function of centre and therefore add given values of mean to data. |
|
Apply centre on data using given mean. |
|
Apply min/max scaling using (x - x_min) / (x_max - x_min). Returned data is in interval [0, 1]. |
|
Apply inverse transformation of min_max scaling. |
|
Apply min_max scaling with given minimum and maximum. |
|
Apply logarithmic transformation (and standarization) to data. This method first uses the logarithm for |
|
Apply inverse log transformation (therefore exponential transformation). Because log is using np.log1p this |
|
Apply numpy’s log1p on given data. Further information can be found in description of log method. |
|
Calculate mean squared error. |
|
Calculate mean absolute error. |
|
Calculate mean error where a is forecast and b the reference (e.g. observation). |
|
Calculate index of agreement (IOA) where a is the forecast and b the reference (e.g. observation). |
|
Calculate modified normalized mean bias (MNMB) where a is the forecast and b the reference (e.g. observation). |
|
Calculate MSE, ME, RMSE, MAE, IOA, and MNMB. Additionally, return number of used values for calculation. |
|
|
|
Calculate Mann-Whitney u-test. Uses pandas’ .apply() on scipy.stats.mannwhitneyu(x, y, …). |
|
Represent p-values as asteriks based on its value. |
|
Return a bootstraped realization of data |
|
Calculate mean of data |
|
Create n bootstrap realizations and calculate averages across realizations |
|
Attributes¶
-
mlair.helpers.statistics.
__date__
= 2019-10-23¶
-
mlair.helpers.statistics.
Data
¶
-
mlair.helpers.statistics.
apply_inverse_transformation
(data: Data, method: str = 'standardise', mean: Data = None, std: Data = None, max: Data = None, min: Data = None, feature_range: Data = None) → Data¶ Apply inverse transformation for given statistics.
- Parameters
data – transform this data back
method – transformation method (optional)
mean – mean of transformation (optional)
std – standard deviation of transformation (optional)
max – maximum value for min/max transformation (optional)
min – minimum value for min/max transformation (optional)
- Returns
inverse transformed data
-
mlair.helpers.statistics.
standardise
(data: Data, dim: Union[str, int]) → Tuple[Data, Dict[str, Data]]¶ Standardise a xarray.dataarray (along dim) or pandas.DataFrame (along axis) with mean=0 and std=1.
- Parameters
data – data to standardise
dim – name (xarray) or axis (pandas) of dimension which should be standardised
- Returns
standardised data, and dictionary with keys method, mean, and standard deviation
-
mlair.helpers.statistics.
standardise_inverse
(data: Data, mean: Data, std: Data) → Data¶ Apply inverse function of standardise on data and therefore vanishes the standardising.
- Parameters
data – standardised data
mean – mean of standardisation
std – standard deviation of transformation
- Returns
inverse standardised data
-
mlair.helpers.statistics.
standardise_apply
(data: Data, mean: Data, std: Data) → Data¶ Apply standardise on data using given mean and std.
- Parameters
data – data to transform
mean – mean to use for transformation
std – standard deviation for transformation
- Returns
transformed data
-
mlair.helpers.statistics.
centre
(data: Data, dim: Union[str, int]) → Tuple[Data, Dict[str, Data]]¶ Centre a xarray.dataarray (along dim) or pandas.DataFrame (along axis) to mean=0.
- Parameters
data – data to centre
dim – name (xarray) or axis (pandas) of dimension which should be centred
- Returns
centred data, and dictionary with keys method, and mean
-
mlair.helpers.statistics.
centre_inverse
(data: Data, mean: Data) → Data¶ Apply inverse function of centre and therefore add given values of mean to data.
- Parameters
data – data to apply inverse centering
mean – mean to use for inverse transformation
- Returns
inverted centering transformation data
-
mlair.helpers.statistics.
centre_apply
(data: Data, mean: Data) → Data¶ Apply centre on data using given mean.
- Parameters
data – data to transform
mean – mean to use for transformation
- Returns
transformed data
-
mlair.helpers.statistics.
min_max
(data: Data, dim: Union[str, int], feature_range: Tuple = 0, 1) → Tuple[Data, Dict[str, Data]]¶ Apply min/max scaling using (x - x_min) / (x_max - x_min). Returned data is in interval [0, 1].
- Parameters
data – data to transform
dim – name (xarray) or axis (pandas) of dimension which should be centred
feature_range – scale data to any interval given in feature range. Default is scaling on interval [0, 1].
- Returns
transformed data, and dictionary with keys method, min, and max
-
mlair.helpers.statistics.
min_max_inverse
(data: Data, _min: Data, _max: Data, feature_range: Tuple = 0, 1) → Data¶ Apply inverse transformation of min_max scaling.
- Parameters
data – data to apply inverse scaling
_min – minimum value to use for min/max scaling
_max – maximum value to use for min/max scaling
feature_range – scale data to any interval given in feature range. Default is scaling on interval [0, 1].
- Returns
inverted min/max scaled data
-
mlair.helpers.statistics.
min_max_apply
(data: Data, _min: Data, _max: Data, feature_range: Data = 0, 1) → Data¶ Apply min_max scaling with given minimum and maximum.
- Parameters
data – data to apply scaling
_min – minimum value to use for min/max scaling
_max – maximum value to use for min/max scaling
feature_range – scale data to any interval given in feature range. Default is scaling on interval [0, 1].
- Returns
min/max scaled data
-
mlair.helpers.statistics.
log
(data: Data, dim: Union[str, int]) → Tuple[Data, Dict[str, Data]]¶ Apply logarithmic transformation (and standarization) to data. This method first uses the logarithm for transformation and second applies the standardise method additionally. A logarithmic function numpy’s log1p is used (res = log(1+x)) instead of the pure logarithm to be applicable to values of 0 too.
- Parameters
data – transform this data
dim – name (xarray) or axis (pandas) of dimension which should be transformed
- Returns
transformed data, and option dictionary with keys method, mean, and std
-
mlair.helpers.statistics.
log_inverse
(data: Data, mean: Data, std: Data) → Data¶ Apply inverse log transformation (therefore exponential transformation). Because log is using np.log1p this method is based on the equivalent method np.exp1m. Data are first rescaled using standardise_inverse and then given to the exponential function.
- Parameters
data – apply inverse log transformation on this data
mean – mean of the standarization
std – std of the standarization
- Returns
inverted data
-
mlair.helpers.statistics.
log_apply
(data: Data, mean: Data, std: Data) → Data¶ Apply numpy’s log1p on given data. Further information can be found in description of log method.
- Parameters
data – transform this data
mean – mean of the standarization
std – std of the standarization
- Returns
transformed data
-
mlair.helpers.statistics.
mean_squared_error
(a, b, dim=None)¶ Calculate mean squared error.
-
mlair.helpers.statistics.
mean_absolute_error
(a, b, dim=None)¶ Calculate mean absolute error.
-
mlair.helpers.statistics.
mean_error
(a, b, dim=None)¶ Calculate mean error where a is forecast and b the reference (e.g. observation).
-
mlair.helpers.statistics.
index_of_agreement
(a, b, dim=None)¶ Calculate index of agreement (IOA) where a is the forecast and b the reference (e.g. observation).
-
mlair.helpers.statistics.
modified_normalized_mean_bias
(a, b, dim=None)¶ Calculate modified normalized mean bias (MNMB) where a is the forecast and b the reference (e.g. observation).
-
mlair.helpers.statistics.
calculate_error_metrics
(a, b, dim)¶ Calculate MSE, ME, RMSE, MAE, IOA, and MNMB. Additionally, return number of used values for calculation.
- Parameters
a – forecast data to calculate metrics for
b – reference (e.g. observation)
dim – dimension to calculate metrics along
- Returns
dict with results for all metrics indicated by lowercase metric short name
-
mlair.helpers.statistics.
get_error_metrics_units
(base_unit)¶
-
mlair.helpers.statistics.
get_error_metrics_long_name
()¶
-
mlair.helpers.statistics.
mann_whitney_u_test
(data: pandas.DataFrame, reference_col_name: str, **kwargs)¶ Calculate Mann-Whitney u-test. Uses pandas’ .apply() on scipy.stats.mannwhitneyu(x, y, …). :param data: :type data: :param reference_col_name: Name of column which is used for comparison (y) :type reference_col_name: :param kwargs: :type kwargs: :return: :rtype:
-
mlair.helpers.statistics.
represent_p_values_as_asteriks
(p_values: pandas.Series, threshold_representation: collections.OrderedDict = None)¶ Represent p-values as asteriks based on its value. :param p_values: :type p_values: :param threshold_representation: :type threshold_representation: :return: :rtype:
-
class
mlair.helpers.statistics.
SkillScores
(external_data: Union[Data, None], models=None, observation_name='obs', ahead_dim='ahead', type_dim='type', index_dim='index')¶ Calculate different kinds of skill scores.
- Skill score on MSE:
Calculate skill score based on MSE for given forecast, reference and observations.
\text{SkillScore} = 1 - \frac{\text{MSE(obs, for)}}{\text{MSE(obs, ref)}}
To run:
skill_scores = SkillScores(None).general_skill_score(data, observation_name, forecast_name, reference_name)
- Competitive skill score:
Calculate skill scores to highlight differences between forecasts. This skill score is also based on the MSE. Currently required forecasts are CNN, OLS and persi, as well as the observation obs.
skill_scores_class = SkillScores(internal_data) # must contain columns CNN, OLS, persi and obs. skill_scores = skill_scores_class.skill_scores(window_lead_time=3)
- Skill score according to Murphy:
Follow climatological skill score definition of Murphy (1988). External data is data from another time period than the internal data set on initialisation. In other terms, this should be the train and validation data whereas the external data is the test data. This sounds perhaps counter-intuitive, but if a skill score is evaluated to a model to another, this must be performed on test data set. Therefore, for this case the foreign data is test.
skill_scores_class = SkillScores(external_data) # must contain columns obs and CNN. skill_scores_clim = skill_scores_class.climatological_skill_scores(internal_data, window_lead_time=3)
-
models_default
= ['cnn', 'persi', 'ols']¶
-
static
_reorder
(model_list: List[str]) → List[str]¶ Set elements persi and obs at the very end of given list.
-
get_model_name_combinations
(self)¶ Return all combinations of two models as tuple and string.
-
skill_scores
(self) → [pandas.DataFrame, pandas.DataFrame]¶ Calculate skill scores for all combinations of model names.
- Returns
skill score for each comparison and forecast step
-
climatological_skill_scores
(self, internal_data: Data, forecast_name: str) → xarray.DataArray¶ Calculate climatological skill scores according to Murphy (1988).
Calculate all CASES I - IV and terms [ABC][I-IV]. Internal data has to be set by initialisation, external data is part of parameters.
- Parameters
internal_data – internal data
forecast_name – name of the forecast to use for this calculation (must be available in data)
- Returns
all CASES as well as all terms
-
_climatological_skill_score
(self, internal_data, observation_name, forecast_name, mu_type=1, external_data=None)¶
-
general_skill_score
(self, data: Data, forecast_name: str, reference_name: str, observation_name: str = None, dim: str = 'index') → numpy.ndarray¶ Calculate general skill score based on mean squared error.
- Parameters
data – internal data containing data for observation, forecast and reference
observation_name – name of observation
forecast_name – name of forecast
reference_name – name of reference
- Returns
skill score of forecast
-
get_count
(self, data: Data, dim: str = 'index') → numpy.ndarray¶ Count data and return number
-
skill_score_pre_calculations
(self, data: Data, observation_name: str, forecast_name: str) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, Data, Dict[str, Data]]¶ Calculate terms AI, BI, and CI, mean, variance and pearson’s correlation and clean up data.
The additional information on mean, variance and pearson’s correlation (and the p-value) are returned as dictionary with the corresponding keys mean, sigma, r and p.
- Parameters
data – internal data to use for calculations
observation_name – name of observation
forecast_name – name of forecast
- Returns
Terms AI, BI, and CI, internal data without nans and mean, variance, correlation and its p-value
-
skill_score_mu_case_1
(self, internal_data, observation_name, forecast_name)¶ Calculate CASE I.
-
skill_score_mu_case_2
(self, internal_data, observation_name, forecast_name)¶ Calculate CASE II.
-
skill_score_mu_case_3
(self, internal_data, observation_name, forecast_name, external_data=None)¶ Calculate CASE III.
-
skill_score_mu_case_4
(self, internal_data, observation_name, forecast_name, external_data=None)¶ Calculate CASE IV.
-
create_monthly_mean_from_daily_data
(self, data, columns=None, index=None)¶ Calculate average for each month and save as daily values with flag ‘X’.
- Parameters
data – data to average
columns – columns to work on (all columns from given data are used if empty)
index – index of returned data (index of given data is used if empty)
- Returns
data containing monthly means in daily resolution
-
mlair.helpers.statistics.
create_single_bootstrap_realization
(data: xarray.DataArray, dim_name_time: str) → xarray.DataArray¶ Return a bootstraped realization of data :param data: data from which to draw ONE bootstrap realization :param dim_name_time: name of time dimension :return: bootstrapped realization of data
-
mlair.helpers.statistics.
calculate_average
(data: xarray.DataArray, **kwargs) → xarray.DataArray¶ Calculate mean of data :param data: data for which to calculate mean :return: mean of data
-
mlair.helpers.statistics.
create_n_bootstrap_realizations
(data: xarray.DataArray, dim_name_time: str, dim_name_model: str, n_boots: int = 1000, dim_name_boots: str = 'boots', seasons: List = None) → Dict[str, xarray.DataArray]¶ Create n bootstrap realizations and calculate averages across realizations
- Parameters
data – original data from which to create bootstrap realizations
dim_name_time – name of time dimension
dim_name_model – name of model dimension
n_boots – number of bootstap realizations
dim_name_boots – name of bootstap dimension
seasons – calculate errors for given seasons in addition (default None)
- Returns
-
mlair.helpers.statistics.
calculate_bias_free_data
(data, time_dim='index', window_size=30)¶