:py:mod:`mlair.helpers.statistics` ================================== .. py:module:: mlair.helpers.statistics .. autoapi-nested-parse:: Collection of stastical methods: Transformation and Skill Scores. Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: mlair.helpers.statistics.SkillScores Functions ~~~~~~~~~ .. autoapisummary:: mlair.helpers.statistics.apply_inverse_transformation mlair.helpers.statistics.standardise mlair.helpers.statistics.standardise_inverse mlair.helpers.statistics.standardise_apply mlair.helpers.statistics.centre mlair.helpers.statistics.centre_inverse mlair.helpers.statistics.centre_apply mlair.helpers.statistics.min_max mlair.helpers.statistics.min_max_inverse mlair.helpers.statistics.min_max_apply mlair.helpers.statistics.log mlair.helpers.statistics.log_inverse mlair.helpers.statistics.log_apply mlair.helpers.statistics.mean_squared_error mlair.helpers.statistics.mean_absolute_error mlair.helpers.statistics.mean_error mlair.helpers.statistics.index_of_agreement mlair.helpers.statistics.modified_normalized_mean_bias mlair.helpers.statistics.calculate_error_metrics mlair.helpers.statistics.get_error_metrics_units mlair.helpers.statistics.get_error_metrics_long_name mlair.helpers.statistics.mann_whitney_u_test mlair.helpers.statistics.represent_p_values_as_asteriks mlair.helpers.statistics.create_single_bootstrap_realization mlair.helpers.statistics.calculate_average mlair.helpers.statistics.create_n_bootstrap_realizations mlair.helpers.statistics.calculate_bias_free_data Attributes ~~~~~~~~~~ .. autoapisummary:: mlair.helpers.statistics.__author__ mlair.helpers.statistics.__date__ mlair.helpers.statistics.Data .. py:data:: __author__ :annotation: = Lukas Leufen, Felix Kleinert .. py:data:: __date__ :annotation: = 2019-10-23 .. py:data:: Data .. py:function:: apply_inverse_transformation(data: Data, method: str = 'standardise', mean: Data = None, std: Data = None, max: Data = None, min: Data = None, feature_range: Data = None) -> Data Apply inverse transformation for given statistics. :param data: transform this data back :param method: transformation method (optional) :param mean: mean of transformation (optional) :param std: standard deviation of transformation (optional) :param max: maximum value for min/max transformation (optional) :param min: minimum value for min/max transformation (optional) :return: inverse transformed data .. py:function:: standardise(data: Data, dim: Union[str, int]) -> Tuple[Data, Dict[str, Data]] Standardise a xarray.dataarray (along dim) or pandas.DataFrame (along axis) with mean=0 and std=1. :param data: data to standardise :param dim: name (xarray) or axis (pandas) of dimension which should be standardised :return: standardised data, and dictionary with keys method, mean, and standard deviation .. py:function:: standardise_inverse(data: Data, mean: Data, std: Data) -> Data Apply inverse function of `standardise` on data and therefore vanishes the standardising. :param data: standardised data :param mean: mean of standardisation :param std: standard deviation of transformation :return: inverse standardised data .. py:function:: standardise_apply(data: Data, mean: Data, std: Data) -> Data Apply `standardise` on data using given mean and std. :param data: data to transform :param mean: mean to use for transformation :param std: standard deviation for transformation :return: transformed data .. py:function:: centre(data: Data, dim: Union[str, int]) -> Tuple[Data, Dict[str, Data]] Centre a xarray.dataarray (along dim) or pandas.DataFrame (along axis) to mean=0. :param data: data to centre :param dim: name (xarray) or axis (pandas) of dimension which should be centred :return: centred data, and dictionary with keys method, and mean .. py:function:: centre_inverse(data: Data, mean: Data) -> Data Apply inverse function of `centre` and therefore add given values of mean to data. :param data: data to apply inverse centering :param mean: mean to use for inverse transformation :return: inverted centering transformation data .. py:function:: centre_apply(data: Data, mean: Data) -> Data Apply `centre` on data using given mean. :param data: data to transform :param mean: mean to use for transformation :return: transformed data .. py:function:: min_max(data: Data, dim: Union[str, int], feature_range: Tuple = (0, 1)) -> Tuple[Data, Dict[str, Data]] Apply min/max scaling using (x - x_min) / (x_max - x_min). Returned data is in interval [0, 1]. :param data: data to transform :param dim: name (xarray) or axis (pandas) of dimension which should be centred :param feature_range: scale data to any interval given in feature range. Default is scaling on interval [0, 1]. :return: transformed data, and dictionary with keys method, min, and max .. py:function:: min_max_inverse(data: Data, _min: Data, _max: Data, feature_range: Tuple = (0, 1)) -> Data Apply inverse transformation of `min_max` scaling. :param data: data to apply inverse scaling :param _min: minimum value to use for min/max scaling :param _max: maximum value to use for min/max scaling :param feature_range: scale data to any interval given in feature range. Default is scaling on interval [0, 1]. :return: inverted min/max scaled data .. py:function:: min_max_apply(data: Data, _min: Data, _max: Data, feature_range: Data = (0, 1)) -> Data Apply `min_max` scaling with given minimum and maximum. :param data: data to apply scaling :param _min: minimum value to use for min/max scaling :param _max: maximum value to use for min/max scaling :param feature_range: scale data to any interval given in feature range. Default is scaling on interval [0, 1]. :return: min/max scaled data .. py:function:: log(data: Data, dim: Union[str, int]) -> Tuple[Data, Dict[str, Data]] Apply logarithmic transformation (and standarization) to data. This method first uses the logarithm for transformation and second applies the `standardise` method additionally. A logarithmic function numpy's log1p is used (`res = log(1+x)`) instead of the pure logarithm to be applicable to values of 0 too. :param data: transform this data :param dim: name (xarray) or axis (pandas) of dimension which should be transformed :return: transformed data, and option dictionary with keys method, mean, and std .. py:function:: log_inverse(data: Data, mean: Data, std: Data) -> Data Apply inverse log transformation (therefore exponential transformation). Because `log` is using `np.log1p` this method is based on the equivalent method `np.exp1m`. Data are first rescaled using `standardise_inverse` and then given to the exponential function. :param data: apply inverse log transformation on this data :param mean: mean of the standarization :param std: std of the standarization :return: inverted data .. py:function:: log_apply(data: Data, mean: Data, std: Data) -> Data Apply numpy's log1p on given data. Further information can be found in description of `log` method. :param data: transform this data :param mean: mean of the standarization :param std: std of the standarization :return: transformed data .. py:function:: mean_squared_error(a, b, dim=None) Calculate mean squared error. .. py:function:: mean_absolute_error(a, b, dim=None) Calculate mean absolute error. .. py:function:: mean_error(a, b, dim=None) Calculate mean error where a is forecast and b the reference (e.g. observation). .. py:function:: index_of_agreement(a, b, dim=None) Calculate index of agreement (IOA) where a is the forecast and b the reference (e.g. observation). .. py:function:: modified_normalized_mean_bias(a, b, dim=None) Calculate modified normalized mean bias (MNMB) where a is the forecast and b the reference (e.g. observation). .. py:function:: calculate_error_metrics(a, b, dim) Calculate MSE, ME, RMSE, MAE, IOA, and MNMB. Additionally, return number of used values for calculation. :param a: forecast data to calculate metrics for :param b: reference (e.g. observation) :param dim: dimension to calculate metrics along :returns: dict with results for all metrics indicated by lowercase metric short name .. py:function:: get_error_metrics_units(base_unit) .. py:function:: get_error_metrics_long_name() .. py:function:: mann_whitney_u_test(data: pandas.DataFrame, reference_col_name: str, **kwargs) Calculate Mann-Whitney u-test. Uses pandas' .apply() on scipy.stats.mannwhitneyu(x, y, ...). :param data: :type data: :param reference_col_name: Name of column which is used for comparison (y) :type reference_col_name: :param kwargs: :type kwargs: :return: :rtype: .. py:function:: represent_p_values_as_asteriks(p_values: pandas.Series, threshold_representation: collections.OrderedDict = None) Represent p-values as asteriks based on its value. :param p_values: :type p_values: :param threshold_representation: :type threshold_representation: :return: :rtype: .. py:class:: SkillScores(external_data: Union[Data, None], models=None, observation_name='obs', ahead_dim='ahead', type_dim='type', index_dim='index') Calculate different kinds of skill scores. Skill score on MSE: Calculate skill score based on MSE for given forecast, reference and observations. .. math:: \text{SkillScore} = 1 - \frac{\text{MSE(obs, for)}}{\text{MSE(obs, ref)}} To run: .. code-block:: python skill_scores = SkillScores(None).general_skill_score(data, observation_name, forecast_name, reference_name) Competitive skill score: Calculate skill scores to highlight differences between forecasts. This skill score is also based on the MSE. Currently required forecasts are CNN, OLS and persi, as well as the observation obs. .. code-block:: python skill_scores_class = SkillScores(internal_data) # must contain columns CNN, OLS, persi and obs. skill_scores = skill_scores_class.skill_scores(window_lead_time=3) Skill score according to Murphy: Follow climatological skill score definition of Murphy (1988). External data is data from another time period than the internal data set on initialisation. In other terms, this should be the train and validation data whereas the external data is the test data. This sounds perhaps counter-intuitive, but if a skill score is evaluated to a model to another, this must be performed on test data set. Therefore, for this case the foreign data is test. .. code-block:: python skill_scores_class = SkillScores(external_data) # must contain columns obs and CNN. skill_scores_clim = skill_scores_class.climatological_skill_scores(internal_data, window_lead_time=3) .. py:attribute:: models_default :annotation: = ['cnn', 'persi', 'ols'] .. py:method:: set_model_names(self, models: List[str]) -> List[str] Either use given models or use defaults. .. py:method:: _reorder(model_list: List[str]) -> List[str] :staticmethod: Set elements persi and obs at the very end of given list. .. py:method:: get_model_name_combinations(self) Return all combinations of two models as tuple and string. .. py:method:: skill_scores(self) -> [pandas.DataFrame, pandas.DataFrame] Calculate skill scores for all combinations of model names. :return: skill score for each comparison and forecast step .. py:method:: climatological_skill_scores(self, internal_data: Data, forecast_name: str) -> xarray.DataArray Calculate climatological skill scores according to Murphy (1988). Calculate all CASES I - IV and terms [ABC][I-IV]. Internal data has to be set by initialisation, external data is part of parameters. :param internal_data: internal data :param forecast_name: name of the forecast to use for this calculation (must be available in `data`) :return: all CASES as well as all terms .. py:method:: _climatological_skill_score(self, internal_data, observation_name, forecast_name, mu_type=1, external_data=None) .. py:method:: general_skill_score(self, data: Data, forecast_name: str, reference_name: str, observation_name: str = None, dim: str = 'index') -> numpy.ndarray Calculate general skill score based on mean squared error. :param data: internal data containing data for observation, forecast and reference :param observation_name: name of observation :param forecast_name: name of forecast :param reference_name: name of reference :return: skill score of forecast .. py:method:: get_count(self, data: Data, dim: str = 'index') -> numpy.ndarray Count data and return number .. py:method:: skill_score_pre_calculations(self, data: Data, observation_name: str, forecast_name: str) -> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, Data, Dict[str, Data]] Calculate terms AI, BI, and CI, mean, variance and pearson's correlation and clean up data. The additional information on mean, variance and pearson's correlation (and the p-value) are returned as dictionary with the corresponding keys mean, sigma, r and p. :param data: internal data to use for calculations :param observation_name: name of observation :param forecast_name: name of forecast :returns: Terms AI, BI, and CI, internal data without nans and mean, variance, correlation and its p-value .. py:method:: skill_score_mu_case_1(self, internal_data, observation_name, forecast_name) Calculate CASE I. .. py:method:: skill_score_mu_case_2(self, internal_data, observation_name, forecast_name) Calculate CASE II. .. py:method:: skill_score_mu_case_3(self, internal_data, observation_name, forecast_name, external_data=None) Calculate CASE III. .. py:method:: skill_score_mu_case_4(self, internal_data, observation_name, forecast_name, external_data=None) Calculate CASE IV. .. py:method:: create_monthly_mean_from_daily_data(self, data, columns=None, index=None) Calculate average for each month and save as daily values with flag 'X'. :param data: data to average :param columns: columns to work on (all columns from given data are used if empty) :param index: index of returned data (index of given data is used if empty) :return: data containing monthly means in daily resolution .. py:function:: create_single_bootstrap_realization(data: xarray.DataArray, dim_name_time: str) -> xarray.DataArray Return a bootstraped realization of data :param data: data from which to draw ONE bootstrap realization :param dim_name_time: name of time dimension :return: bootstrapped realization of data .. py:function:: calculate_average(data: xarray.DataArray, **kwargs) -> xarray.DataArray Calculate mean of data :param data: data for which to calculate mean :return: mean of data .. py:function:: create_n_bootstrap_realizations(data: xarray.DataArray, dim_name_time: str, dim_name_model: str, n_boots: int = 1000, dim_name_boots: str = 'boots', seasons: List = None) -> Dict[str, xarray.DataArray] Create n bootstrap realizations and calculate averages across realizations :param data: original data from which to create bootstrap realizations :param dim_name_time: name of time dimension :param dim_name_model: name of model dimension :param n_boots: number of bootstap realizations :param dim_name_boots: name of bootstap dimension :param seasons: calculate errors for given seasons in addition (default None) :return: .. py:function:: calculate_bias_free_data(data, time_dim='index', window_size=30)