:py:mod:`mlair.helpers.statistics`
==================================

.. py:module:: mlair.helpers.statistics

.. autoapi-nested-parse::

   Collection of stastical methods: Transformation and Skill Scores.


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   mlair.helpers.statistics.SkillScores


Functions
~~~~~~~~~

.. autoapisummary::

   mlair.helpers.statistics.apply_inverse_transformation
   mlair.helpers.statistics.standardise
   mlair.helpers.statistics.standardise_inverse
   mlair.helpers.statistics.standardise_apply
   mlair.helpers.statistics.centre
   mlair.helpers.statistics.centre_inverse
   mlair.helpers.statistics.centre_apply
   mlair.helpers.statistics.min_max
   mlair.helpers.statistics.min_max_inverse
   mlair.helpers.statistics.min_max_apply
   mlair.helpers.statistics.log
   mlair.helpers.statistics.log_inverse
   mlair.helpers.statistics.log_apply
   mlair.helpers.statistics.mean_squared_error
   mlair.helpers.statistics.mean_absolute_error
   mlair.helpers.statistics.mean_error
   mlair.helpers.statistics.index_of_agreement
   mlair.helpers.statistics.modified_normalized_mean_bias
   mlair.helpers.statistics.calculate_error_metrics
   mlair.helpers.statistics.get_error_metrics_units
   mlair.helpers.statistics.get_error_metrics_long_name
   mlair.helpers.statistics.mann_whitney_u_test
   mlair.helpers.statistics.represent_p_values_as_asteriks
   mlair.helpers.statistics.create_single_bootstrap_realization
   mlair.helpers.statistics.calculate_average
   mlair.helpers.statistics.create_n_bootstrap_realizations
   mlair.helpers.statistics.calculate_bias_free_data


Attributes
~~~~~~~~~~

.. autoapisummary::

   mlair.helpers.statistics.__author__
   mlair.helpers.statistics.__date__
   mlair.helpers.statistics.Data


.. py:data:: __author__
   :annotation: = Lukas Leufen, Felix Kleinert

   
.. py:data:: __date__
   :annotation: = 2019-10-23

   
.. py:data:: Data
   

.. py:function:: apply_inverse_transformation(data: Data, method: str = 'standardise', mean: Data = None, std: Data = None, max: Data = None, min: Data = None, feature_range: Data = None) -> Data

   Apply inverse transformation for given statistics.

   :param data: transform this data back
   :param method: transformation method (optional)
   :param mean: mean of transformation (optional)
   :param std: standard deviation of transformation (optional)
   :param max: maximum value for min/max transformation (optional)
   :param min: minimum value for min/max transformation (optional)

   :return: inverse transformed data


.. py:function:: standardise(data: Data, dim: Union[str, int]) -> Tuple[Data, Dict[str, Data]]

   Standardise a xarray.dataarray (along dim) or pandas.DataFrame (along axis) with mean=0 and std=1.

   :param data: data to standardise
   :param dim: name (xarray) or axis (pandas) of dimension which should be standardised
   :return: standardised data, and dictionary with keys method, mean, and standard deviation


.. py:function:: standardise_inverse(data: Data, mean: Data, std: Data) -> Data

   Apply inverse function of `standardise` on data and therefore vanishes the standardising.

   :param data: standardised data
   :param mean: mean of standardisation
   :param std: standard deviation of transformation

   :return: inverse standardised data


.. py:function:: standardise_apply(data: Data, mean: Data, std: Data) -> Data

   Apply `standardise` on data using given mean and std.

   :param data: data to transform
   :param mean: mean to use for transformation
   :param std: standard deviation for transformation

   :return: transformed data


.. py:function:: centre(data: Data, dim: Union[str, int]) -> Tuple[Data, Dict[str, Data]]

   Centre a xarray.dataarray (along dim) or pandas.DataFrame (along axis) to mean=0.

   :param data: data to centre
   :param dim: name (xarray) or axis (pandas) of dimension which should be centred

   :return: centred data, and dictionary with keys method, and mean


.. py:function:: centre_inverse(data: Data, mean: Data) -> Data

   Apply inverse function of `centre` and therefore add given values of mean to data.

   :param data: data to apply inverse centering
   :param mean: mean to use for inverse transformation

   :return: inverted centering transformation data


.. py:function:: centre_apply(data: Data, mean: Data) -> Data

   Apply `centre` on data using given mean.

   :param data: data to transform
   :param mean: mean to use for transformation

   :return: transformed data


.. py:function:: min_max(data: Data, dim: Union[str, int], feature_range: Tuple = (0, 1)) -> Tuple[Data, Dict[str, Data]]

   Apply min/max scaling using (x - x_min) / (x_max - x_min). Returned data is in interval [0, 1].

   :param data: data to transform
   :param dim: name (xarray) or axis (pandas) of dimension which should be centred
   :param feature_range: scale data to any interval given in feature range. Default is scaling on interval [0, 1].
   :return: transformed data, and dictionary with keys method, min, and max


.. py:function:: min_max_inverse(data: Data, _min: Data, _max: Data, feature_range: Tuple = (0, 1)) -> Data

   Apply inverse transformation of `min_max` scaling.

   :param data: data to apply inverse scaling
   :param _min: minimum value to use for min/max scaling
   :param _max: maximum value to use for min/max scaling
   :param feature_range: scale data to any interval given in feature range. Default is scaling on interval [0, 1].
   :return: inverted min/max scaled data


.. py:function:: min_max_apply(data: Data, _min: Data, _max: Data, feature_range: Data = (0, 1)) -> Data

   Apply `min_max` scaling with given minimum and maximum.

   :param data: data to apply scaling
   :param _min: minimum value to use for min/max scaling
   :param _max: maximum value to use for min/max scaling
   :param feature_range: scale data to any interval given in feature range. Default is scaling on interval [0, 1].
   :return: min/max scaled data


.. py:function:: log(data: Data, dim: Union[str, int]) -> Tuple[Data, Dict[str, Data]]

   Apply logarithmic transformation (and standarization) to data. This method first uses the logarithm for
   transformation and second applies the `standardise` method additionally. A logarithmic function numpy's log1p is
   used (`res = log(1+x)`) instead of the pure logarithm to be applicable to values of 0 too.

   :param data: transform this data
   :param dim: name (xarray) or axis (pandas) of dimension which should be transformed
   :return: transformed data, and option dictionary with keys method, mean, and std


.. py:function:: log_inverse(data: Data, mean: Data, std: Data) -> Data

   Apply inverse log transformation (therefore exponential transformation). Because `log` is using `np.log1p` this
   method is based on the equivalent method `np.exp1m`. Data are first rescaled using `standardise_inverse` and then
   given to the exponential function.

   :param data: apply inverse log transformation on this data
   :param mean: mean of the standarization
   :param std: std of the standarization
   :return: inverted data


.. py:function:: log_apply(data: Data, mean: Data, std: Data) -> Data

   Apply numpy's log1p on given data. Further information can be found in description of `log` method.

   :param data: transform this data
   :param mean: mean of the standarization
   :param std: std of the standarization
   :return: transformed data


.. py:function:: mean_squared_error(a, b, dim=None)

   Calculate mean squared error.


.. py:function:: mean_absolute_error(a, b, dim=None)

   Calculate mean absolute error.


.. py:function:: mean_error(a, b, dim=None)

   Calculate mean error where a is forecast and b the reference (e.g. observation).


.. py:function:: index_of_agreement(a, b, dim=None)

   Calculate index of agreement (IOA) where a is the forecast and b the reference (e.g. observation).


.. py:function:: modified_normalized_mean_bias(a, b, dim=None)

   Calculate modified normalized mean bias (MNMB) where a is the forecast and b the reference (e.g. observation).


.. py:function:: calculate_error_metrics(a, b, dim)

   Calculate MSE, ME, RMSE, MAE, IOA, and MNMB. Additionally, return number of used values for calculation.

   :param a: forecast data to calculate metrics for
   :param b: reference (e.g. observation)
   :param dim: dimension to calculate metrics along

   :returns: dict with results for all metrics indicated by lowercase metric short name


.. py:function:: get_error_metrics_units(base_unit)


.. py:function:: get_error_metrics_long_name()


.. py:function:: mann_whitney_u_test(data: pandas.DataFrame, reference_col_name: str, **kwargs)

   Calculate Mann-Whitney u-test. Uses pandas' .apply() on scipy.stats.mannwhitneyu(x, y, ...).
   :param data:
   :type data:
   :param reference_col_name: Name of column which is used for comparison (y)
   :type reference_col_name:
   :param kwargs:
   :type kwargs:
   :return:
   :rtype:


.. py:function:: represent_p_values_as_asteriks(p_values: pandas.Series, threshold_representation: collections.OrderedDict = None)

   Represent p-values as asteriks based on its value.
   :param p_values:
   :type p_values:
   :param threshold_representation:
   :type threshold_representation:
   :return:
   :rtype:


.. py:class:: SkillScores(external_data: Union[Data, None], models=None, observation_name='obs', ahead_dim='ahead', type_dim='type', index_dim='index')

   Calculate different kinds of skill scores.

   Skill score on MSE:
       Calculate skill score based on MSE for given forecast, reference and observations.

       .. math::

           \text{SkillScore} = 1 - \frac{\text{MSE(obs, for)}}{\text{MSE(obs, ref)}}

       To run:

       .. code-block:: python

           skill_scores = SkillScores(None).general_skill_score(data, observation_name, forecast_name, reference_name)

   Competitive skill score:
       Calculate skill scores to highlight differences between forecasts. This skill score is also based on the MSE.
       Currently required forecasts are CNN, OLS and persi, as well as the observation obs.

       .. code-block:: python

           skill_scores_class = SkillScores(internal_data)  # must contain columns CNN, OLS, persi and obs.
           skill_scores = skill_scores_class.skill_scores(window_lead_time=3)

   Skill score according to Murphy:
       Follow climatological skill score definition of Murphy (1988). External data is data from another time period
       than the internal data set on initialisation. In other terms, this should be the train and validation data
       whereas the external data is the test data. This sounds perhaps counter-intuitive, but if a skill score is
       evaluated to a model to another, this must be performed on test data set. Therefore, for this case the foreign
       data is test.

       .. code-block:: python

           skill_scores_class = SkillScores(external_data)  # must contain columns obs and CNN.
           skill_scores_clim = skill_scores_class.climatological_skill_scores(internal_data, window_lead_time=3)


   .. py:attribute:: models_default
      :annotation: = ['cnn', 'persi', 'ols']

      
   .. py:method:: set_model_names(self, models: List[str]) -> List[str]

      Either use given models or use defaults.


   .. py:method:: _reorder(model_list: List[str]) -> List[str]
      :staticmethod:

      Set elements persi and obs at the very end of given list.


   .. py:method:: get_model_name_combinations(self)

      Return all combinations of two models as tuple and string.


   .. py:method:: skill_scores(self) -> [pandas.DataFrame, pandas.DataFrame]

      Calculate skill scores for all combinations of model names.

      :return: skill score for each comparison and forecast step


   .. py:method:: climatological_skill_scores(self, internal_data: Data, forecast_name: str) -> xarray.DataArray

      Calculate climatological skill scores according to Murphy (1988).

      Calculate all CASES I - IV and terms [ABC][I-IV]. Internal data has to be set by initialisation, external data
      is part of parameters.

      :param internal_data: internal data
      :param forecast_name: name of the forecast to use for this calculation (must be available in `data`)

      :return: all CASES as well as all terms


   .. py:method:: _climatological_skill_score(self, internal_data, observation_name, forecast_name, mu_type=1, external_data=None)


   .. py:method:: general_skill_score(self, data: Data, forecast_name: str, reference_name: str, observation_name: str = None, dim: str = 'index') -> numpy.ndarray

      Calculate general skill score based on mean squared error.

      :param data: internal data containing data for observation, forecast and reference
      :param observation_name: name of observation
      :param forecast_name: name of forecast
      :param reference_name: name of reference

      :return: skill score of forecast


   .. py:method:: get_count(self, data: Data, dim: str = 'index') -> numpy.ndarray

      Count data and return number


   .. py:method:: skill_score_pre_calculations(self, data: Data, observation_name: str, forecast_name: str) -> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, Data, Dict[str, Data]]

      Calculate terms AI, BI, and CI, mean, variance and pearson's correlation and clean up data.

      The additional information on mean, variance and pearson's correlation (and the p-value) are returned as
      dictionary with the corresponding keys mean, sigma, r and p.

      :param data: internal data to use for calculations
      :param observation_name: name of observation
      :param forecast_name: name of forecast

      :returns: Terms AI, BI, and CI, internal data without nans and mean, variance, correlation and its p-value


   .. py:method:: skill_score_mu_case_1(self, internal_data, observation_name, forecast_name)

      Calculate CASE I.


   .. py:method:: skill_score_mu_case_2(self, internal_data, observation_name, forecast_name)

      Calculate CASE II.


   .. py:method:: skill_score_mu_case_3(self, internal_data, observation_name, forecast_name, external_data=None)

      Calculate CASE III.


   .. py:method:: skill_score_mu_case_4(self, internal_data, observation_name, forecast_name, external_data=None)

      Calculate CASE IV.


   .. py:method:: create_monthly_mean_from_daily_data(self, data, columns=None, index=None)

      Calculate average for each month and save as daily values with flag 'X'.

      :param data: data to average
      :param columns: columns to work on (all columns from given data are used if empty)
      :param index: index of returned data (index of given data is used if empty)

      :return: data containing monthly means in daily resolution


.. py:function:: create_single_bootstrap_realization(data: xarray.DataArray, dim_name_time: str) -> xarray.DataArray

   Return a bootstraped realization of data
   :param data: data from which to draw ONE bootstrap realization
   :param dim_name_time: name of time dimension
   :return: bootstrapped realization of data


.. py:function:: calculate_average(data: xarray.DataArray, **kwargs) -> xarray.DataArray

   Calculate mean of data
   :param data: data for which to calculate mean
   :return: mean of data


.. py:function:: create_n_bootstrap_realizations(data: xarray.DataArray, dim_name_time: str, dim_name_model: str, n_boots: int = 1000, dim_name_boots: str = 'boots', seasons: List = None) -> Dict[str, xarray.DataArray]

   Create n bootstrap realizations and calculate averages across realizations

   :param data: original data from which to create bootstrap realizations
   :param dim_name_time: name of time dimension
   :param dim_name_model: name of model dimension
   :param n_boots: number of bootstap realizations
   :param dim_name_boots: name of bootstap dimension
   :param seasons: calculate errors for given seasons in addition (default None)
   :return:


.. py:function:: calculate_bias_free_data(data, time_dim='index', window_size=30)