pytesmo package

Subpackages

Submodules

pytesmo.cdf_matching module

class pytesmo.cdf_matching.CDFMatching(nbins: int = 100, percentiles: Sequence | None = None, minobs: int | None = None, linear_edge_scaling: bool = False, combine_invalid: bool = False)[source]

Bases: RegressorMixin, BaseEstimator

Predicts variable from single other variable by CDF matching.

Parameters:
  • nbins (int, optional) – Number of bins to use for the empirical CDF. Default is 100. If minobs is set, this might be reduced in case there’s not enough data in each bin.

  • percentiles (sequence, optional) – Percentile values to use. If this is given, nbins is ignored. The percentiles might still be changed if minobs is given and the number data per bin is lower. Default is None.

  • linear_edge_scaling (bool, optional) – Whether to derive the edge parameters via linear regression (more robust, see Moesinger et al. (2020) for more info). Default is False. Note that this way only the outliers in the reference (y) CDF are handled. Outliers in the input data (x) will not be removed and will still show up in the data.

  • minobs (int, optional) – Minimum desired number of observations in a bin. If there is less data for a bin, the number of bins is reduced. Default is None (no resizing).

  • combine_invalid (bool, optional) – Optional feature to combine the masks of invalid data (NaN, Inf) of both source (X) and reference (y) data passed to fit. This only makes sense if X and y are both timeseries data corresponding to the same index. In this case, this makes sures that data is only used if values for X and y are available, so that seasonal patterns in missing values in one of them do not lead to distortions. (For example, if X is available the whole year, but y is only available during summer, the distribution of y should not be matched against the whole year CDF of X, because that could introduce systematic seasonal biases). The default is False.

x_perc_

The percentile values derived from the source (X) data. If the number of bins was reduced during fitting due to insufficient data, it is right-padded with NaNs.

Type:

np.ndarray (nbins,)

y_perc_

The percentile values derived from the reference (y) data. If the number of bins was reduced during fitting due to insufficient data, it is right-padded with NaNs.

Type:

np.ndarray (nbins,)

Notes

This implementation does not do any temporal matching of the reference and source datasets. If this is required, this has to be done beforehand.

fit(X: ndarray | Series | DataFrame, y: ndarray | Series)[source]

Derive the CDF matching parameters.

Parameters:
  • X (array_like) – An array/pd.Series or a matrix/pd.DataFrame with a single column.

  • y (array_like) – An array/pd.Series of reference data.

predict(X)[source]
set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CDFMatching

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

pytesmo.df_metrics module

pytesmo.scaling module

Created on Apr 17, 2013

@author: Christoph Paulik christoph.paulik@geo.tuwien.ac.at

pytesmo.scaling.add_scaled(df, method='linreg', label_in=None, label_scale=None, **kwargs)[source]

takes a dataframe and appends a scaled time series to it. If no labels are given the first column will be scaled to the second column of the DataFrame

Parameters:
  • df (pandas.DataFrame) – input dataframe

  • method (string) – scaling method

  • label_in (string, optional) – the column of the dataframe that should be scaled to that with label_scale default is the first column

  • label_scale (string, optional) – the column of the dataframe the label_in column should be scaled to default is the second column

Returns:

df – input dataframe with new column labeled label_in+’_scaled_’+method

Return type:

pandas.DataFrame

pytesmo.scaling.cdf_beta_match(*args, **kwargs)[source]
pytesmo.scaling.cdf_match(src, ref, nbins=100, minobs=20, linear_edge_scaling=True, percentiles=None, combine_invalid=True, max_val=None, min_val=None)[source]

Rescales by CDF matching.

This calculates the empirical CDFs for source and reference dataset using a specified number of bins. In case of non-unique percentile values, a beta distribution is fitted to the CDF. For more robust estimation of the lower and upper bins, linear edge scaling is used (see Moesinger et al., 2020 for details).

Parameters:
  • src (numpy.array) – input dataset which will be scaled

  • ref (numpy.array) – src will be scaled to this dataset

  • nbins (int, optional) – Number of bins to use for estimation of the CDF

  • percentiles (sequence, optional) – Percentile values to use. If this is given, nbins is ignored. The percentiles might still be changed if minobs is given and the number data per bin is lower. Default is None.

  • minobs (int, optional) – Minimum desired number of observations in a bin for bin resizing. If it is None bins will not be resized. Default is 20.

  • linear_edge_scaling (bool, optional) – Whether to derive the edge parameters via linear regression (more robust, see Moesinger et al. (2020) for more info). Default is True. Note that this way only the outliers in the reference (y) CDF are handled. Outliers in the input data (x) will not be removed and will still show up in the data.

  • combine_invalid (bool, optional) – Optional feature to combine the masks of invalid data (NaN, Inf) of both source (X) and reference (y) data passed to fit. This only makes sense if X and y are both timeseries data corresponding to the same index. In this case, this makes sures that data is only used if values for X and y are available, so that seasonal patterns in missing values in one of them do not lead to distortions. (For example, if X is available the whole year, but y is only available during summer, the distribution of y should not be matched against the whole year CDF of X, because that could introduce systematic seasonal biases). Default is True.

  • max_val (float, optional) – Maximum and minimum values to enforce.

  • min_val (float, optional) – Maximum and minimum values to enforce.

Returns:

CDF matched values – dataset src with CDF as ref

Return type:

numpy.array

pytesmo.scaling.get_scaling_function(method)[source]

Get scaling function based on method name.

Parameters:

method (string) – method name as string

Returns:

scaling_func – function(src:numpy.ndarray, ref:numpy.ndarray) > scaled_src:np.ndarray

Return type:

function

Raises:

KeyError: – if method is not found

pytesmo.scaling.get_scaling_method_lut()[source]

Get all defined scaling methods and their function names.

Returns:

lut – key: scaling method name value: function

Return type:

dictionary

pytesmo.scaling.linreg(src, ref, **kwargs)[source]

scales the input datasets using linear regression

Parameters:
  • src (numpy.array) – input dataset which will be scaled

  • ref (numpy.array) – src will be scaled to this dataset

Returns:

scaled dataset – dataset scaled using linear regression

Return type:

numpy.array

pytesmo.scaling.linreg_params(src, ref)[source]

Calculate additive and multiplicative correction parameters based on linear regression models.

Parameters:
  • src (numpy.array) – Candidate data (to which the corrections apply)

  • ref (numpy.array) – Reference data (which candidate is scaled to)

Returns:

  • slope (float) – Multiplicative correction value

  • intercept (float) – Additive correction value

pytesmo.scaling.linreg_stored_params(src, slope, intercept)[source]

Scale the input data with passed correction values

Parameters:
  • src (numpy.array) – Candidate values, that are scaled

  • slope (float) – Multiplicative correction value

  • intercept (float) – Additive correction value

Returns:

src_scaled – The scaled input values

Return type:

numpy.array

pytesmo.scaling.mean_std(src, ref, **kwargs)[source]

scales the input datasets so that they have the same mean and standard deviation afterwards

Parameters:
  • src (numpy.array) – input dataset which will be scaled

  • ref (numpy.array) – src will be scaled to this dataset

Returns:

scaled dataset – dataset src with same mean and standard deviation as ref

Return type:

numpy.array

pytesmo.scaling.min_max(src, ref, **kwargs)[source]

scales the input datasets so that they have the same minimum and maximum afterwards

Parameters:
  • src (numpy.array) – input dataset which will be scaled

  • ref (numpy.array) – src will be scaled to this dataset

Returns:

scaled dataset – dataset src with same maximum and minimum as ref

Return type:

numpy.array

pytesmo.scaling.scale(df, method='linreg', reference_index=0, **kwargs)[source]

takes pandas.DataFrame and scales all columns to the column specified by reference_index with the chosen method

Parameters:
  • df (pandas.DataFrame) – containing matched time series that should be scaled

  • method (string, optional) – method definition, has to be a function in globals() that takes 2 numpy.array as input and returns one numpy.array of same length

  • reference_index (int, optional) – default 0, column index of reference dataset in dataframe

Returns:

scaled data – all time series of the input DataFrame scaled to the one specified by reference_index

Return type:

pandas.DataFrame

pytesmo.temporal_matching module

Provides functions for temporally collocating data from multiple dataframes.

pytesmo.temporal_matching.combined_temporal_collocation(reference, others, window, method='nearest', dropduplicates=False, dropna=False, combined_dropna=False, flag=None, checkna=False, use_invalid=False, add_ref_data=False)[source]

Temporally collocates multiple dataframes to reference times.

Parameters:
  • reference (pd.DataFrame, pd.Series, or pd.DatetimeIndex) – The reference onto which other should be collocated. If this is a DataFrame or a Series, the index must be a DatetimeIndex. If the index is timezone-naive, UTC will be assumed.

  • others (list/tuple of pd.DataFrame or pd.Series) – DataFrames/Series to be collocated. Each entry must have a pd.DatetimeIndex as index. If the index is timezone-naive, the timezone of the reference data will be assumed.

  • window (pd.Timedelta or float) – Window around reference timestamps in which to look for data. Floats are interpreted as number of days.

  • method (str, optional) –

    Which method to use for the temporal collocation:

    • ”nearest” (default): Uses the nearest valid neighbour. When this method is used, entries with duplicate index values in other will be dropped, and only the first of the duplicates is kept.

    • ”mean”: Takes the mean over the given window around the reference times.

  • dropduplicates (bool, optional) – Whether to drop duplicated timestamps in others. Default is False, except when method="nearest", in which case this is enforced to be True.

  • dropna (bool, optional) – Whether to drop NaNs from the resulting dataframe (arising for example from duplicates with duplicates_nan=True or from missing values). Default is False.

  • combined_dropna (str or bool, optional) – Whether and how to drop NaNs from the resulting combined DataFrame. Can be "any", "all", True or False. “any” makes sure that the output dataframe only has values at times where all input frames had values, while “all” only drops lines where all values are NaN. True is the same as “any”, and False (default) disables dropping NaNs.

  • checkna (bool, optional) – Whether to check if only NaNs are returned (i.e. no match has been found). If set to True, raises a UserWarning in case no match has been found. Default is False.

  • flag (np.ndarray or None, optional) – Flag column as array. If this is given, the column will be interpreted as validity indicator. Any nonzero values mark the row as invalid. Default is None.

  • use_invalid (bool, optional) – Whether to use invalid values marked by flag in case no valid values are available. Default is False.

  • add_ref_data (bool, optional) – If reference is a DataFrame or Series, add the data to the final collocated dataframe.

Returns:

collocated – Temporally collocated DataFrame with variables from all input frames merged together.

Return type:

pd.DataFrame or pd.Series

pytesmo.temporal_matching.temporal_collocation(reference, other, window, method='nearest', return_index=False, return_distance=False, dropduplicates=False, dropna=False, checkna=False, flag=None, use_invalid=False)[source]

Temporally collocates values to reference.

Parameters:
  • reference (pd.DataFrame, pd.Series, or pd.DatetimeIndex) – The reference onto which other should be collocated. If this is a DataFrame or a Series, the index must be a DatetimeIndex. If the index is timezone-naive and other is not, the timezone of other will be assumed.

  • other (pd.DataFrame or pd.Series) – Data to be collocated. Must have a pd.DatetimeIndex as index. If the index is timezone-naive and reference is not, the timezone of the reference data will be assumed.

  • window (pd.Timedelta or float) – Window around reference timestamps in which to look for data. Floats are interpreted as number of days.

  • method (str, optional) –

    Which method to use for the temporal collocation:

    • ”nearest” (default): Uses the nearest valid neighbour. When this method is used, entries with duplicate index values in other will be dropped, and only the first of the duplicates is kept.

    • ”mean”: Takes the mean over the given window around the reference times.

  • return_index (boolean, optional) – Include index of other in matched dataframe (default: False). Only used with method="nearest". The index will be added as a separate column with the name “index_other”.

  • return_distance (boolean, optional) – Include distance information between reference and other in matched dataframe (default: False). This is only used with method="nearest", and implies return_index=True. The distance will be added as a separate column with the name “distance_other”.

  • dropduplicates (bool, optional) – Whether to drop duplicated timestamps in other. Default is False, except when method="nearest", in which case this is enforced to be True.

  • dropna (bool, optional) – Whether to drop NaNs from the resulting dataframe (arising for example from duplicates with duplicates_nan=True or from missing values). This uses how="all", that is, only rows where all values are NaN are dropped. Default is False.

  • checkna (bool, optional) – Whether to check if only NaNs are returned (i.e. no match has been found). If set to True, raises a UserWarning in case no match has been found. Default is False.

  • flag (np.ndarray, str or None, optional) – Flag column as array or name of the flag column in other. If this is given, the column will be interpreted as validity indicator. Any nonzero values mark the row as invalid. Default is None.

  • use_invalid (bool, optional) – Whether to use invalid values marked by flag in case no valid values are available. Default is False.

Returns:

collocated – Temporally collocated version of other.

Return type:

pd.DataFrame or pd.Series

pytesmo.utils module

Module containing utility functions that do not fit into other modules

pytesmo.utils.array_dropna(*arrs)[source]

Drop elements from input arrays where ANY array is NaN

Parameters:

*arrs (np.array(s)) – One or multiple numpy arrays of the same length that contain nans

Returns:

arrs_dropna – Input arrays without NaNs

Return type:

np.array

pytesmo.utils.deprecated(message: str | None = None)[source]

Decorator for class methods or functions to mark them as deprecated. If the decorator is applied without a specific message (@deprecated()), the default warning is shown when using the function/class. To specify a custom message use it like:

@deprecated(‘Don’t use this function anymore!’).

Parameters:

message (str, optional (default: None)) – Custom message to show with the DeprecationWarning.

pytesmo.utils.element_iterable(el)[source]

Test if a element is iterable

Parameters:

el (object) –

Returns:

iterable – if True then then el is iterable if Fales then not

Return type:

boolean

pytesmo.utils.ensure_iterable(el)[source]

Ensure that an object is iterable by putting it into a list. Strings are handled slightly differently. They are technically iterable but we want to keep the whole.

Parameters:

el (object) –

Returns:

iterable – [el]

Return type:

list

pytesmo.utils.interp_uniq(src)[source]

replace non unique values by their linear interpolated value This method interpolates iteratively like it is done in IDL.

Parameters:

src (numpy.array) – array to ensure uniqueness of

Returns:

src – interpolated unique values in array of same size as src

Return type:

numpy.array

pytesmo.utils.rootdir() Path[source]

Module contents