pytesmo.cdf_matching module

class pytesmo.cdf_matching.CDFMatching(nbins: int = 100, percentiles: Sequence | None = None, minobs: int | None = None, linear_edge_scaling: bool = False, combine_invalid: bool = False)[source]

Bases: RegressorMixin, BaseEstimator

Predicts variable from single other variable by CDF matching.

Parameters:

nbins (int, optional) – Number of bins to use for the empirical CDF. Default is 100. If minobs is set, this might be reduced in case there’s not enough data in each bin.
percentiles (sequence, optional) – Percentile values to use. If this is given, nbins is ignored. The percentiles might still be changed if minobs is given and the number data per bin is lower. Default is None.
linear_edge_scaling (bool, optional) – Whether to derive the edge parameters via linear regression (more robust, see Moesinger et al. (2020) for more info). Default is False. Note that this way only the outliers in the reference (y) CDF are handled. Outliers in the input data (x) will not be removed and will still show up in the data.
minobs (int, optional) – Minimum desired number of observations in a bin. If there is less data for a bin, the number of bins is reduced. Default is None (no resizing).
combine_invalid (bool, optional) – Optional feature to combine the masks of invalid data (NaN, Inf) of both source (X) and reference (y) data passed to fit. This only makes sense if X and y are both timeseries data corresponding to the same index. In this case, this makes sures that data is only used if values for X and y are available, so that seasonal patterns in missing values in one of them do not lead to distortions. (For example, if X is available the whole year, but y is only available during summer, the distribution of y should not be matched against the whole year CDF of X, because that could introduce systematic seasonal biases). The default is False.

x_perc_

The percentile values derived from the source (X) data. If the number of bins was reduced during fitting due to insufficient data, it is right-padded with NaNs.

Type:: np.ndarray (nbins,)

y_perc_

The percentile values derived from the reference (y) data. If the number of bins was reduced during fitting due to insufficient data, it is right-padded with NaNs.

Type:: np.ndarray (nbins,)

Notes

This implementation does not do any temporal matching of the reference and source datasets. If this is required, this has to be done beforehand.

fit(X: ndarray | Series | DataFrame, y: ndarray | Series)[source]

Derive the CDF matching parameters.

Parameters:

X (array_like) – An array/pd.Series or a matrix/pd.DataFrame with a single column.
y (array_like) – An array/pd.Series of reference data.

predict(X)[source]

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → CDFMatching

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Returns:: self – The updated object.
Return type:: object