pytesmo.cdf_matching module
- class pytesmo.cdf_matching.CDFMatching(nbins: int = 100, percentiles: Sequence | None = None, minobs: int | None = None, linear_edge_scaling: bool = False, combine_invalid: bool = False)[source]
Bases:
RegressorMixin
,BaseEstimator
Predicts variable from single other variable by CDF matching.
- Parameters:
nbins (int, optional) – Number of bins to use for the empirical CDF. Default is 100. If minobs is set, this might be reduced in case there’s not enough data in each bin.
percentiles (sequence, optional) – Percentile values to use. If this is given, nbins is ignored. The percentiles might still be changed if minobs is given and the number data per bin is lower. Default is
None
.linear_edge_scaling (bool, optional) – Whether to derive the edge parameters via linear regression (more robust, see Moesinger et al. (2020) for more info). Default is
False
. Note that this way only the outliers in the reference (y) CDF are handled. Outliers in the input data (x) will not be removed and will still show up in the data.minobs (int, optional) – Minimum desired number of observations in a bin. If there is less data for a bin, the number of bins is reduced. Default is
None
(no resizing).combine_invalid (bool, optional) – Optional feature to combine the masks of invalid data (NaN, Inf) of both source (X) and reference (y) data passed to fit. This only makes sense if X and y are both timeseries data corresponding to the same index. In this case, this makes sures that data is only used if values for X and y are available, so that seasonal patterns in missing values in one of them do not lead to distortions. (For example, if X is available the whole year, but y is only available during summer, the distribution of y should not be matched against the whole year CDF of X, because that could introduce systematic seasonal biases). The default is
False
.
- x_perc_
The percentile values derived from the source (X) data. If the number of bins was reduced during fitting due to insufficient data, it is right-padded with NaNs.
- Type:
np.ndarray (nbins,)
- y_perc_
The percentile values derived from the reference (y) data. If the number of bins was reduced during fitting due to insufficient data, it is right-padded with NaNs.
- Type:
np.ndarray (nbins,)
Notes
This implementation does not do any temporal matching of the reference and source datasets. If this is required, this has to be done beforehand.
- fit(X: ndarray | Series | DataFrame, y: ndarray | Series)[source]
Derive the CDF matching parameters.
- Parameters:
X (array_like) – An array/pd.Series or a matrix/pd.DataFrame with a single column.
y (array_like) – An array/pd.Series of reference data.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CDFMatching
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.