pytesmo package
Subpackages
- pytesmo.colormaps package
- pytesmo.grid package
- pytesmo.interpolate namespace
- pytesmo.metrics package
- pytesmo.time_series package
- pytesmo.timedate package
- pytesmo.validation_framework package
- Submodules
- pytesmo.validation_framework.adapters module
- pytesmo.validation_framework.data_manager module
- pytesmo.validation_framework.data_scalers module
- pytesmo.validation_framework.error_handling module
- pytesmo.validation_framework.metric_calculators module
- pytesmo.validation_framework.metric_calculators_adapters module
- pytesmo.validation_framework.results_manager module
- pytesmo.validation_framework.start_validation module
- pytesmo.validation_framework.temporal_matchers module
- pytesmo.validation_framework.upscaling module
- pytesmo.validation_framework.validation module
ValidationValidation.calc()Validation.get_processing_jobs()Validation.calc()Validation.dummy_validation_result()Validation.get_data_for_result_tuple()Validation.get_processing_jobs()Validation.k_datasets_from()Validation.mask_dataset()Validation.perform_validation()Validation.temporal_match_datasets()Validation.temporal_match_masking_data()
args_to_iterable()
- Module contents
Submodules
pytesmo.cdf_matching module
- class pytesmo.cdf_matching.CDFMatching(nbins: int = 100, percentiles: Sequence | None = None, minobs: int | None = None, linear_edge_scaling: bool = False, combine_invalid: bool = False)[source]
Bases:
RegressorMixin,BaseEstimatorPredicts variable from single other variable by CDF matching.
- Parameters:
nbins (int, optional) – Number of bins to use for the empirical CDF. Default is 100. If minobs is set, this might be reduced in case there’s not enough data in each bin.
percentiles (sequence, optional) – Percentile values to use. If this is given, nbins is ignored. The percentiles might still be changed if minobs is given and the number data per bin is lower. Default is
None.linear_edge_scaling (bool, optional) – Whether to derive the edge parameters via linear regression (more robust, see Moesinger et al. (2020) for more info). Default is
False. Note that this way only the outliers in the reference (y) CDF are handled. Outliers in the input data (x) will not be removed and will still show up in the data.minobs (int, optional) – Minimum desired number of observations in a bin. If there is less data for a bin, the number of bins is reduced. Default is
None(no resizing).combine_invalid (bool, optional) – Optional feature to combine the masks of invalid data (NaN, Inf) of both source (X) and reference (y) data passed to fit. This only makes sense if X and y are both timeseries data corresponding to the same index. In this case, this makes sures that data is only used if values for X and y are available, so that seasonal patterns in missing values in one of them do not lead to distortions. (For example, if X is available the whole year, but y is only available during summer, the distribution of y should not be matched against the whole year CDF of X, because that could introduce systematic seasonal biases). The default is
False.
- x_perc_
The percentile values derived from the source (X) data. If the number of bins was reduced during fitting due to insufficient data, it is right-padded with NaNs.
- Type:
np.ndarray (nbins,)
- y_perc_
The percentile values derived from the reference (y) data. If the number of bins was reduced during fitting due to insufficient data, it is right-padded with NaNs.
- Type:
np.ndarray (nbins,)
Notes
This implementation does not do any temporal matching of the reference and source datasets. If this is required, this has to be done beforehand.
- fit(X: ndarray | Series | DataFrame, y: ndarray | Series)[source]
Derive the CDF matching parameters.
- Parameters:
X (array_like) – An array/pd.Series or a matrix/pd.DataFrame with a single column.
y (array_like) – An array/pd.Series of reference data.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CDFMatching
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
pytesmo.df_metrics module
pytesmo.scaling module
Created on Apr 17, 2013
@author: Christoph Paulik christoph.paulik@geo.tuwien.ac.at
- pytesmo.scaling.add_scaled(df, method='linreg', label_in=None, label_scale=None, **kwargs)[source]
takes a dataframe and appends a scaled time series to it. If no labels are given the first column will be scaled to the second column of the DataFrame
- Parameters:
df (pandas.DataFrame) – input dataframe
method (string) – scaling method
label_in (string, optional) – the column of the dataframe that should be scaled to that with label_scale default is the first column
label_scale (string, optional) – the column of the dataframe the label_in column should be scaled to default is the second column
- Returns:
df – input dataframe with new column labeled label_in+’_scaled_’+method
- Return type:
- pytesmo.scaling.cdf_match(src, ref, nbins=100, minobs=20, linear_edge_scaling=True, percentiles=None, combine_invalid=True, max_val=None, min_val=None)[source]
Rescales by CDF matching.
This calculates the empirical CDFs for source and reference dataset using a specified number of bins. In case of non-unique percentile values, a beta distribution is fitted to the CDF. For more robust estimation of the lower and upper bins, linear edge scaling is used (see Moesinger et al., 2020 for details).
- Parameters:
src (numpy.array) – input dataset which will be scaled
ref (numpy.array) – src will be scaled to this dataset
nbins (int, optional) – Number of bins to use for estimation of the CDF
percentiles (sequence, optional) – Percentile values to use. If this is given, nbins is ignored. The percentiles might still be changed if minobs is given and the number data per bin is lower. Default is
None.minobs (int, optional) – Minimum desired number of observations in a bin for bin resizing. If it is
Nonebins will not be resized. Default is 20.linear_edge_scaling (bool, optional) – Whether to derive the edge parameters via linear regression (more robust, see Moesinger et al. (2020) for more info). Default is
True. Note that this way only the outliers in the reference (y) CDF are handled. Outliers in the input data (x) will not be removed and will still show up in the data.combine_invalid (bool, optional) – Optional feature to combine the masks of invalid data (NaN, Inf) of both source (X) and reference (y) data passed to fit. This only makes sense if X and y are both timeseries data corresponding to the same index. In this case, this makes sures that data is only used if values for X and y are available, so that seasonal patterns in missing values in one of them do not lead to distortions. (For example, if X is available the whole year, but y is only available during summer, the distribution of y should not be matched against the whole year CDF of X, because that could introduce systematic seasonal biases). Default is True.
max_val (float, optional) – Maximum and minimum values to enforce.
min_val (float, optional) – Maximum and minimum values to enforce.
- Returns:
CDF matched values – dataset src with CDF as ref
- Return type:
numpy.array
- pytesmo.scaling.get_scaling_function(method)[source]
Get scaling function based on method name.
- Parameters:
method (string) – method name as string
- Returns:
scaling_func – function(src:numpy.ndarray, ref:numpy.ndarray) > scaled_src:np.ndarray
- Return type:
function
- Raises:
KeyError: – if method is not found
- pytesmo.scaling.get_scaling_method_lut()[source]
Get all defined scaling methods and their function names.
- Returns:
lut – key: scaling method name value: function
- Return type:
dictionary
- pytesmo.scaling.linreg(src, ref, **kwargs)[source]
scales the input datasets using linear regression
- Parameters:
src (numpy.array) – input dataset which will be scaled
ref (numpy.array) – src will be scaled to this dataset
- Returns:
scaled dataset – dataset scaled using linear regression
- Return type:
numpy.array
- pytesmo.scaling.linreg_params(src, ref)[source]
Calculate additive and multiplicative correction parameters based on linear regression models.
- Parameters:
src (numpy.array) – Candidate data (to which the corrections apply)
ref (numpy.array) – Reference data (which candidate is scaled to)
- Returns:
slope (float) – Multiplicative correction value
intercept (float) – Additive correction value
- pytesmo.scaling.linreg_stored_params(src, slope, intercept)[source]
Scale the input data with passed correction values
- pytesmo.scaling.mean_std(src, ref, **kwargs)[source]
scales the input datasets so that they have the same mean and standard deviation afterwards
- Parameters:
src (numpy.array) – input dataset which will be scaled
ref (numpy.array) – src will be scaled to this dataset
- Returns:
scaled dataset – dataset src with same mean and standard deviation as ref
- Return type:
numpy.array
- pytesmo.scaling.min_max(src, ref, **kwargs)[source]
scales the input datasets so that they have the same minimum and maximum afterwards
- Parameters:
src (numpy.array) – input dataset which will be scaled
ref (numpy.array) – src will be scaled to this dataset
- Returns:
scaled dataset – dataset src with same maximum and minimum as ref
- Return type:
numpy.array
- pytesmo.scaling.scale(df, method='linreg', reference_index=0, **kwargs)[source]
takes pandas.DataFrame and scales all columns to the column specified by reference_index with the chosen method
- Parameters:
df (pandas.DataFrame) – containing matched time series that should be scaled
method (string, optional) – method definition, has to be a function in globals() that takes 2 numpy.array as input and returns one numpy.array of same length
reference_index (int, optional) – default 0, column index of reference dataset in dataframe
- Returns:
scaled data – all time series of the input DataFrame scaled to the one specified by reference_index
- Return type:
pytesmo.temporal_matching module
Provides functions for temporally collocating data from multiple dataframes.
- pytesmo.temporal_matching.combined_temporal_collocation(reference, others, window, method='nearest', dropduplicates=False, dropna=False, combined_dropna=False, flag=None, checkna=False, use_invalid=False, add_ref_data=False)[source]
Temporally collocates multiple dataframes to reference times.
- Parameters:
reference (pd.DataFrame, pd.Series, or pd.DatetimeIndex) – The reference onto which other should be collocated. If this is a DataFrame or a Series, the index must be a DatetimeIndex. If the index is timezone-naive, UTC will be assumed.
others (list/tuple of pd.DataFrame or pd.Series) – DataFrames/Series to be collocated. Each entry must have a pd.DatetimeIndex as index. If the index is timezone-naive, the timezone of the reference data will be assumed.
window (pd.Timedelta or float) – Window around reference timestamps in which to look for data. Floats are interpreted as number of days.
method (str, optional) –
Which method to use for the temporal collocation:
”nearest” (default): Uses the nearest valid neighbour. When this method is used, entries with duplicate index values in other will be dropped, and only the first of the duplicates is kept.
”mean”: Takes the mean over the given window around the reference times.
dropduplicates (bool, optional) – Whether to drop duplicated timestamps in others. Default is
False, except whenmethod="nearest", in which case this is enforced to beTrue.dropna (bool, optional) – Whether to drop NaNs from the resulting dataframe (arising for example from duplicates with
duplicates_nan=Trueor from missing values). Default isFalse.combined_dropna (str or bool, optional) – Whether and how to drop NaNs from the resulting combined DataFrame. Can be
"any","all",TrueorFalse. “any” makes sure that the output dataframe only has values at times where all input frames had values, while “all” only drops lines where all values are NaN.Trueis the same as “any”, andFalse(default) disables dropping NaNs.checkna (bool, optional) – Whether to check if only NaNs are returned (i.e. no match has been found). If set to
True, raises aUserWarningin case no match has been found. Default isFalse.flag (np.ndarray or None, optional) – Flag column as array. If this is given, the column will be interpreted as validity indicator. Any nonzero values mark the row as invalid. Default is
None.use_invalid (bool, optional) – Whether to use invalid values marked by flag in case no valid values are available. Default is
False.add_ref_data (bool, optional) – If reference is a DataFrame or Series, add the data to the final collocated dataframe.
- Returns:
collocated – Temporally collocated DataFrame with variables from all input frames merged together.
- Return type:
pd.DataFrame or pd.Series
- pytesmo.temporal_matching.temporal_collocation(reference, other, window, method='nearest', return_index=False, return_distance=False, dropduplicates=False, dropna=False, checkna=False, flag=None, use_invalid=False)[source]
Temporally collocates values to reference.
- Parameters:
reference (pd.DataFrame, pd.Series, or pd.DatetimeIndex) – The reference onto which other should be collocated. If this is a DataFrame or a Series, the index must be a DatetimeIndex. If the index is timezone-naive and other is not, the timezone of other will be assumed.
other (pd.DataFrame or pd.Series) – Data to be collocated. Must have a pd.DatetimeIndex as index. If the index is timezone-naive and reference is not, the timezone of the reference data will be assumed.
window (pd.Timedelta or float) – Window around reference timestamps in which to look for data. Floats are interpreted as number of days.
method (str, optional) –
Which method to use for the temporal collocation:
”nearest” (default): Uses the nearest valid neighbour. When this method is used, entries with duplicate index values in other will be dropped, and only the first of the duplicates is kept.
”mean”: Takes the mean over the given window around the reference times.
return_index (boolean, optional) – Include index of other in matched dataframe (default: False). Only used with
method="nearest". The index will be added as a separate column with the name “index_other”.return_distance (boolean, optional) – Include distance information between reference and other in matched dataframe (default: False). This is only used with
method="nearest", and impliesreturn_index=True. The distance will be added as a separate column with the name “distance_other”.dropduplicates (bool, optional) – Whether to drop duplicated timestamps in other. Default is
False, except whenmethod="nearest", in which case this is enforced to beTrue.dropna (bool, optional) – Whether to drop NaNs from the resulting dataframe (arising for example from duplicates with
duplicates_nan=Trueor from missing values). This useshow="all", that is, only rows where all values are NaN are dropped. Default isFalse.checkna (bool, optional) – Whether to check if only NaNs are returned (i.e. no match has been found). If set to
True, raises aUserWarningin case no match has been found. Default isFalse.flag (np.ndarray, str or None, optional) – Flag column as array or name of the flag column in other. If this is given, the column will be interpreted as validity indicator. Any nonzero values mark the row as invalid. Default is
None.use_invalid (bool, optional) – Whether to use invalid values marked by flag in case no valid values are available. Default is
False.
- Returns:
collocated – Temporally collocated version of
other.- Return type:
pd.DataFrame or pd.Series
pytesmo.utils module
Module containing utility functions that do not fit into other modules
- pytesmo.utils.array_dropna(*arrs)[source]
Drop elements from input arrays where ANY array is NaN
- Parameters:
*arrs (np.array(s)) – One or multiple numpy arrays of the same length that contain nans
- Returns:
arrs_dropna – Input arrays without NaNs
- Return type:
np.array
- pytesmo.utils.deprecated(message: str | None = None)[source]
Decorator for class methods or functions to mark them as deprecated. If the decorator is applied without a specific message (@deprecated()), the default warning is shown when using the function/class. To specify a custom message use it like:
@deprecated(‘Don’t use this function anymore!’).
- Parameters:
message (str, optional (default: None)) – Custom message to show with the DeprecationWarning.
- pytesmo.utils.element_iterable(el)[source]
Test if a element is iterable
- Parameters:
el (object) –
- Returns:
iterable – if True then then el is iterable if Fales then not
- Return type:
boolean
- pytesmo.utils.ensure_iterable(el)[source]
Ensure that an object is iterable by putting it into a list. Strings are handled slightly differently. They are technically iterable but we want to keep the whole.
- pytesmo.utils.interp_uniq(src)[source]
replace non unique values by their linear interpolated value This method interpolates iteratively like it is done in IDL.
- Parameters:
src (numpy.array) – array to ensure uniqueness of
- Returns:
src – interpolated unique values in array of same size as src
- Return type:
numpy.array