{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Temporal Collocation of irregularly sampled timeseries\n",
    "\n",
    "Satellite observations usually have an irregular temporal sampling pattern (intervals between 6-36 hours), which is mostly controlled by the orbit of the satellite and the instrument measurement geometry. On the other hand, in-situ instruments or land surface models generally sample on regular time intervals (commonly every 1, 3, 6, 12 or 24 hours). \n",
    "In order to compute error/performance statistics (such as RMSD, bias, correlation) between the time series coming from different sources, it is required that observation pairs (or triplets, etc.) are found which (nearly) coincide in time.\n",
    "\n",
    "A simple way to identify such pairs is by using a nearest neighbor search. First, one time series needs to be selected as temporal reference (i.e. all other time series will be matched to this reference) and second, a tolerance window (typically around 1-12 hours) has to be defined characterizing the temporal correlation of neighboring observation (i.e. observations outside of the tolerance window are no longer be considered as representative neighbors). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Temporal collocation in pytesmo\n",
    "\n",
    "Pytesmo contains the function `pytesmo.temporal_matching.temporal_collocation` for temporally collocating timeseries. Currently, it implements nearest neighbour matching and a windowed mean. It requires a reference index (can also be a DataFrame or a Series), a DataFrame (or Series) to be collocated, and a window.\n",
    "\n",
    "```\n",
    "collocated = temporal_collocation(reference, input_frame, window)\n",
    "```\n",
    "\n",
    "The window argument corresponds to the time intervals that are included in the nearest neighbour search in each direction, e.g. if the reference time is $t$ and the window $\\Delta$, the nearest neighbour inside $[t-\\Delta, t+\\Delta]$ is returned. If no neighbour is found `np.nan` is used as replacement. NaNs can be dropped from the returned dataframe by providing the optional keyword argument ``dropna=True`` to the function.\n",
    "\n",
    "Below are two simple examples which demonstrate the usage. The first example assumes that the index of data to be collocated is shifted by 3 hours with respect to the reference, while using a 6 hour window. The second example uses an index that is randomly shifted by $\\pm12$ hours with respect to the reference. The second example also uses a 6 hour window, which results in some missing values in the resulting dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from pytesmo.temporal_matching import temporal_collocation, combined_temporal_collocation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "                     shifted_0  shifted_1  shifted_2\n2020-01-01 03:00:00  -0.687795  -0.626649   0.237109\n2020-01-02 03:00:00  -0.514778  -1.981137   0.354644\n2020-01-03 03:00:00  -0.600629  -0.761766   0.169777\n2020-01-04 03:00:00  -0.650058  -0.548499   0.548560\n2020-01-05 03:00:00   1.331785  -1.611482  -1.325902",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>shifted_0</th>\n      <th>shifted_1</th>\n      <th>shifted_2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2020-01-01 03:00:00</th>\n      <td>-0.687795</td>\n      <td>-0.626649</td>\n      <td>0.237109</td>\n    </tr>\n    <tr>\n      <th>2020-01-02 03:00:00</th>\n      <td>-0.514778</td>\n      <td>-1.981137</td>\n      <td>0.354644</td>\n    </tr>\n    <tr>\n      <th>2020-01-03 03:00:00</th>\n      <td>-0.600629</td>\n      <td>-0.761766</td>\n      <td>0.169777</td>\n    </tr>\n    <tr>\n      <th>2020-01-04 03:00:00</th>\n      <td>-0.650058</td>\n      <td>-0.548499</td>\n      <td>0.548560</td>\n    </tr>\n    <tr>\n      <th>2020-01-05 03:00:00</th>\n      <td>1.331785</td>\n      <td>-1.611482</td>\n      <td>-1.325902</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create reference time series\n",
    "ref = pd.date_range(\"2020-01-01\", \"2020-12-31\", freq=\"D\")\n",
    "# temporal_collocation can also take a DataFrame or Series as reference input,\n",
    "# in case their index is a DatetimeIndex.\n",
    "\n",
    "# create other time series as dataframe\n",
    "values = np.random.randn(len(ref), 3)\n",
    "shifted = pd.DataFrame(values, index=ref + pd.Timedelta(hours=3), \n",
    "                       columns=list(map(lambda x: f\"shifted_{x}\", range(3))))\n",
    "random_shift = np.random.uniform(-12, 12, len(ref))\n",
    "random = pd.DataFrame(values, index=ref + pd.to_timedelta(random_shift, \"H\"),\n",
    "                      columns=list(map(lambda x: f\"random_{x}\", range(3))))\n",
    "\n",
    "shifted.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "                               random_0  random_1  random_2\n2019-12-31 21:00:12.990837600 -0.687795 -0.626649  0.237109\n2020-01-02 03:18:11.512674000 -0.514778 -1.981137  0.354644\n2020-01-03 04:15:20.703038399 -0.600629 -0.761766  0.169777\n2020-01-03 22:47:25.851343200 -0.650058 -0.548499  0.548560\n2020-01-05 01:02:46.090482000  1.331785 -1.611482 -1.325902",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>random_0</th>\n      <th>random_1</th>\n      <th>random_2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2019-12-31 21:00:12.990837600</th>\n      <td>-0.687795</td>\n      <td>-0.626649</td>\n      <td>0.237109</td>\n    </tr>\n    <tr>\n      <th>2020-01-02 03:18:11.512674000</th>\n      <td>-0.514778</td>\n      <td>-1.981137</td>\n      <td>0.354644</td>\n    </tr>\n    <tr>\n      <th>2020-01-03 04:15:20.703038399</th>\n      <td>-0.600629</td>\n      <td>-0.761766</td>\n      <td>0.169777</td>\n    </tr>\n    <tr>\n      <th>2020-01-03 22:47:25.851343200</th>\n      <td>-0.650058</td>\n      <td>-0.548499</td>\n      <td>0.548560</td>\n    </tr>\n    <tr>\n      <th>2020-01-05 01:02:46.090482000</th>\n      <td>1.331785</td>\n      <td>-1.611482</td>\n      <td>-1.325902</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now match the shifted timeseries to the reference index by using a 6-hour window, either for a nearest neighbour search, or for taking a windowed mean. Both should return unchanges timeseries, except for the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "            shifted_0  shifted_1  shifted_2\n2020-01-01  -0.687795  -0.626649   0.237109\n2020-01-02  -0.514778  -1.981137   0.354644\n2020-01-03  -0.600629  -0.761766   0.169777\n2020-01-04  -0.650058  -0.548499   0.548560\n2020-01-05   1.331785  -1.611482  -1.325902\n...               ...        ...        ...\n2020-12-27  -1.030740  -0.520146  -0.049081\n2020-12-28  -1.221772  -0.770726   0.585857\n2020-12-29   0.420476  -2.558351  -0.797550\n2020-12-30   0.146506   1.014053  -0.629454\n2020-12-31  -1.296130   2.000808  -1.084081\n\n[366 rows x 3 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>shifted_0</th>\n      <th>shifted_1</th>\n      <th>shifted_2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2020-01-01</th>\n      <td>-0.687795</td>\n      <td>-0.626649</td>\n      <td>0.237109</td>\n    </tr>\n    <tr>\n      <th>2020-01-02</th>\n      <td>-0.514778</td>\n      <td>-1.981137</td>\n      <td>0.354644</td>\n    </tr>\n    <tr>\n      <th>2020-01-03</th>\n      <td>-0.600629</td>\n      <td>-0.761766</td>\n      <td>0.169777</td>\n    </tr>\n    <tr>\n      <th>2020-01-04</th>\n      <td>-0.650058</td>\n      <td>-0.548499</td>\n      <td>0.548560</td>\n    </tr>\n    <tr>\n      <th>2020-01-05</th>\n      <td>1.331785</td>\n      <td>-1.611482</td>\n      <td>-1.325902</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>2020-12-27</th>\n      <td>-1.030740</td>\n      <td>-0.520146</td>\n      <td>-0.049081</td>\n    </tr>\n    <tr>\n      <th>2020-12-28</th>\n      <td>-1.221772</td>\n      <td>-0.770726</td>\n      <td>0.585857</td>\n    </tr>\n    <tr>\n      <th>2020-12-29</th>\n      <td>0.420476</td>\n      <td>-2.558351</td>\n      <td>-0.797550</td>\n    </tr>\n    <tr>\n      <th>2020-12-30</th>\n      <td>0.146506</td>\n      <td>1.014053</td>\n      <td>-0.629454</td>\n    </tr>\n    <tr>\n      <th>2020-12-31</th>\n      <td>-1.296130</td>\n      <td>2.000808</td>\n      <td>-1.084081</td>\n    </tr>\n  </tbody>\n</table>\n<p>366 rows × 3 columns</p>\n</div>"
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# match the regularly shifted data\n",
    "window = pd.Timedelta(hours=6)\n",
    "matched_shifted_nn = temporal_collocation(ref, shifted, window, method=\"nearest\")\n",
    "matched_shifted_mean = temporal_collocation(ref, shifted, window, method=\"mean\")\n",
    "\n",
    "# the data should be the same before and after matching for both methods\n",
    "assert np.all(shifted.values == matched_shifted_nn.values)\n",
    "assert np.all(shifted.values == matched_shifted_mean.values)\n",
    "\n",
    "matched_shifted_nn\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do the same for the randomly shifted timeseries. Here we should see some changes, because sometimes there's no value inside the window that we are looking at. However, the result of mean and nearest neighbour should be the same"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "            random_0  random_1  random_2\n2020-01-01 -0.687795 -0.626649  0.237109\n2020-01-02 -0.514778 -1.981137  0.354644\n2020-01-03 -0.600629 -0.761766  0.169777\n2020-01-04 -0.650058 -0.548499  0.548560\n2020-01-05  1.331785 -1.611482 -1.325902\n...              ...       ...       ...\n2020-12-27 -1.030740 -0.520146 -0.049081\n2020-12-28 -1.221772 -0.770726  0.585857\n2020-12-29  0.420476 -2.558351 -0.797550\n2020-12-30  0.146506  1.014053 -0.629454\n2020-12-31 -1.296130  2.000808 -1.084081\n\n[366 rows x 3 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>random_0</th>\n      <th>random_1</th>\n      <th>random_2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2020-01-01</th>\n      <td>-0.687795</td>\n      <td>-0.626649</td>\n      <td>0.237109</td>\n    </tr>\n    <tr>\n      <th>2020-01-02</th>\n      <td>-0.514778</td>\n      <td>-1.981137</td>\n      <td>0.354644</td>\n    </tr>\n    <tr>\n      <th>2020-01-03</th>\n      <td>-0.600629</td>\n      <td>-0.761766</td>\n      <td>0.169777</td>\n    </tr>\n    <tr>\n      <th>2020-01-04</th>\n      <td>-0.650058</td>\n      <td>-0.548499</td>\n      <td>0.548560</td>\n    </tr>\n    <tr>\n      <th>2020-01-05</th>\n      <td>1.331785</td>\n      <td>-1.611482</td>\n      <td>-1.325902</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>2020-12-27</th>\n      <td>-1.030740</td>\n      <td>-0.520146</td>\n      <td>-0.049081</td>\n    </tr>\n    <tr>\n      <th>2020-12-28</th>\n      <td>-1.221772</td>\n      <td>-0.770726</td>\n      <td>0.585857</td>\n    </tr>\n    <tr>\n      <th>2020-12-29</th>\n      <td>0.420476</td>\n      <td>-2.558351</td>\n      <td>-0.797550</td>\n    </tr>\n    <tr>\n      <th>2020-12-30</th>\n      <td>0.146506</td>\n      <td>1.014053</td>\n      <td>-0.629454</td>\n    </tr>\n    <tr>\n      <th>2020-12-31</th>\n      <td>-1.296130</td>\n      <td>2.000808</td>\n      <td>-1.084081</td>\n    </tr>\n  </tbody>\n</table>\n<p>366 rows × 3 columns</p>\n</div>"
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# match the randomly shifted data\n",
    "matched_random_mean = temporal_collocation(ref, random, window, method=\"mean\")\n",
    "matched_random_nn = temporal_collocation(ref, random, window, method=\"nearest\")\n",
    "\n",
    "# the data should be the same as before matching at the locations where the shift\n",
    "# was below 6 hours, and should be np.nan when shift was larger\n",
    "should_be_nan = np.abs(random_shift) > 6\n",
    "\n",
    "assert np.all(matched_random_nn[~should_be_nan].values == random[~should_be_nan].values)\n",
    "assert np.all(np.isnan(matched_random_nn[should_be_nan].values))\n",
    "\n",
    "for c in matched_random_nn.columns:\n",
    "    df = pd.DataFrame(index=range(366),\n",
    "                      data={'mean': matched_random_mean[c].values,\n",
    "                            'nn': matched_random_nn[c].values}).dropna()\n",
    "    np.testing.assert_almost_equal(\n",
    "        df.diff(axis=1).iloc[:, 1].values,\n",
    "        np.zeros(df.index.size),\n",
    "        decimal=4)\n",
    "\n",
    "matched_random_nn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Returning the original index\n",
    "\n",
    "`temporal_collocation` can also return the original index of the data that was matched as a separate column in the resulting DataFrame, if required, and can additionally also calculate the distance to the reference. The column names are \"index_other\" and \"distance_other\", respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "            shifted_0  shifted_1  shifted_2         index_other  \\\n2020-01-01  -0.687795  -0.626649   0.237109 2020-01-01 03:00:00   \n2020-01-02  -0.514778  -1.981137   0.354644 2020-01-02 03:00:00   \n2020-01-03  -0.600629  -0.761766   0.169777 2020-01-03 03:00:00   \n2020-01-04  -0.650058  -0.548499   0.548560 2020-01-04 03:00:00   \n2020-01-05   1.331785  -1.611482  -1.325902 2020-01-05 03:00:00   \n...               ...        ...        ...                 ...   \n2020-12-27  -1.030740  -0.520146  -0.049081 2020-12-27 03:00:00   \n2020-12-28  -1.221772  -0.770726   0.585857 2020-12-28 03:00:00   \n2020-12-29   0.420476  -2.558351  -0.797550 2020-12-29 03:00:00   \n2020-12-30   0.146506   1.014053  -0.629454 2020-12-30 03:00:00   \n2020-12-31  -1.296130   2.000808  -1.084081 2020-12-31 03:00:00   \n\n            distance_other  \n2020-01-01 0 days 03:00:00  \n2020-01-02 0 days 03:00:00  \n2020-01-03 0 days 03:00:00  \n2020-01-04 0 days 03:00:00  \n2020-01-05 0 days 03:00:00  \n...                    ...  \n2020-12-27 0 days 03:00:00  \n2020-12-28 0 days 03:00:00  \n2020-12-29 0 days 03:00:00  \n2020-12-30 0 days 03:00:00  \n2020-12-31 0 days 03:00:00  \n\n[366 rows x 5 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>shifted_0</th>\n      <th>shifted_1</th>\n      <th>shifted_2</th>\n      <th>index_other</th>\n      <th>distance_other</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2020-01-01</th>\n      <td>-0.687795</td>\n      <td>-0.626649</td>\n      <td>0.237109</td>\n      <td>2020-01-01 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n    <tr>\n      <th>2020-01-02</th>\n      <td>-0.514778</td>\n      <td>-1.981137</td>\n      <td>0.354644</td>\n      <td>2020-01-02 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n    <tr>\n      <th>2020-01-03</th>\n      <td>-0.600629</td>\n      <td>-0.761766</td>\n      <td>0.169777</td>\n      <td>2020-01-03 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n    <tr>\n      <th>2020-01-04</th>\n      <td>-0.650058</td>\n      <td>-0.548499</td>\n      <td>0.548560</td>\n      <td>2020-01-04 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n    <tr>\n      <th>2020-01-05</th>\n      <td>1.331785</td>\n      <td>-1.611482</td>\n      <td>-1.325902</td>\n      <td>2020-01-05 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>2020-12-27</th>\n      <td>-1.030740</td>\n      <td>-0.520146</td>\n      <td>-0.049081</td>\n      <td>2020-12-27 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n    <tr>\n      <th>2020-12-28</th>\n      <td>-1.221772</td>\n      <td>-0.770726</td>\n      <td>0.585857</td>\n      <td>2020-12-28 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n    <tr>\n      <th>2020-12-29</th>\n      <td>0.420476</td>\n      <td>-2.558351</td>\n      <td>-0.797550</td>\n      <td>2020-12-29 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n    <tr>\n      <th>2020-12-30</th>\n      <td>0.146506</td>\n      <td>1.014053</td>\n      <td>-0.629454</td>\n      <td>2020-12-30 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n    <tr>\n      <th>2020-12-31</th>\n      <td>-1.296130</td>\n      <td>2.000808</td>\n      <td>-1.084081</td>\n      <td>2020-12-31 03:00:00</td>\n      <td>0 days 03:00:00</td>\n    </tr>\n  </tbody>\n</table>\n<p>366 rows × 5 columns</p>\n</div>"
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# also return original index and distance\n",
    "matched_shifted = temporal_collocation(\n",
    "    ref, shifted, window, \n",
    "    method=\"nearest\", \n",
    "    return_index=True, \n",
    "    return_distance=True\n",
    ")\n",
    "\n",
    "# the index should be the same as unmatched, and the distance should be 3  hours\n",
    "assert np.all(matched_shifted[\"index_other\"].values == shifted.index.values)\n",
    "assert np.all(matched_shifted[\"distance_other\"] == pd.Timedelta(hours=3))\n",
    "\n",
    "matched_shifted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Flags\n",
    "\n",
    "Satellite data often contains flags indicating quality issues with the data. With `temporal_collocation` it is possible to use this information. Flags can either be provided as array (of the same length as the input DataFrame), or the name of a column in the DataFrame to be used as flag can be provided as string. Any non-zero flag is interpreted as indicating invalid data. By default this will not be used, but when passing ``use_invalid=True``, the invalid values will be used in case no valid match was found.\n",
    "\n",
    "For the following example, we reuse the input data shifted by 3 hours, but we will now assume that the first 3 observations had quality issues."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "array([ True,  True,  True, False, False, False, False, False, False,\n       False])"
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# flag the first 3 observations as invalid\n",
    "flag = np.zeros(len(ref), dtype=bool)\n",
    "flag[0:3] = True\n",
    "flag[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "            shifted_0  shifted_1  shifted_2\n2020-01-01        NaN        NaN        NaN\n2020-01-02        NaN        NaN        NaN\n2020-01-03        NaN        NaN        NaN\n2020-01-04  -0.650058  -0.548499   0.548560\n2020-01-05   1.331785  -1.611482  -1.325902\n...               ...        ...        ...\n2020-12-27  -1.030740  -0.520146  -0.049081\n2020-12-28  -1.221772  -0.770726   0.585857\n2020-12-29   0.420476  -2.558351  -0.797550\n2020-12-30   0.146506   1.014053  -0.629454\n2020-12-31  -1.296130   2.000808  -1.084081\n\n[366 rows x 3 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>shifted_0</th>\n      <th>shifted_1</th>\n      <th>shifted_2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2020-01-01</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2020-01-02</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2020-01-03</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2020-01-04</th>\n      <td>-0.650058</td>\n      <td>-0.548499</td>\n      <td>0.548560</td>\n    </tr>\n    <tr>\n      <th>2020-01-05</th>\n      <td>1.331785</td>\n      <td>-1.611482</td>\n      <td>-1.325902</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>2020-12-27</th>\n      <td>-1.030740</td>\n      <td>-0.520146</td>\n      <td>-0.049081</td>\n    </tr>\n    <tr>\n      <th>2020-12-28</th>\n      <td>-1.221772</td>\n      <td>-0.770726</td>\n      <td>0.585857</td>\n    </tr>\n    <tr>\n      <th>2020-12-29</th>\n      <td>0.420476</td>\n      <td>-2.558351</td>\n      <td>-0.797550</td>\n    </tr>\n    <tr>\n      <th>2020-12-30</th>\n      <td>0.146506</td>\n      <td>1.014053</td>\n      <td>-0.629454</td>\n    </tr>\n    <tr>\n      <th>2020-12-31</th>\n      <td>-1.296130</td>\n      <td>2.000808</td>\n      <td>-1.084081</td>\n    </tr>\n  </tbody>\n</table>\n<p>366 rows × 3 columns</p>\n</div>"
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "matched_flagged = temporal_collocation(ref, shifted, window, flag=flag)\n",
    "\n",
    "# the first 3 values should be NaN, otherwise the result should be the same as matched_shifted\n",
    "assert np.all(np.isnan(matched_flagged.values[0:3, :]))\n",
    "assert np.all(matched_flagged.values[3:, :] == matched_shifted.values[3:, 0:3])  # excluding additonal columns\n",
    "matched_flagged"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "            shifted_0  shifted_1  shifted_2 my_flag\n2020-01-01        NaN        NaN        NaN     NaN\n2020-01-02        NaN        NaN        NaN     NaN\n2020-01-03        NaN        NaN        NaN     NaN\n2020-01-04  -0.650058  -0.548499   0.548560   False\n2020-01-05   1.331785  -1.611482  -1.325902   False\n...               ...        ...        ...     ...\n2020-12-27  -1.030740  -0.520146  -0.049081   False\n2020-12-28  -1.221772  -0.770726   0.585857   False\n2020-12-29   0.420476  -2.558351  -0.797550   False\n2020-12-30   0.146506   1.014053  -0.629454   False\n2020-12-31  -1.296130   2.000808  -1.084081   False\n\n[366 rows x 4 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>shifted_0</th>\n      <th>shifted_1</th>\n      <th>shifted_2</th>\n      <th>my_flag</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2020-01-01</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2020-01-02</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2020-01-03</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2020-01-04</th>\n      <td>-0.650058</td>\n      <td>-0.548499</td>\n      <td>0.548560</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>2020-01-05</th>\n      <td>1.331785</td>\n      <td>-1.611482</td>\n      <td>-1.325902</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>2020-12-27</th>\n      <td>-1.030740</td>\n      <td>-0.520146</td>\n      <td>-0.049081</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>2020-12-28</th>\n      <td>-1.221772</td>\n      <td>-0.770726</td>\n      <td>0.585857</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>2020-12-29</th>\n      <td>0.420476</td>\n      <td>-2.558351</td>\n      <td>-0.797550</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>2020-12-30</th>\n      <td>0.146506</td>\n      <td>1.014053</td>\n      <td>-0.629454</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>2020-12-31</th>\n      <td>-1.296130</td>\n      <td>2.000808</td>\n      <td>-1.084081</td>\n      <td>False</td>\n    </tr>\n  </tbody>\n</table>\n<p>366 rows × 4 columns</p>\n</div>"
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This also works when the flag is already in the input frame, but note that\n",
    "# in the output frame the nonzero flag values are replaced by NaN\n",
    "flagged = shifted.assign(my_flag=flag)\n",
    "matched_flagged = temporal_collocation(ref, flagged, window, flag=\"my_flag\")\n",
    "\n",
    "# the first 3 values should be NaN, otherwise the result should be the same as matched_shifted\n",
    "assert np.all(np.isnan(matched_flagged.iloc[0:3, 0:3].values))\n",
    "assert np.all(matched_flagged.iloc[3:, 0:3].values == matched_shifted.values[3:, 0:3])  # excluding additonal columns\n",
    "matched_flagged"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Combined collocation\n",
    "\n",
    "It is also possible to match multiple timeseries together against a reference dataset using the function `pytesmo.temporal_matching.combined_temporal_collocation`. With the keyword argument `combined_dropna` it's possible to drop data where one of the input datasets has missing values from all other input datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "            random_0  random_1  random_2  shifted_0  shifted_1  shifted_2\n2020-01-01 -0.687795 -0.626649  0.237109  -0.687795  -0.626649   0.237109\n2020-01-02 -0.514778 -1.981137  0.354644  -0.514778  -1.981137   0.354644\n2020-01-03 -0.600629 -0.761766  0.169777  -0.600629  -0.761766   0.169777\n2020-01-04 -0.650058 -0.548499  0.548560  -0.650058  -0.548499   0.548560\n2020-01-05  1.331785 -1.611482 -1.325902   1.331785  -1.611482  -1.325902\n...              ...       ...       ...        ...        ...        ...\n2020-12-27 -1.030740 -0.520146 -0.049081  -1.030740  -0.520146  -0.049081\n2020-12-28 -1.221772 -0.770726  0.585857  -1.221772  -0.770726   0.585857\n2020-12-29  0.420476 -2.558351 -0.797550   0.420476  -2.558351  -0.797550\n2020-12-30  0.146506  1.014053 -0.629454   0.146506   1.014053  -0.629454\n2020-12-31 -1.296130  2.000808 -1.084081  -1.296130   2.000808  -1.084081\n\n[366 rows x 6 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>random_0</th>\n      <th>random_1</th>\n      <th>random_2</th>\n      <th>shifted_0</th>\n      <th>shifted_1</th>\n      <th>shifted_2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2020-01-01</th>\n      <td>-0.687795</td>\n      <td>-0.626649</td>\n      <td>0.237109</td>\n      <td>-0.687795</td>\n      <td>-0.626649</td>\n      <td>0.237109</td>\n    </tr>\n    <tr>\n      <th>2020-01-02</th>\n      <td>-0.514778</td>\n      <td>-1.981137</td>\n      <td>0.354644</td>\n      <td>-0.514778</td>\n      <td>-1.981137</td>\n      <td>0.354644</td>\n    </tr>\n    <tr>\n      <th>2020-01-03</th>\n      <td>-0.600629</td>\n      <td>-0.761766</td>\n      <td>0.169777</td>\n      <td>-0.600629</td>\n      <td>-0.761766</td>\n      <td>0.169777</td>\n    </tr>\n    <tr>\n      <th>2020-01-04</th>\n      <td>-0.650058</td>\n      <td>-0.548499</td>\n      <td>0.548560</td>\n      <td>-0.650058</td>\n      <td>-0.548499</td>\n      <td>0.548560</td>\n    </tr>\n    <tr>\n      <th>2020-01-05</th>\n      <td>1.331785</td>\n      <td>-1.611482</td>\n      <td>-1.325902</td>\n      <td>1.331785</td>\n      <td>-1.611482</td>\n      <td>-1.325902</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>2020-12-27</th>\n      <td>-1.030740</td>\n      <td>-0.520146</td>\n      <td>-0.049081</td>\n      <td>-1.030740</td>\n      <td>-0.520146</td>\n      <td>-0.049081</td>\n    </tr>\n    <tr>\n      <th>2020-12-28</th>\n      <td>-1.221772</td>\n      <td>-0.770726</td>\n      <td>0.585857</td>\n      <td>-1.221772</td>\n      <td>-0.770726</td>\n      <td>0.585857</td>\n    </tr>\n    <tr>\n      <th>2020-12-29</th>\n      <td>0.420476</td>\n      <td>-2.558351</td>\n      <td>-0.797550</td>\n      <td>0.420476</td>\n      <td>-2.558351</td>\n      <td>-0.797550</td>\n    </tr>\n    <tr>\n      <th>2020-12-30</th>\n      <td>0.146506</td>\n      <td>1.014053</td>\n      <td>-0.629454</td>\n      <td>0.146506</td>\n      <td>1.014053</td>\n      <td>-0.629454</td>\n    </tr>\n    <tr>\n      <th>2020-12-31</th>\n      <td>-1.296130</td>\n      <td>2.000808</td>\n      <td>-1.084081</td>\n      <td>-1.296130</td>\n      <td>2.000808</td>\n      <td>-1.084081</td>\n    </tr>\n  </tbody>\n</table>\n<p>366 rows × 6 columns</p>\n</div>"
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combined_match = combined_temporal_collocation(ref, (random, shifted), window, combined_dropna=True)\n",
    "# matched dataframe should have same length as matched_random_nn without NaNs\n",
    "assert len(combined_match == len(matched_random_nn.dropna()))\n",
    "combined_match"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:pytesmo] *",
   "language": "python",
   "name": "conda-env-pytesmo-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}