{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "742c2576-32cc-4e7a-b083-9b88ed986dc8",
   "metadata": {},
   "source": [
    "# Example application\n",
    "The example (code and data) shown here are available in the \n",
    "`/docs/example` directory of the [HydroBM repository](https://github.com/wknoben/hydrobm).\n",
    "\n",
    "\n",
    "## Example data\n",
    "This example uses data obtained from the Caravan data set (Kratzert et al., 2023). To reduce repository size only the \"total_precipitation_sum\", \"temperature_2m_mean\", and \"streamflow\" variables are retained in `camels_01022500_minimal.nc`.\n",
    "\n",
    "### Subsetting code\n",
    "For full reproducibility, download the original Caravan data and use this code to subset.\n",
    "\n",
    "```\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "\n",
    "# Load the data\n",
    "data_file = 'camels_01022500.nc'\n",
    "mini_file = 'camels_01022500_minimal.nc'\n",
    "\n",
    "data = xr.open_dataset(data_file)\n",
    "keep = ['date', 'streamflow', 'total_precipitation_sum', 'temperature_2m_mean']\n",
    "mini = data.drop_vars([var for var in data.variables if not var in keep])\n",
    "mask = np.isnan(mini['streamflow'])\n",
    "mini = mini.isel(date=~mask)\n",
    "mini.to_netcdf(mini_file)\n",
    "\n",
    "```\n",
    "\n",
    "### References\n",
    "Kratzert, F., Nearing, G., Addor, N., Erickson, T., Gauch, M., Gilon, O., Gudmundsson, L., Hassidim, A., Klotz, D., Nevo, S., Shalev, G., and Matias, Y.: Caravan - A global community dataset for large-sample hydrology, Scientific Data, 10, 61, https://doi.org/10.1038/s41597-023-01975-w, 2023.\n",
    "\n",
    "## Example application"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83956102-4529-4550-b768-e0921e568baf",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from hydrobm.calculate import calc_bm\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.dates as mdates\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import xarray as xr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82731184-d82d-4bc8-9d14-d36c7d3a753d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the example data\n",
    "data_file = './camels_01022500_minimal.nc'\n",
    "data = xr.open_dataset(data_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1770f15-1522-436a-b2c7-9cfe65780efc",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Create an exploratory plot\n",
    "This shows that snow likely plays a role in this basin. \n",
    "\n",
    "Note that `matplotlib` is not a dependency of `hydrobm` and thus will not be automatically installed if it isn't already present on your system."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a11c73ca-aac4-4e1d-b883-b4caba71dd97",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "fig,ax = plt.subplots(1,1)\n",
    "data['total_precipitation_sum'].plot(ax=ax, label='precipitation')\n",
    "data['streamflow'].plot(ax=ax, label='streamflow')\n",
    "data['temperature_2m_mean'].plot(ax=ax, label='temperature')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "befbb18a-9ead-44cf-b016-7e0ce191a58d",
   "metadata": {},
   "source": [
    "### Run HydroBM\n",
    "For this example, we'll calculate all benchmarks and all metrics, as well as estimate a snow melt flux from the precipitation and temperature data. \n",
    "\n",
    "First, we'll specify time masks for benchmark calculation and validation. These are arbitrary choices and will depend on your data availability and study purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1208f0eb-c0a6-4d0a-aeb5-b2d28e045693",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Specify the calculation and evaluation periods, as boolean masks\n",
    "cal_mask = data['date'].values < np.datetime64('1999-01-01')\n",
    "val_mask = ~cal_mask"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1718d776",
   "metadata": {},
   "source": [
    "Next, we'll specify the benchmarks we want to calculate. \n",
    "We'll go for the full ensemble available in HydroBM at the time of its initial release.\n",
    "\n",
    "This will trigger a warning when we run HydroBM, because the `annual_mean_flow` and `annual_median_flows` shouldn't be used in combination with a `val_mask`, but we'll accept that for this example. As a consequence, we'll also see some NumPy warnings when HydroBM runs it's benchmark validation for these two benchmarks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec3a47b5-4cab-4e90-af50-c33cd8518ee1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Specify the benchmarks and metrics to calculate\n",
    "benchmarks = [\n",
    "        # Streamflow benchmarks\n",
    "        # =====================\n",
    "        \"mean_flow\",\n",
    "        \"median_flow\",\n",
    "        \"annual_mean_flow\",\n",
    "        \"annual_median_flow\",\n",
    "        \"monthly_mean_flow\",\n",
    "        \"monthly_median_flow\",\n",
    "        \"daily_mean_flow\",\n",
    "        \"daily_median_flow\",\n",
    "    \n",
    "        # Precipitation and streamflow benchmarks\n",
    "        # =====================\n",
    "        # Long-term rainfall-runoff ratio benchmarks\n",
    "        \"rainfall_runoff_ratio_to_all\",\n",
    "        \"rainfall_runoff_ratio_to_annual\",\n",
    "        \"rainfall_runoff_ratio_to_monthly\",\n",
    "        \"rainfall_runoff_ratio_to_daily\",\n",
    "        \"rainfall_runoff_ratio_to_timestep\",\n",
    "        \"scaled_precipitation_benchmark\",  # equivalent to \"rainfall_runoff_ratio_to_daily\"\n",
    "\n",
    "         # Short-term rainfall-runoff ratio benchmarks\n",
    "        \"monthly_rainfall_runoff_ratio_to_monthly\",\n",
    "        \"monthly_rainfall_runoff_ratio_to_daily\",\n",
    "        \"monthly_rainfall_runoff_ratio_to_timestep\",\n",
    "\n",
    "        # Precipitation deviation benchmarks\n",
    "        \"annual_scaled_daily_mean_flow\",\n",
    "        \"monthly_scaled_daily_mean_flow\",\n",
    "    \n",
    "        # Parsimonious models\n",
    "        # =====================\n",
    "        \"eckhardt_baseflow\",\n",
    "    \n",
    "        # Schaefli & Gupta (2007) benchmarks\n",
    "        \"adjusted_precipitation_benchmark\",\n",
    "        \"adjusted_smoothed_precipitation_benchmark\", # from Schaefli & Gupta (2007)\n",
    "     ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "094db2fd",
   "metadata": {},
   "source": [
    "We'll also calculate values for the four statistics available in HydroBM at its initial release. \n",
    "\n",
    "Note that automatically calculating certain metric scores is included in HydroBM as a quality-of-life feature, but HydroBM's usage is not limited to these metrics only. HydroBMs main purpose is to create the benchmark simulation time series and return these to the user. If your metric of choice is not (yet) available within HydroBM, simply use HydroBM to generate the benchmark model timeseries and calculate the metric yourself from those.\n",
    "\n",
    "Of course, you could always consider contributing your metric code to the HydroBM package through its [GitHub reposity](https://github.com/wknoben/hydrobm) :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea367ecc",
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics = ['nse', 'kge', 'mse', 'rmse']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d04914fc",
   "metadata": {},
   "source": [
    "Now we're ready to run the whole thing. Because we only have precipitation data, but know that snow is important, we'll also run HydroBM's simple snow model (`calc_snowmelt = True`). When running everything through `calc_bm`, enabling the snow routine will automatically internally use the resulting snow-plus-melt flux to calculate any requested benchmarks.\n",
    "\n",
    "Note that if you have an external snow model you like better, you can simply run that and feed that timeseries into HydroBM by changing the `precipitation` variable name to that of your own rain-plus-melt timeseries in `data`.\n",
    "\n",
    "We'll stick with the default values for `snowmelt_threshold` (threshold between rain and snow), `snowmelt_rate` (how quickly does accumulated snow melt) and `optimization_method` (how do we find optimal lag/smoothing values for the APB and ASPB benchmarks), but I've included them here anyway to show how/that they can be adjusted. See the `Usage` part of the documentation for further details.\n",
    "\n",
    "This is the part where all the work is done (run snow model, create benchmark timeseries, calculate metric values) and running this will take a minute or two. Note that we'll get some warnings from both HydroBM and NumPy because we're combining `val_mask` with two benchmarks that by definition are incompatible with the concept of unseen data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "982484d4-6d4d-4990-8726-abb1b4a32d80",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Calculate the benchmarks and scores\n",
    "benchmark_flows,scores = calc_bm(\n",
    "                            data,\n",
    "                \n",
    "                            # Time period selection\n",
    "                            cal_mask,\n",
    "                            val_mask=val_mask,\n",
    "                \n",
    "                            # Variable names in 'data'\n",
    "                            precipitation=\"total_precipitation_sum\",\n",
    "                            streamflow=\"streamflow\",\n",
    "                \n",
    "                            # Benchmark choices\n",
    "                            benchmarks=benchmarks,\n",
    "                            metrics=metrics,\n",
    "                            optimization_method=\"brute_force\",\n",
    "                \n",
    "                            # Snow model inputs\n",
    "                            calc_snowmelt=True,\n",
    "                            temperature=\"temperature_2m_mean\",\n",
    "                            snowmelt_threshold=0.0,\n",
    "                            snowmelt_rate=3.0,\n",
    "                        )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29b7e1e0-31c4-40f7-8f6c-01b5a39aa1d9",
   "metadata": {},
   "source": [
    "### Visualize the benchmarks\n",
    "First, lets visualze the timeseries of each benchmark. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b53c89c-02e3-4bcb-adb4-5971e050397e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Subset the benchmark flows between 1997 (calibration period) and 2000 (validation period)\n",
    "bm_sub = benchmark_flows.loc[\"1997\":\"2000\"]\n",
    "# Subset the observed flow for the same period\n",
    "qobs_sub = data.sel(date=slice(\"1997\", \"2000\"))[\"streamflow\"].to_series()\n",
    "# Extract list of benchmarks\n",
    "bm_cols = bm_sub.columns.tolist()\n",
    "\n",
    "# Format plots\n",
    "n_cols = 2\n",
    "n_rows = -(-len(bm_cols) // n_cols)\n",
    "# Find the last visible plot index in each column (for x-axis labels)\n",
    "bottom_axes = set()\n",
    "for c in range(n_cols):\n",
    "    col_indices = [c + r * n_cols for r in range(n_rows) if c + r * n_cols < len(bm_cols)]\n",
    "    if col_indices:\n",
    "        bottom_axes.add(col_indices[-1])\n",
    "fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 3), constrained_layout=True)\n",
    "axes = axes.flatten()\n",
    "\n",
    "split_date = pd.Timestamp(\"1999-01-01\")\n",
    "\n",
    "for i, name in enumerate(bm_cols):\n",
    "    ax = axes[i]\n",
    "    l1, = ax.plot(qobs_sub.index, qobs_sub.values, color=\"steelblue\", linewidth=0.9, label=\"Observed\")\n",
    "    l2, = ax.plot(bm_sub.index, bm_sub[name].values, color=\"darkorange\", linewidth=0.8, label=\"Benchmark\")\n",
    "    title = name.replace(\"bm_\", \"\").replace(\"_\", \" \").title()\n",
    "    ax.set_title(title, fontsize=10, fontweight=\"bold\", pad=3)\n",
    "    ax.xaxis.set_major_locator(mdates.MonthLocator(bymonth=[1, 7]))\n",
    "    ax.xaxis.set_major_formatter(mdates.DateFormatter(\"%b %Y\"))\n",
    "    ax.tick_params(axis=\"y\", labelsize=9)\n",
    "    ax.set_ylabel(\"Streamflow\", fontsize=9)\n",
    "    ax.spines[[\"top\", \"right\"]].set_visible(False)\n",
    "\n",
    "    # Vertical line at calibration/validation split\n",
    "    ax.axvline(split_date, color=\"black\", linewidth=1.0, linestyle=\"--\", zorder=3)\n",
    "\n",
    "    # Cal / Val labels\n",
    "    trans = ax.get_xaxis_transform()\n",
    "    ax.text(split_date - pd.Timedelta(days=500), 0.9,\n",
    "            \"Calibration\", transform=trans, ha=\"right\", va=\"top\", fontsize=9, color=\"black\")\n",
    "    ax.text(split_date + pd.Timedelta(days=500), 0.9,\n",
    "            \"Validation\", transform=trans, ha=\"left\", va=\"top\", fontsize=9, color=\"black\")\n",
    "\n",
    "    if i in bottom_axes:\n",
    "        ax.tick_params(axis=\"x\", labelsize=9, rotation=30)\n",
    "        ax.set_xlabel(\"Date\", fontsize=9)\n",
    "    else:\n",
    "        ax.tick_params(axis=\"x\", labelbottom=False)\n",
    "\n",
    "for j in range(len(bm_cols), len(axes)):\n",
    "    axes[j].set_visible(False)\n",
    "\n",
    "fig.suptitle(\"Benchmark Flows vs Observed Streamflow  |  1997–2000\", fontsize=14, fontweight=\"bold\")\n",
    "fig.legend(handles=[l1, l2], labels=[\"Observed\", \"Benchmark\"],\n",
    "           loc=\"lower center\", ncol=2, fontsize=10,\n",
    "           frameon=False, bbox_to_anchor=(0.5, -0.02))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4666900f-b839-4300-b666-efd021f538dd",
   "metadata": {},
   "source": [
    "Each benchmark produces a different timeseries. The Annual Mean and Median Flow benchmarks only produce values for the calibration period (1998), as they cannot be used on unseen data. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8589760e-f356-4508-aa7a-7b8845351fbe",
   "metadata": {},
   "source": [
    "### Analyze the results\n",
    "Here's some basic analysis of the resulting benchmarks and their scores. Feel free to build your own analysis from this!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "faf127e9-ab33-4b4d-b66c-f4e838cd4305",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Print the scores with some basic formatting applied\n",
    "for key,val in scores.items():\n",
    "    if key == 'benchmarks':\n",
    "        print(f'{key}: {val}')\n",
    "    else:\n",
    "        pm = [f'{num:.2f}' for num in val]\n",
    "        print(f'{key}: {pm}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ebd5261-c8dd-432f-bbda-726491b622f2",
   "metadata": {},
   "source": [
    "Not very clear. Let's create some hydrographs as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c74d3205-2d23-4861-9f9d-3dc0f18891ce",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Select the four best benchmarks for plotting\n",
    "def top_n_indices_and_values(values_list, n=4):\n",
    "    arr = np.array(values_list) # numpy array\n",
    "    nan_idx = np.where(np.isnan(arr)) # find nan values\n",
    "    arr_sort = arr.argsort() # sort the full array, nans go at the end\n",
    "    arr_sort = arr_sort[~np.isin(arr_sort, nan_idx)] # remove nans\n",
    "    indices = arr_sort[-n:] # get the top n indices\n",
    "    values = arr[indices] # get the values\n",
    "    return indices.tolist(), values.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dcd58db7-99b7-4027-9630-3e4e182909c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "idx,vals = top_n_indices_and_values(scores['kge_val'], 4)\n",
    "top_benchmarks = [scores['benchmarks'][id] for id in idx]\n",
    "top_kge_vals = [scores['kge_val'][id] for id in idx]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9d63142-2a4a-49ee-a54a-67ce5d71c2b2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Print streamflow along with the four best benchmarks\n",
    "fig,ax = plt.subplots(4,1, figsize=(14,14))\n",
    "for i,(bm,kge) in enumerate(zip(top_benchmarks,top_kge_vals)):\n",
    "    data['streamflow'].plot(ax=ax[i], linewidth=2, label='streamflow')\n",
    "    benchmark_flows[f'bm_{bm}'].plot(ax=ax[i], label=bm)\n",
    "    ax[i].legend(loc='upper left')\n",
    "    ax[i].set_title(f'{bm} (KGE: {kge:.2f})')\n",
    "    ax[i].set_xlabel('') # drops 'Date'\n",
    "    \n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc8e5613-ceaa-4e94-9f55-fa8338b94cea",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reformat the scores a bit for cleaner saving \n",
    "col_names = scores.pop(\"benchmarks\", None)\n",
    "df = pd.DataFrame(scores, index=col_names)\n",
    "df = df.T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f4f8830",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment these to save the benchmark models and scores\n",
    "#  No point doing that here while building the documentation.\n",
    "#df.to_csv(\"benchmark_scores.csv\")\n",
    "#benchmark_flows.to_csv(\"benchmark_flows.csv\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}