By Jerom Aerts
Here we present a non-exhaustive collection of useful (open-source) programming resources and tips for conducting research within the broader hydrological sciences. After supervising multiple students, a recurring question is where to start and what tools to use. Therefore, we created this collection with the needs of beginning to advanced level data scientists in mind.
We aim to regularly update this post with newly developed and or missing resources. This blog post is non-exclusive towards programming languages, however the first posts will focus on the Python language as this is most commonly used amongst the H3S committee members. A compilation of hydrology resources for R programming is also available.
If you feel that we missed resources and tips that can benefit this post, please contact us through the contact page on this website or through social media.
General Resources and Tips
The internet contains many tutorial on how to use Python for research. The community is very helpful when stuck debugging code. Some useful websites include:
– https://stackoverflow.com/ – Tip: Post your questions, the community will help.
– https://w3schools.com/python/ – Tutorials that cover the basics.
– https://towardsdatascience.com/ – Blog style short tutorials.
– https://earthdatascience.org/courses – Tutorials that cover earth data science.
How to start? A great place to start is by creating a solid basis that circumvents problems with conflicting packages and other headaches. One approach is the use of Anaconda Python (https://docs.anaconda.com/anaconda/install/).
Anaconda is a particular distribution of Python that includes a suite of core Python packages that are essential for data analysis. Installing Anaconda also sets paths correctly such that Python can easily find packages on your system. With the Anaconda Python installation their is an option to install the program Spyder. The Spyder program is an integrated development environment (IDE), where you can write and run code that has the Python file extension (.py).
An helpful tool that Anaconda offers is the creation of so-called environments. The user can install packages that are specific for a given study. This helps prevent the possibility of packages conflicting with each other. In addition, environments can be exported and shared with other users, which is useful for project collaboration.
After installing Anaconda and activating an environment, new packages can be installed using the Anaconda prompt with the command “conda install [‘package’]”.
The Anaconda installation contains the Spyder IDE and sets up the system such that it is ready to use.
Other examples of IDE’s that can streamline your workflow are:
– PyCharm (https://www.jetbrains.com/pycharm-edu/)
– Microsoft Visual Studio Code (https://code.visualstudio.com/)
Jupyter Notebooks and or Jupyter Lab environments are now being used actively in education and research as they enhance the readability of programming code and show direct output. Of interest is the use of widgets that can make your notebooks interactive.
Jupyter notebooks can be installed using Anaconda or pip (https://jupyter.org/install).
Github provides an excellent environment for storing code, collaborating on code, keeping track of versions and publishing. To interface with Github one could use the website, a desktop program and or the command line interface provided by git. There are many useful guides to get started, for example (https://opensource.com/article/18/1/step-step-guide-git).
Dataframe data (Tables, Excel):
A go to for many researchers when it comes to tables, csv or Excel files is the powerful Python package called Pandas. Pandas has a built in plotting tool and has many capabilities for cleaning interpolating, messy data. It offers data structures and operations for manipulating numerical tables and time series.
Useful tutorial on how to get started can be found here: https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html.
Raster Data (Arrays)
NumPy is a fast and powerful Python package written for working with n-dimensional arrays. NumPy contains functions for linear algebra, Fourier transform, and matrices. The great thing about NumPy is that the suite of functions is so extensive that in many cases a dedicated function is available for your needs.
Tutorial for the basics can be found here: https://numpy.org/doc/stable/user/quickstart.html.
Vector Data (Polygon, Shapefiles)
Shapely is a useful Python package for reading and manipulating vector data that is most commonly used in GIS software. It works with points, curves, surfaces. Shapely can handle coordinate system and geometric objects.
Fiona is an alternative package to Shapely that many find more intuitive. A self-declared, GDALs neat and nimble vector API for Python programmers. Fiona can read and write real-world data using multi-layered GIS formats and zipped virtual file systems and integrates readily with other Python GIS packages such as pyproj, Rtree, and Shapely.
Instructions on how to use Fiona can be found here: https://fiona.readthedocs.io/en/latest/manual.html.
GeoPandas is a high level interface tool that extends the datatypes used by Pandas to allow spatial operations on geometric types such as points, lines, and polygons. It is easy to use and highly recommended for beginning Python users.
Tutorials on how to use GeoPandas can be found here:
A standard approach of plotting with Python is by using the package Matplotlib (https://matplotlib.org/gallery/index.html). Other useful packages are:
A higher level interface on top of Matplotlib. This package is more intuitive for beginning level Python users. Seaborn is an easy way to make your figures look cleaner and more sophisticated than ‘standard Python figures’, due to the ability to import the Seaborn style guide.
The Seaborn website hosts a great tutorial: https://seaborn.pydata.org/tutorial.html.
Altair is a declarative statistical visualization library for Python, meaning that at a higher level the user declares links between data columns and channels like x-axis, y-axis, and colour. Altair works great with the Pandas data structures and can handle georeferenced objects through Shapely and Geopandas.
A tutorial can be found here: https://github.com/altair-viz/altair-tutorial.
Plotly is a powerful visualization tool that is best used in combination with Jupyter Notebooks. Plotly supports interactive figures. Plotly’s capabilities range from basic statistical figures, georeferenced maps, to machine learning visualizations.
Tutorials on these visualizations can be found here: https://plotly.com/python/.
Bokeh can be best used in combination with Jupyter Notebooks because it contains a suite of widgets. It works well with common tools like NumPy, Pandas and scikit-learn. Intuitive and fast are the best words to describe Bokeh. Plots, dashboards, and apps can be published in web pages or Jupyter Notebooks.
Tutorials can be found here: https://nbviewer.jupyter.org/github/bokeh/bokeh-notebooks/blob/master/index.ipynb#Tutorial.
Holoviews build a higher level interface on top of Bokeh and is best used with Jupyter Notebooks. By reducing the lines of code required to visualize data more time is left for exploring and analysing data.
GDAL is a translator library for raster and vector geospatial data formats. GDAL contains a very large suite of Raster and Vector drivers allowing it to work with many data formats. Some of the tools capabilities are reprojection, resampling resolution, raster calculation, and more. GDAL can be interfaced directly in Python or from the command line.
Tutorials on how to use GDAL can be found here: https://gdal.org/tutorials/index.html
Rasterio is more ‘Pythonic’ and easier to use than GDAL. Rasterio covers all the tools necessary for manipulating and storing gridded georeferenced raster datasets. Besides a simple Python interface, Rasterio contains a powerful command line tool called Rio.
A quickstart tutorial can be found here: https://rasterio.readthedocs.io/en/latest/quickstart.html
Advanced topics tutorial here:
Pyproj is an essential library of cartographic projections and coordinate transformations.
Xarray handles N-D labelled arrays and is a valuable tool for handling geospatial data. The interface borrows heavily from Pandas and shares compatibility. The great aspect of Xarray is that its build upon Dask and therefore allows for efficient parallel computing and reading of lazy-data. Meaning, not all of the data is loaded into memory. Especially working with the NetCDF file format is easy and intuitive.
Tip: the open_mfdataset and preprocessing functions allow for handling of multiple NetCDF files at the same time without creating memory issues.
A very useful tutorial on how to use xarray can be found here: https://geohackweek.github.io/nDarrays/00-datasets/
A powerful, format-agnostic, and community-driven Python library for analysing and visualising Earth science data.
An introduction to Iris can be found here: https://ourcodingclub.github.io/tutorials/iris-python-data-vis/
ESMPy is a Python interface to the powerful regridding tool ESMF.
A tutorial on how to use ESMPy can be found here: https://github.com/nawendt/esmpy-tutorial/blob/master/esmpy_tutorial.ipynb
The package works through a higher level interface than ESMPy on top of ESMF and can therefore be more intuitive for beginning Python users. The interface works similar to Xarray and shares compatibility with Xarray and NumPy. xESMF supports Dask resulting in out-of-core and parallel computation.
A tutorial on how to use xESMF can be found here: https://xesmf.readthedocs.io/en/latest/notebooks/Rectilinear_grid.html
Domain Specific Resources and Tips
Credits to Raoul Collenteur for compiling this list of open source python packages .
CMF – Catchment Modelling Framework, a hydrologic modelling toolbox.
TopoFlow – Spatial hydrologic model (D8-based, fully BMI-compliant).
VIC – The Variable Infiltration Capacity (VIC) Macroscale Hydrologic Model.
Xanthos – Xanthos is an open-source hydrologic model, written in Python, designed to quantify and analyze global water availability.
WRF-Hydro – wrfhydrpy is a Python API for the WRF-Hydro modelling system.
EXP-HYDRO – a catchment scale hydrological model that operates at a daily time-step. It takes as inputs the daily values of precipitation, air temperature, and potential evapotranspiration, and simulates daily streamflow at the catchment outlet.
RRMPG – Rainfall-Runoff modelling playground.
LHMP – Lumped Hydrological Models Playground.
SMARTPy – Python implementation of the rainfall-runoff model SMART
PyStream – Python implementation of the STREAM hydrological rainfall-runoff model.
HydrPy – A framework for the development and application of hydrological models based on Python.
Catchmod – CATCHMOD is a widely used rainfall runoff model in the United Kingdom. It was introduced by Wilby (1994).
wflow – wflow consists of a set of Python programs that can be run on the command line and perform hydrological simulations. The models are based on the PCRaster Python framework
PyTOPKAPI – PyTOPKAPI is a BSD licensed Python library implementing the TOPKAPI Hydrological model (Liu and Todini, 2002).
mhmpy – A Python-API for the mesoscale Hydrological Model.
SuperflexPy – SuperflexPy: A new open source framework for building conceptual hydrological models
NeuralHydrology – Python library to train neural networks with a strong focus on hydrological applications.
MetPy – MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.
PyEto – PyETo is a Python library for calculating reference crop evapotranspiration (ETo), sometimes referred to as potential evapotranspiration (PET). The library provides numerous functions for estimating missing meteorological data.
Improver – IMPROVER is a library of algorithms for meteorological post-processing and verification.
MetSim – MetSim is a meteorological simulator and forcing disaggregator for hydrologic modeling and climate applications.
MELODIST – MELODIST is an open-source toolbox written in Python for disaggregating daily meteorological time series to hourly time steps.
PyCat – Climate Analysis Tool written in python
PySteps – pySTEPS is a community-driven initiative for developing and maintaining an easy to use, modular, free and open source Python framework for short-term ensemble prediction systems.
Evaporation – Calculation of evaporation and transpiration.
rainymotion – Python library for radar-based precipitation nowcasting based on optical flow techniques.
Pytesmo – Python Toolbox for the Evaluation of Soil Moisture Observations.
Phydrus – Python implementation of the HYDRUS-1D unsaturated zone model
Flopy – The Python interface to MODFLOW.
imod-python – Make massive MODFLOW models.
Idfpy – A simple module for reading and writing iMOD IDF files. IDF is a simple binary format used by the iMOD groundwater modelling software.
WellApplication – Set of tools for groundwater level and water chemistry analysis.
TIMML – A Multi-Layer, Analytic Element Model.
TTim – A Multi-Layer, Transient, Analytic Element Model.
PyHELP – A Python library for the assessment of spatially distributed groundwater recharge and hydrological components with HELP.
PyRecharge – Spatially distributed groundwater recharge and depletion modeling framework in Python
Anaflow – A python-package containing analytical solutions for the groundwater flow equation
WellTestPy – A python-package for handling well based field campaigns.
Time Series (Analysis):
Hydropy – Analysis of hydrological-oriented time series.
Pastas – Analysis of hydrological time series using time series models.
Hydrostats – Tools for use in comparison studies, specifically for use in the field of hydrology.
htimeseries – This module provides the HTimeseries class, which is a layer on top of Pandas, offering a little more functionality.
EFlowCalc – Calculator of Streamflow Characteristics.
Hydrofunctions – A suite of convenience functions for working with hydrology data in an interactive Python session.
Hydrobox – Hydrological preprocessing and analysis toolbox build upon Pandas and Numpy.
Optimization, Uncertainty, Statistics:
LMFIT – Non-Linear Least Squares Minimization, with flexible Parameter settings, based on scipy.optimize.leastsq, and with many additional classes and methods for curve fitting.
SPOTpy – A Statistical Parameter Optimization Tool for Python.
PyGLUE – Generalised Likelihood Uncertainty Estimation (GLUE) Framework.
Pyemu – Python modules for model-independent uncertainty analyses, data-worth analyses, and interfacing with PEST(++).
HPGL – High Performance Geostatistics Library.
HydroErr – Goodness of Fit metrics for use in comparison studies, specifically in the field of hydrology.
Climate-indices – Climate indices for drought monitoring, community reference implementations in Python.
HydroLM – The HydroLM package contains a class and functions for automating linear regressions OLS for hydrologists.
PySDI – pysdi is a set of open source scripts that compute non-parametric standardized drought indices (SDI) using raster data sets as input data.
PcRaster – Is a collection of software targeted at the development and deployment of spatio-temporal environmental models.
PyGeoprocessing – a Python/Cython based library that provides a set of commonly used raster, vector, and hydrological operations for GIS processing.
Pysheds – Simple and fast watershed delineation in python.
Lidar – Terrain and hydrological analysis based on LiDAR-derived digital elevation models (DEM).