The Meandering River of Data-Driven Research: The Underappreciated Skill of Looking for Datasets

Seth Adelsperger and Danyka Byrnes
May 11, 2023
5 min read

In geoscience, “data” has been the buzzword for a few years now. Everyone wants to use it for insights into various processes. Researchers and companies are now focusing on data-driven work and machine learning and need experts to facilitate this transition. Anecdotally, many of these positions go to computer scientists and related fields. But geoscientists are in a prime position to be the leaders of this work because we have the domain knowledge, giving us valuable insight into the data. With this insight, we can make better decisions at every step – we can choose the right data, use data cleaning approaches that are more informed, results can be framed in the context of our knowledge, etc.

The glaring issue is that geoscience students are not taught computer literacy and how to “do data”.

So, let’s take a step back, start at square one, and answer the question: How do I find data?

We liken looking for datasets to a float down a river. The process is a journey, the effort required to find what you need ebbs and flows. Any student who has dipped their toes into work that requires finding datasets and harmonizing data can relate to the frustration of the process. There are a lot of skills and approaches involved in looking for data.

I, Danyka, work at the national to global scale. I use data-driven approaches to understand how nitrogen has impacted water quality over the last 100 years. My work heavily relies on harmonizing disparate datasets to create a century of annual nitrogen input estimates. I also have to wrangle water quality data to estimate how much nitrogen comes from watersheds. With this, we can infer the fate of nitrogen in watersheds and what drives these processes.

I, Seth, work from the regional scale down to the watershed scale. My research involves understanding how agricultural practices, predominantly tile drainage, interact with watershed characteristics and climate to influence the streamflow regime. This approach necessitates the need for various datasets that fall in the categories of streamflow, climate, and land cover/land use (LULC) . I then utilize them to gain meaningful and useful outputs that further our understanding of how agricultural practices have altered streamflow.

Needless to say, both of us have spent many hours searching for data and ancillary information that we need. Despite working at different scales across different research interests, we both follow a similar multi-step process to find data. Here are some approaches we suggest when starting on your journey to find data. .

Where to look?

The variety of ways to look for data can be overwhelming, so here are a few avenues we use.

Tapping your network: Ask your advisor, committee member(s), lab member(s), or friends.

This is the most intuitive place to start and arguably the lowest effort. You can use your network to ask, “Do you know of a dataset with XYZ?” This approach can be extremely fruitful. So, don’t be shy. Ask around.

Pros: Quick, and they can warn you about issues with datasets.
Cons: Some ability to find your own data will be needed eventually.

Like Attracts Like: Turn to your peers’ papers

The method sections of papers or supplemental documents can be a treasure trove for data explorers. This avenue will probably take time but may yield datasets not found through a blind search.

Pros: the source will also include how researchers have used the data and the limitations of the dataset.
Cons: Due to the time it takes to publish new work, if it’s used in a paper already, the data is likely to be a few years old.

Lean on an old friend: Using search engines and data repositories.

One obvious answer is to use Google (or Google Dataset Search) to find data. This approach will likely return millions of results, so you’ll want to use some filters to narrow your search results so they are closer to what you are looking for. There are some “key” data producers in our field that are constantly generating data. Familiarize yourself with agencies and organizations like USGS, NASA, Google Earth Engine, World Resource Institute, World in Data, USDA, FAO, Resource Watch and specific research labs to know what type of data they generate. These people created the “household” datasets, including GRACE, GridMET, PRISM, streamflow, NLCD, and HydroSHEDS, to name a few.

Pros: There is plenty of data on these sites, and you can find what you are looking for, given you are looking in the right place.
Cons: Some of these agencies have interfaces that are challenging to navigate. Usually, they have a tag or search function, but these can be finicky depending on how data is labeled.

Straight from the source: Subscribing to Google notifications and data journal table of contents

This is a standard go-to passive method to find data. Typically, a handful of researchers in your field are considered the “go-to” data people. These researchers will often publish their dataset in a paper. This has become far more common because there has been an increase in the number of journals dedicated to publishing scientific datasets. You can set up notifications from Google Scholar, which will notify you when a researcher authors a paper. You can also subscribe to journals’ or data repositories' table of contents and receive notifications of all new publications. That way, you can passively receive daily emails of all the new publications and pursue the new datasets quickly.

Pro: Passive, efficient, and allows you to keep up with literature.
Con: This yields a large quantity of content, so your inbox can get bombarded with emails. To mitigate this impact, we suggest having a “Publication folder” on your email and filtering all incoming data emails to that folder.

Make it a routine activity

I know we all have a million and one things on our plate. But there are new data released daily. You want to ensure you are keeping an eye on the new data rolling in.

We suggest you look weekly and keep a running list of datasets relevant to your work. There are two types of data you’ll want to save: (1) datasets that will benefit your current projects, and (2) datasets that can be used to inspire future research (side projects, postdocs, proposals, etc.).

You can compile them into a Google Doc or Sheet with (a) dataset name, (b) dataset description, and (c) DOI or URL. You’ll want to record key metadata information, including the type of data (time series data, point data, or gridded spatial data), time range, and spatial extents. In the end, you are probably going to forget about the dataset. Essentially, you want to aim to include the “need to know” information of the dataset, to evaluate whether it would suit your need. This way, when time has passed and you’ve forgotten about the dataset, you can quickly survey the list and glean all the necessary information for usability.

Geosciences are coming of age in their creation and analysis of large data. While new datasets are developed daily by a plethora of models and researchers, these datasets require skill to understand and utilize to gain meaningful insight from the data. While finding and gathering data is an incredibly useful skill that needs to be developed, it is but the tip of the iceberg. Once the data has been found, it has to be used and understood, both of which present their own challenges. Given the prevalence of new and existing data, what are the barriers for new graduate students to gather and use data meaningfully?