Your Data Hygienist is Calling and It’s a Robot


Studio X Team


September 1, 2021

AI data hygiene has the power to change the geoscience industry. Here’s how.

.   .   .

In an ideal world, our applications would take care of our data management for us. Humans are by nature inadequate at keeping meticulous records. To save our data – and the frustration of future generations – we need tools that are constantly making records of how, when, and what we are interpreting. 

The Difficulties of Legacy Data

Internal surveys of geoscientists have consistently come back with the same problem: “Where is my data?!” Finding the right piece of data, and being able to trust it, is a huge problem for every geoscientist. And while the tools for managing our current data are coming, they won’t help the decades of past documents that need to be understood.

Oftentimes, geoscientists will throw out legacy data and do the work over. This is not because they think they are better than past geoscientists, but because they have no idea how the data was generated and therefore can’t design a project from it or defend it in reviews.

Endless Connections

And so, a solution has been proposed: an AI data hygienist. An AI data hygienist would be able to fill in the gaps in a piece of data’s backstory by taking in the context of the data. By understanding the author, tools used, and location of the data on Earth, a machine should be able to provide lost details to create a richer history. This history, or metadata, can be searched or used as data itself to provide better materials for geoscientists to use. There are hundreds of thousands of great insights to be found in journals, textbooks, and company reports of the past. AI data hygiene would allow us to build connections between these documents to better understand our world. 

Consider, for example, the coffee cup sitting on your desk right now. It’s a simple object, and yet it contains hundreds of details and connections. The store it was bought from. The event that brought it into your life. The gallons of coffee it has held in its life to date. These are just a few elements of the cup’s history – imagine how many more connections lie within a single rock or a byte of geoscience data. The more of these connections we can map out and understand, the deeper we’ll be able to go in our understanding of the Earth and how we can harness energy from it.

AI Data Hygiene Tools

So what does AI data hygiene look like? One aspect is building tools that extract geoscience-aware phrases across the entire document corpus which can be used to isolate analogous documents and relevant data types, especially related to legacy studies. In layman’s terms, these tools can take a piece of geoscience jargon, like “subangular, subrounded greywacke,” for example, and connect it to all the different pieces of geoscience knowledge and history that reference it. Another example is Document AI, an augmented analog of the Google document AI tool, which integrates document geolocation, image and table extraction, and ontological-focused (or domain-focused) search data within the document. 

This is all part of Knowledge Map Graphing. Remember our coffee cup from earlier? Imagine all the relationships between different coffee cups displayed as a graph. This graph could also allow you to find all other similar coffee in town that were derived from the same bean location, the quality of the bean given which farm it came from, and so on.

Studio X and Data Hygiene

We believe data management for current and legacy data is an important endeavor. That’s why we are committed to data cleanliness, both now and in the future. Currently, XCover creates detailed records of new data that is being generated by geoscientists in assignments. On the Xeek platform, we maintain specific requirements around clean document code for all submissions to challenges. Going forward, the Studio X Data Science Team and Xeek Tools will be developing an AI data hygienist. AI data hygiene has the potential to uncover and fully digitize the global document data corpus, creating near endless connections in geoscience and unleashing the next generation of data insights.