Xeek Tools: Table Extractor
Meet the tool that allows you to export legacy geology data within seconds.
. . .
Geoscientists have been collecting data and placing it in tables for decades. There is a tremendous wealth of information in legacy documents that can be used to help evaluate opportunities for the energy transition, like reservoir quality of carbon capture targets or analysis of ground water samples. Unfortunately, these data are stuck unless a geoscientist sacrifices a portion of their day to manually type out the data. This is a well-known pain point for geoscientists, so when Studio X started planning our first tranche of tools, a Table Extractor was at the top of the priority list.
Who the Table Extractor is for?
Our goal was to create a tool that is helpful for any technical worker. It is quite common for a region of the world to see a resurgence in interest many years later, and this resurgence is especially prevalent in teams working the energy transition. Legacy knowledge is key for understanding the potential targets for carbon capture, the geochemistry of subsurface fluids for geothermal or rare elements, and new sources of energy, such as green hydrogen. The original sponsor team for the Table Extractor tool was a group of oil and gas geochemists that needed a way to build a new database of water samples, as they saw water management becoming their primary focus in the decades to come. Having a tool that can extract data trapped in tables from old documents is transformative for these scientists’ digital workflows.
How it Works
To use the Table Extractor, a user first uploads PDF reports that contain data tables. Once uploaded, the algorithm runs through each page of the documents and identifies whether a page has a table on it. If a table is present, the page is extracted and moved on to the Optical Character Technology (OCR) phase. This phase has a separate algorithm that recognizes the rows, columns, headers, and text of the table, then converts it to a digital format using OCR technology.
All of these steps happen within seconds — from the user’s perspective, they simply see a processing page while this work is being done. Once digitalization is complete, the results are presented to the user in a clean interface that allows them to select which document they want to review first, as well as a list of all the pages with tables with their new digital output for quality control. Depending on the age or quality of the document, OCR extraction can make mistakes, so the tool allows the user to flip between the digitized table and an image of the original to ensure quality. Finally, the user is able to select “Export” and get a series of CSV files of the extracted tables that they can then use for their next steps in processing.
Experience the Table Extractor
The Studio X team took great care to create a tool that will actually make a difference for geoscientists and their work. The Table Extractor tool went through several different iterations and machine learning models to find the most reliable output, especially for documents that are of a poorer quality. We utilized a mix of both open source and proprietary Python packages to tackle this problem, and the Data Science Team and Designers worked together to design a layout and flow that makes it easy for users to jump into the tool and start getting value out of it.
The Table Extractor is free to use for anyone that signs up on xeek.ai. This is the first iteration of this tool, and we want as many users to have access to it as possible — we want our users to break it, find gaps, and point out our errors! Together we can refine the Table Extractor and create a tool that is a huge benefit to the geoscience community.