Meta Data Enrichment of Open Datasets | Data Science Dojo Blog

datasciencedojo · February 18, 2019, 6:05pm

The last few years have seen great advancement in AI technologies for data science and analytics. According to McKinsey, “Data is now a critical corporate asset—and its value is tied to its ultimate use … Value is likely to accrue to the owners of scarce data, [and] to players that aggregate data in unique ways.” With analytics engines capable of ingesting and analyzing almost any amount and type of data, the bottleneck has shifted from the technology to the data itself.

Addressing the need for reliable and diverse big data sources

There is growing availability of open data sets from governmental and other data sources. In order to make these data sets usable and actionable, they first need to be discoverable. Discovery starts with the metadata - traditionally the Achilles heel of open data. Published on thousands of disparate websites and usually with poor (or even wrong) metadata, open data sets are there to be used and their value realized; but locating the relevant data sets is oftentimes difficult or even impossible.

The Schema.org framework is an important step in the right direction. It calls for "upgrade" and standardization of the metadata and the data sets’ discoverability. Eventually, we hope to see data professionals able to discover data sets according to more detailed parameters – such as granular locations, company names, professional terms and other information found within the data sets.

In order to make the most of the Schema.org framework, we need to trust that publishers will adhere to the format, of which there is no guarantee. While regulation is slow in this area, initiatives like the Federal Open Data Policy push to make data sets increasingly discoverable, reusable and comparable.

The way we access open data is changing

We’re delighted to see that many independent and third-party solutions are pushing for deeper metadata enrichment, making it easier for data professionals to benefit from a wealth of relevant data. We believe the most important element is metadata enrichment, no easy task when you’re thinking about building a centralized repository that holds the sheer volume of data we generate.

Some techniques could include:

- Smart phrase extraction from within the data files, to essentially enable users to conduct “in-file search” for the most relevant terms.

- Contextual enrichment, understanding the data set and the context of the data to properly tag and classify the data sets. An example of this could be whether the word ‘Jordan’ is looking for data on the country, the river, or the NBA icon.

- Understanding the relationship between data set, for example similar datasets, complementary ones that expand on your knowledge, and other types of related datasets.

- Crowd / peer wisdom and usage-based metadata enrichment, for example, user input on data quality and usability.

With the demand for data going through the roof, 3rd-party Open Data discovery tools are starting to emerge that address the limitations of search engines up until now. Some search engines are focused on a specific area, such as Enigma for the financial market. Others are coming forward from the big players, such as Google Data Set Search.

We believe that a continued focus on metadata enrichment will allow data professionals to see the incredible potential value of open data come into fruition.

About the Author

Assaf Katan is the CEO & Co-Founder of Apertio, the first open data deep search engine. Assaf is an accomplished executive with 20 years of experience in both startup and corporate environments, and is passionate about closing deals and desert hiking.

This is a companion discussion topic for the original entry at https://blog.datasciencedojo.com/open-datasets/