What type of Data Catalog do you really want?
The Data Catalog product of today is built with the idea of how we work with data from some time ago. The Data Catalog exists, because most companies have very siloed data. With the introduction of the Data Lake, nothing has changed, we still “need” a way to be able to know what tables, files, databases we have lying around our workplace and then we need to run the data preparation on them to realise their potential. This type of Data Catalog will soon no longer be needed as much as we need it today. Why? Because this way of treating data is the exact reason why we can’t get value out of our data today.
Let’s walk through the normal process of a Data Catalog and then highlight the wins and flaws of this approach. Things usually start with the business requesting some data e.g. We need to know all our customers from across all locations. The IT team will then type the word “customers” into the Data Catalog and a list of results would come back with Tables, Databases, Applications, Columns with the name “customer” in it.
Here is the first fatal flaw. We are assuming that all systems call it “Customers”, but there is no need for an assumption - they don’t. Let’s just assume this actually did work and that we have been a good company. Now we need to blend the data together. The IT team now needs to be business domain experts to figure out how to blend the different customer tables and databases together to be able to deliver the list of unified customers.