April 22nd, 2022 | 11 min read
Driving data science with a data quality pipeline
High quality, trusted data is the foundation of Machine Learning (ML) and Artificial Intelligence (AI). It is essential to the accuracy and success of ML models. In this article, we’ll discover how CluedIn contributes to driving your Data Science efforts by delivering the high quality data you need.
CluedIn not only provides your teams with tooling that improves the quality of the data that is fed to your ML models, it also simplifies the iterations by which you can evaluate their effectiveness.
The five Vs of Data Quality
The term “data quality” is overused and can mean many things. As Machine Learning and Big Data are still both evolutionary fields with developments in each complementing the other, we’ll approach it from an angle you may already be familiar with – the five Vs of Big Data (Volume, Variety, Velocity, Value and Veracity).
Put simply, the accuracy of any statistical model improves according to the amount of data you feed it into it – the more you have, the more accuracy improves. In technical terms, the larger your sample size, the smaller the margin of error (aka confidence interval). The challenge for most enterprises, however, is how to consolidate data that comes from several legacy and new systems in a consistent manner. Those systems will vary in terms of technology stack, stakeholders, security levels and data structures. Which means that putting them together could, theoretically, take years.
CluedIn simplifies the process of collecting large amounts of data by providing several out-of-the-box data crawlers that take into account aspects like batching. Most of our crawlers are open source, so you can potentially build one in a matter of hours.
Additionally, CluedIn uses a combination of micro services and can autoscale the amount of computing power needed to process the data while a high volume of data is processed.
Finally, all these different systems from which we plan to get data are likely to have different models and naming conventions. CluedIn lets you follow an ELT (Extract, Load, Transform) approach across unlimited external enrichers and your own data sources. Which means that, based on your business priorities, you can model only the relevant extracted data and take care of the rest later.
We’ve seen that the volume of data has an impact on the quality of a ML model. There are also hidden types of data flowing through your organization that can contribute to that volume. The fact is most traditional MDM solutions don’t make it easy for you to ingest those.
One of the reasons for this is because these MDM systems are heavily focused on ingesting structured data – i.e. database tables. But what about email messages, presentations, documents, etc? These unstructured data sources contain valuable reference points too – like companies, people, projects and several other types of entities/domains that are important to your business. Using a mix of NLP (Natural Language Processing) and Entity Named Recognition (ENR), CluedIn can help you to identify them and make the respective records more robust, all the time increasing the volume of data as discussed above.
Additionally, there is nothing to stop you from integrating other linguistic algorithm services with CluedIn to make it even more powerful.
Semi-structured formats can also contain more valuable information. JSON and XML structures were born to be flexible in order to accommodate changes over time. The good news is that CluedIn chose a Graph data structure as its main store foundation, which means the platform can easily accommodate the changes your business will inevitably go through.
Depending on your industry, your models can change very quickly. Agility, automation and collaboration is going to be key here. Your Data Science teams should not have to worry about how to get valuable data into their ML models. They need a way of working with engineering teams to easily centralise data in a place where business users can perform semantic modelling and cleaning before the data is handed over to Data Science teams.
CluedIn offers virtually infinite ways of ingesting data so your engineers just need to pick the simplest one: databases, automated webhook endpoints, static files and custom connectors.
This enables business users who are the experts in their domain to model and clean that new data in readiness for the Data Scientists who can focus on tweaking and testing the models efficiency, as opposed to spending a huge amount of time acquiring domain knowledge and cleaning the data themselves.
CluedIn offers several mechanisms for identifying issues in data as well as cleaning it. Once a cleaning action is performed, CluedIn can even automatically create Rules so that any future values that meet that criteria will be fixed with zero human effort.
Finally, as discussed, autoscaling can also play an important role to get data moving as quickly as possible to where it needs to be.
The famous cliché “Garbage In, Garbage Out” (GIGO, for short) is used so often because it is absolutely true. How do you ensure that the output of your ML models is something you can get actionable business value from?
Let’s think about the process of iteratively putting an ML model into Production:
- Define business problem
- Acquire and prepare data
- Develop the model
- Deploy the model
- Monitor model's performance
CluedIn’s security features, user interface and even pricing model have been built to allow users from all backgrounds to operate together to define a business problem (step 1) and acquire, govern, and clean the relevant data (step 2). Thus ensuring that it is prepared for the following steps.
As you can see, we have the right person on the right desk throughout the project cycle. Business users can define what datapoints are outliers, within thresholds and that conform to defined standards. But they may not be able to work on the technical models further down the line.
At step 5, when we monitor a model’s performance, almost inevitably over time there will be a need for more data. That means you need to go through the process again, and CluedIn will make that easier.
This leads to another point: the value of your data also relates to where it is. In other words, how easily it can be obtained for your ML initiatives.
If you have CluedIn, you are not starting from scratch with your data preparation. For example, imagine you have all your data in a lake. You have a file called customers.csv in a raw format and you want to predict churn. A Data Engineer will break out their best Data Science and Engineering toolset. The project goes out the door and everyone is happy. Two weeks later, another initiative related to customer data begins. With CluedIn, because you have this data in an operational store and not just as a raw file on a lake, your second initiative can piggyback on all the good work that has already been done to prepare the data; hence accelerating your time to value for the second insight project.
The further away you are from the data source, the harder it is to trust its veracity.
You can solve this in three ways:
First, CluedIn Quality Metrics such as accuracy and uniformity contribute to your understanding of the credibility of the data at global, source and record levels. If any of these are below or above an acceptable threshold, you can automatically stop ingesting or sending the data.
Second, we have customers that automatically “tag” records according to which business validation procedures it went through. So, if a record has not yet been marked as “good to go”, your CluedIn Stream can block that.
Third, as mentioned, CluedIn encourages business users to participate in the process of building datasets as opposed to only technical stakeholders. This way, the likelihood of getting untrusted data downstream is mitigated or can be easily fixed without involving too many people. For example, if a Power BI report fed by CluedIn has inconsistent data, the business user can go ahead and fix that in CluedIn on their own. No product owners, no developers, no sprints.
Machine Learning models are as useful as the data that was provided to train them. We have described how the five Vs of Big Data are a great reference for obtaining the required data quality and how CluedIn contributes to those individual aspects.
From a talent perspective, we acknowledge that data comes from several places, involves different technologies, diverse types, and spans across several business domains. This makes it quite difficult to find the breadth of skills required in one person. Therefore, data science is a team sport, not an individual effort.