August 3rd, 2022 | 9 min read
Why the time has come to retire your reference data
In the world of traditional data management, reference data and master data are treated as two different categories of data. Reference data is used to classify or categorize other data, and master data is business critical data which is shared by multiple systems, applications, and processes. Conventionally speaking, examples of master data include customer data, product records and vendor data. Reference data includes code lists, taxonomies, and hierarchies of data, amongst other things.
But times have changed, and the advent of modern Master Data Management (MDM) – an approach which does away with old-fashioned classifications and hierarchies – essentially means that there is little to no difference between reference data and master data any more. In many ways, reference data is a relic to technology that forced us to denormalise models and treat data in an unnatural way. Everything is considered a lookup today, including what was traditionally thought of as reference data, such as colours, countries and currencies. In reality, reference data is just master data, and master data is just…data – you get where this is going?
In the same way we use countries in lots of data, we also do the same with Domains in general. In the world of Graph (which is pivotal to modern MDM), Entities connect to Entities, not Entities to Properties - as with reference data.
In fact, all arguments to maintain reference data can easily be quashed by the more modern approach. Reference data muddies the water and overcomplicates the MDM discussion. It could even be argued that master data does the same.
If master data is slow moving then reference data is even slower. Historically, reference data is managed differently because it is very static and rarely changes. Why does that even matter? In classic database design, you don't call tables different things just because of the data they contain, you call them tables.
Metadata that refers to reference data sets may document:
- The meaning and purpose of each reference data value domain
- The reference tables and databases where the reference data appears
- The source of the data in each table
- The version of the reference data that is currently available
- When the reference data was last updated
- Maintenance description for the reference data
- Business data stewardship information for the reference data
In the world of ontologies, this is no longer needed, and it always makes sense to remove unnecessary steps in a process. Wikipedia is the best example of this. Wikipedia is a web of objects that talk to each other. There is no differentiation between reference and master data, data is data and objects are objects. A country, is a country. A currency is its own thing that has relationships to other objects.
Master data is data that relates to the business entities that provide context for business transactions. Unlike reference data, master data values are not usually limited to predefined domain values. Business rules typically dictate the format and permitted ranges of master data values. Common organizational master data includes data concerning:
- Categories such as individuals, organizations, roles, customers, citizens, patients, vendors, suppliers, business partners, competitors, employees and students.
- Products, internal and external, inventory, and related concepts.
- Financial structures, including general ledger accounts, cost centres, profit centres, etc.
- Location concepts, for the organizations and individuals and other entities that concern the enterprise.
In the context of a classic Relational Database, the idea of having a Countries table denormalised in order for it to be used to reference other tables sounds like a good idea. However, the future of MDM is widely acknowledged to be based on the Graph. In the Graph world, you do not denormalise to tables, you denormalise to records. With this flexibility, each record can evolve in its own way, providing its own schema, it is not tied into an expected schema that matches all other Countries, for example.
In many ways, the sooner we stop talking about master data, the better. What should we be saying instead?
We should be speaking in Domains, that is it. Domains are consumable and understandable by all. As soon as you talk about master data, the first question that usually crops up is "What is considered master data"? Why add that extra layer of complexity? Domains are a key part of MDM but they are in ALL data projects, MDM does not have the monopoly on Domains.
What was considered MDM, has completely confused what should be explained very differently.
Whether you call it Data Mesh, Data Fabric or modern MDM, there is definitely a need for SOMETHING to translate all of the data that sits across your business in an easy, scalable and agile manner. Unfortunately, MDM has traditionally involved extremely tight and rigid demands on data, inherently taking the approach of "nobody change a thing!" Guess what, everyone changed everything - and your upfront, schema-driven, top down approach didn't work!
The traditional Data Warehouse also promised this, but similar to traditional MDM it leans more towards having rigid domain tables to rule them all.
Managing reference data properly is important to any organization since reference data carries the context of data transactions through its semantic content (code value descriptions, location data, and other contextual information). Reference data can be used to drive business logic that helps execute a business process, designate an application to perform specific actions, or provide meaningful segmentation to analyse transaction data. Also, mapping reference data often requires human judgement, so the need for intervention by business data stewards in the reference data management process cannot be overlooked.
Reference data management was traditionally thought of as important for several reasons. Reference data:
- Describes the structures used in the organization (internal department codes, internal product codes, employee organization codes, internal location codes, etc.)
- Describes the common data used in organizations that are external but connected to the organization (e.g., geographical, currency, country, diagnosis coding structures)
- Provides assistance and support to analytics and business intelligence (e.g., classification codes).
Organizations with a high demand for data entry, including healthcare, insurance, and government entities, experience significant data quality challenges due to improper coding of reference data values. These errors can be quite costly, in several ways. Additionally, many organizations rely on hundreds of individually developed reference files or tables, and each instance requires updating and periodic quality review. In fact, it is a big reason that we see companies still working and managing reference data in Excel! Since most organizations do not have sufficient staff to perform the reference data tasks, these activities may not be performed; therefore the reference data is outdated, causing errors in application performance and data integration.
So where do we go from here? If we look five years into the future, the modern MDM movement will make it clear that reference data is a relic of the past. Reference data is just data, master data is just data. However, just talking in data is still too abstract. The sooner we steer the data discussion towards speaking about Domains, the easier it will be to generate insights with our data. Having initiatives to move the needle concerning Domains such as Customers, Products, Issues and Vendors will move companies closer to insight and further away from unnecessary complexity.