The recent Data Quality Blog Olympics with Charles Blyth, Jim Harris, Henrik Liliendahl Sørensen got me thinking about this more in depth. From a pure technical standpoint, can you have a single version of the truth? Does it necessarily have to be a shared version of the truth? I’m thinking you can have a single version of the truth but it doesn’t have to be a shared version of the truth. I think what you can’t have is a single interpretation of the truth.
In simplistic terms, data is the data – wherever you get it from. You need to define it and define it well. For instance, the color of the product – as I define it – is “blue”. The height is 6 inches. I define my fields for the height and the unit of measure for that height as well as the color. Those are absolute values. They are the truth, per se.
Let’s try a few more relatively simple ideas. Pretend I am run a store. In my store I like to assign a “customer number” to each of my customers because that’s what I started doing dozens of years ago. My preference is to use their social security number (yes, I know it is a bad practice, but I choose to do this). However, some customers don’t have a social security number. I need to have another method of assigning them a number so that I can uniquely identify them. I now have to have a method to
- capture and store the social security number
- identify that the social security number is the customer number (perhaps I store it twice – once in SS# field and once in the customer number field, perhaps I use a pointer, perhaps my business logic looks to the SS# field and only looks to customer number field if SS# is blank)
- create a customer number if SS# is not available.
- make sure that the created customer number can be uniquely identified as not being a SS#
Now, I know what the customer’s SS# is if they have one (assuming they gave it to me). I also know that my customer number is – which may or may not be their social security number. My single version of the truth for the customer number is whatever number I have for them – SS# or my created number. I need to interpret – perhaps – whether it is a SS# or not.
Perhaps that’s not a good example. What about transient data like “ship to address”. Is the current address accurate? By accurate do we mean “do they live their now?” or do we mean “did we spell it correctly?” Do we keep previous versions of the address so we know history? Still, the information is absolute – whatever the answer is. And we know it is kept separately from “billing address” though the two might be the same. If we want to know who we ship products to, we can know based on ship to or billing address – and doing sales analysis separately against both can yield different results. But the actual data is absolute. And we know when each sale was placed for each ship to and billing address.
Marketing, sales and the finance department might all look at the same data set and interpret it to mean different things, but the absolute numbers are still the same thing. They may aggregate different pieces of information – like color, size, and price by ship to address or color size and cost by customer number. All those pieces of data may be absolute, but the end result of the use of that data is a different interpretation of the impact to the business.
I’m still having problems with the various case studies for why you can’t accurately and definitively assign absolute values – a single version of the truth – to a field of data. Perhaps its free form text fields that are left to interpretation? Perhaps it is that some of the data entered may be inaccurate? If it is only entered in one place and then electronically shared to other secondary instances of the field (in other systems) you still have a single version of the truth – it just may not be accurate. Is it because we just don’t want to capture the same information in 5 different ways to satisfy the different usages people will have for it? I know that different states and different railroads refer to rail crossing information fields differently. But if the authoritative body defines the meaning of the fields that it uses, the various other entities may interpret this information in their own “language” but it is still defined by the authoritative body.
Perhaps I’ve been lucky in my work so far, as the problems seem to always be in executing on the rules once defined (someone put width in the height field). Yes, initial agreement on definition can be tough, but agreement can and should be reached. It may be a leadership issue rather than a data issue if you can’t.
What are your thoughts? Give me some examples of where you can’t have a uniquely defined field in your line of work. Are there different industries, lines of business, or organizations where this is more difficult?