Trying to simplify the data quality debate just a bit.
So, I read the three posts by Jim Harris, Henrik Liliendahl Sorensen and Charles Blyth in their good-natured debate on data quality. The problem is that they seemed so theoretical and – in some ways – abstract, that I found the meanings got lost in their messages. I found Jim’s to be the most understandable though I think Charles’ entry was perhaps a bit more in line with what I see as data quality. Still, they all left me wanting in some ways.
I think my view is a bit simplistic compared to these three data quality thought leaders, but these are my thoughts on data quality with an eye towards these three blogs.
- Information is data in context. If you ensure that data is defined accurately at a granular level and you have an objective way to ensure your data is complete, accurate and normalized, the users of the data will put it in context – as long as they have access to, and understand, the definitions of that granularly defined data. This “granular definition” is the key point I like from Charles’ post. In my mind, information “in context” is the key “subjective” dimension mentioned by Jim.
- If the data is defined at an appropriately granular level, you should be able to arrive at a single version of the truth. This may result in many, many granular definitions – many more than any one user might care for. In fact, any individual user may choose to roll up groups of granularly defined data into meta-groups. And that is fine if that data fits the context for which they want to use it. But it first has to be defined at the most granular level possible.
- Once the data has been defined at the granular level, and single system of record needs to be chosen for every defined piece of data. If it can be entered into more than one system, that needs to stop. All the secondary systems (not the system of record) need to be cut off from manual updates to help ensure 1) accuracy across all instances and 2) potential for “redefinition” by someone with a silo-influenced view. While this might be thought of as a MDM or governance related concept, I think it is core to the needs of quality data. Both the definition and the value of the data must be made sacrosanct.
With appropriately defined data at a granular level that has both the definition and the actual data values protected, the users can be free to put that data into any contexts they wish to. From where I sit, if the user understands the definitions, it is their prerogative – and in some cases their job – to leverage that data which they have access to in different ways to move the organization forward. They put the data into context and create information. If they are properly informed, those users can decide for themselves if the data is “fit for their intended purposes” as Henrik mentions.
In many ways I think the subjective use of the data should be separated from the objective process of securing the accuracy of the individual values. The data is either objectively accurate or it is not. The key to this, though, is making sure the users are fully informed – and they understand – the definitions of the data. In fact, this may be the most difficult aspect of data quality. Technically, I think defining data at a granular level may be relatively easy as compared to making sure users understand those definitions.
Then there is the cross-enterprise use of data – and whether your definitions align with the rest of the world. A recent study conducted by AMR Research and GXS shows that about 1/3 of all data originates outside an organization (40% in automotive, 30% in high tech, about 34% in retail). Industry standard definitions come in handy but everyone has to use them to bring real value to the various enterprises (see the retail supply chain’s use of the Global Data Synchronization Network as an example of good definitions but limited adoption). We still probably have the need to translate definitions between organizations. That is where the breakdown comes. Like much of the success in the RFID world, data quality is best managed in a closed loop environment. Once it leaves your four walls (or your back office systems) you can’t easily control 1) its continued accuracy 2) its definition 3) the interpretation of the definition by external users. Even with a global standard, GDSN participants had trouble figuring out how to measure their products so height, depth and width have been transposed many times.
I think, though, that companies that focus on internal data accuracy will be in a much better position to have cross-enterprise accuracy as well. The key here is that eliminating the internal challenges will reduce the occurrences of problems and help narrow the list of places to look if any do arise. And having that good data can help make an ERP Firewall much more valuable.
Of course, this topic is something that could take thousands of pages to cover and years to write – and still miss most of the challenges, I’m sure. But sticking to the three points above for internal data quality can go a long way towards making better decisions, improving the bottom line, and making the overall enterprise more effective.