The Data Quality Blog Olympics

Trying to simplify the data quality debate just a bit.

So, I read the three posts by Jim Harris, Henrik Liliendahl Sorensen and Charles Blyth in their good-natured debate on data quality.  The problem is that they seemed so theoretical and – in some ways – abstract, that I found the meanings got lost in their messages.  I found Jim’s to be the most understandable though I think Charles’ entry was perhaps a bit more in line with what I see as data quality.  Still, they all left me wanting in some ways.

My Take

I think my view is a bit simplistic compared to these three data quality thought leaders, but these are my thoughts on data quality with an eye towards these three blogs.

  1. Information is data in context.  If you ensure that data is defined accurately at a granular level and you have an objective way to ensure your data is complete, accurate and normalized, the users of the data will put it in context – as long as they have access to, and understand, the definitions of that granularly defined data.  This “granular definition” is the key point I like from Charles’ post.  In my mind, information “in context” is the key “subjective” dimension mentioned by Jim.
  2. If the data is defined at an appropriately granular level, you should be able to arrive at a single version of the truth.  This may result in many, many granular definitions – many more than any one user might care for.  In fact, any individual user may choose to roll up groups of granularly defined data into meta-groups.  And that is fine if that data fits the context for which they want to use it.  But it first has to be defined at the most granular level possible.
  3. Once the data has been defined at the granular level, and single system of record needs to be chosen for every defined piece of data.  If it can be entered into more than one system, that needs to stop.  All the secondary systems (not the system of record) need to be cut off from manual updates to help ensure 1) accuracy across all instances and 2) potential for “redefinition” by someone with a silo-influenced view.  While this might be thought of as a MDM or governance related concept, I think it is core to the needs of quality data.  Both the definition and the value of the data must be made sacrosanct.

With appropriately defined data at a granular level that has both the definition and the actual data values protected, the users can be free to put that data into any contexts they wish to.  From where I sit, if the user understands the definitions, it is their prerogative – and in some cases their job – to leverage that data which they have access to in different ways to move the organization forward.  They put the data into context and create information.  If they are properly informed, those users can decide for themselves if the data is “fit for their intended purposes” as Henrik mentions.

In many ways I think the subjective use of the data should be separated from the objective process of securing the accuracy of the individual values.  The data is either objectively accurate or it is not.  The key to this, though, is making sure the users are fully informed – and they understand – the definitions of the data.  In fact, this may be the most difficult aspect of data quality.  Technically, I think defining data at a granular level may be relatively easy as compared to making sure users understand those definitions.

Then there is the cross-enterprise use of data – and whether your definitions align with the rest of the world.  A recent study conducted by AMR Research and GXS shows that about 1/3 of all data originates outside an organization (40% in automotive, 30% in high tech, about 34% in retail). Industry standard definitions come in handy but everyone has to use them to bring real value to the various enterprises (see the retail supply chain’s use of the Global Data Synchronization Network as an example of good definitions but limited adoption).  We still probably have the need to translate definitions between organizations.  That is where the breakdown comes.  Like much of the success in the RFID world, data quality is best managed in a closed loop environment.  Once it leaves your four walls (or your back office systems) you can’t easily control 1) its continued accuracy 2) its definition 3) the interpretation of the definition by external users.  Even with a global standard, GDSN participants had trouble figuring out how to measure their products so height, depth and width have been transposed many times.

I think, though, that companies that focus on internal data accuracy will be in a much better position to have cross-enterprise accuracy as well.  The key here is that eliminating the internal challenges will reduce the occurrences of problems and help narrow the list of places to look if any do arise.  And having that good data can help make an ERP Firewall much more valuable.

Of course, this topic is something that could take thousands of pages to cover and years to write – and still miss most of the challenges, I’m sure.  But sticking to the three points above for internal data quality can go a long way towards making better decisions, improving the bottom line, and making the overall enterprise more effective.



5 comments on “The Data Quality Blog Olympics

  1. Jim Harris says:


    Thank you very much for taking the time to participate in the discussion sparked by the Data Quality Blog Olympics. It is greatly appreciated, especially since that was the goal of the “contest.”

    Yes, I admit I was concerned about being overly theoretical and abstract. I have a general tendency to do that on many of my blog posts. But I often try to first reach an understanding with regards to theory before delving into practice – because if my theories are off-center, then my practical advice probably wouldn’t help very much.

    I think that your emphasis on context and understandable granular definitions is terrific.

    I also appreciate you helping raise the critical point about external data (on which Henrik also raised great points), especially since the need to synthesize the enterprise’s version(s) of the truth with (as you well-worded it) “cross-enterprise accuracy” was entirely absent from my blog post.

    Thanks and Best Regards,


  2. Pradheep Sampath says:

    Bryan, Great post. Glad to see data quality discussions unfold in Olympic proportions! I don’t think companies can afford to wait to get proficient in perfecting the quality of their internal data before looking externally for incremental benefits. As you observe, a large percentage of data that drives business applications – be it ERP,SCM,WMS or TMS originates externally. Thus, an integration platform that is neutral to both the origin and destination of such data is best positioned to accomplish what I consider to be the 3Rs of transactional data quality – Reject, Remediate & Resubmit so that their internal data quality burdens don’t snowball.

    • Pradheep, I agree wholeheartedly in the idea of the integration platform. I’d quibble just a bit and say that it is hard to make an ERP Firewall that is integration platform-bound without having some good data (as well as business rules) to drive the validation.

      That said, integration is always preferable to manual data entry – where you know you will have problems.

  3. Bryan, as I mentioned yesterday, thank you for joining in the debate. I said I would respond so here goes …

    I would like to pick up on two points here (paraphrased):

    1) You talk about users of data putting it into context. Is this not risking multiple versions of the truth? Certainly users will take information and create derived analysis from it, that’s why information exists, and as you say, it is how companies move forward. However, I fear that if you take the context out of the granular definition of the data you risk ambiguity and ‘free world’ data definition, which goes against Data Governance principals. I feel that you should define the data in context at source, i.e. Data Warehouse, MDM Hub, Source System etc. thus avoiding any possibility of doubt.

    2) You should aim to have only one source of data entry. This would be ideal, Utopia for Data Governance experts and resources, however is that not what SAP, Oracle etc have attempted to do, and have failed. MDM is born out of the fact that companies are disparate, and so are their systems, ERP, CRM etc. However, even if you do have a single point of data entry, you will have the need for Data Governance, because you can not get away from the fact that on virtually every occasion there will be a human or free form process entering data into the system, and they need to be governed.

    As Jim says it is great that you have raised the point on external data, one of the reasons why I liked Henrik’s entry in the debate. This is an area that needs further discussion, and it’s profile raised.

    Thanks again for joining in, I love the opportunity to debate these topics with new people with different view points.



    • Charles,

      I agree that the data when defined has to be defined in context. Absolutely. It can’t be defined, I believe, without context. However, most data is then aggregated with other to create the context in which the user wants to use it. For example, 1000 red widgets were sold, but where and when did they sell and to whom. The context for the absolute value of the number of widgets that sold is one thing. The user might want to know how many of all types of widgets (red, green and blue) were sold. So, while each has its own context the aggregate is the user putting all that data into the context in which they wish to view it. So, there are probably two layers of context – the absolute value context which is really part of its definition. The second is how the user leverages or utilizes the data and the meaning they give to the data in that usage.

      Making a gross generality, the financial person might care about total numbers sold and revenue while the buyer might care about numbers sold, where they were sold, what demographic and the colors so they can fine tune the purchase and sale of the products.

      The absolute values will remain absolute. Once each piece of data is put next to other pieces, the whole becomes more subjective in context and – as you mention – potentially ambiguous.

      Regarding point two, there are certain areas where the data is unique enough that perhaps you can have a single place to enter data. But I appreciate these are few and far between. Still, when possible, companies need to strive for reduced manual data entry. With 2-5% of all keystrokes being in error, it doesn’t take long for 10 fields of 10 characters each to become a 50% error rate when one character in each of 5 fields is entered incorrectly. 5 fields out of 10 is a 50% error rate – even though the per character rate is only 5%.

      The migration of B2B away from EDI and other electronic means of transacting and to manual portals has helped drop at least two industry leaders in the consumer electronics space out of their top positions (they blamed supply chain issues but they somehow forgot to mention they had moved their transactions to portals).

      Governance mush continue to be practiced – absolutely. But implementing manual procedures because a portal is easy (in Global Data Synchronization most early adopters – and I believe most current users – are still manually entering data into portals or spreadsheets as opposed to doing machine-to-machine transfer of product data) just breaks all those best practices.

      Much appreciation for your comments!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s