Bad Data – When to Let it Go?

My friend Cliff recently approached me with a problem.  His organization has tasked him and his team with analyzing, amongst other things, the depth of their bad data problems in advance of replacing their financial systems.  Initial indications are that their data is not just bad, it is very, very bad.  His question?  When is it ok to leave the data behind and not port it over to the new system?  When should he just “let it go”?

In looking at his problem, it is obvious that many problems stem from decisions made long before the current financial system was implemented.  In fact, at first glance, it looks like decisions made as long as 20 years ago may be impacting the current system and threaten to render the new system useless from the get-go.  If you’ve been around long enough you know that storage used to be very costly so fields were sized to just the right length for then current use.  In Cliff’s case, certain practices were adopted that saved space at the time but led to – with the help of additional bad practices – worse problems later on.

When we sat down to look at the issues, some didn’t look quite as bad as they initially appeared.  In fact, certain scenarios can be fixed in a properly managed migration.  For example, for some reason bank routing numbers are stored in an integer field.  This means that leading zeros are dropped.  In order to fix this, scripts have been written to take leading zeros and attach them to the end of the routing number before storing in the field.  Though I haven’t seen it, I’ve got to assume that any use of that same field for downstream purposes includes the reverse procedure in order to creat a legitimate bank routing number.  Of course, when a real bank routing number and a re-combined number end up being the same, there are problems.  He hasn’t yet identified if this is a problem.  If not, then migrating this data should be relatively easy.

Another example is the long-ago decision to limit shipping reference numbers to 10 digits.  There are two challenges to this problem for him.  The first is that many shipping numbers they generate or receive today are 13 digit numbers.  The second is that they generate many small package shipments in sequence, so the last 3 digits often really, really matter.  When reference numbers’ size expanded beyond the 10 the original programmers thought would be enough, a bright soul decided to take a rarely used 3-digit reference field and use that to store the remaining digits.  Probably not a bad problem when they had few significant sequential shipments.  However, since the current system has no way to natively report on the two fields in combination – and for some reason no one was able to write a report to do so – every shipment must be manually tracked or referenced with special look-ups each time they need to check on one.  Once again, this problem should probably be fixable by combining the fields when migrating the data to their new system although certain dependencies may make it a bit more difficult.

Unfortunately, there are so many manual aspects to the current system that 10 years of fat-fingering data entry have led to some version of data neuropathy – where the data is so damaged the organization has trouble sensing it. This numbness becomes organizatoinally painful in day-to-day functioning.

Early on in my data quality life, another friend told me “Missing data is easy. Bad data is hard.”  He meant that if the data was missing, you knew what you had to do to fix it.  You had to find the data or accept it was not going to be there.  But bad data?  Heck, anything you look at could be bad – and how would you know?  That’s difficult.

So, this is Cliff’s challenge.  The data isn’t missing, it is bad.  But how bad?  The two scenarios above were the easy fixes – or could be.  The rest of the system is just as messed up or worse.  Being a financial system it is hard to imaging getting rid of anything.  Yet bringing anything forward could corrupt the entire new system.  And trying to clean all the data looks to be an impossible task.  Another friend was familiar with an organization that recently faced a similar situation.  For them, the answer was to keep the old system running as long as they needed to for reference purposes – it will be around for years to come.  They stood up the new system and migrated only a limited, select, known-good set of data to help jump-start the new system.

This approach sounds reasonable on the surface, but there may be years of manual cross-referencing between the systems – and the users of the old system will need to be suspicious of everything they see in that system.  Still, they have a new, pristine system that, with the right data firewalls, may stay relatively clean for years to come.  How did they know when enough was enough?  How did they know when to let their bad data go?

I’d love to see your thoughts and experiences in this area.

2 comments on “Bad Data – When to Let it Go?

  1. garymdm says:

    Data migrations are, in my opinion, the perfect opportunity to clean your data – as discussed in this post http://dataqualitymatters.wordpress.com/2011/09/23/data-migrations-good-opportunity-to-improve-data-quality/

    When I posted this post I got a lot of flak – mostly from system integrators arguing that the budget must be used only for the functional delivery of the new system, and that it would be wasted on frivolous data issues.

    As you point out, simply taking all the data across simply replicates the problems of the original system. At the same time, not all issues can be resolved so a compromise must be found.

    One environment I know just took balances across to a new system – they now have no way to defend billings if they are queried by clients because they have no history. A little short sighted advice from the system integrator.

    A good place to start is with an understanding of what the new system must be able to achieve, combined with a proper data quality audit using a decent profiling tool like Trillium Software Discovery to create a baseline. Data cleansing procedures can then be planned to migrate and clean necessary data, address issues that must be resolved manually and identify data that will be left behind (because it is not fit for purpose)

    Incidentally, the idea of running the old system in parallel is not outrageous – particularly if it is used to update the migrated data over time until they are in synch.

    One bonus of using automated data cleansing processes is that they can be reused to maintain the data in the new system on an ongoing basis – less problems going forward.

    once last point, if the reason they want to replace their financial system is just bad data then they may want to consider fixing the data in the old system. It may not be fashionable and it won’t get the new system’s sales rep a new car – but it may solve the problem at a fraction of the cost of a new system. If they don’t fix the data the new system may give them the same problems.

    • Great information, GaryMDM. Thanks for taking the time to provide such deep feedback. I’ve already discussed your thoughts with my friend and I’ve sent him your post.

      Once again, much appreciated.

      Bryan

Leave a comment