In my second project, one year into my career as an ERP consultant, I was tasked to participate in a data migration project. One of our clients had acquired a competitor and had to migrate them from their current (soon-to-be legacy) system to the global M3 solution of the mother company. As the target solution was already defined, it was believed that all that had to be done was to work backwards from the data architecture of the new solution and find the corresponding data in the legacy system.
Me and a couple of peers from iStone were brought in as target system experts and set to collaborate with two legacy system experts in mapping where the required data resided in the legacy data structure so that it could be extracted, transformed and loaded to the new solution. It was all supposed to be pretty much one-to-one. Well, it wasn’t.
In the initial workshops all seemed fine. Almost all the required data was available in the legacy system. Mapping rules were written and sent to developers for implementation. Loading tests started, but abruptly failed. Several of the fields that were mandatory in the target system were empty. Through defaulting and various level of manual manipulation we managed to squeeze a fair chunk of data in to allow process testing to start, but the results were devastatingly poor. Calamity, re-work, delays and finger pointing ensued.
So what went wrong? In one word: assumptions.
It was assumed that the legacy solution was configured adhering to the system standard, and that the stored data was complete and of good quality. However, through a mix of system modifications, user mistakes, workarounds and sloppiness the legacy data had taken on a shape very different from what was documented, and completely unrecognizable from the system standard. To fix this mess, we had to resort to data quality tactics to find and remedy the problems with the source data before it was migration ready. First of all, we had to start with a thorough data profiling session.
In retrospect it seems obvious that we would have to investigate the actual source data, as opposed to relying on documentation and naive ideal-world descriptions of what it was supposed to look like. Profiling data doesn’t have to be complicated at all and should have been the first step on the roadmap.
Data profiling software is rapidly being improved and is very useful for quickly getting to grips with new and unknown data sources. There are currently several free options available (Bloor research have made a good summary here). More familiar data structures can even be profiled very efficiently using just an excel sheet or other standard software.
Our approach was to arrange a series of workshops by data area (customer data, items, pricing etc) involving super users and process owners from the business together with target and source system consultants. Together we examined an excel extract of the source data and column by column we went through a sequence of steps:
- The source system expert would explain the purpose of the field in the legacy solution (since column names from the data base are rarely particularly descriptive for business users).
- The business super users would validate if this was how the field was actually used in practice.
- The target system expert would validate if the same data was stored by the target system. If not, this would be documented as a solution gap (since the data would not be available in the new solution it most likely means there will be process or at least document output differences from the legacy solution, some of which could be critical).
- Data was sample checked to find deviations in format, patterns and level of completeness (number of fields with blank, NULL or non-sense values entered, like a single dot in the CustomerLastName field).
- The target system expert would then present the data structure of the new solution, column by column.
- The process owner would explain the usage and purpose of the data. The legacy system super user could then verify if they had corresponding data in their system. If not, this was noted in the issue log.
The lessons learned
This slightly tedious but simple exercise helped reveal not only missing or erroneous source data but also several gaps in the global target solution that would cause the acquired business unit huge problems if a solution was not defined. Some data errors would have been caught by the automatic syntax validation of the target systems API’s, but many would have been accepted and caused havoc after go-live. This applies in particular to data used by the business in the legacy system but which lacks an equivalent in the new solution.
As a side benefit we also got much better user involvement and business ownership of the data, two other areas where we had been struggling. We were far from done, but at least we had a full overview of the complexity and tasks to start chewing through.
Incorrectly defined migration rules are probably the most common source of data quality problems
This all goes to show that all areas of data management are tightly related. A data migration might be a one-off project, but will be very dependent on the data quality of the source system as well as have a great impact on the quality of the target system data. Incorrectly defined migration rules are probably the most common source of data quality problems, even before manual entry errors.
These were my humble beginnings in the data quality area. It is a tale of beginner naivety, but hopefully there are lessons to be learned and mistakes to be avoided for others.