Where Is Your Bad Data Coming From?

As Kimball documents in The Data Warehouse Lifecycle Toolkit (available in all good book stores), there are five concepts that together, can be considered to define data quality:

Accuracy – The correctness of values contained in each field of each database record.

Completeness – Users must be aware of what data is the minimum required for a record to be considered complete and to contain enough information to be useful to the business.

Consistency – High Level or summarized information is in agreement with the lower-level detail.

Timeliness – Data must be up-to-date, and users should be made aware of any problems by use of a standard update schedule.

Uniqueness – One business or consumer must correspond to only one entity in your data. For example, Jim Smyth and James Smith at the same address should somehow be merged as these records represent the same consumer in reality.

So using Kimball’s list, we might know what kind of data we want in the database but unfortunately, despite our best intentions, there are forces conspiring against good data quality. While it doesn’t take a forensics degree, there are so many sources of poor data you may not even know where to look. For that, we’ve come up with our own list. Let’s take a look…

1. Data Entry Mistakes.

The most obvious of the bad data sources, these take the form of simple typing mistakes that employees can make when entering data into the system e.g. simple typos, entering data into the wrong fields, using variations on certain data elements.  Even under ideal circumstances, these are easy mistakes to make and therefore extremely common but unfortunately can be the source of high numbers of duplicate records.  But why is it so hard to get the data right? Consider these circumstances that can exacerbate your data entry process:

  • Poorly trained staff with no expectations for data entry
  • High employee turnover
  • Under-resourcing of call centres that leads to rushing customer exchanges
  • Forms that do not allow room for all the relevant info
  • Unenforced business rules because bad data is not tracked down to its source

2. Lazy Customers.

Let’s face it. Customers are a key source of bad data. Whether they are providing information over the phone to a representative or completing a transaction online, customers can deliberately and inadvertently provide inaccurate or incomplete data. But you know this already. Here are a few specific circumstances to look out for, especially in retail settings:

  • In store business rules that permit staff to enter store addresses or phone numbers in place of the real customer info
  • Multiple ‘rewards cards’ per household or family that are not linked together
  • Use of store rewards cards that link purchases to different accounts
  • Customers that subconsciously use multiple emails, nicknames or addresses without realizing it
  • Web forms that allow incorrectly formatted data elements such as phone numbers or zip codes
  • Customers pushed for time who then skip or cheat on certain data elements
  • Security concerns of web transactions that lead customers to leave out certain data or simply lie to protect their personal information

3. Bad Form

Web forms. CRMs. ERP systems. The way they are designed can impact data quality. How? Some CRM systems are inflexible and may not allow easy implementation of data rules, leading to required fields being left blank, or containing incomplete data. Indeed many web forms allow any kind of gibberish data to be entered into any fields which can immediately contaminate the database. Not enough space for relevant info or systems and forms that have not been updated to match the business process also pose a challenge. Many systems also simply do not perform an address check at entry – allowing invalid addresses to enter the system. When it comes to data quality, good form is everything.

4. Customization Simply Reroutes Bad Data

All businesses have processes and data items unique to that business or industry sector. Unfortunately, when systems do not provide genuine flexibility and extensibility, IT will customize the system as necessary. For example, a CRM system may be adjusted to allow a full range of user-defined data (eg to allow a software company to store multiple licence details for each customer). Where this happens, the hacks and workarounds can lead to a lack of data integrity in the system (e.g. you end up storing data in fields designed for other data types (dates in character fields).

5. Data Erosion is Beyond Your Control

Businesses and consumers move address. People get married and change their name. Business names change too plus contacts get promoted or replaced. Email addresses and phone numbers are constantly evolving. People die. No matter how sophisticated your systems are, some measure of data erosion is simply unavoidable. While good business rules will assist in updating data at relevant checkpoints, to maintain the best quality data, it’s important to update the data from reliable data sources on a regular basis.

6. New Data. Bad Data. Duplicate Data.

Many businesses regularly source new prospect lists that are subsequently loaded into the CRM. These can come from a variety of places including list vendors, trade shows, publications, outbound marketing campaigns and even internal customer communications and surveys. Although it’s exciting to consider procuring a new, large database of prospects, there are two ways this addition of data can go horribly wrong. First, the data itself is always suspect, falling prey to all the potential issues of data entry, data erosion and customer error. But even if you can corroborate or cleanse the data before entering, there is still a chance you will be entering duplicate records that won’t always be quickly identified.

7. Overconfidence

OK. So this may not be a true ‘source’ of bad data but it is the most important precipitating factor. You may think that by implementing business rules or by using a CRM’s built-in duplicate detection tools, that you are covered. In practice, business rules are important and valuable but are never foolproof and require constant enforcement, evaluation and updates. Moreover, built-in data quality features are typically fairly limited in scope and ability to simply detect exact matches. They simply not powerful enough to do the heavy lifting of a more sophisticated fuzzy and phonetic matching engine that will catch the subtle data quality errors that can lead to major data quality issues. This false sense of confidence means you can easily overlook sources of poor data and neglect to perform critical data quality checks.

So if you keep these seven bad data sources in mind – are you home free? Unfortunately not. These are simply the building blocks of bad data. When even just some of these conditions occur simultaneously, the risk of bad data multiplies  exponentially. The only true way to achieve the five-pronged data quality ideal outlined by Kimball (accuracy, completeness, consistency, timeliness and uniqueness) is through a comprehensive data quality firewall that addresses each of these components individually.

Stay tuned for more information on Best Practices in data quality that pinpoint specific business rules and software solutions to achieve true real-time data quality.

1 reply
  1. Andrew Dean
    Andrew Dean says:

    Nice summary though I think you missed one. What about data processing errors. IT departments are frequently called upon to process data in some way. Coming from a programming background I know how easy it is to get this wrong, particularly if test procedures are not all that they should be.

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply