Where Is Your Bad Data Coming From?

As Kimball documents in The Data Warehouse Lifecycle Toolkit (available in all good book stores), there are five concepts that together, can be considered to define data quality:

Accuracy – The correctness of values contained in each field of each database record.

Completeness – Users must be aware of what data is the minimum required for a record to be considered complete and to contain enough information to be useful to the business.

Consistency – High Level or summarized information is in agreement with the lower-level detail.

Timeliness – Data must be up-to-date, and users should be made aware of any problems by use of a standard update schedule.

Uniqueness – One business or consumer must correspond to only one entity in your data. For example, Jim Smyth and James Smith at the same address should somehow be merged as these records represent the same consumer in reality.

So using Kimball’s list, we might know what kind of data we want in the database but unfortunately, despite our best intentions, there are forces conspiring against good data quality. While it doesn’t take a forensics degree, there are so many sources of poor data you may not even know where to look. For that, we’ve come up with our own list. Let’s take a look…

1. Data Entry Mistakes.

The most obvious of the bad data sources, these take the form of simple typing mistakes that employees can make when entering data into the system e.g. simple typos, entering data into the wrong fields, using variations on certain data elements.  Even under ideal circumstances, these are easy mistakes to make and therefore extremely common but unfortunately can be the source of high numbers of duplicate records.  But why is it so hard to get the data right? Consider these circumstances that can exacerbate your data entry process:

  • Poorly trained staff with no expectations for data entry
  • High employee turnover
  • Under-resourcing of call centres that leads to rushing customer exchanges
  • Forms that do not allow room for all the relevant info
  • Unenforced business rules because bad data is not tracked down to its source

2. Lazy Customers.

Let’s face it. Customers are a key source of bad data. Whether they are providing information over the phone to a representative or completing a transaction online, customers can deliberately and inadvertently provide inaccurate or incomplete data. But you know this already. Here are a few specific circumstances to look out for, especially in retail settings:

  • In store business rules that permit staff to enter store addresses or phone numbers in place of the real customer info
  • Multiple ‘rewards cards’ per household or family that are not linked together
  • Use of store rewards cards that link purchases to different accounts
  • Customers that subconsciously use multiple emails, nicknames or addresses without realizing it
  • Web forms that allow incorrectly formatted data elements such as phone numbers or zip codes
  • Customers pushed for time who then skip or cheat on certain data elements
  • Security concerns of web transactions that lead customers to leave out certain data or simply lie to protect their personal information

3. Bad Form

Web forms. CRMs. ERP systems. The way they are designed can impact data quality. How? Some CRM systems are inflexible and may not allow easy implementation of data rules, leading to required fields being left blank, or containing incomplete data. Indeed many web forms allow any kind of gibberish data to be entered into any fields which can immediately contaminate the database. Not enough space for relevant info or systems and forms that have not been updated to match the business process also pose a challenge. Many systems also simply do not perform an address check at entry – allowing invalid addresses to enter the system. When it comes to data quality, good form is everything.

4. Customization Simply Reroutes Bad Data

All businesses have processes and data items unique to that business or industry sector. Unfortunately, when systems do not provide genuine flexibility and extensibility, IT will customize the system as necessary. For example, a CRM system may be adjusted to allow a full range of user-defined data (eg to allow a software company to store multiple licence details for each customer). Where this happens, the hacks and workarounds can lead to a lack of data integrity in the system (e.g. you end up storing data in fields designed for other data types (dates in character fields).

5. Data Erosion is Beyond Your Control

Businesses and consumers move address. People get married and change their name. Business names change too plus contacts get promoted or replaced. Email addresses and phone numbers are constantly evolving. People die. No matter how sophisticated your systems are, some measure of data erosion is simply unavoidable. While good business rules will assist in updating data at relevant checkpoints, to maintain the best quality data, it’s important to update the data from reliable data sources on a regular basis.

6. New Data. Bad Data. Duplicate Data.

Many businesses regularly source new prospect lists that are subsequently loaded into the CRM. These can come from a variety of places including list vendors, trade shows, publications, outbound marketing campaigns and even internal customer communications and surveys. Although it’s exciting to consider procuring a new, large database of prospects, there are two ways this addition of data can go horribly wrong. First, the data itself is always suspect, falling prey to all the potential issues of data entry, data erosion and customer error. But even if you can corroborate or cleanse the data before entering, there is still a chance you will be entering duplicate records that won’t always be quickly identified.

7. Overconfidence

OK. So this may not be a true ‘source’ of bad data but it is the most important precipitating factor. You may think that by implementing business rules or by using a CRM’s built-in duplicate detection tools, that you are covered. In practice, business rules are important and valuable but are never foolproof and require constant enforcement, evaluation and updates. Moreover, built-in data quality features are typically fairly limited in scope and ability to simply detect exact matches. They simply not powerful enough to do the heavy lifting of a more sophisticated fuzzy and phonetic matching engine that will catch the subtle data quality errors that can lead to major data quality issues. This false sense of confidence means you can easily overlook sources of poor data and neglect to perform critical data quality checks.

So if you keep these seven bad data sources in mind – are you home free? Unfortunately not. These are simply the building blocks of bad data. When even just some of these conditions occur simultaneously, the risk of bad data multiplies  exponentially. The only true way to achieve the five-pronged data quality ideal outlined by Kimball (accuracy, completeness, consistency, timeliness and uniqueness) is through a comprehensive data quality firewall that addresses each of these components individually.

Stay tuned for more information on Best Practices in data quality that pinpoint specific business rules and software solutions to achieve true real-time data quality.

Data Quality and the Spill Chucker

One of my favorite software tools is the spell checker, due to its entertainment value. Colloquially known as the spill chucker due to the fact that if you mistype spell checker as spill chucker, the spell checker identifies that both “spill” and “chucker” are valid words, the spell checker has no concept of context. I was reminded of this the other day, when I received a resume from someone who had two stints as an “Account Manger” and was then promoted to “Senior Account Manger” 🙂 It would be very useful if the spell checker dictionary was more easily customizable, because then most business users (and probably all job applicants) would no doubt remove “Manger” from the dictionary as they have no need to use the word, or it is so infrequent that they’re happy for the spell checker to question it.

We have the same challenges with Data Quality – most data items are only correct if they are in the right context. For example, if you have a column in a table that contains last names, and then find a record that contains a company name in the last name column, it is out of context and is poor quality data. Another example I encountered nearly 20 years ago was reported in a computer magazine – a major computer company addressed a letter to:

Mr David A Wilson
Unemployed At Moment
15 Lower Rd
Farnborough
Hants
GU14 7BQ

Someone had faithfully entered what Mr. Wilson had written in the job title field rather than enter it in a Notes field – maybe the database designer hadn’t allowed for notes.

Effective Data Quality tools must allow for poorly structured data – they must be able to recognize data that is in the wrong place and relocate it to the right place. You can’t match records, correct addresses effectively etc. unless you can improve the structure of poorly structured data. Of course, the context can depend on the language – even British English and American English are different in this respect. I remember when we at helpIT first Americanized our software over 10 years ago, coming across a test case where Earl Jones was given a salutation of “My Lord” rather than simply “Mr. Jones”! Of course, “Earl” is almost certainly a first name in the US but more likely to be a title in the UK. Often, it isn’t easy programming what we humans know instinctively. Salutations for letters derived from unstructured data can be a major source of discomfort and merriment e.g. MS Society is an organization, not to be addressed as “Dear Ms Society”. The landlord at The Duke of Wellington pub shouldn’t receive a letter starting “My Lord”. “Victoria and Albert Museum” is an organization not “Mr & Mrs Museum”, even if it hasn’t been entered in the Organization column.

But going back to spell checkers, maybe they’re sometimes more intelligent than we give them credit for? Just the other day, mine changed what I was attempting to type: “project milestones” to “project millstones”. I did wonder whether it knew more than I did, or maybe it was just feeling pretty negative that day…

helpIT Systems is Driving Data Quality

For most of us around the US, the Department of Motor Vehicles is a dreaded place, bringing with it a reputation of long lines, mountains of paperwork and drawn out processes. As customers, we loathe the trip to the DMV and though while standing in line, we may not give it much thought  – the reality is, poor data quality is a common culprit of some of these DMV woes. While it may seem unlikely that an organization as large and bureaucratic as the DMV can right the ship, today, DMV’s around the country are fighting back with calculated investments in data quality.

While improving the quality of registered driver data is not a new concept, technology systems implemented 15-20 years ago have long been a barrier for DMVs to actually take corrective action. However, as more DMVs begin to modernize their IT infrastructure, data quality projects are becoming more of a reality. Over the past year, helpIT has begun work with several DMVs to implement solutions designed to cleanse driver data, eliminate duplicate records, update addresses and even improve the quality of incoming data.

From a batch perspective, setting up a solution to cleanse the existing database paves the way for DMVs to implement other types of operational efficiencies like putting the license renewal process online, offering email notification of specific deadlines and reducing the waste associated with having (and trying to work with) bad data.

In addition to cleaning up existing state databases, some DMVs are taking the initiative a step further and working with helpIT to take more proactive measures by incorporating real-time address validation into their systems.  This ‘real-time data quality’ creates a firewall of sorts, facilitating the capture of accurate data by DMV representatives – while you provide it (via phone or at a window). With typedown technology embedded directly within DMV data entry forms, if there is a problem with your address, or you accidently forgot to provide them with information that affects the accuracy, like your apartment number or a street directional (North vs. South), the representatives are empowered to prompt and request clarification.

Getting your contact data to be accurate from the start means your new license is provided immediately without you having to make another visit, or call and wait on hold for 30 minutes just to resolve the problem that could have been no more than a simple typo.

Having met several DMV employees over the past year, it’s obvious that they want you to have an excellent experience. Better data quality is a great place to start. Even while DMV budgets are slashed year after year, modest investments in data quality software are yielding big results in customer experience.

 

If you want to learn more about improving the quality of your data, contact us at 866.332.7132 for a free demo of our comprehensive suite of data quality products.

Golden Records Need Golden Data: 7 Questions to Ask

If you’ve found yourself reading this blog then you’re no doubt already aware of the importance of maintaining data quality through processes such as data verification, suppression screening, and duplicate detection. In this post I’d like to look a bit closer at how you draw value from, and make the best use of, the results of the hard work you invest into tracking down duplicates within your data.

The great thing about fuzzy matching is that it enables us to identify groups of two or more records that pertain to the same entity but that don’t necessarily contain exactly the same information. Records in a group of fuzzy matches will normally contain similar information with slight variations from one record to the next. For example, one record may contain a full forename whilst another contains just an abbreviated version or even none at all. You will also frequently encounter fuzzy matches where incorrectly spelt or poorly input data is matched against its accurate counterpart.

Once you’ve identified these groups of fuzzy matches, what do you do with them? Ultimately you want to end up with only unique records within your data, but there are a couple of ways that you can go about reaching that goal. One approach is to try and determine the best record in a group of matches and discard all of the records that matched against it. Other times, you may find that you are able to draw more value from your data by taking the most accurate, complete, and relevant information from a group of matched records and merging it together so that you’re left with a single hybrid record containing a superior set of data than was available in any of the individual records from which it was created.

Regardless of the approach you take, you’ll need to establish some rules to use when determining the best record or best pieces of information from multiple records. Removing the wrong record or information could actually end up making your data worse so this decision warrants a bit of thought. The criteria you use for this purpose will vary from one job to the next, but the following is a list of 7 questions that target the desirable attributes you’ll want to consider when deciding what data should be retained:

  1. How current is the data?
    You’ll most likely want to keep data that was most recently acquired.
  2. How complete is the data?
    How many fields are populated, and how well are those fields populated?
  3. Is the data valid?
    Have dates been entered in the required format? Does an email address contain an at sign?
  4. Is the data accurate?
    Has it been verified (e.g. address verified against PAF)?
  5. How reliable is the data?
    Has it come from a trusted source?
  6. Is the data relevant?
    Is the data appropriate for its intended use (e.g. keep female contacts over male if compiling a list of recipients for a woman’s clothing catalogue)?
  7. Is there a predetermined hierarchy?
    Do you have a business rule in place that requires one set of data is always used over another?

When you have such a large range of competing criteria to consider, how do you apply all of these rules simultaneously? The approach we at helpIT use in our software is to allow the user to weight each item or collection of data, so they can choose what aspects are the most important in their business context. This isn’t necessarily whether an item is present or not, or how long it is, but could be whether it was an input value or derived from supplied information, or whether it has been verified by reference to an external dataset such as a Postal Address File. Once the master record has been selected, the user may also want to transfer data from records being deleted to the master record e.g. to copy a job title from a duplicate to a master record which contains fuller/better name and address information, but no job title. By creating a composite record, you ensure that no data is lost.

Hopefully this post will have given you something to think about when deciding how to deal with the duplicates you’ve identified in your data. I’d welcome any comments or questions.

The Retail Single Customer View

One of the more interesting aspects of working for a data quality company is the challenge associated with solving real world business issues through effective data management. In order to provide an end-to-end solution, several moving parts must be taken into consideration, data quality software being just one of them.

A few months back, helpIT was given the challenge of working with a multi-channel retail organization seeking a Single Customer View to improve the effectiveness of their marketing efforts. They received customer data from several channels including: Point of Sale, Website, and Call Center. Their hope was to link up every transaction over the past 5 years to a single instance of the right customer.

In a vacuum, this is a pretty straightforward job:

  1. Standardize and clean all of the data
  2. Identify the priority of transaction data sources and merge all sources together based on individual contact information. Develop a single list or table of all unique customers and assign a unique ID
  3. Take all transaction data sources and then match against the unique customers and assign the ID of the unique customer to the transaction.

With live business data it’s rarely if ever that simple. Here are some of the issues we had to circumvent:

  • Very high rate of repeat customers across multiple channels
  • Different data governance rules and standards at all points of capture
    Point of Sale – Only name and email were required. When no address was provided – store address was used. “Not Provided” also acceptable causing other incomplete data Website – Entered as preferred by customer Call Center – Typed into CRM application, no customer lookup process
  • No “My Account” section on the website, which means that all orders are treated as new customers
  • Archaic Point-of-Sale application that was too expensive to replace for this project
  • Newly created SQL Server environment that acts as a central data repository but had no starting point for unique customers

To come up with a solution that could enable the customer to develop and also maintain a Single Customer View we proposed the following workflow that could be used for both.

Website

This was immediately identified as the best source of information because the customers are entering it themselves and have the genuine desire to receive delivery or be contacted if there is an issue with the orders. The process was started with an initial batch clean up of all historical web orders as follows:

  1. Run all orders through Address Validation and National Change of Address (NCOA) to get the most up to date information on the contacts
  2. Standardize all data points using the matchIT SQL casing and parsing engine
  3. Performe contact level deduplication with matchIT SQL using combination of exact and fuzzy matching routines and included confidence scores for all matches.
  4. Our client identified the confidence thresholds that were either confident matches to commit, matches that required manual review, and unique records. They completed the manual review and incorporated some further logic based on these matches to prevent future review for the same issues. The final step in their process was to identify a future score threshold to commit transactions from other sources to the customer record.
  5. The deduplication process was finalized and a new SQL Server table was created with standardized, accurate, and unique customer data that would act as the “golden record”.
  6. All transactions from the Web history file were updated with a new column containing the unique customer ID from the “golden record” table.

Call Center

This was the next area to undertake and was essentially a repetition of the process used on the Website data with the exception of the final steps of matching the cleaned version of the Call Center data with the “golden record” table.

After performing the overlap, any unique customers from the call center were then added to the “golden record” table and assigned an ID. All the overlapping Call Center customers received the overlapped ID from the “golden record” table which was then appended to the related Call Center transactions.

Store

This was the tricky part!

Some of the data contained full and accurate customer information, but nearly 30% of the customer transaction data captured at the store level contained the store address information.

So how did we overcome this?

  1. We created a suppression table that contained only their store addresses
  2. All store transactions with the minimum information for capture (at least a name and address) were standardized and then matched against the store suppressions file yielding a list of matches (customers with store info as their address) and non matches (customers that provided their own address information)
  3. For the customers that provided their address the process then went back to the same procedure run on the call center data
  4. For the customers with store information we had to use a different set of matching logic that ignored the address information and instead looked to the other data elements like name, email, phone, credit card number and date of birth. Several matchkeys were required because of the inconsistency in what matches would be found.
  5. The client then decided for the remaining portion of customers in the Store file (3%) to put those customers in a hold table until some other piece of information popped up that would allow for a bridging of the transaction to a new transaction.

A workflow diagram of the store process can be found to the left:

The key to the whole strategy was to identify a way to isolate records with alternative matching requirements on an automated basis. Once we separated the records with store addresses we were free to develop the ideal logic for each scenario, providing an end to end solution for a very complicated but frequently occurring data issue.

If you are attempting to establish the ever-elusive single customer view, remember that there are several moving parts to the process that go well beyond the implementation of a data quality software. The solution may well require a brand new workflow to reach the desired objective.

 

For more information about establishing a Single Customer View or to sign up for a Free CRM Data Analysis, click here.

Process Centric Data Quality

When I read that even today, contact data quality issues are still costing US business billions, and industry average duplicate rates are around 1 of every 20 records for a typical database we need to accept that there has got to be a better way to resolve the issue.

In my role, I’ve had the opportunity to speak with a lot of CIO’s, DBA’s, and business stake holders describe the challenges in trying to deal with data quality issues. The one question which consistently comes up is; “how do I enforce throughout my company- policies and practices which ensure the users entering data into our database is clean and duplicate free?”

The answer starts with establishing a simple universal truth; that for any company- data quality starts at the point of capture. This is the moment that record is being entered into your database. It doesn’t matter if it is entered by someone in a call center, a sales person, account manager, billing, support or even a web generated lead or sale. This is the opportunity to get it right.

Between the CRM or ERP in any given company- nearly every employee either looks up, adds records, or modifies record details. Even the website connects to the database handling new web leads or new customer e-commerce purchases, billing and shipping details. Data providers such as JigSaw Data, InfoUSA and SalesGenie have made a lot of data readily available at little to no cost and it is being sucked into company databases. While all of this data has enormous benefits to business and profits, it creates a lot of work for IT departments trying to keep it all clean and linked to existing accounts or records. For their part, the data quality industry has been diligent in coming up with new processes and methodologies like; MDM, CDC, CDI, data stewardship, etc., which have certainly helped many companies understand and make improvements to the data quality dilemma.

If you look at the data quality industry as a whole, little has changed over the years. Backend “batch” data quality processing is still the predominant way IT departments deal with correcting poor inputs and linking duplicate records. Yes, processing has moved from the mainframe to the workstation, and costs have certainly come down to a point where it is reasonable for every company to seriously consider acquiring. But we are still using tools and building thought processes based on 1990’s technology to deal with 21st century data realities.

Admittedly, there will always be a place for backend maintenance correction and analysis. It’s fundamental. But in most cases, batch processing is performed offline, days, weeks, months, and for some companies- years after the record had been created. Speaking for helpIT systems, we have done a lot of design around simplifying the processes around complex data cleansing functions and extending it into robust and fully functional batch data cleansing tools. These tools are critical in effectively supporting the IT department in maintaining a single customer view across the enterprise.

But poor data quality is in the first instance- a process problem, not a technology problem. However, properly applied, technology can assist each user and the organization to eliminate or at least mitigate the human impact on data capture.

To quote Rob Karel, Principle Analyst with Forrester Research, “Investments in downstream batch data quality solutions help to mitigate data quality issues, but there is no replacement for capturing the right data in the first place”.

This is where our new data quality framework fits into a company’s data quality initiatives.

findIT S2 is a real-time data quality framework designed to be integrated into frontend application and reside between the data entry tasks and database. findIT S2 reports suspect duplicates to the UI and calls a postal reference database to ensure that addresses are complete, accurate and entered in fewer keystrokes. The rest of the data quality engine extrapolates further reference data, standardizes data and post the information back to the under laying database.

Essentially with findIT S2 we’ve empowered every user to be a deputized data steward and a functional part of the data quality process. Instead of the human element being the problem, the user is an integral part of the solution. Clean data is entered into the systems in real-time allowing business decisions, actions and reporting to be made with more accurate data.

Additionally, findIT S2 UI can be modified or customized, and can be linked to multiple internal or external data sources or web services. This functionality will allow findIT S2 to extend its core functionality to also provide customized data enhancement/append, immediate reporting of fraud/ red flag warning, or synchronous matching and updating between internal systems.

By applying a process centric data quality approach, you are directly impacting the amount of work necessary downstream. It’s the old adage at work; an ounce of prevention is worth a pound of cure.