helpIT Systems is Driving Data Quality

For most of us around the US, the Department of Motor Vehicles is a dreaded place, bringing with it a reputation of long lines, mountains of paperwork and drawn out processes. As customers, we loathe the trip to the DMV and though while standing in line, we may not give it much thought  – the reality is, poor data quality is a common culprit of some of these DMV woes. While it may seem unlikely that an organization as large and bureaucratic as the DMV can right the ship, today, DMV’s around the country are fighting back with calculated investments in data quality.

While improving the quality of registered driver data is not a new concept, technology systems implemented 15-20 years ago have long been a barrier for DMVs to actually take corrective action. However, as more DMVs begin to modernize their IT infrastructure, data quality projects are becoming more of a reality. Over the past year, helpIT has begun work with several DMVs to implement solutions designed to cleanse driver data, eliminate duplicate records, update addresses and even improve the quality of incoming data.

From a batch perspective, setting up a solution to cleanse the existing database paves the way for DMVs to implement other types of operational efficiencies like putting the license renewal process online, offering email notification of specific deadlines and reducing the waste associated with having (and trying to work with) bad data.

In addition to cleaning up existing state databases, some DMVs are taking the initiative a step further and working with helpIT to take more proactive measures by incorporating real-time address validation into their systems.  This ‘real-time data quality’ creates a firewall of sorts, facilitating the capture of accurate data by DMV representatives – while you provide it (via phone or at a window). With typedown technology embedded directly within DMV data entry forms, if there is a problem with your address, or you accidently forgot to provide them with information that affects the accuracy, like your apartment number or a street directional (North vs. South), the representatives are empowered to prompt and request clarification.

Getting your contact data to be accurate from the start means your new license is provided immediately without you having to make another visit, or call and wait on hold for 30 minutes just to resolve the problem that could have been no more than a simple typo.

Having met several DMV employees over the past year, it’s obvious that they want you to have an excellent experience. Better data quality is a great place to start. Even while DMV budgets are slashed year after year, modest investments in data quality software are yielding big results in customer experience.

 

If you want to learn more about improving the quality of your data, contact us at 866.332.7132 for a free demo of our comprehensive suite of data quality products.

Creating Your Ideal Test Data

Every day we work with customers to begin the process of evaluating helpIT data quality software (along with other vendors they are looking at). That process can be daunting for a variety of reasons from identifying the right vendors to settling on an implementation strategy, but one of the big hurdles that occurs early on in the process is running an initial set of data through the application.

Once you’ve gotten a trial of a few applications (hopefully including helpIT’s) and you are poised to start your evaluation to determine which one is going to generate the best result – you’ll need to develop a sample data set to run on the software. This is an important step not to be overlooked because you want to be sure that the software you invest in can deliver the highest quality matches so you can effectively dedupe your database and most importantly, TRUST that the resulting data is as clean as it possibly can be with the least possible wiggle room. So how do you create the ideal test data?

The first word of advice – use real data.

Many software trials will come preinstalled with sample or demo data designed primarily to showcase the features of the software. While this sample data can give you examples of generic match results, they will not be a clear reflection of your match results. This is why it is best to run an evaluation of the software on your own data whenever possible. Using the guidelines below, we suggest ‘identifying’ a real dataset that is representative of the challenges you will typically see within your actual database. That dataset will tell you if the software can find your more challenging matches, and how well it can do that.

For fuzzy matching features, you may like to consider whether the data that you test with includes these situations:

  • phonetic matches (e.g. Naughton and Norton)
  • reading errors (e.g. Horton and Norton)
  • typing errors (e.g. Notron, Noron, Nortopn and Norton)
  • one record has title and initial and the other has first name with no title
    (e.g. Mr J Smith and John Smith)
  • one record has missing name elements (e.g. John Smith and Mr J R Smith)
  • names are reversed (e.g. John Smith and Smith, John)
  • one record has missing address elements (e.g. one record has the village or house
    name and the other address just has the street number or town)
  • one record has the full postal code and the other a partial postal code or no postal code

When matching company names data, consider including the following challenges:

  • acronyms e.g. IBM, I B M, I.B.M., International Business Machines
  • one record has missing name elements e.g.
  1. The Crescent Hotel, Crescent Hotel
  2. Breeze Ltd, Breeze
  3. Deloitte & Touche, Deloitte, Deloittes.

You should also ensure that you have groups of records where the data that matches exactly, varies for pairs within the group. For example:

If you don’t have these scenarios all represented, you can doctor your real data to create them, as long as you start with real records that are as close as possible to the test cases and make one or at the most two changes to each record. In the real world, matching records will have something in common – not every field will be slightly different.

With regard to size, it’s better to work with a reasonable sample of your data than a whole database or file, otherwise the mass of information runs the risk of obscuring important details and test runs take longer than they need to. We recommend that you take two selections from your data – one for a specific postal code or geographic area, and one (if possible) an alphabetical range by last name. Join these selections together and then eliminate all the exact matches – if you can’t do this easily, one of the solutions that you’re evaluating can probably do it for you.

Ultimately, you should have a reasonable size sample without so many obvious matches, which should contain a reasonable number of fuzzier matches (e.g. matches where the first character of the postal code or last name is different between two records that otherwise match, matches with phonetic variations of last name, etc.)

__________________________________________________________________________

For more information on data quality vendor evaluations, please download our Practical Guide to Data Quality Vendor Selection.

The Retail Single Customer View

One of the more interesting aspects of working for a data quality company is the challenge associated with solving real world business issues through effective data management. In order to provide an end-to-end solution, several moving parts must be taken into consideration, data quality software being just one of them.

A few months back, helpIT was given the challenge of working with a multi-channel retail organization seeking a Single Customer View to improve the effectiveness of their marketing efforts. They received customer data from several channels including: Point of Sale, Website, and Call Center. Their hope was to link up every transaction over the past 5 years to a single instance of the right customer.

In a vacuum, this is a pretty straightforward job:

  1. Standardize and clean all of the data
  2. Identify the priority of transaction data sources and merge all sources together based on individual contact information. Develop a single list or table of all unique customers and assign a unique ID
  3. Take all transaction data sources and then match against the unique customers and assign the ID of the unique customer to the transaction.

With live business data it’s rarely if ever that simple. Here are some of the issues we had to circumvent:

  • Very high rate of repeat customers across multiple channels
  • Different data governance rules and standards at all points of capture
    Point of Sale – Only name and email were required. When no address was provided – store address was used. “Not Provided” also acceptable causing other incomplete data Website – Entered as preferred by customer Call Center – Typed into CRM application, no customer lookup process
  • No “My Account” section on the website, which means that all orders are treated as new customers
  • Archaic Point-of-Sale application that was too expensive to replace for this project
  • Newly created SQL Server environment that acts as a central data repository but had no starting point for unique customers

To come up with a solution that could enable the customer to develop and also maintain a Single Customer View we proposed the following workflow that could be used for both.

Website

This was immediately identified as the best source of information because the customers are entering it themselves and have the genuine desire to receive delivery or be contacted if there is an issue with the orders. The process was started with an initial batch clean up of all historical web orders as follows:

  1. Run all orders through Address Validation and National Change of Address (NCOA) to get the most up to date information on the contacts
  2. Standardize all data points using the matchIT SQL casing and parsing engine
  3. Performe contact level deduplication with matchIT SQL using combination of exact and fuzzy matching routines and included confidence scores for all matches.
  4. Our client identified the confidence thresholds that were either confident matches to commit, matches that required manual review, and unique records. They completed the manual review and incorporated some further logic based on these matches to prevent future review for the same issues. The final step in their process was to identify a future score threshold to commit transactions from other sources to the customer record.
  5. The deduplication process was finalized and a new SQL Server table was created with standardized, accurate, and unique customer data that would act as the “golden record”.
  6. All transactions from the Web history file were updated with a new column containing the unique customer ID from the “golden record” table.

Call Center

This was the next area to undertake and was essentially a repetition of the process used on the Website data with the exception of the final steps of matching the cleaned version of the Call Center data with the “golden record” table.

After performing the overlap, any unique customers from the call center were then added to the “golden record” table and assigned an ID. All the overlapping Call Center customers received the overlapped ID from the “golden record” table which was then appended to the related Call Center transactions.

Store

This was the tricky part!

Some of the data contained full and accurate customer information, but nearly 30% of the customer transaction data captured at the store level contained the store address information.

So how did we overcome this?

  1. We created a suppression table that contained only their store addresses
  2. All store transactions with the minimum information for capture (at least a name and address) were standardized and then matched against the store suppressions file yielding a list of matches (customers with store info as their address) and non matches (customers that provided their own address information)
  3. For the customers that provided their address the process then went back to the same procedure run on the call center data
  4. For the customers with store information we had to use a different set of matching logic that ignored the address information and instead looked to the other data elements like name, email, phone, credit card number and date of birth. Several matchkeys were required because of the inconsistency in what matches would be found.
  5. The client then decided for the remaining portion of customers in the Store file (3%) to put those customers in a hold table until some other piece of information popped up that would allow for a bridging of the transaction to a new transaction.

A workflow diagram of the store process can be found to the left:

The key to the whole strategy was to identify a way to isolate records with alternative matching requirements on an automated basis. Once we separated the records with store addresses we were free to develop the ideal logic for each scenario, providing an end to end solution for a very complicated but frequently occurring data issue.

If you are attempting to establish the ever-elusive single customer view, remember that there are several moving parts to the process that go well beyond the implementation of a data quality software. The solution may well require a brand new workflow to reach the desired objective.

 

For more information about establishing a Single Customer View or to sign up for a Free CRM Data Analysis, click here.

Any Advance on Soundex?

A lot has been written about “phonetic algorithms” since Soundex was created for the US Census in (I think) 1880, but the world seemed to stand fairly still until computer software started to implement name matching in the 70’s. The strange thing is that Soundex seems to have remained the de facto standard until well into the 90’s, even up to the present day – strange because Soundex is manifestly prone to false matches as well as missing some typical true name matches. To explain why this happens, let’s look at how Soundex works:

Soundex constructs a crude non-phonetic key by keeping the initial letter of the name, then removing all vowels, plus the letters H, W and Y, and translating the remaining letters to numbers e.g. Tootill and Toothill both become T340. It gives the same number to letters that can be confused e.g. ‘m’ and ‘n’ both become 5. It also drops repeated consonants and consecutive letters that give the same number e.g. S and C. To illustrate some of the issues: Soundex translates Brady, Beard and Broad all to B630, and Wilkins and Wilson both to W425, and yet doesn’t match Knot and Nott – let alone more challenging examples like Dayton and Deighton.

A lot of the work done in the 70’s and 80’s focused on shoring up Soundex’s more obvious symptoms, rather than addressing the root problem – it doesn’t attempt to understand how people pronounce names. In the last 20 years, there have been various “phonetic” algorithms enter the public domain but Metaphone 3 is the only one that I know of that doesn’t invariably disregard the vowels after the first letter of the name.

Much of the material I read back in 1995 (when searching for a phonetic algorithm that worked better than Soundex) started off on the wrong tack by adopting a similar approach to Soundex. Often, the authors quoted various sounds to support their solutions which are rarely present in names of people e.g. the different “ua” sounds in “persuade” and “casual”, the “pt” in “Ptolemy”. Even when I’ve revisited the subject in the intervals since then, there has been little advance. Back in 1995, I decided that we at helpIT would write our own genuinely phonetic algorithm, and I laid down these requirements:

  • It must produce phonetic codes that represent typical pronunciations
  • It should focus on “proper names” and not consider other words
  • It should be loose enough to allow for British and American regional differences in pronunciation (e.g. “Shaw” and “Shore”) but not so loose as to equate names that sound completely different
  • It should not try and address other forms of fuzzy matching that arise from keying or reading errors and inconsistencies – phonetic matching should, by definition, simply address words that may sound the same.

The last point is important – the most effective way to match names is to use a combination of techniques to allow for different kinds of error, rather than try to create one algorithm that is a jack of all trades. I will describe our phonetic algorithm in a later post, but software development, like data quality, is a journey and not a destination – so we’re always looking to improve. If any of our readers know of any other algorithms that attempt to genuinely represent the phonetic attributes of names, I would be delighted to hear from you.

Steve Tootill (stEv tWtyl)

Real-Time ERP Data? Yes.

Over the past several months our findIT S2 application has been gaining some significant traction as a real-time duplicate prevention and record linking solution on websites, CRM and ERP applications.

Just last month, helpIT added our first Epicor client seeking these same capabilities directly within their ERP. The acquisition of findIT S2 comes on the heels of their Epicor ERP purchase, validating our belief that there is a strategic opportunity for organizations to take advantage of a real-time data quality firewall, even for those already utilizing some of the world’s leading strategic CRM and ERP applications.

In each interaction with our prospects and customers we are finding new ways that the power of real-time matching is helping companies. Some unprecedented ways include:

• Providing a single customer view in both CRM and ERP applications, despite technical incompatibilities between systems
• Linking captured web leads to credit bureau information to identify applicable credit offers
• Preventing the overextension of credit to customers by finding terms previously offered

The most common scenario, of course, is organizations who understand that the prevention of duplication can lead to very significant improvements in efficiency and decisioning when compared to retroactive data quality resolution.

helpIT systems is proactively working with a range of enterprise partners to extend the value of exceptional data quality through our best-in-class front-end and back-end matching technologies.

To explore partnership or direct opportunities, please contact Josh Buckler at 866.628.2448 or connect with him via Twitter @bucklerjosh.