Golden Records Need Golden Data: 7 Questions to Ask

If you’ve found yourself reading this blog then you’re no doubt already aware of the importance of maintaining data quality through processes such as data verification, suppression screening, and duplicate detection. In this post I’d like to look a bit closer at how you draw value from, and make the best use of, the results of the hard work you invest into tracking down duplicates within your data.

The great thing about fuzzy matching is that it enables us to identify groups of two or more records that pertain to the same entity but that don’t necessarily contain exactly the same information. Records in a group of fuzzy matches will normally contain similar information with slight variations from one record to the next. For example, one record may contain a full forename whilst another contains just an abbreviated version or even none at all. You will also frequently encounter fuzzy matches where incorrectly spelt or poorly input data is matched against its accurate counterpart.

Once you’ve identified these groups of fuzzy matches, what do you do with them? Ultimately you want to end up with only unique records within your data, but there are a couple of ways that you can go about reaching that goal. One approach is to try and determine the best record in a group of matches and discard all of the records that matched against it. Other times, you may find that you are able to draw more value from your data by taking the most accurate, complete, and relevant information from a group of matched records and merging it together so that you’re left with a single hybrid record containing a superior set of data than was available in any of the individual records from which it was created.

Regardless of the approach you take, you’ll need to establish some rules to use when determining the best record or best pieces of information from multiple records. Removing the wrong record or information could actually end up making your data worse so this decision warrants a bit of thought. The criteria you use for this purpose will vary from one job to the next, but the following is a list of 7 questions that target the desirable attributes you’ll want to consider when deciding what data should be retained:

  1. How current is the data?
    You’ll most likely want to keep data that was most recently acquired.
  2. How complete is the data?
    How many fields are populated, and how well are those fields populated?
  3. Is the data valid?
    Have dates been entered in the required format? Does an email address contain an at sign?
  4. Is the data accurate?
    Has it been verified (e.g. address verified against PAF)?
  5. How reliable is the data?
    Has it come from a trusted source?
  6. Is the data relevant?
    Is the data appropriate for its intended use (e.g. keep female contacts over male if compiling a list of recipients for a woman’s clothing catalogue)?
  7. Is there a predetermined hierarchy?
    Do you have a business rule in place that requires one set of data is always used over another?

When you have such a large range of competing criteria to consider, how do you apply all of these rules simultaneously? The approach we at helpIT use in our software is to allow the user to weight each item or collection of data, so they can choose what aspects are the most important in their business context. This isn’t necessarily whether an item is present or not, or how long it is, but could be whether it was an input value or derived from supplied information, or whether it has been verified by reference to an external dataset such as a Postal Address File. Once the master record has been selected, the user may also want to transfer data from records being deleted to the master record e.g. to copy a job title from a duplicate to a master record which contains fuller/better name and address information, but no job title. By creating a composite record, you ensure that no data is lost.

Hopefully this post will have given you something to think about when deciding how to deal with the duplicates you’ve identified in your data. I’d welcome any comments or questions.

The Retail Single Customer View

One of the more interesting aspects of working for a data quality company is the challenge associated with solving real world business issues through effective data management. In order to provide an end-to-end solution, several moving parts must be taken into consideration, data quality software being just one of them.

A few months back, helpIT was given the challenge of working with a multi-channel retail organization seeking a Single Customer View to improve the effectiveness of their marketing efforts. They received customer data from several channels including: Point of Sale, Website, and Call Center. Their hope was to link up every transaction over the past 5 years to a single instance of the right customer.

In a vacuum, this is a pretty straightforward job:

  1. Standardize and clean all of the data
  2. Identify the priority of transaction data sources and merge all sources together based on individual contact information. Develop a single list or table of all unique customers and assign a unique ID
  3. Take all transaction data sources and then match against the unique customers and assign the ID of the unique customer to the transaction.

With live business data it’s rarely if ever that simple. Here are some of the issues we had to circumvent:

  • Very high rate of repeat customers across multiple channels
  • Different data governance rules and standards at all points of capture
    Point of Sale – Only name and email were required. When no address was provided – store address was used. “Not Provided” also acceptable causing other incomplete data Website – Entered as preferred by customer Call Center – Typed into CRM application, no customer lookup process
  • No “My Account” section on the website, which means that all orders are treated as new customers
  • Archaic Point-of-Sale application that was too expensive to replace for this project
  • Newly created SQL Server environment that acts as a central data repository but had no starting point for unique customers

To come up with a solution that could enable the customer to develop and also maintain a Single Customer View we proposed the following workflow that could be used for both.

Website

This was immediately identified as the best source of information because the customers are entering it themselves and have the genuine desire to receive delivery or be contacted if there is an issue with the orders. The process was started with an initial batch clean up of all historical web orders as follows:

  1. Run all orders through Address Validation and National Change of Address (NCOA) to get the most up to date information on the contacts
  2. Standardize all data points using the matchIT SQL casing and parsing engine
  3. Performe contact level deduplication with matchIT SQL using combination of exact and fuzzy matching routines and included confidence scores for all matches.
  4. Our client identified the confidence thresholds that were either confident matches to commit, matches that required manual review, and unique records. They completed the manual review and incorporated some further logic based on these matches to prevent future review for the same issues. The final step in their process was to identify a future score threshold to commit transactions from other sources to the customer record.
  5. The deduplication process was finalized and a new SQL Server table was created with standardized, accurate, and unique customer data that would act as the “golden record”.
  6. All transactions from the Web history file were updated with a new column containing the unique customer ID from the “golden record” table.

Call Center

This was the next area to undertake and was essentially a repetition of the process used on the Website data with the exception of the final steps of matching the cleaned version of the Call Center data with the “golden record” table.

After performing the overlap, any unique customers from the call center were then added to the “golden record” table and assigned an ID. All the overlapping Call Center customers received the overlapped ID from the “golden record” table which was then appended to the related Call Center transactions.

Store

This was the tricky part!

Some of the data contained full and accurate customer information, but nearly 30% of the customer transaction data captured at the store level contained the store address information.

So how did we overcome this?

  1. We created a suppression table that contained only their store addresses
  2. All store transactions with the minimum information for capture (at least a name and address) were standardized and then matched against the store suppressions file yielding a list of matches (customers with store info as their address) and non matches (customers that provided their own address information)
  3. For the customers that provided their address the process then went back to the same procedure run on the call center data
  4. For the customers with store information we had to use a different set of matching logic that ignored the address information and instead looked to the other data elements like name, email, phone, credit card number and date of birth. Several matchkeys were required because of the inconsistency in what matches would be found.
  5. The client then decided for the remaining portion of customers in the Store file (3%) to put those customers in a hold table until some other piece of information popped up that would allow for a bridging of the transaction to a new transaction.

A workflow diagram of the store process can be found to the left:

The key to the whole strategy was to identify a way to isolate records with alternative matching requirements on an automated basis. Once we separated the records with store addresses we were free to develop the ideal logic for each scenario, providing an end to end solution for a very complicated but frequently occurring data issue.

If you are attempting to establish the ever-elusive single customer view, remember that there are several moving parts to the process that go well beyond the implementation of a data quality software. The solution may well require a brand new workflow to reach the desired objective.

 

For more information about establishing a Single Customer View or to sign up for a Free CRM Data Analysis, click here.

Royal Mail to Charge VAT on Bulk Mail

Some important recent news from the Royal Mail indicates that because of recent changes in VAT law, from 2 April 2012, all bulk mail services will be subject to VAT. You can read more about this on the DMA website or at Royal Mail but unfortunately, this is just another in a long line of trends that are driving up the cost of mailing. While most mailers will be able to reclaim the VAT from their customers, this is going to worsen cash flow for the mailing sponsor and the bottom line is that companies need to be even more shrewd with their marketing and mailing strategy to keep costs under control.

So what can you do?

Know who you’re mailing. Make sure that the postage you spend for each mail piece is well-spent by ensuring it’s not a duplicate  and that it is a prospect or customer that is likely to be worth the investment. Whether you are a mail house or a marketer, here are some quick tips to make sure you are not paying for mailings that are not likely to generate ROI:

For Marketers:

  • First, always remember to request your mailer to perform deduplication and suppression on your lists. This may  not be standard and while you will spend a little bit extra to run these processes, the cost savings usually far outweighs the spend. You can ask for a cost estimate for these processes first if you prefer.
  • Always remember to send suppression files with your mailing list to remove customers who have opted out or to remove customers who are not likely to be a good investment for this mailing. Remember that your ideal target audience will change from campaign to campaign and to send the appropriate suppression list to get your ideal target list.
  • Consider using NCOA and Deceased/Gone Away suppression to reduce your list size even further – removing people who have died or moved and updating any addresses that may have changed – and of course MPS suppression not only saves money on unresponsive customers but is also a requirement for DMA members.

For Mailers:

  • Remember to remind your customers about the various types of suppression. Although it does take an extra step and it reduces the total number of mail pieces, your reputation and retention of business in the long run will be bolstered when customers know you are looking out for their bottom line.
  • Create a check list for customers to prompt them for certain types of target market suppression – Have you suppressed customers who have recently purchased this product or service? Have you targeted your list to the price range of this particular product or service?
  • Promote NCOA and other new address services where possible to help your customers target the right address

For more information on Address Validation and Suppression services that can help you reduce mailing waste, visit www.helpIT.com or contact our sales team at [email protected] 

 

Any Advance on Soundex?

A lot has been written about “phonetic algorithms” since Soundex was created for the US Census in (I think) 1880, but the world seemed to stand fairly still until computer software started to implement name matching in the 70’s. The strange thing is that Soundex seems to have remained the de facto standard until well into the 90’s, even up to the present day – strange because Soundex is manifestly prone to false matches as well as missing some typical true name matches. To explain why this happens, let’s look at how Soundex works:

Soundex constructs a crude non-phonetic key by keeping the initial letter of the name, then removing all vowels, plus the letters H, W and Y, and translating the remaining letters to numbers e.g. Tootill and Toothill both become T340. It gives the same number to letters that can be confused e.g. ‘m’ and ‘n’ both become 5. It also drops repeated consonants and consecutive letters that give the same number e.g. S and C. To illustrate some of the issues: Soundex translates Brady, Beard and Broad all to B630, and Wilkins and Wilson both to W425, and yet doesn’t match Knot and Nott – let alone more challenging examples like Dayton and Deighton.

A lot of the work done in the 70’s and 80’s focused on shoring up Soundex’s more obvious symptoms, rather than addressing the root problem – it doesn’t attempt to understand how people pronounce names. In the last 20 years, there have been various “phonetic” algorithms enter the public domain but Metaphone 3 is the only one that I know of that doesn’t invariably disregard the vowels after the first letter of the name.

Much of the material I read back in 1995 (when searching for a phonetic algorithm that worked better than Soundex) started off on the wrong tack by adopting a similar approach to Soundex. Often, the authors quoted various sounds to support their solutions which are rarely present in names of people e.g. the different “ua” sounds in “persuade” and “casual”, the “pt” in “Ptolemy”. Even when I’ve revisited the subject in the intervals since then, there has been little advance. Back in 1995, I decided that we at helpIT would write our own genuinely phonetic algorithm, and I laid down these requirements:

  • It must produce phonetic codes that represent typical pronunciations
  • It should focus on “proper names” and not consider other words
  • It should be loose enough to allow for British and American regional differences in pronunciation (e.g. “Shaw” and “Shore”) but not so loose as to equate names that sound completely different
  • It should not try and address other forms of fuzzy matching that arise from keying or reading errors and inconsistencies – phonetic matching should, by definition, simply address words that may sound the same.

The last point is important – the most effective way to match names is to use a combination of techniques to allow for different kinds of error, rather than try to create one algorithm that is a jack of all trades. I will describe our phonetic algorithm in a later post, but software development, like data quality, is a journey and not a destination – so we’re always looking to improve. If any of our readers know of any other algorithms that attempt to genuinely represent the phonetic attributes of names, I would be delighted to hear from you.

Steve Tootill (stEv tWtyl)