Weighing up the Cost of Bad Data

Weighing up the Cost of Bad Data

In a recent survey conducted by helpIT systems, almost 25 percent of respondents cited finances as the biggest hindrance to maintaining superior contact databases.  We get it.  Data quality solutions can carry what may seem to be a hefty pricetag, and they won’t show up two days later in a nicely wrapped package like an Amazon Prime purchase.  As such, like any other expensive and complicated decision, data quality may well get pushed to the bottom of the pile.

Then again, just like going to the gym or eating salad instead of steak, the toughest behaviors to adapt are usually the most beneficial.  Because even though database management may be something we’d rather forget about, 40 percent of those same respondents stated that their companies were losing tens of thousands of dollars each year due to poor contact data quality.  So while the solution may not be cheap and easy, the cost of living without it does not appear to be either.  Data Warehousing Institute found that the cost of bad data to US businesses is more than $600 billion each year.  Is that a number your company can afford to ignore?

Many businesses do notice these dollars disappearing and choose to do something about it.  Unfortunately however, this is often simply a “quick fix”.  They look at their messy databases, pay someone to “clean them up”, and then everyone gets a pat on the back for a job well done.  And it is.  Until someone enters a new record in the CRM, a customer moves, or perhaps even dares to get a new phone number.  And I will shock everyone by reporting that this happens all the time.  Studies indicate up to a 2 percent degradation each month…even in a perfect database.

Right now you’re probably picking up on the fact that maintaining good data is going to cost money.  You’re right.  But the fact is, avoiding that cost is only going to cost more in the long run.  Just like having a well-trained sales team, a finely-targeted marketing plan, or a boss with years of experience…great results are an investment of time and resources rather than a happy accident.

Companies that choose to invest in good data quality, as well as to view it as an ongoing process rather than a simple one-time fix, are finding that the benefits by far outweigh the initial costs.  Advertising dollars are reaching their intended audiences and sales calls are reaching the right recipient, with customer satisfaction going through the roof.  Today’s consumer expects the personal touches that can only come from having an accurate and up-to-date Single Customer View, and it is good data quality solutions that will achieve them.

How Ashley Madison Can Inspire Your Business

As each new name and every illicit detail is revealed, the 37 million members of Ashley Madison, a website promoting extramarital affairs, are scrambling to save their marriages, careers, and reputations.  This list, which is now available to anyone aware ofthe existence of Google, reportedly includes the names and sexual fantasies of members of the armed services, United Nations, and even the Vatican.  Looks like someone’s prayers weren’t heard this week.

As the extent of the contact information becomes more easily accessible, a new breed of data analyst is emerging.  Creative thinkers are using the information to win custody battles, deduce which cities have the most cheaters, and even get a leg up over another candidate for a job promotion.

If everyone from neglected housewives to tawdry tabloid writers is capable of using data to form opinions and make well-informed decisions, the question is… why aren’t you?

Now I’m not talking about crawling through Ashley Madison’s troves of cheaters, I’m talking about your company.  Your data.  Demographics, geographic locations, purchasing behavior… your contact records say a million things about your customers.  A million patterns are lying in wait, holding the key to better marketing, better operations, and better business decisions.  Whereas for Ashley Madison data spelled disaster, for you it should spell potential.

Customer data, when compromised, can be a company’s worst nightmare.  When used intelligently, customer data can increase profits and reduce the guessing game so many businesses play on a day-to-day basis.

In order to use your data intelligently, you must be confident that it is accurate and up-to-date.  If your records indicate you have 14 Jeremiah Whittinglys living in Chicago, you can either double your production of Jeremiah Whittingly personalized baseball caps, or perhaps take a closer look at how clean your data is.  I’m personally leaning towards the second option.

However, beefing up marketing efforts in Juneau, where your database says 10 percent of your client base is located, is a smart idea.  Unless your data entry employee didn’t realize ‘AK’ was the postal code abbreviation for Alaska rather than Arkansas.  In which case, polar bears stand a better chance of appreciating your new billboard than your target market.

Ridding your database of duplicate, incorrect, or incomplete records is the first step in recognizing the power of customer data.  The next step is figuring out what this data means for you and your company, and if every talk show host and dark web hacker can do it with the right tools, so can you.

Why Customers Must Be More Than Numbers

I read with some amazement a story in the London Daily Telegraph this week about a customer of NatWest Bank who sent £11,200 last month via online banking to an unknown company instead of his wife. Although Paul Sampson had correctly entered his wife’s name, sort code and account number when he first made an online payment to her HSBC account, he wasn’t aware that she had subsequently closed the account.

Mr Sampson thought he was transferring £11,200 to his wife: he clicked Margaret’s name among a list of payees saved in his NatWest banking profile and confirmed the transaction, but the payment went to a business in Leeds. Mr Sampson believes that HSBC had reissued his wife’s old account number to someone else, a company whose name they refused to tell him. NatWest told Mr Sampson it was powerless to claw the money back.

HSBC said it had contacted its customer, but it had no obligation regarding the money. HSBC insisted that the account number in question was not “recycled”, saying Mr Sampson must have made a typing error when he first saved the details, which he disputes. Although the money was in fact returned after the newspaper contacted HSBC, a very large issue has not been resolved.

Although news to most of us, it is apparently a common practice among banks in the UK to recycle account numbers, presumably because banking systems are so entrenched around 8 or 9 digit account numbers that they are concerned about running out of numbers. Apparently a recent code of practice suggests that banks should warn the customer making the payment if they haven’t sent money to this payee for 13 months, but according to the Daily Telegraph “No major high street bank could confirm that it followed this part of the code”.

The Daily Telegraph goes on to state that the recipients of electronic payments are identified by account numbers only. The names are not checked in the process, so even if they do not match, the transaction can proceed. “This is now a major issue when you can use something as basic as a mobile phone number to transfer money,” said Mike Pemberton, of solicitors Stephensons. “If you get one digit wrong there’s no other backup check, like a person’s name – once it’s gone it’s gone.” If you misdirect an online payment, your bank should contact the other bank within two working days of your having informed them of the error, but they have no legal obligation to help.

Mr Sampson obviously expected that the bank’s software would check that the account number belonged to the account name he had stored in his online payee list, but apparently UK banking software doesn’t do this. Why on earth not? Surely it’s not unreasonable for banks with all the money they spend on computer systems to perform this safety check? It’s not good enough to point to the problems that can arise when a name is entered in different ways such as Sheila Jones, Mrs S Jones, Sheila M Jones, SM Jones, Mrs S M Jones, Mrs Sheila Mary Jones etc.

These are all elementary examples for intelligent name matching software.  More challenging are typos, nicknames and other inconsistencies such as those caused by poor handwriting, which would all occur regularly should banks check the name belonging to the account number. But software such as matchIT Hub is easily available to cope with these challenges too, as well as the even more challenging job of matching joint names and business names.

There are also issues in the USA with banking software matching names – I remember when I first wanted to transfer money from my Chase account to my Citibank account, I could only do so if the two accounts had exactly the same name – these were joint accounts and the names had to match exactly letter for letter, so I had to either change the name on one of the accounts or open a new one! Having been an enthusiastic user of the system in the USA for sending money to someone electronically using just their email address, I’m now starting to worry about the wisdom of this…

We banking customers should perhaps question our banks more closely about the checks that they employ when we make online payments!

Who is the “Current Resident”, anyway?

Our CEO received an extremely heavy, expensive-looking catalog in the mail the other day, from an upmarket retailer and addressed to a previous occupant of his house “Or Current Resident”. When you receive catalogs in the mail that are addressed to the previous homeowner or the “current resident”, do you read them or toss them? Obviously the company hopes that anyone who receives a catalog at this address will more than likely take a gander at what’s being offered.
But is this a cost-effective supposition? When you consider the resources wasted on shipping a catalog to anyone that lives at a particular address, you have to wonder whether this is a smart strategy or just a cop out from cleansing a database.

Using address verification software which includes the National Change of Address (NOCA) service would help catalog senders increase their return on investment by updating their databases as frequently as needed. The NCOA service would ensure that databases are updated with the customer’s current address information, or warn of deceased or moved customers who did not give a forwarding address.

NCOA relies on the customers filling out a Change of Address form, and the USPS internal databases which keep track of customer information, which it then relays to the NCOA service. Rather than make use of NCOA data, many companies add “Or Current Resident” to the name from their databases, as the most timely and least expensive method of allowing that the addressee may no longer be there.

Set against the convenience of this tactic, these factors should also be considered:

  • The expense of shipping items to an old address
  • The much reduced chance of the new resident making a purchase
  • Losing track of a past customer
  • Alienating the new mail recipient

But does the NCOA process take that much time, or add to the expense of the mailing? Well, the answer is “no” on both counts! A whole spectrum of NCOA options is available, from desktop software that can be used by marketers, and software integrated into the corporate database (both contacting an NCOA service under the hood) to online bureaus who take your data and return the updated file a few hours later. The cost depends on data volumes, but even if you only have a few thousand records in your mailing file, you can always find an option that saves you money compared with print and mail costs – especially if your catalog is bulky.

Sending catalogs to the “current resident” might sound like easy advertising, but it doesn’t deliver return on investment for the costs of printing and mailing and it doesn’t help your brand. It really is easy and much smarter to keep track of customers with NCOA services, stop shipments to non-existent customers and even save money to reinvest in other positive, more effective marketing efforts.

6 Reasons Companies Ignore Data Quality Issues

When lean businesses encounter data quality issues, managers may be tempted to leverage existing CRM platforms or similar tools to try and meet the perceived data cleansing needs. They might also default to reinforcing some existing business processes and educating users in support of good data. While these approaches might be a piece of the data quality puzzle, it would be naive to think that they will resolve the problem. In fact, ignoring the problem for much longer while trying some half-hearted approaches, can actually amplify the problem you’ll eventually have to deal with later. So why do they do it? Here are some reasons we have heard about why businesses have stuck their heads in the proverbial data quality sand:

1. “We don’t need it. We just need to reinforce the business rules.”

Even in companies that run the tightest of ships, reinforcing business rules and standards won’t prevent all your problem. First, not all data quality errors are attributable to lazy or untrained employees. Consider nicknames, multiple legitimate addresses and variations on foreign spellings just to mention a few. Plus, while getting your process and team in line is always a good habit, it still leaves the challenge of cleaning up what you’ve got.

2. “We already have it. We just need to use it.”

Stakeholders often mistakenly think that data quality tools are inherent in existing applications or are a modular function that can be added on. Managers with sophisticated CRM or ERP tools in place may find it particularly hard to believe that their expensive investment doesn’t account for data quality. While customizing or extending existing ERP applications may take you part of the way, we are constantly talking to companies that have used up valuable time, funds and resources trying to squeeze a sufficient data quality solution out of one of their other software tools and it rarely goes well.

3. “We have no resources.”

When human, IT and financial resources are maxed out, the thought of adding a major initiative such as data quality can seem foolhardy. Even defining business  requirements is challenging unless a knowledgeable data steward is on board. With no clear approach, some businesses tread water in spite of mounting a formal assault. It’s important to keep in mind though that procrastinating a data quality issue can cost more resources in the long run because the time it takes staff to navigate data with inherent problems, can take a serious toll on efficiency.

4. “Nobody cares about data quality.”

Unfortunately, when it comes to advocating for data quality, there is often only one lone voice on the team, advocating for something that no one else really seems to care about. The key is to find the people that get it. They are there, the problem is they are rarely asked. They are usually in the trenches, trying to work with the data or struggling to keep up with the maintenance. They are not empowered to change any systems to resolve the data quality issues and may not even realize the extent of the issues, but they definitely care because it impacts their ability to do their job.

5. “It’s in the queue.”

Businesses may recognize the importance of data quality but just can’t think about it until after some other major implementation, such as a data migration, integration or warehousing project. It’s hard to know where data quality fits into the equation and when and how that tool should be implemented but it’s a safe bet to say that the time for data quality is before records move to a new environment. Put another way: garbage in = garbage out. Unfortunately for these companies, the unfamiliarity of a new system or process compounds the challenge of cleansing data errors that have migrated from the old system.

6. “I can’t justify the cost.”

One of the biggest challenges we hear about in our industry is the struggle to justify a data quality initiative with an ROI that is difficult to quantify. However, just because you can’t capture the cost of bad data in a single number doesn’t mean that it’s not affecting your bottom line. If you are faced with the dilemma of ‘justifying’ a major purchase but can’t find the figures to back it up, try to justify doing nothing. It may be easier to argue against sticking your head in the sand, then to fight ‘for’ the solution you know you need.

Is your company currently sticking their head in the sand when it comes to data quality? What other reasons have you heard?

Remember, bad data triumphs when good managers do nothing.

8 Ways to Save Your Data Quality Project

Let’s face it, if data quality were easy, everyone would have good data and it wouldn’t be such a hot topic. On the contrary, despite all the tools and advice out there, selecting and implementing a comprehensive data quality solution still presents some hefty challenges. So how does a newly appointed Data Steward NOT mess up the data quality project? Here are a few pointers on how to avoid failure.

1.DON’T FORGET THE LITTLE PEOPLE

As with other IT projects, the top challenge for data quality projects is securing business stakeholder engagement throughout the process. But this doesn’t just mean C-level executives. Stakeholders for a data quality initiative should also include department managers and even end-users within the company who must deal with the consequences of bad data as well as the impact of system changes. Marketing, for example, relies on data accuracy to reach the correct audience and maintain a positive image. Customer Service depends on completeness and accuracy of a record to meet their specific KPIs. Finance, logistics and even manufacturing may need to leverage the data for effective operations or even to feed future decisions. When it comes to obtaining business buy-in, it is critical for Data Stewards to think outside the box regarding how the organization uses (or could use) the data and then seek input from the relevant team members. While the instinct might be to avoid decision by committee, in the end, it’s not worth the risk of developing a solution that does not meet business expectations.

2. BEWARE OF THE “KITCHEN SINK” SOLUTION

The appeal of an ‘umbrella’ data management solution can lure both managers and IT experts, offering the ease and convenience of one-stop shopping. In fact, contact data quality can often be an add-on toolset offered by a major MDM or BI vendor – simply to check the box. However, when your main concern is contact data, be sure to measure all your options against a best-of-breed standard before deciding on a vendor. That means understanding the difference between match quality vs match quantity, determining the intrinsic value (for your organization) of integrated data quality processes and not overlooking features (or quality) that might seem like nice-to-haves now but which down the line, can make or break the success of your overall solution.  Once you know the standard you are looking for with regards to contact deduplication, address validation, and single customer view, you can effectively evaluate whether those larger-scale solutions will have the granularity needed to achieve the best possible contact data cleansing for your company. While building that broader data strategy is a worthy goal, now is the time to be conscious of not throwing the data quality out with the proverbial bathwater.

3. JUST BECAUSE YOU CAN, DOESN’T MEAN YOU SHOULD

When it comes to identifying the right contact data quality solution, most companies not only compare vendors to one another but they also consider the notion of developing a solution in-house. In fact, if you have a reasonably well-equipped IT Department (or consultant team) it is entirely possible that an in-house solution will appear cheaper to develop and there may be several factors that cause organizations to ‘lean’ in that direction including the desire to have ‘more control’ over the data or eliminate security and privacy concerns.

There is a flip side, however, to these perceived advantages, that begs to be considered before jumping in. First, ask yourself, does your team really have the knowledge AND bandwidth necessary to pull this off? Contact data cleansing is both art and science. Best-of-breed applications have been developed over years of trial and error and come with very deep knowledge bases and sophisticated match algorithms that can take a data quality project from 80% accuracy to 95% or greater accuracy. When you are dealing with millions or even billions of records, that extra percentage matters. Keep in mind that even the best-intentioned developers may be all too eager to prove they can build a data quality solution, without much thought as to whether or not they should. Even if the initial investment is less expensive than a purchased solution, how much revenue is lost (or not gained) by diverting resources to this initiative rather than to something more profitable?  In-house solutions can be viable solutions, as long as they are chosen for the right reasons and nothing is sacrificed in the long run.

4. NEVER USE SOMEONE ELSE’S YARDSTICK

Every vendor you evaluate will basically tell you to measure by the benchmarks they perform the best at. So the only way to truly make an unbiased decision is to know ALL the benchmarks and then decide for yourself which is most important to your company and don’t be fooled in the fine print. For example:

  • Number of duplicates, are often touted as a key measure of an application’s efficacy, but that figure is only valuable if they are all TRUE duplicates. Check this in an actual trial of your own data and go for the tool that delivers the greater number of TRUE duplicates while minimizing false matches.
  • Speed matters too but make sure you know the run speeds on your data and on your equipment.
  • More ‘versatile’ solutions are great, as long as your users will really be able to take advantage of all the bells and whistles.
  • Likewise, the volume of records processed should cover you for today and for what you expect to be processing in the next two to five years as this solution is not going to be something you want to implement and then change within a short time frame. Hence, scalability matters as well.

So, use your own data file, test several software options and compare the results in your own environment, with your own users. Plus remember those intangibles like how long it will take you to get it up and running, users trained, quality of reports, etc. These very targeted parameters should be the measure of success for your chosen solution – not what anyone else dictates.

5. MIND YOUR OWN BUSINESS (TEST CASES, THAT IS)

Not all matching software is created equal and the only way to effectively determine which software will address your specific needs, is to develop test cases that serve as relevant and appropriate examples of the kinds of data quality issues your organization is experiencing. These should be used as the litmus to determine which applications will best be able to resolve those examples. Be detailed in developing these test cases so you can get down to the granular features in the software which address them. Here are a few examples to consider:

  • Do you have contact records with phonetic variations in their names?
  • Are certain fields prone to missing or incorrect data?
  • Do your datasets consistently have data in the wrong fields (e.g. names in address lines, postal code in city fields, etc)?
  • Is business name matching a major priority?
  • Do customers often have multiple addresses?

Once you have identified a specific list of recurring challenges within your data, pull several real-world examples from your actual database and use them in any data sample you send to vendors for trial cleansing. When reviewing the results, make sure the solutions you are considering can find these matches on a trial. Each test case will require specific features and strengths that not all data quality software offers. Without this granular level of information about the names, addresses, emails, zip codes and phone numbers that are in your system, you will not be able to fully evaluate whether a software can resolve them or not.

6. REMEMBER IT’S NOT ALL BLACK AND WHITE

Contact data quality solutions are often presented as binary – they either find the match or they don’t. In fact, as we mentioned earlier, some vendors will tout the number of matches found as the key benchmark for efficiency. The problem with this perception is that matching is not black and white – there is always a gray area of matches that ‘might be the same, but you can’t really be sure without inspecting each match pair’ so it is important to anticipate how large your gray area will be and have a plan for addressing it. This is where the false match/true match discussion comes into play.

True matches are just what they sound like while false matches are contact records that look and sound alike to the matching engine, but are in fact, different. While it’s great when a software package can find lots of matches, the scary part is in deciding what to do with them. Do you merge and purge them all? What if they are false matches? Which one do you treat as a master record?  What info will you lose? What other consequence flowed from that incorrect decision?

The bottom line is: know how your chosen data quality vendor or solution will address the gray area. Ideally, you’ll want a solution that allows the user to set the threshold of match strictness. A mass marketing mailing may err on the side of removing records in the gray area to minimize the risk of mailing dupes whereas customer data integration may require manual review of gray records to ensure they are all correct. If a solution doesn’t mention the gray area or have a way of addressing it, that’s a red flag indicating they do not understand data quality.

7. DON’T FORGET ABOUT FORMAT

Most companies do not have the luxury of one nice, cleanly formatted database where everyone follows the rules of entry. In fact, most companies have data stored in a variety of places with incoming files muddying the waters on a daily basis. Users and customers are creative in entering information. Legacy systems often have inflexible data structures. Ultimately, every company has a variety of formatting anomalies that need to be considered when exploring data cleansing tools. To avoid finding out too late, make sure to pull together data samples from all your sources and run them during your trial. The data quality solution needs to handle data amalgamation from systems with different structures and standards. Otherwise, inconsistencies will migrate and continue to cause systemic quality problems.

8. DON’T BE SHORT-SIGHTED

Wouldn’t it be nice if once data is cleansed, the record set remains clean and static? Well, it would be nice but it wouldn’t be realistic. On the contrary, information constantly evolves, even in the most closed-loop system. Contact records represent real people with changing lives and as a result, decay by at least 4 percent per year through deaths, moves, name changes, postal address changes or even contact preference updates. Business-side changes such as acquisitions/mergers, system changes, upgrades and staff turnover also drive data decay. The post-acquisition company often faces the task of either hybridizing systems or migrating data into the chosen solution. Project teams must not only consider record integrity, but they must update business rules and filters that can affect data format and cleansing standards.

Valid data being entered into the system during the normal course of business (either by CSR reps or by customers themselves) also contributes to ongoing changes within the data. New forms and data elements may be added by marketing and will need to be accounted for in the database. Incoming lists or big data sources will muddy the water. Expansion of sales will result in new audiences and languages providing data in formats you haven’t anticipated. Remember, the only constant in data quality is change. If you begin with this assumption, you skyrocket your project’s likelihood of success. Identify the ways that your data changes over time so you can plan ahead and establish a solution or set of business processes that will scale with your business.

Data quality is hard. Unfortunately, there is no one-size fits all approach and there isn’t even a single vendor that can solve all your data quality problems. However, by being aware of some of the common pitfalls and doing a thorough and comprehensive evaluation of any vendors involved, you can get your initiative off to the right start and give yourself the best possible chances of success.

What I Learned About Data Quality From Vacation

Over the 12 hours it took us to get from NY to the beaches of North Carolina, I had plenty of time to contemplate how our vacation was going to go. I mentally planned our week out and tried to anticipate what would be the best ways for us to ‘relax’ as a family. What relaxes me – is not having to clean up.  So to facilitate this, I set about implementing a few ‘business rules’ so that we could manage our mess in real-time, which I knew deep down, would be better for everyone.  The irony of this, as it relates to my role as the Director of Marketing for a Data Quality company did not escape me but I didn’t realize there would be fodder for a blog post in here until I realized business rules actually can work. Really and truly. This is how.

1. We Never Got Too Comfortable.

We were staying in someone else’s house and it wasn’t our stuff. So it dawned on me that we take much more liberty with our own things than we apparently do with someone else’s and I believe this applies to data as well. Some departments feel like they are the ‘owners’ of specific data. I know from direct experience that marketing, in many cases, takes responsibility for customer contact data, and as a result, we often take liberties knowing ‘we’ll ‘remember what we changed’ or ‘we can always deal with it later’. The reality is, there are lots of other people who use and interact with that data and each business user would benefit from following a “Treat It Like It’s Someone Else’s” approach.

2. Remember the Buck Stops With You.

In our rental, there was no daily cleaning lady and we didn’t have the freedom of leaving it messy when we left (in just a mere 7 days). So essentially, the buck stopped with us. Imagine how much cleaner your organization’s data would be if each person who touched it took responsibility for leaving it in good condition. Business rules that communicate to each user that they will be held accountable for the integrity of each data element along with clarity on what level of maintenance is expected, can help develop this sense of responsibility.

3. Maintain a Healthy Sense of Urgency.

On vacation, we had limited time before we’d have to atone for any messy indiscretions. None of us wanted to face a huge mess at the end of the week so it made us more diligent about dealing with it on the fly. To ‘assist’ the kids with this, we literally did room checks and constantly reminded each other that we had only a few days left – if they didn’t do it now, they’d have to do it later. Likewise, if users are aware that regular data audits will be performed and that they will be the ones responsible for cleaning up the mess, the instinct to proactively manage data may be just a tad stronger.

So when it comes to vacation (and data quality), there is good reason not to put off important cleansing activities that can be made more manageable by simply doing them regularly in small batches.

Phonetic Matching Matters!

by Steve Tootill (Tootle, Toothill, Tutil, Tootil, Tootal)

In a recent blog entry, Any Advance on Soundex?, I promised to describe our phonetic algorithm, soundIT. To recap, here’s what we think a phonetic algorithm for contact data matching should do:

  • Produce phonetic codes that represent typical pronunciations
  • Focus on “proper names” and not consider other words
  • Be loose enough to allow for regional differences in pronunciation but not so loose as to equate names that sound completely different.

We don’t think it should also try and address errors that arise from keying or reading errors and inconsistencies, as that is best done by other algorithms focused on those types of issues.

To design our algorithm, I decided to keep it in the family: my father Geoff Tootill is a linguist, classics scholar and computer pioneer, who played a leading role in development of the Manchester Small-Scale Experimental Machine in 1947-48, popularly known now as the “Baby” – the first computer that stored programs in electronic memory

The first program stored in electronic memory

Geoff was an obvious choice to grapple with the problem of how to design a program that understands pronunciation… We called the resultant algorithm “soundIT”.

So, how does it work?

soundIT derives phonetic codes that represent typical pronunciation of names. It takes account of vowel sounds and determines the stressed syllable in the name. This means that “Batten” and “Batton” sound the same according to soundIT, as the different letters fall in the unstressed syllable, whilst “Batton” and “Button” sound different, as it is the stressed syllable which differs. Clearly, “Batton” and “Button” are a fuzzy match, just not a phonetic match. My name is often misspelled as “Tootle”, “Toothill”, “Tutil”, “Tootil” and “Tootal”, all of which soundIT equates to the correct spelling of “Tootill” – probably why I’m so interested in fuzzy matching of names! Although “Toothill” could be pronounced as “tooth-ill” rather than “toot-hill”, most people treat the “h” as part of “hill” but don’t stress it, hence it sounds like “Tootill”. Another advantage of soundIT is that it can recognize silent consonants – thus it can equate “Shaw” and “Shore”, “Wight” and “White”, “Naughton” and “Norton”, “Porter” and “Porta”, “Moir” and “Moya” (which are all reasonably common last names in the UK and USA).

There are always going to be challenges with representing pronunciation of English names e.g. the city of “Reading” rhymes with “bedding” not “weeding”, to say nothing of the different pronunciations of “ough” represented in “A rough-coated dough-faced ploughboy strode coughing and hiccoughing thoughtfully through the streets of the borough”. Although there are no proper names in this sentence, the challenges of “ough” are represented in place names like “Broughton”, “Poughkeepsie” and “Loughborough”. Fortunately, these challenges only occur in limited numbers and we have found in practice that non-phonetic fuzzy matching techniques, together with matching on other data for a contact or company, allow for the occasional ambiguity in pronunciation of names and places. These exceptions don’t negate the need for a genuine phonetic algorithm in your data matching arsenal.

We implemented soundIT within our dedupe package (matchIT) fairly easily and then proceeded to feed through vast quantities of data to identify any weaknesses and improvements required. soundIT proved very successful in its initial market in the UK and then in the USA. There are algorithms that focus on other languages such as Beider-Morse Phonetic Matching for Germanic and Slavic languages, but as helpIT systems market focus is on English and Pan-European data, we developed a generic form of soundIT for European languages. We also use a looser version of the algorithm for identifying candidate matches than we do for actually allocating similarity scores.

Of course, American English pronunciation of names can be subtly different – a point that was brought home to us when an American customer passed on the comment from one of his team “Does Shaw really sound like Shore?” As I was reading this in an email, and as I am a Brit, I was confused! I rang a friend in Texas who laughed and explained that I was reading it wrong – he read it back to me in a Texan accent and I must admit, they did sound different! But then he explained to me that if you are from Boston, Shaw and Shore do sound very similar, so he felt that we were quite right to flag them as a potential match.

No program is ever perfect, so we continue to develop and tweak soundIT to this day, but it has stood the test of time remarkably well – apart from Beider-Morse, I till don’t know of another algorithm that takes this truly phonetic approach, let alone as successfully as soundIT has done.

Steve Tootill (stEv tWtyl)

Creating Your Ideal Test Data

Every day we work with customers to begin the process of evaluating helpIT data quality software (along with other vendors they are looking at). That process can be daunting for a variety of reasons from identifying the right vendors to settling on an implementation strategy, but one of the big hurdles that occurs early on in the process is running an initial set of data through the application.

Once you’ve gotten a trial of a few applications (hopefully including helpIT’s) and you are poised to start your evaluation to determine which one is going to generate the best result – you’ll need to develop a sample data set to run on the software. This is an important step not to be overlooked because you want to be sure that the software you invest in can deliver the highest quality matches so you can effectively dedupe your database and most importantly, TRUST that the resulting data is as clean as it possibly can be with the least possible wiggle room. So how do you create the ideal test data?

The first word of advice – use real data.

Many software trials will come preinstalled with sample or demo data designed primarily to showcase the features of the software. While this sample data can give you examples of generic match results, they will not be a clear reflection of your match results. This is why it is best to run an evaluation of the software on your own data whenever possible. Using the guidelines below, we suggest ‘identifying’ a real dataset that is representative of the challenges you will typically see within your actual database. That dataset will tell you if the software can find your more challenging matches, and how well it can do that.

For fuzzy matching features, you may like to consider whether the data that you test with includes these situations:

  • phonetic matches (e.g. Naughton and Norton)
  • reading errors (e.g. Horton and Norton)
  • typing errors (e.g. Notron, Noron, Nortopn and Norton)
  • one record has title and initial and the other has first name with no title
    (e.g. Mr J Smith and John Smith)
  • one record has missing name elements (e.g. John Smith and Mr J R Smith)
  • names are reversed (e.g. John Smith and Smith, John)
  • one record has missing address elements (e.g. one record has the village or house
    name and the other address just has the street number or town)
  • one record has the full postal code and the other a partial postal code or no postal code

When matching company names data, consider including the following challenges:

  • acronyms e.g. IBM, I B M, I.B.M., International Business Machines
  • one record has missing name elements e.g.
  1. The Crescent Hotel, Crescent Hotel
  2. Breeze Ltd, Breeze
  3. Deloitte & Touche, Deloitte, Deloittes.

You should also ensure that you have groups of records where the data that matches exactly, varies for pairs within the group. For example:

If you don’t have these scenarios all represented, you can doctor your real data to create them, as long as you start with real records that are as close as possible to the test cases and make one or at the most two changes to each record. In the real world, matching records will have something in common – not every field will be slightly different.

With regard to size, it’s better to work with a reasonable sample of your data than a whole database or file, otherwise the mass of information runs the risk of obscuring important details and test runs take longer than they need to. We recommend that you take two selections from your data – one for a specific postal code or geographic area, and one (if possible) an alphabetical range by last name. Join these selections together and then eliminate all the exact matches – if you can’t do this easily, one of the solutions that you’re evaluating can probably do it for you.

Ultimately, you should have a reasonable size sample without so many obvious matches, which should contain a reasonable number of fuzzier matches (e.g. matches where the first character of the postal code or last name is different between two records that otherwise match, matches with phonetic variations of last name, etc.)

__________________________________________________________________________

For more information on data quality vendor evaluations, please download our Practical Guide to Data Quality Vendor Selection.

Golden Records Need Golden Data: 7 Questions to Ask

If you’ve found yourself reading this blog then you’re no doubt already aware of the importance of maintaining data quality through processes such as data verification, suppression screening, and duplicate detection. In this post I’d like to look a bit closer at how you draw value from, and make the best use of, the results of the hard work you invest into tracking down duplicates within your data.

The great thing about fuzzy matching is that it enables us to identify groups of two or more records that pertain to the same entity but that don’t necessarily contain exactly the same information. Records in a group of fuzzy matches will normally contain similar information with slight variations from one record to the next. For example, one record may contain a full forename whilst another contains just an abbreviated version or even none at all. You will also frequently encounter fuzzy matches where incorrectly spelt or poorly input data is matched against its accurate counterpart.

Once you’ve identified these groups of fuzzy matches, what do you do with them? Ultimately you want to end up with only unique records within your data, but there are a couple of ways that you can go about reaching that goal. One approach is to try and determine the best record in a group of matches and discard all of the records that matched against it. Other times, you may find that you are able to draw more value from your data by taking the most accurate, complete, and relevant information from a group of matched records and merging it together so that you’re left with a single hybrid record containing a superior set of data than was available in any of the individual records from which it was created.

Regardless of the approach you take, you’ll need to establish some rules to use when determining the best record or best pieces of information from multiple records. Removing the wrong record or information could actually end up making your data worse so this decision warrants a bit of thought. The criteria you use for this purpose will vary from one job to the next, but the following is a list of 7 questions that target the desirable attributes you’ll want to consider when deciding what data should be retained:

  1. How current is the data?
    You’ll most likely want to keep data that was most recently acquired.
  2. How complete is the data?
    How many fields are populated, and how well are those fields populated?
  3. Is the data valid?
    Have dates been entered in the required format? Does an email address contain an at sign?
  4. Is the data accurate?
    Has it been verified (e.g. address verified against PAF)?
  5. How reliable is the data?
    Has it come from a trusted source?
  6. Is the data relevant?
    Is the data appropriate for its intended use (e.g. keep female contacts over male if compiling a list of recipients for a woman’s clothing catalogue)?
  7. Is there a predetermined hierarchy?
    Do you have a business rule in place that requires one set of data is always used over another?

When you have such a large range of competing criteria to consider, how do you apply all of these rules simultaneously? The approach we at helpIT use in our software is to allow the user to weight each item or collection of data, so they can choose what aspects are the most important in their business context. This isn’t necessarily whether an item is present or not, or how long it is, but could be whether it was an input value or derived from supplied information, or whether it has been verified by reference to an external dataset such as a Postal Address File. Once the master record has been selected, the user may also want to transfer data from records being deleted to the master record e.g. to copy a job title from a duplicate to a master record which contains fuller/better name and address information, but no job title. By creating a composite record, you ensure that no data is lost.

Hopefully this post will have given you something to think about when deciding how to deal with the duplicates you’ve identified in your data. I’d welcome any comments or questions.