Posts

Where Big Data, Contact Data and Data Quality come together

We’ve been working in an area of untapped potential for Big Data for the last couple of years, which can best be summed up by the phrase “Contact Big Data Quality”. It doesn’t exactly roll off the tongue, so we’ll probably have to create yet another acronym, CBDQ… What do we mean by this? Well, our thought process started when we wondered exactly what people mean when they use the phrase “Big Data” and what, if anything, companies are doing in that arena. The more we looked into it, the more we concluded that although there are many different interpretations of “Big Data”, the one thing that underpins all of them is the need for new techniques to enable enhanced knowledge and decision making. I think the challenges are best summed up by the Forrester definition:

“Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers. To remember the pragmatic definition of Big Data, think SPA — the three questions of Big Data:

  • Store. Can you capture and store the data?
  • Process. Can you cleanse, enrich, and analyze the data?
  • Access. Can you retrieve, search, integrate, and visualize the data?”

http://blogs.forrester.com/mike_gualtieri/12-12-05-the_pragmatic_definition_of_big_data

As part of our research, we sponsored a study by The Information Difference (available here) which answered such questions as:

  • how many companies have actually implemented Big Data technologies, and in what areas
  • how much money  and effort are organisations investing in it
  • what areas of the business are driving investment
  • what benefits are they seeing
  • what data volumes are being handled

We concluded that plenty of technology is available to Store and Access Big Data, and many of the tools that provide Access also Analyze the data – but there is a dearth of solutions to  Cleanse and Enrich Big Data, at least in terms of contact data which is where we focus. There are two key hurdles to overcome:

  1. Understanding the contact attributes in the data i.e. being able to parse, match and link contact information. If you can do this, you can cleanse contact data (remove duplication, correct and standardize information) and enrich it by adding attributes from reference data files (e.g. voter rolls, profiling sources, business information).
  2. Being able to do this for very high volumes of data spread across multiple database platforms.

The first of these should be addressed by standard data cleansing tools, but most of these only work well on structured data, maybe even requiring data of a uniform standard – and Big Data, by definition, will contain plenty of unstructured data which is of widely varying standards and degrees of completeness. At helpIT systems, we’ve always developed software that doesn’t expect data to be well structured and doesn’t rely on data being complete before we can work with it, so we’re already in pretty good shape for clearing this hurdle – although semantic annotation of Big Data is more akin to a journey than a destination!

The second hurdle is the one that we have been focused on for the last couple of years and we believe that we’ve now got the answer – using in-memory processing for our proven parsing/matching engine, to achieve super-fast and scalable performance on data from any source. Our new product, matchIT Hub will be launching later this month, and we’re all very excited by the potential it has not just for Big Data exploitation, but also for:

  • increasing the number of matches that can safely be automated in enterprise Data Quality applications, and
  • providing matching results across the enterprise that are always available and up-to-date.

In the next post, I’ll write about the potential of in-memory matching coupled with readily available ETL tools.

The 12 Days of Shopping

According to IBM’s real-time reporting unit, Black Friday online sales were up close to 20% this year over the same period in 2012.  As for Cyber Monday, sales increased 30.3% in 2012 compared to the previous year and is expected to grow another 15% in 2013. Mobile transactions are at an all time high and combined with in store sales, The National Retail Federation expects retail sales to pass the $600 billion mark during the last two months of the year alone. While that might sound like music to a retailer’s ears, as the holiday shopping season goes into full swing on this Cyber Monday, the pressure to handle the astronomical influx of data collected at dozens of possible transaction points is mounting. From websites and storefronts to kiosks and catalogues, every scarf or video game purchased this season brings with it a variety of data points that must be appropriately stored, linked, referenced and hopefully leveraged. Add to that a blinding amount of big data now being collected (such as social media activity or mobile tracking), and it all amounts to a holiday nightmare for the IT and data analysis teams. So how much data are we talking and how does it actually manifest itself? In the spirit of keeping things light, we offer you, The 12 Days of Shopping…

On the first day of shopping my data gave to me,
1 million duplicate names.

On the second day of shopping my data gave to me,
2 million transactions, and
1 million duplicate names.

On the third day of shopping my data gave to me,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

On the fourth day of shopping my data gave to me,
40 returned shipments,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

On the fifth day of shopping my data gave to me,
5 new marketing lists,
40 returned shipments,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

On the sixth day of shopping my data gave to me,
6,000 bad addresses,
5 new marketing lists,
40 returned shipments,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

On the seventh day of shopping my data gave to me,
7,000 refunds,
6,000 bad addresses,
5 new marketing lists,
40 returned shipments,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

On the eighth day of shopping my data gave to me,
8,000 new logins,
7,000 refunds,
6,000 bad addresses,
5 new marketing lists,
40 returned shipments,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

On the ninth day of shopping my data gave to me,
90,000 emails,
8,000 new logins,
7,000 refunds,
6,000 bad addresses,
5 new marketing lists,
40 returned shipments,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

On the tenth day of shopping my data gave to me,
10,000 tweets,
90,000 emails,
8,000 new logins,
7,000 refunds,
6,000 bad addresses,
5 new marketing lists,
40 returned shipments,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

On the eleventh day of shopping my data gave to me,
11 new campaigns,
10,000 tweets,
90,000 emails,
8,000 new logins,
7,000 refunds,
6,000 bad addresses,
5 new marketing lists,
40 returned shipments,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

On the twelfth day of shopping my data gave to me,
12 fraud alerts,
11 new campaigns,
10,000 tweets,
90,000 emails,
8,000 new logins,
7,000 refunds,
6,000 bad addresses,
5 new marketing lists,
40 returned shipments,
30,000 credit apps,
2 million transactions, and
1 million duplicate names.

While we joke about the enormity of it all, if you are a retailer stumbling under the weight of all this data, there is hope and over the next few weeks we’ll dive a bit deeper into these figures to showcase how you can get control of the incoming data and most importantly, leverage it in a meaningful way.

Sources:
http://techcrunch.com/2013/11/29/black-friday-online-sales-up-7-percent-mobile-is-37-percent-of-all-traffic-and-21-5-percent-of-all-purchases/

http://www.pfsweb.com/blog/cyber-monday-2012-the-results/

http://www.foxnews.com/us/2013/11/29/retailers-usher-in-holiday-shopping-season-as-black-friday-morphs-into/

UK Regulatory Pressure to Contact Customers Increases

In recent weeks, UK government and financial services organisations have received increasing political and regulatory pressure to make greater efforts to proactively notify policy holders and account owners of their rights and savings information. To avoid the threat of regulatory fines, organisations have quickly prioritised data quality initiatives to the top of the list but in reality, the benefits of data suppression and enhancement go far beyond avoiding fines and in fact will make for stronger business models, more trustworthy brands and better customer service.

What’s New

A report in July by the House of Commons Public Accounts Committee quoted Treasury estimates that from 200,000 to 236,000 victims of the collapse of Equitable Life may miss out on compensation payments because it may not be able to trace between 17%-20% of policyholders by that date. The committee urged the Treasury to take urgent action to track down as many former policyholders of the failed insurer as possible (many of whom are elderly) before the March 2014 deadline. Payments totalling £370 million are due to be made by that date.

More recently still, there has been discussion of the huge number of interest rate reductions affecting savers without them being notified – banks and building societies last month announced a further 120 cuts to rates on savings accounts, some as high as 0.5%, on top of 750 made to existing easy access accounts this year. According to the Daily Telegraph, “around 17 million households are believed to have cash in an easy access account”.  While savings providers are able to make cuts of up to 0.25% without notifying customers, a spokesman for the regulator, the Financial Conduct Authority (FCA), told The Telegraph that “it is keeping a close eye on the activity of banks as the blizzard of rate reductions continues.”

Case in Point

To avoid the risk of potentially massive future penalties, a variety of organisations have taken up the challenge of contacting large numbers of customers, to provide the requisite communication. In fact, a financial services organisation which was recently advised by the FCA to make reasonable efforts to contact all its customers, retained a helpIT client to run a suppression job which netted significant savings: of the initial mailing file consisting of over seven million customers, half a million new addresses were supplied, half a million gone aways were removed and over 200 thousand deceased names suppressed. In this instance, the actual and potential savings for the organisation were enormous and went well beyond the cost of non-compliance – to say nothing of the savings to brand reputation in the eyes of new occupants and relatives of the deceased.

Easy Options

Fortunately, the right software makes it easy to compare customer data to an assortment of third party suppression files in different formats, keyed according to different standards. In fact, huge savings can be achieved by employing standard “gone away” and suppression screening, as well increasing the success rate in contacting old customers by finding their new addresses. While there used to be only a couple of broad coverage “gone away” files, these days there is a wealth of data available to mailers to enable them to reach their customers, going far beyond Royal Mail’s NCOA (National Change of Address) and Experian’s Absolute Movers lists. This “new address” data is in many cases pooled by financial services companies via reference agencies such as Equifax (in the reConnect file) and by property agencies via firms such as Wilmington Millennium (Smartlink). Similarly, deceased data is now much more comprehensive and more readily available than ever before.

New address, gone away and deceased data is also easy to access, either as a web-based service or downloaded onto the organisation’s own servers. Costs have come down with competition, so it’s certainly cheaper now to run gone away and deceased suppression than it is to print and mail letters to the “disappeared”.

Although it is never going to be 100%, data and software tools do exist to make it easy for the organisation to take reasonable steps to cost-effectively fulfil their obligations, even on names that might be considered low value, that an organisation might ordinarily have forgotten about.

Bottom Line

These numbers should give pause for thought to organisations of any type that are tempted to “spray and pray” or decide to keep silent about something their customers would really like to know about, regardless of regulation. What’s more, the value to the business, the customers and the brand goes far beyond the regulations with which they need to comply.

helpIT Feedback to Royal Mail PAF® Consultation

On 14 June 2013, Royal Mail launched a consultation on proposed changes to the Postcode Address File (PAF®) licensing scheme and invited contributions from anyone affected. Said to “simplify…the licensing and pricing regime”, helpIT has concerns that the proposed changes would negatively impact direct mailers. As a provider of data quality software to more than 100  organisations that would be affected by such changes, helpIT systems notified customers, collated their input and drafted a response on their behalf. The Consultation is now closed but you can read more about the PAF® licensing options here.

Below is a summary of the feedback submitted to Royal Mail and the kind of feedback received from our customers which mirrors our own concerns.

Q.1: Do you agree with the principles underpinning PAF® Licence simplification?

We are a major provider of PAF address verification software for batch usage – our users are a mixture of service providers and end users who use PAF software embedded within our broader data cleansing solutions. Our feedback includes feedback from many of our users who have replied directly to our notification of the consultation, rather than reply via your portal.

We agree with the principles except for no. 6, “to ensure that current levels of income derived from PAF® licensing are maintained for Royal Mail”. In addition, although we support no. 8, “to seek swift deployment of a PAF® Public Sector Licence”, we feel that free usage should be extended to the private sector, or at least made available to all private sector organisations at a small flat fee of no more than is necessary to cover administration of the licence and to discourage users without a real need.

Q.2 Are there other principles that you believe should underpin PAF® licence simplification?

Royal Mail should follow the example of postal providers in other countries who have made PAF free for users, which (unsurprisingly) is proven to result in improved address quality  and lower sortation and delivery costs through higher levels of automation. We believe that in the UK too, these reduced costs will far outweigh the loss of income by eliminating or reducing the income received from PAF licensing.

Q.3 Do you agree that these are an accurate reflection of market needs?

The market needs an efficient and cost-effective mail system – this principle is not mentioned! Royal Mail’s approach should be to encourage use of direct mail and delivery of goods by mail. It should focus on reduction in handling costs to more effectively compete with other carriers, rather than increase prices in a vain effort to improve profitability.

Q.5 Is the emergence of ‘Licensee by Usage’ as a preferred model reasonable when assessed against the principles, market needs and evaluation criteria?

For reasons stated above, this model does not fit the market needs, or Royal Mail and the UK economy’s fundamental interests. If a usage-based charging model is adopted for batch use of PAF, at the least we would not expect to see a transaction charge applied to a record whose address and postcode are not changed as part of a batch process, as in our opinion this will deter usage of PAF for batch cleansing and directly lead to a lower return on investment for use of mail. Even if this refinement is accepted, this will increase work for solutions and service providers, end users and Royal Mail in recording changed addresses/postcodes and auditing. We have a large, established user base that has made use of PAF, particularly for batch address verification, essential to maintenance of data quality standards. Any increase in charges to our user base will result in decreased usage and the more significant any increase, the higher the dropout rate will be amongst our current users and the lower the take-up from new users.

Typical feedback from an end user is as follows:

We currently use a Marketing Data Warehouse which is fed from transactional databases for Web, Call Centre and Shop transactions. The addresses captured in these different systems are of variable quality, and includes historical data from other systems since replaced. Much of it is unmailable without PAF enhancement, but we are unable to load enhanced/corrected address data back to the transactional systems for operational reasons. This Marketing Data Warehouse is used to mail around 6 million pieces a year via Royal Mail, in individual mailings of up to 600,000, as well as smaller mailings. The quality of the data is crucial to us in making both mailings and customer purchases deliverable. Our Marketing Data Warehouse is built each weekend from the transactional systems, and as a part of this build we PAF process all records each weekend, and load the corrected data into the database alongside the original data. It’s not an ideal solution, but is a pragmatic response to the restrictions of our environment, and enables us to mail good quality addresses, and to remove duplicate records (over 100,000). If we simply count the number of addresses processed per week, at 1p per unit, this would be completely unaffordable. Should this happen we would have to re-engineer our operations to remove redundant processing. Also, when a new PAF file was available we would still have to process the whole file (currently around 2.6 million records), at a cost of £26,000 assuming the minimum cost of 1p per record. This is again unaffordable. It is not in Royal Mail’s interests to price users out of PAF processing records in this way. We therefore urge Royal Mail to reconsider their proposals to ensure our costs do not rise significantly.

Typical feedback from a service provider is as follows:

95% of our PAF usage is to achieve maximum postage discount for our clients. We would either enhance an address or add a DPS suffix to an address.  Therefore, the primary principle of PAF is to assist with the automation of the postal process.  Reading through the consultation document there is very little discussion surrounding PAF and postal system. All the working examples are for call centres. In paragraph 10 of the consultation document, Royal Mail acknowledges the wider use of PAF in areas such as database marketing, e-commerce and fraud management.  However, these areas have no additional benefits to Royal Mail.  On the traditional mail side, Royal Mail directly benefits from the automation of the
postal system through the use of PAF validated addresses.  If Royal Mail wish to promote mail and strive for full automation in the postal system then they should be encouraging the use of PAF validation by mail customers.

There is also a potential conflict of interest for Royal Mail. The more changes they make to PAF then the more revenue they could generate from address updates.  Worthwhile having some limits on the number of addresses that can be changed in a year or at least some authority checking on the necessity of the address changes. I believe there is a conflict of interest with Royal Mail being both the provider and an end user of PAF (through mailing system).  It would be better to have the administration and selling of PAF as an independent organisation.

Q.6 Do you believe that a different model would better meet the principles that underpin licence simplification?

Yes, a flat rate payment model.

Q.9 Are there any further simplification or changes that might be required?

Due to the short notice for the consultation period, during a holiday period, and the lack of notice provided proactively to us as a solutions provider, we can’t currently comment on this except to say that it is probable that changes will be required.

Q.10 Are the ways you use PAF® covered by the proposed terms?

Same answer as Q9.

Q.13 Do you think Transactional pricing is an appropriate way to price PAF®?

As explained above and made crystal clear in the typical responses from two of our users, transactional pricing is NOT an appropriate way to price PAF for batch usage. It will simply lead to a large exodus by batch users of PAF and a significant reduction in the use of direct mail and delivery by mail.

Q.14 Do you think ‘by Transaction’ is an appropriate way of measuring usage?

There are significant systems and auditing problems associated with measuring usage by transaction.

Q.15 Does your organisation have the capability to measure ‘Usage by Transaction’?

Our software does not measure volume of usage and it will not be possible to do this in a foolproof way. It will also lead to significant challenges for audit.

Q.16 Are there situations or Types of Use that you don’t think suit transactional measurement?

Batch database and mailing list cleansing.

 

What I Learned About Data Quality From Vacation

Over the 12 hours it took us to get from NY to the beaches of North Carolina, I had plenty of time to contemplate how our vacation was going to go. I mentally planned our week out and tried to anticipate what would be the best ways for us to ‘relax’ as a family. What relaxes me – is not having to clean up.  So to facilitate this, I set about implementing a few ‘business rules’ so that we could manage our mess in real-time, which I knew deep down, would be better for everyone.  The irony of this, as it relates to my role as the Director of Marketing for a Data Quality company did not escape me but I didn’t realize there would be fodder for a blog post in here until I realized business rules actually can work. Really and truly. This is how.

1. We Never Got Too Comfortable.

We were staying in someone else’s house and it wasn’t our stuff. So it dawned on me that we take much more liberty with our own things than we apparently do with someone else’s and I believe this applies to data as well. Some departments feel like they are the ‘owners’ of specific data. I know from direct experience that marketing, in many cases, takes responsibility for customer contact data, and as a result, we often take liberties knowing ‘we’ll ‘remember what we changed’ or ‘we can always deal with it later’. The reality is, there are lots of other people who use and interact with that data and each business user would benefit from following a “Treat It Like It’s Someone Else’s” approach.

2. Remember the Buck Stops With You.

In our rental, there was no daily cleaning lady and we didn’t have the freedom of leaving it messy when we left (in just a mere 7 days). So essentially, the buck stopped with us. Imagine how much cleaner your organization’s data would be if each person who touched it took responsibility for leaving it in good condition. Business rules that communicate to each user that they will be held accountable for the integrity of each data element along with clarity on what level of maintenance is expected, can help develop this sense of responsibility.

3. Maintain a Healthy Sense of Urgency.

On vacation, we had limited time before we’d have to atone for any messy indiscretions. None of us wanted to face a huge mess at the end of the week so it made us more diligent about dealing with it on the fly. To ‘assist’ the kids with this, we literally did room checks and constantly reminded each other that we had only a few days left – if they didn’t do it now, they’d have to do it later. Likewise, if users are aware that regular data audits will be performed and that they will be the ones responsible for cleaning up the mess, the instinct to proactively manage data may be just a tad stronger.

So when it comes to vacation (and data quality), there is good reason not to put off important cleansing activities that can be made more manageable by simply doing them regularly in small batches.

The New Paradigm in Healthcare Data Quality

There is no higher importance in managing customer information than when making decisions on health care. While most markets are busy striving for a ‘single customer view’ to improve customer service KPIs or marketing campaign results, healthcare organizations must focus on establishing  a ‘single patient view’, making sure a full patient history is attached to a single, correct contact.  Unlike in traditional CRM solutions, healthcare data is inherently disparate
and is managed by a wide variety of patient systems that, in addition to collecting and managing contact data, also tracks thousands of patient data points including electronic health records, insurance coverage, provider names,  prescriptions and more. Needless to say, establishing the relationships between patients and their healthcare providers, insurers, brokers, pharmacies and the like or even grouping families and couples together, is a significant challenge. Among them are issues with maiden/married last names, migration of individuals between family units and insurance plans, keying errors at point of entry or even deliberate attempts by consumers to defraud the healthcare system.

In many cases, the single patient view can be handled through unique identifiers , such as those for group health plans or for individuals within their provider network. This was an accepted practice at a recent Kaiser Permanente location I visited, where a gentleman went to the counter and reeled off his nine digit patient number before saying “hello”. But while patient ID numbers are standard identifiers, they will differ between suppliers and patients can’t be relied on to use it as their first method of identification. This is where accuracy and access to other collected data points (I.e. SSN, DOB and current address) becomes critical.

While healthcare organizations have done a decent job so far of attempting to establish and utilize this ‘single patient view’, the healthcare data quality paradigm is shifting once again. For example, The Patient Protection and Affordable Care Act (PPACA) means that healthcare organizations will now have to deal with more data, from more sources and face tougher regulations on how to manage and maintain that data.  The ObamaCare Health Insurance Exchange Pool means that more Americans can potentially benefit from health insurance coverage, increasing the number with coverage by around 30 million. Through these new initiatives, consumers will also have greater choice for both coverage and services  – all further distributing the data that desperately needs to be linked.

With such inherent change – how do you effectively service patients at the point-of-care? And, do you want your trained medics and patient management team to be responsible for the data quality audit before such care can even begin?

So what are the new dynamics that healthcare companies need to plan for?

  • Addition of new patients into a system without prior medical coverage or records
  • Frequent movement of consumers between healthcare plans under the choice offered by the affordable care scheme
  • Increased mobility of individuals through healthcare systems as they consume different vendors and services

This increased transactional activity means healthcare data managers must go beyond the existing efforts of linking internal data and start to look at how to share data across systems (both internal and external) and invest in technology that will facilitate this critical information exchange. Granted, this will be a significant challenge given the fact that many organizations have several proprietary systems, contract requirements and privacy concerns but oddly enough, this begins with best practices in managing contact data effectively.

Over the last year, I’ve worked with an increasing number of customers on the issue of managing the introduction of new data into healthcare databases.  Just like healthcare, data quality is both preventative and curative. Curative measures include triage on existing poor quality data, and investigating the latent symptoms of unidentified relationships in the data. The preventative measures are to introduce a regimen of using DQ tools to accurately capture new information at
point of entry efficiently, and to help identify existing customers quickly and accurately.

For healthcare customers, we’ve managed to do just this by implementing helpIT systems’ technology, matchIT SQL to deal with the backend data matching, validation and merging and findIT S2 to empower users to quickly and accurately identify existing patients or validate new patient details with the minimum of keystrokes. This complementary approach gives a huge return on investment allowing clinical end-users to focus on the task at hand, rather than repeatedly dealing with data issues.

Whenever there is movement in data or new sources of information, data quality issues will arise. But when it comes to healthcare data quality, I’m sure healthcare DBA’s and other administrators are fully aware of the stakes at hand. Improving and streamlining data capture plus tapping into the various technology connectors that will give physicians and service providers access to all patient data will have a profound effect on patient care, healthcare costs, physician workloads and access to relevant treatment. Ultimately, this is the desired outcome.

I’m delighted to be engaged further on this subject so if you have more insight to share, please comment on this or drop me a line.


Data Quality Makes the Best Neighbor

So this week’s #dataqualityblunder is brought to you by the insurance industry and demonstrates that data quality issues can manifest themselves in a variety of ways and have unexpected impacts on the business entity.

Case in point – State Farm. Big company. Tons of agents. Working hard at a new, bold advertising campaign. It’s kind of common knowledge that they have regional agents (you see the billboards throughout the NY Tri-State area) and it’s common to get repeated promotional materials from your regional agent.

But, what happens when agents start competing for the same territory? That appears to be the situation for a recent set of mailings I received. On the same day, I got the same letter from two different agents in neighboring regions.

Same offer. Same address. So, who do I call? And how long will it take for me to get annoyed by getting two sets of the same marketing material? Although it may be obvious, there are a few impacts from this kind of blunder:

  • First of all – wasted dollars. Not sure who foots the bills here – State Farm or the agents themselves, but either way, someone is spending more money than they need to.
  • Brand equity suffers. When one local agent promotes themselves to me, I get a warm fuzzy feeling that he is somehow reaching out to his ‘neighbor’. He lives in this community and will understand my concerns and needs. This is his livelihood and it matters to him. But, when I get the same exact mailing from two agents in different offices, I realize there is a machine behind this initiative. Warm feelings gone and the brand State Farm has worked so hard to develop, loses its luster.
  • Painful inefficiency.  I am just one person that got stuck on two mailing lists. How many more are there? And how much more successful would each agent be if they focused their time, money and energy on a unique territory, instead of overlapping ones.

There are lots of lessons in this one and there are a variety of possible reasons for this kind of blunder.  A quick call to one of the agents and I learned that most of the lists come from the parent organization but some agents do supplement with additional lists but they assured me, this kind of overlap was not expected or planned. That means there is a step (or tool) in the process that is missing. It could require a change in business rules for agent marketing. It’s possible they have the rules in place but requires greater enforcement. It could just be a matter of implementing the right deduplication tools across their multiple data sources. There are plenty of ways to insure against this kind of #dataqualityblunder once the issue is highlighted and data quality becomes a priority.

Data Quality and Gender Bending

We have all heard the story about the man who was sent a mailing for an expectant mother. Obviously this exposed the organization sending it to a good deal of ridicule, but there are plenty of more subtle examples of incorrect targeting based on getting the gender wrong. Today I was amused to get another in a series of emails from gocompare.com addressed to stevetoo[email protected] The subject was “Eileen, will the ECJ gender ruling affect your insurance premiums?” 🙂 The email went on to explain that from December, insurers in the EU will no longer be able to use a person’s gender to calculate a car insurance quote, “which may be good news for men, but what about women…” They obviously think that my first name is Eileen and therefore I must be female.
Now, I know that my mother had plans to call me Stephanie, but I think that was only because she already had two sons and figured it was going to be third time lucky. Since I actually emerged noisily into the world, I have gotten completely used to Stephen or Steve and never had anyone get it wrong – unlike my last name, Tootill, which has (amongst other variations) been miskeyed as:

• Toothill                    • Tootil
• Tootle                      • Tootal
• Tutil                         • Tooil
• Foothill                    • Toohill
• Toosti                       • Stoolchill

“Stephen” and “Steve” are obviously equivalent, but to suddenly become Eileen is a novel and entertaining experience. In fact, it’s happened more than once so it’s clear that the data here has never been scrubbed to remedy the situation.
Wouldn’t it be useful then if there was some software to scan email addresses to pick out the first and/or last names, or initial letters, so it would be clear that the salutation for [email protected] is not Eileen?

Yes, helpIT systems does offer email validation software, but the real reason for highlighting this is that we just hate it when innovative marketing is compromised by bad data.  That’s why we’re starting a campaign to highlight data quality blunders, with a Twitter hash tag of #DATAQUALITYBLUNDER. Let’s raise the profile of Data Quality and raise a smile at the same time! If you have any examples that you’d like us to share, please comment on this post or send them to [email protected].

Note: As I explained in a previous blog (Phonetic Matching Matters!), the first four variations above are phonetic matches for the correct spelling, whereas the next four are fuzzy phonetic matches. “Toosti” and “Stoolchill” were one-offs and so off-the-wall that it would be a mistake to design a fuzzy matching algorithm to pick them up.

When Data Quality Goes Wrong…

When Data Quality Goes Wrong…

Whether you are a data steward or not, we’ve all experienced the unfortunate consequences of data quality gone terribly awry. Multiple catalogues to the same name and address. Purchasing a product through an online retailer only to find you have three different accounts with three different user names. Long, frustrating phone calls with customer service who can’t help you because they don’t have access to all the relevant info.

As the Director of Marketing for a data quality company, it brings me exceptional pain to see bad data quality in action. Such inefficiency is what gives marketing a bad reputation. It can ruin brands, destroy customer loyalty, waste opportunities and…let’s face it, it also kill trees.

Indeed, throughout our entire company, the water cooler occasionally buzzes with stories of bad data quality. So what’s a data quality company to do with all these DQ “blunders”? Call them out!

So this summer we’re going to dig through our box of examples and showcase a few #dataqualityblunders. We’ll try to be nice about it of course but the important part is that we’ll also highlight the ways that a good data quality strategy could have addressed these indiscretions. Because where there is bad data, there is also a clean data solution.

Have a #dataqualityblunder you’re just dying to spill?

We know that you’ve seen your fair share of data quality blunders. Send them in and win a $10 Starbucks gift card! Just email [email protected]!

 

Where Is Your Bad Data Coming From?

As Kimball documents in The Data Warehouse Lifecycle Toolkit (available in all good book stores), there are five concepts that together, can be considered to define data quality:

Accuracy – The correctness of values contained in each field of each database record.

Completeness – Users must be aware of what data is the minimum required for a record to be considered complete and to contain enough information to be useful to the business.

Consistency – High Level or summarized information is in agreement with the lower-level detail.

Timeliness – Data must be up-to-date, and users should be made aware of any problems by use of a standard update schedule.

Uniqueness – One business or consumer must correspond to only one entity in your data. For example, Jim Smyth and James Smith at the same address should somehow be merged as these records represent the same consumer in reality.

So using Kimball’s list, we might know what kind of data we want in the database but unfortunately, despite our best intentions, there are forces conspiring against good data quality. While it doesn’t take a forensics degree, there are so many sources of poor data you may not even know where to look. For that, we’ve come up with our own list. Let’s take a look…

1. Data Entry Mistakes.

The most obvious of the bad data sources, these take the form of simple typing mistakes that employees can make when entering data into the system e.g. simple typos, entering data into the wrong fields, using variations on certain data elements.  Even under ideal circumstances, these are easy mistakes to make and therefore extremely common but unfortunately can be the source of high numbers of duplicate records.  But why is it so hard to get the data right? Consider these circumstances that can exacerbate your data entry process:

  • Poorly trained staff with no expectations for data entry
  • High employee turnover
  • Under-resourcing of call centres that leads to rushing customer exchanges
  • Forms that do not allow room for all the relevant info
  • Unenforced business rules because bad data is not tracked down to its source

2. Lazy Customers.

Let’s face it. Customers are a key source of bad data. Whether they are providing information over the phone to a representative or completing a transaction online, customers can deliberately and inadvertently provide inaccurate or incomplete data. But you know this already. Here are a few specific circumstances to look out for, especially in retail settings:

  • In store business rules that permit staff to enter store addresses or phone numbers in place of the real customer info
  • Multiple ‘rewards cards’ per household or family that are not linked together
  • Use of store rewards cards that link purchases to different accounts
  • Customers that subconsciously use multiple emails, nicknames or addresses without realizing it
  • Web forms that allow incorrectly formatted data elements such as phone numbers or zip codes
  • Customers pushed for time who then skip or cheat on certain data elements
  • Security concerns of web transactions that lead customers to leave out certain data or simply lie to protect their personal information

3. Bad Form

Web forms. CRMs. ERP systems. The way they are designed can impact data quality. How? Some CRM systems are inflexible and may not allow easy implementation of data rules, leading to required fields being left blank, or containing incomplete data. Indeed many web forms allow any kind of gibberish data to be entered into any fields which can immediately contaminate the database. Not enough space for relevant info or systems and forms that have not been updated to match the business process also pose a challenge. Many systems also simply do not perform an address check at entry – allowing invalid addresses to enter the system. When it comes to data quality, good form is everything.

4. Customization Simply Reroutes Bad Data

All businesses have processes and data items unique to that business or industry sector. Unfortunately, when systems do not provide genuine flexibility and extensibility, IT will customize the system as necessary. For example, a CRM system may be adjusted to allow a full range of user-defined data (eg to allow a software company to store multiple licence details for each customer). Where this happens, the hacks and workarounds can lead to a lack of data integrity in the system (e.g. you end up storing data in fields designed for other data types (dates in character fields).

5. Data Erosion is Beyond Your Control

Businesses and consumers move address. People get married and change their name. Business names change too plus contacts get promoted or replaced. Email addresses and phone numbers are constantly evolving. People die. No matter how sophisticated your systems are, some measure of data erosion is simply unavoidable. While good business rules will assist in updating data at relevant checkpoints, to maintain the best quality data, it’s important to update the data from reliable data sources on a regular basis.

6. New Data. Bad Data. Duplicate Data.

Many businesses regularly source new prospect lists that are subsequently loaded into the CRM. These can come from a variety of places including list vendors, trade shows, publications, outbound marketing campaigns and even internal customer communications and surveys. Although it’s exciting to consider procuring a new, large database of prospects, there are two ways this addition of data can go horribly wrong. First, the data itself is always suspect, falling prey to all the potential issues of data entry, data erosion and customer error. But even if you can corroborate or cleanse the data before entering, there is still a chance you will be entering duplicate records that won’t always be quickly identified.

7. Overconfidence

OK. So this may not be a true ‘source’ of bad data but it is the most important precipitating factor. You may think that by implementing business rules or by using a CRM’s built-in duplicate detection tools, that you are covered. In practice, business rules are important and valuable but are never foolproof and require constant enforcement, evaluation and updates. Moreover, built-in data quality features are typically fairly limited in scope and ability to simply detect exact matches. They simply not powerful enough to do the heavy lifting of a more sophisticated fuzzy and phonetic matching engine that will catch the subtle data quality errors that can lead to major data quality issues. This false sense of confidence means you can easily overlook sources of poor data and neglect to perform critical data quality checks.

So if you keep these seven bad data sources in mind – are you home free? Unfortunately not. These are simply the building blocks of bad data. When even just some of these conditions occur simultaneously, the risk of bad data multiplies  exponentially. The only true way to achieve the five-pronged data quality ideal outlined by Kimball (accuracy, completeness, consistency, timeliness and uniqueness) is through a comprehensive data quality firewall that addresses each of these components individually.

Stay tuned for more information on Best Practices in data quality that pinpoint specific business rules and software solutions to achieve true real-time data quality.