Data Quality and Gender Bending

We have all heard the story about the man who was sent a mailing for an expectant mother. Obviously this exposed the organization sending it to a good deal of ridicule, but there are plenty of more subtle examples of incorrect targeting based on getting the gender wrong. Today I was amused to get another in a series of emails from gocompare.com addressed to [email protected] The subject was “Eileen, will the ECJ gender ruling affect your insurance premiums?” 🙂 The email went on to explain that from December, insurers in the EU will no longer be able to use a person’s gender to calculate a car insurance quote, “which may be good news for men, but what about women…” They obviously think that my first name is Eileen and therefore I must be female.
Now, I know that my mother had plans to call me Stephanie, but I think that was only because she already had two sons and figured it was going to be third time lucky. Since I actually emerged noisily into the world, I have gotten completely used to Stephen or Steve and never had anyone get it wrong – unlike my last name, Tootill, which has (amongst other variations) been miskeyed as:

• Toothill                    • Tootil
• Tootle                      • Tootal
• Tutil                         • Tooil
• Foothill                    • Toohill
• Toosti                       • Stoolchill

“Stephen” and “Steve” are obviously equivalent, but to suddenly become Eileen is a novel and entertaining experience. In fact, it’s happened more than once so it’s clear that the data here has never been scrubbed to remedy the situation.
Wouldn’t it be useful then if there was some software to scan email addresses to pick out the first and/or last names, or initial letters, so it would be clear that the salutation for [email protected] is not Eileen?

Yes, helpIT systems does offer email validation software, but the real reason for highlighting this is that we just hate it when innovative marketing is compromised by bad data.  That’s why we’re starting a campaign to highlight data quality blunders, with a Twitter hash tag of #DATAQUALITYBLUNDER. Let’s raise the profile of Data Quality and raise a smile at the same time! If you have any examples that you’d like us to share, please comment on this post or send them to [email protected].

Note: As I explained in a previous blog (Phonetic Matching Matters!), the first four variations above are phonetic matches for the correct spelling, whereas the next four are fuzzy phonetic matches. “Toosti” and “Stoolchill” were one-offs and so off-the-wall that it would be a mistake to design a fuzzy matching algorithm to pick them up.

Data Quality and the Spill Chucker

One of my favorite software tools is the spell checker, due to its entertainment value. Colloquially known as the spill chucker due to the fact that if you mistype spell checker as spill chucker, the spell checker identifies that both “spill” and “chucker” are valid words, the spell checker has no concept of context. I was reminded of this the other day, when I received a resume from someone who had two stints as an “Account Manger” and was then promoted to “Senior Account Manger” 🙂 It would be very useful if the spell checker dictionary was more easily customizable, because then most business users (and probably all job applicants) would no doubt remove “Manger” from the dictionary as they have no need to use the word, or it is so infrequent that they’re happy for the spell checker to question it.

We have the same challenges with Data Quality – most data items are only correct if they are in the right context. For example, if you have a column in a table that contains last names, and then find a record that contains a company name in the last name column, it is out of context and is poor quality data. Another example I encountered nearly 20 years ago was reported in a computer magazine – a major computer company addressed a letter to:

Mr David A Wilson
Unemployed At Moment
15 Lower Rd
Farnborough
Hants
GU14 7BQ

Someone had faithfully entered what Mr. Wilson had written in the job title field rather than enter it in a Notes field – maybe the database designer hadn’t allowed for notes.

Effective Data Quality tools must allow for poorly structured data – they must be able to recognize data that is in the wrong place and relocate it to the right place. You can’t match records, correct addresses effectively etc. unless you can improve the structure of poorly structured data. Of course, the context can depend on the language – even British English and American English are different in this respect. I remember when we at helpIT first Americanized our software over 10 years ago, coming across a test case where Earl Jones was given a salutation of “My Lord” rather than simply “Mr. Jones”! Of course, “Earl” is almost certainly a first name in the US but more likely to be a title in the UK. Often, it isn’t easy programming what we humans know instinctively. Salutations for letters derived from unstructured data can be a major source of discomfort and merriment e.g. MS Society is an organization, not to be addressed as “Dear Ms Society”. The landlord at The Duke of Wellington pub shouldn’t receive a letter starting “My Lord”. “Victoria and Albert Museum” is an organization not “Mr & Mrs Museum”, even if it hasn’t been entered in the Organization column.

But going back to spell checkers, maybe they’re sometimes more intelligent than we give them credit for? Just the other day, mine changed what I was attempting to type: “project milestones” to “project millstones”. I did wonder whether it knew more than I did, or maybe it was just feeling pretty negative that day…

Process Centric Data Quality

When I read that even today, contact data quality issues are still costing US business billions, and industry average duplicate rates are around 1 of every 20 records for a typical database we need to accept that there has got to be a better way to resolve the issue.

In my role, I’ve had the opportunity to speak with a lot of CIO’s, DBA’s, and business stake holders describe the challenges in trying to deal with data quality issues. The one question which consistently comes up is; “how do I enforce throughout my company- policies and practices which ensure the users entering data into our database is clean and duplicate free?”

The answer starts with establishing a simple universal truth; that for any company- data quality starts at the point of capture. This is the moment that record is being entered into your database. It doesn’t matter if it is entered by someone in a call center, a sales person, account manager, billing, support or even a web generated lead or sale. This is the opportunity to get it right.

Between the CRM or ERP in any given company- nearly every employee either looks up, adds records, or modifies record details. Even the website connects to the database handling new web leads or new customer e-commerce purchases, billing and shipping details. Data providers such as JigSaw Data, InfoUSA and SalesGenie have made a lot of data readily available at little to no cost and it is being sucked into company databases. While all of this data has enormous benefits to business and profits, it creates a lot of work for IT departments trying to keep it all clean and linked to existing accounts or records. For their part, the data quality industry has been diligent in coming up with new processes and methodologies like; MDM, CDC, CDI, data stewardship, etc., which have certainly helped many companies understand and make improvements to the data quality dilemma.

If you look at the data quality industry as a whole, little has changed over the years. Backend “batch” data quality processing is still the predominant way IT departments deal with correcting poor inputs and linking duplicate records. Yes, processing has moved from the mainframe to the workstation, and costs have certainly come down to a point where it is reasonable for every company to seriously consider acquiring. But we are still using tools and building thought processes based on 1990’s technology to deal with 21st century data realities.

Admittedly, there will always be a place for backend maintenance correction and analysis. It’s fundamental. But in most cases, batch processing is performed offline, days, weeks, months, and for some companies- years after the record had been created. Speaking for helpIT systems, we have done a lot of design around simplifying the processes around complex data cleansing functions and extending it into robust and fully functional batch data cleansing tools. These tools are critical in effectively supporting the IT department in maintaining a single customer view across the enterprise.

But poor data quality is in the first instance- a process problem, not a technology problem. However, properly applied, technology can assist each user and the organization to eliminate or at least mitigate the human impact on data capture.

To quote Rob Karel, Principle Analyst with Forrester Research, “Investments in downstream batch data quality solutions help to mitigate data quality issues, but there is no replacement for capturing the right data in the first place”.

This is where our new data quality framework fits into a company’s data quality initiatives.

findIT S2 is a real-time data quality framework designed to be integrated into frontend application and reside between the data entry tasks and database. findIT S2 reports suspect duplicates to the UI and calls a postal reference database to ensure that addresses are complete, accurate and entered in fewer keystrokes. The rest of the data quality engine extrapolates further reference data, standardizes data and post the information back to the under laying database.

Essentially with findIT S2 we’ve empowered every user to be a deputized data steward and a functional part of the data quality process. Instead of the human element being the problem, the user is an integral part of the solution. Clean data is entered into the systems in real-time allowing business decisions, actions and reporting to be made with more accurate data.

Additionally, findIT S2 UI can be modified or customized, and can be linked to multiple internal or external data sources or web services. This functionality will allow findIT S2 to extend its core functionality to also provide customized data enhancement/append, immediate reporting of fraud/ red flag warning, or synchronous matching and updating between internal systems.

By applying a process centric data quality approach, you are directly impacting the amount of work necessary downstream. It’s the old adage at work; an ounce of prevention is worth a pound of cure.