The doctor won’t see you now – NHS data needs a health check!

On BBC Radio 4 the other day, I heard that people who have not been to see their local GP in the last 5 years could face being ‘struck-off’ from the register and denied access until they re-register – the story is also covered in most of the national press, including The Guardian. It’s an effort to save money on NHS England’s £9bn annual expenditure on GP practices, but is it the most cost-effective and patient-friendly approach for updating NHS records?

Under the contract, an NHS supplier (Capita) will write every year to all patients who have not been in to see their local doctor or practice nurse in the last five years. This is aimed at removing those who have moved away or died – every name on the register costs the NHS on average around £136 (as at 2013/14) in payments to the GP. After Capita receives the list of names from the GP practice, they’ll send out two letters, the first within ten working days and the next within six months. If they get no reply, the person will be removed from the list. Of course, as well as those who have moved away or died, this will end up removing healthy people who have not seen the GP and don’t respond to either letter. An investigation in 2013 by Pulse, the magazine for GP’s, revealed that “over half of patients removed from practice lists in trials in some areas have been forced to re-register with their practice, with GP’s often blamed for the administrative error. PCTs (Primary Care Trusts) are scrambling to hit the Government’s target of removing 2.5 million patients from practice lists, often targeting the most vulnerable patients, including those with learning disabilities, the very elderly and children.” According to Pulse, the average proportion that were forced to re-register was 9.8%.

This problem of so-called ‘ghost patients’ falsely inflating GP patient lists, and therefore practice incomes, has been an issue for NHS primary care management since at least the 1990’s, and probably long before that. What has almost certainly increased over the last twenty years is the number of temporary residents (e.g. from the rest of the EU) who are very difficult to track.

A spokesperson for the BMA on the radio was quite eloquent on why the NHS scheme was badly flawed, but had no effective answer when the interviewer asked what alternatives there were – that’s what I want to examine here, an analytical approach to a typical Data Quality challenge.

First, what do we know about the current systems? There is a single UK NHS number database, against which all GP practice database registers are automatically reconciled on a regular basis, so that transfers when people move and register with a new GP are well handled. Registered deaths, people imprisoned and those enlisting in the armed forces are also regularly reconciled. Extensive efforts are made to manage common issues such as naming conventions in different cultures, misspelling, etc. but it’s not clear how effective these are.

But if the GP databases are reconciled against the national NHS number database regularly, how is it that according to the Daily Mail “latest figures from the Health and Social Care Information Centre show there are 57.6 million patients registered with a GP in England compared to a population of 55.1 million”? There will be a small proportion of this excess due to inadequacies in matching algorithms or incorrect data being provided, but given that registering a death and registering at a new GP both require provision of the NHS number, any inadequacies here aren’t likely to cause many of the excess registrations. It seems likely that the two major causes are:

  • People who have moved out of the area and not yet registered with a new practice.
  • As mentioned above, temporary residents with NHS numbers that have left the country.

To Data Quality professionals, the obvious solution for the first cause is to use specialist list cleansing software and services to identify people who are known to have moved, using readily available data from Royal Mail, Equifax and other companies. This is how many commercial organisations keep their databases up to date and it is far more targeted than writing to every “ghost patient” at their registered address and relying on them to reply. New addresses can be provided for a large proportion of movers so their letters can be addressed accordingly – if they have moved within the local area, their address should be updated rather than the patient be removed. Using the same methods, Capita can also screen for deaths against third party deceased lists, which will probably pick up more deceased names than the NHS system – simple trials will establish what proportion of patients are tracked to a new address, have moved without the new address being known, or have died.

Next, Capita could target the other category, the potential temporary residents from abroad, by writing to adults whose NHS number was issued in the last (say) 10 years.

The remainder of the list can be further segmented, using the targeted approach that the NHS already uses for screening or immunisation requests: for example, elderly people may have gone to live with other family members or moved into a care home, and young people may be registered at university or be sharing accommodation with friends – letters and other communications can be tailored accordingly to solicit the best response.

What remains after sending targeted letters in each category above probably represents people in a demographic that should still be registered with the practice. Further trials would establish the best approach (in terms of cost and accuracy) for this group: maybe it is cost-effective to write to them and remove non-responders, but if this resulted in only removing a small number, some of these wrongly, maybe it is not worth mailing them.

The bottom line is that well-established Data Quality practices of automatic suppression and change of address, allied with smart targeting, can reduce the costs of the exercise and will make sure that the NHS doesn’t penalise healthy people simply for… being healthy!

New Years Resolutions

2016 New Year’s Resolutions – The Stats

The New Year has arrived, and with January 1st comes the obligatory New Year’s Resolutions.  Almost 50 percent of adults consistently make these resolutions, according to Statistics Brain Research Institute.  Which is surprising considering that statistically we are more likely to fail than succeed.  Yet human perseverance prevails each year as people vow to change their lives in every way from getting into shape to falling in love.

Because we love data, we were especially interested in these numbers, as well as those found in the new study released by the University of Scranton stating that only 1 in 8 people achieve their New Year’s Resolutions.  It sounds much worse than it is.  Consider some other odds:

  • Odds of being audited by the IRS:  1 in 175
  • Odds of finding a pearl in an oyster:  1 in 12,000
  • Odds of dating a supermodel:  1 in 88,000
  • Odds of becoming a billionaire:  1 in 7,000,000
  • Odds of winning $1000 in the McDonald’s Monopoly game:  1 in 36,950,005 (that’s a lot of Big Macs!)

So all odds considered, keeping your New Year’s Resolutions seems very doable.  At least we think so.  And the study goes on to say that people who make resolutions are 10 times more likely to change their lives than those who don’t.  Which means that succeed or fail, trying is half the battle.

To help you kick off the New Year, helpIT systems made a few resolutions of our own to inspire our colleagues and ourselves.  We want to prioritize database quality management in 2016 for a more profitable and productive New Year.  Statistically, 1 in 8 of you will take our challenge.  Will you be that one?

New Year’s Resolution #1:  Getting Organized

Getting Organized is the second most popular New Year’s Resolution, right after shaping back up to pre-holiday jeans size.  The average office employee spends 1.5 hours per day (6 hours a week) looking for things!  According to the authors of Book of Odds, From Lightning Strikes to Love at First Sight, men at home are constantly looking for clean socks, remote control, wedding album, car keys (guilty!), and driver’s license.  Women are always on the hunt for their favorite shoes, a child’s toy, wallet, lipstick, and the remote control.

Let contact data records be the one thing you are not looking for this year.  If we spend that many hours looking for the remote control, imagine how many hours of productivity are lost each year by employees sifting through CRMs for contact data.  “Dirty data” not only sucks hours of productivity out of your day, it will also affect the success of marketing efforts, the sales process, and the bottom line.

Getting organized in your contact database is the first step for a #CleanData2016.  Since we know database management can be a daunting task, like so many New Year’s Resolutions are, we at helpIT systems have 25 years experience and are here to help.

This month helpIT systems is offering a FREE analysis of your company’s contact database by one of our data quality experts.  The analysis will review the effectiveness of current data quality initiatives, pinpoint weaknesses, and run a free data deduplication and matching test on your own data.

Don’t miss out on your chance to kick the New Year off right, click here to claim your free analysis.

New Year’s Resolution #2:  Saving Money

Who wouldn’t like to have a few extra dollars or pounds in their pocket this year?  The second most popular New Year’s Resolution is to save money.  People go about this all sorts of ways, from cutting back on designer handbag purchases to taking public transit instead of their car.  But here are a few ways you might be wasting money without even realizing it:

  1. Small fees that add up:  Credit card interest, paying for speedy shipping, and ATM fees all add up over time.  So while it might seem worth it to have that Amazon purchase overnighted or to hit the ATM at a concert rather than skip out on purchasing a Dave Matthews Band t-shirt, remember that a few years ago ATM fees totaled $7 billion.  To put that in perspective, the average ATM fee is $3.  Let’s say you use an ATM twice a month…that’s THREE Dave Matthews t-shirts from Amazon…maybe four if you don’t pay for the expedited shipping.
  2. Bad habits:  nearly half of Americans consume soft drinks daily, and their fast food consumption totaled $117 billion last year.  That’s almost $400 per person!  Throw in the cost of alcohol, cigarettes, and that daily $5 mocha cappuccino grande at your local Starbucks, and these habits are adding up quick.
  3. Good habits that you’re not actually doing:  While we had the best of intentions when joining that gym, signing up for Spanish lessons, or purchasing the Daily Deal for unlimited monthly meditation sessions, how often have you used it?  Studies show that gyms sell memberships expecting only 18 percent of members to use their facilities on the regular.  Take a look at what you are paying for and ask, “Am I really using this?”
  4. Gambling:  With the American Powerball topping out at a record $1.3 billion this week, spending some of your hard earned cash on a lottery ticket might seem like a worthwhile investment.  The sight of all those zeros make normally sane people forget they have a better chance of being struck by lightning, becoming President of the United States, or being attacked by a shark.  Probably a better chance of all three of those happening at the same time before you hit those lucky numbers.
  5. Waste:  A mind-boggling 33 percent of the world’s food is thrown away each year.  The math works out to about $529 per person.  That’s a nice start towards a down payment on a car or a beach vacation.  Most households could also cut energy costs by a third if they followed recommended guidelines.

So as you are looking to save money this year, consider all the places your money is going, rather than just the obvious few.  Here in the data quality world, we see this happen all the time.  Companies know exactly how much money they are losing due to employee turnover or loss of market share.  However when it comes to how much they are losing due to poor data quality, most are in the dark.  Studies suggest that companies are losing billions of dollars each year from poor data quality.  Don’t hide behind 2015’s denials, whether it’s how much that cup of coffee is really costing you or the effects of dirty data on your organization.  It’s time for a #CleanData2016.

New Year’s Resolution #3:  Be Healthy

This is a big one.  Whether it is to get fit, join a gym, meditate, or eat better, many people focus their New Year’s resolutions on improving their health.  One healthy habit can unintentionally permeate into other behaviors, often changing many aspects of a person’s life for the better.

The tricky part is making these goals stick.  Studies have shown that on average, a person needs to maintain a behavior for 66 days before it becomes a habit.  Some behaviors are harder to change than others, meaning that the 66 day rule is just a guideline rather than an absolute.

Change is hard.  Anyone who has been on a diet or quit smoking knows this.  But the great part, the biggest relief, is that it gets easier.  In his book The One Thing, Gary Keller states that, “Success is actually a short race—a sprint fueled by discipline just long enough for habit to kick in and take over.”  Meaning that we don’t have to be this disciplined forever, we just have to do it long enough for it to become a habit.  Maybe that is 66 days.  Maybe it is 246 days.  But once it is a habit, the effort needed to keep eating veggies or meditating daily will decrease substantially.  After all, how much thought do you put into brushing your teeth in the morning or buckling your seatbelt?  Habits are sometimes done without us even realizing it!

Here are a few tips to help keep you building habits in 2016:

Track yourself.  Imagine a bowl of M&M’s was on your desk right now.  It is mid-afternoon, the sun is streaming in your window, and the to-do list does not seem to be getting any shorter as the minutes slowly tick towards 5:00.  Would you reward your hard efforts so far with one M&M?  Two?  Perhaps a handful.  After all, they’re small.

Now imagine that for every M&M you ate, you had to pull out a little journal and write “1 M&M – 25 Calories”.   Would you still eat a handful?  Probably not.  For whatever behavior you are trying to eliminate or add to your life, write it down.  Every minute, every calorie, every dollar spent.  Darren Hardy advocates for this method in his book The Compound Effect.  As the name implies, these little actions add up big over time.

Mix it up.  Everyone gets into a rut.  Dr. Frank Farley, a professor of psychological studies in education at Philadelphia’s Temple University, tells Wall Street Journal that making the same resolutions year after year can lead to boredom and failure as a result. Want to lose 20 pounds?  Try pledging to walk 3 miles every day instead.  Focusing on adding a healthy behavior rather than the end result can help you feel a sense of daily accomplishment.  Each day of completing your walking resolution will bring you closer to your underlying goal.

Let others help.  No one accomplishes anything alone.  The world’s most successful people had advisors, mentors, and colleagues in their corner that made their achievements possible.  God had Moses.  Barnum had Bailey.  Let others in on what you are trying to accomplish.  Even better, find someone who has the same goals as you so you can encourage each other.

New Year’s Resolution #4:  Stop Procrastinating

In the madness that ensues during the holidays, the calm of January often leaves many people confused.  Where did all this time come from?  And more importantly, what in the world do we do with it?  Several of you are already dreaming of a Star Wars movie marathon or the chance to conquer the next level of Angry Birds.

Yet many people are shrugging off those comfortable time-killers and resolving to make 2016 a productive year both personally and professionally.  This could mean finally training for that 10K run, spending more time with family and friends, or even chasing a passion like watercolors or writing a great novel.

Companies are stopping the procrastinations of 2015 and seeking a more effective data quality plan for the New Year.  While cleaning up millions of contact data records and stopping the influx of bad data can seem like a daunting task, it is one situation that will not improve by delaying the process.  For every year companies procrastinate, bad records are piling up in CRMs, and the effects are staggering.  Departments ranging from marketing to customer service are seeing money and time wasted due to poor data quality.

How do companies accomplish a task of this magnitude?  The same way you eat an elephant…one bite at a time.  Let us help by knocking out a few of those last year excuses:

  • I don’t have the time for a project like that.  Do you have 20 minutes?  Yes?  Twenty minutes will get you started on a free data quality analysis with one of our database experts.  Do you have 20 minutes tomorrow?  If you could spend 20 minutes of each day working on data quality, by the end of 2016 you will have put in 86 hours.  That’s almost FOUR full 24 hour days!  A lot can be done in 86 hours.
  • I don’t know where to start.  Start with those 20 minutes on the phone with one of our data quality experts.  They will talk about your data, your company’s goals, and solutions tailored for you.  While many companies sell a one-size-fits-all solution, there is no “one-size” company.  Let our knowledgeable staff build a solution that is best for your company individually.
  • We tried that last year and the problem just came back.  Maintaining data quality is a habit to be maintained, not a one-time accomplishment.  Just like eating jelly doughnuts will eliminate last year’s workout goals, dirty data will creep up on you if the correct systems are not in place.  helpIT systems offers our clients complete data solutions with long-term results rather than a few quick fixes.
  • I don’t have the money.  Sure you do.  Except you are throwing it away in wasted marketing spend and lost productivity each year.  We work with hundreds of companies that originally thought “we don’t have the money” who have since discovered that not only do they have the money for a data quality solution, they have much more.  The profits realized from clean contact databases enabled them to accomplish many other projects that had been on the back burner as well.

Don’t delay.  This is our last week of offering FREE Data Quality Analysis.  Request yours here.

 

 

12 Days of Data Quality

12 Days of Data Quality

The holidays are finally here.  They always seem so far away and then, as the days grow short and temperatures fall, they tend to jump out at us in a surprise attack like a kid in a spooky costume on Halloween.  And once they are here, if you blink, they are over.  The anticipated smells of gingerbread baking in the oven, the joy of seeing a loved one open a carefully selected present, the glow of thousands of twinkling Christmas lights… all over before we were able to slow down and truly appreciate the holiday season.

So before December disappears under a pile of wrapping paper, we are inviting you to take the time to be merry, revel in the holidays, and perhaps still get a bit of work done.

Welcome to helpIT system’s 12 Days of Data Quality:

On the first Day of Data Quality, helpIT gave to me:

A Single Customer View (In a Pear Tree)

The first gift in this classic holiday carol is a Partridge in a Pear Tree.  The partridge sits alone high above the rest of the world.  Regally.  Eating pears (I imagine it eating pears) while looking down on all the lesser beings that have to see the world from ground level.

Your organization can be that partridge, sitting high above the rest.  Except in the world of database management, we are seeking a truly accurate Single Customer View, rather than a belly full of pears.  We all want the ability to look down on one contact record and obtain accurate, up-to-date information, each and every time.  Having one complete record for each customer ensures that they will receive the correct marketing materials at the correct address.  Salespeople will know a customer’s complete purchase history to analyze likely future purchases.  Customer service reps will be aware of address changes, name changes, as well as any other personal details in order to make the customer feel like they matter.  Which they do.  A lot.

Each customer in your contact database makes up a limb of your “pear tree”.  In the song, no matter how many gifts of drummers or pipers or ladies milking cows are given, it always comes back to the pear tree.  The tree is the center of everything, holding up even the partridge.  Just as your customers hold up your organization.  Make your customers feel this importance by respecting them as individuals, and as the base of your success, rather than lumping them in with the rest in your database.  The first step in doing this is by having a strong data quality solution and system in place.

On the second Day of Data Quality, helpIT gave to me:

2 Matched Records

On the second day of Christmas, my true love gave to me two turtle doves.  Which was great, in medieval times, when the doves symbolized true love’s endurance, mainly because they mated for life.  Everyone from the Bible to Shakespeare has made mention of them.

This December, give yourself another sort of true match.  Matching contact records in your database is the first step to cleaning up dirty data and obtaining a Single Customer View.  And there is no one-size-fits-all solution.

The important thing to consider when matching and deduping your database is the methodology used in the process.  Some software only matches exact records, so ‘John Smith’ and ‘John Smith’ would show up as a duplicate.  However, ‘John Smith’ and ‘Jon Smith’ would not.  So if you want a truly accurate database, you have to employ a more sophisticated method.

helpIT system’s matching software compares all the datasets in one contact record against the rest of your database.  John Smith’s address, birthday, phone number, email, or whatever other datapoints you use are all taken into consideration when pinpointing matches.  This process often picks up 20-80 percent more matches than other software.  When you multiply that by 200 million records, that’s a lot of matches.

The biggest mistake organizations make when matching records is to view it as a “one and done” solution.  Data matching, like any long-term relationship, is something that must be constantly tweaked, adapted, and carried out on a regular basis.  Although as the turtle doves can attest, this type of devotion does come with big rewards.

On the third Day of Data Quality, helpIT gave to me:

 

3 Frenchmen

 

Rather than the French Hens in the traditional song, let us meet a Frenchman whose name is Dr. Mathieu Arment. He loves to purchase designer scarfs from your company, Parisian Scarves. During his first online purchase, he entered his information as follows:

Matheiu Arment
27 Rue Pasteur
14390 Cabourg
FRANCE

The Parisian Scarves marketing department then sends him a catalogue for the Spring Collection. He flips through it while sitting at a local café sipping a latte and finds a handsome purple plaid scarf that he absolutely must have, but he has forgotten his laptop. So he calls in the order. The Parisian Scarves customer service rep does not see an account under the name she types in, Mattheiu Armond, so she creates a new account record and places his order.

Later, a second customer service rep is handling an issue with his order and decides to send a coupon as a gift for all of his trouble. The coupon is sent to Mathis Amiot. And the rep slightly misspelled his address on Rue Pasteur. Upon receiving the misguided coupon as well as two of the same catalogue addressed to slight variations of his name, Dr. Arment realizes that he is not just one Frenchman, but rather 3 separate Frenchmen in the eyes of Parisian Scarves. Feeling annoyed and undervalued that his favorite scarf company cannot even spell his name correctly, not to mention they also forgot his birthday, Dr. Arment takes his scarf shopping to another business who appreciates him as an individual.

Not all data matching software is created equal. While some compares only exact matches, helpIT system’s unique phonetics matching system will pull out similar sounding pieces of data as well as similar spellings. This will create a higher match rate, allowing for less duplicates to slip through into your database.

In this instance, the Parisian Scarves customer service rep would have typed in Mattheiu Armond, only to have the record Matheiu Arment show up as a possible match. She would have noted the similar addresses and concurred, correctly, that these were the same customer.

Accurate data matching creates a data quality firewall, preventing bad data from entering the system at point-of-entry, as well as filtering it out on a regularly scheduled check-up. So Dr. Arment can stay one Frenchman, and more importantly, he will stay a customer of Parisian Scarves.

On the fourth Day of Data Quality, helpIT gave to me:

 

4 Calling Salespeople

 

Sales is a unique industry in which every minute can translate into profits, if that minute is spent efficiently and effectively. Salespeople are constantly seeking better ways of doing things in order to increase your company’s profits, as well as their commissions. Which means every minute wasted clicking through the CRM, either in search of leads or trying to obtain accurate client data, is valuable time lost. Every phone call they make is either costing you money or making you money. What decides whether a sales team is a drain or an asset? The quality of the leads they are contacting.

A CRM that has effective data quality measures in place is filled with accurate contact records. These records can be analyzed to obtain valuable information by all arms of your organization, especially the sales department.

A good salesperson can use CRM data to offer the right products to the right potential buyers, as well as dedicate more time to leads that are statistically more likely to turn into sales. They will be able to quickly obtain the correct point-of-contact and contact information without fishing through multiple records for the same lead. A salesperson will also look more knowledgeable as they are able to talk easily with a client about their business needs.

Give your salespeople the resources they need to be a profitable addition to your company by having an accurate, up-to-date CRM.

On the fifth Day of Data Quality, helpIT gave to me:

 

5 Golden Reasons to Trial matchIT SQL

 

helpIT’s ‘12 Days of Data Quality’ continues with 5 Golden Reasons to Trial matchIT SQL. Perhaps not the golden rings the lady received in the song, but really, who needs five golden rings? Sounds like a pickpocket’s dream come true. So instead, we here at helpIT are presenting you with five reasons to try our matching system.

We hope by now that you are starting to understand how important a strong data quality management system is to the success of your organization. It can increase profits and productivity in all arms of your organization. Yet sometimes it is hard to get the ball rolling, especially if you have a lot of chiefs who are part of this decision. So consider these five reasons why a helpIT systems trial is a good place to start:

1. Quick Installation. Be processing data in less than an hour!
2. Run data cleansing processes on your own data in your own environment (even address validation).
3. Customize the matching process and fine-tune your results with dedicated Trial Customer Support.
4. Run large volumes of data to see real performance results.
5. Get the real-world examples you need to justify your business case for SQL data.

This holiday season, try matchIT SQL for 30 days for absolutely nothing! We know you will love it, but if you don’t, we will give you 5 golden rings. Or maybe just one. Or a thank-you email. Yes, if you don’t love it, we will send you an email thanking you for your time. Happy trialing!

On the sixth Day of Data Quality, helpIT gave to me:

 

6 Companies a-Laying

 

Our countdown to Christmas and better data quality measures continues! In the song, his sweetheart received 6 geese laying eggs. Which might get some odd looks around the office. Instead, consider the importance of laying a strong foundation when beginning your quest for clean data.

All geese lay eggs. But the goose that laid the golden egg got a lot more attention than the rest. Like that golden-egg laying goose, the company that lays the strongest foundation in regards to data quality will garner the most attention and achieve the best results.

Most organizations think of clearing out dirty data as something to be dealt with when absolutely necessary. When in reality, database maintenance is a process that should be consistently tweaked, monitored, and exercised. Contact data is constantly entering your system. Contacts are frequently relocating, changing names, or passing away. Which means a good database administrator is diligent in tracking these changes.

Laying the foundation for strong data quality measures is often labeled too time consuming to be dealt with. But the time invested originally will pay off in piles of golden eggs in the future.

On the seventh Day of Data Quality, helpIT gave to me:

 

7 Sales a-Swimming

 

Or rather, floundering. Whether you want to admit it or not, the odds are you are floundering in bad data, working hard just to stay afloat of the changes that occur in your contact database on a daily basis. Each sale relies on every member of your team being able to swim seamlessly through the CRM to obtain the information they need to make a client feel valued and understood.

Companies today report data analysis as one of the most effective tools for developing marketing campaigns and targeting sales leads. Many organizations use data analysis on a daily basis. However, if they are analyzing inaccurate or out-of-date data, the analysis is all but pointless. A database that does not have systems in place for catching bad data at point-of-entry, as well as a regular cleansing schedule, is a hindrance rather than a help in regards to data analysis.

This holiday season, give your data analysis the gift of a life raft. Make sure your team is swimming, rather than floundering, in the sea of contact data. Accurate data analysis will increase marketing effectiveness, reduce marketing spend, and increase productivity in all aspects of your business that work in the CRM.

On the eighth Day of Data Quality, helpIT gave to me:

 

8 Maids E-Mailing

 

While your business is probably not made up of maids, it does most likely contain many people that rely on email communications on a day-to-day basis. Email is an important means to reach prospects, current customers, and vendors. How these messages are delivered, as well as the content in them, is a strong reflection on the quality of your business model.

Do the emails look polished and professional? Or lazy and sloppy? Most organizations unintentionally accomplish the latter. A lack of data quality management systems has caused incorrect contact information to reside in their database. So Joe Smith gets an email addressed to Jo Smith. Or Jo Smith becomes a Mrs. instead of a Mr. And that’s all assuming that the email is even delivered.

Email deliverability is a key concern to many businesses, especially in regards to marketing. A great marketing campaign is irrelevant if the message is not received by the intended recipient. New email addresses are often mistyped. Another possibility is that a wrong address was given intentionally. Either way, the organization has lost a sales lead because the incorrect address is not reachable.

Email validation is a valuable and effective piece of the data quality puzzle. It will greatly increase the number of leads passed onwards to your sales team as well as ensure that marketing communications arrive to the person they were intended for. It is easy to implement, and the rewards far outweigh the costs.

On the ninth Day of Data Quality, helpIT gave to me:

 

9 Ladies Dating

 

One of the biggest challenges in your database can come from name changes. Sometimes it is from 9 ladies dating and then deciding to tie the knot. And while marriage is normally considered a wonderfully celebrated occasion, to the database administrator it means the possibility of error. Because it is almost a certainty that Ms. Smith is not calling her 17 magazine subscriptions from her honeymoon to let them know she married Mr. Clark and moved into his duplex in the Heights.

The new Mrs. Clark is a valued customer. So treat her as such by recognizing these changes as quickly as possible. Name changes and new addresses are easily dealt with when you have a proper data quality system in place. Stay tuned for tomorrow’s blog for some tips on keeping up with Mrs. Clark.

On the tenth Day of Data Quality, helpIT gave to me:

 

10 Lords a-Moving

 

The original Lords from the song might be a-leaping, but most of your customers getting around via UHaul trucks and airplanes. They are leaping across town, across the state, and sometimes, across the world. In an average year over 40 million people move. Keeping up with them can seem even harder than remembering the words to the 12 Days of Christmas.

Keep your contact database accurate and up-to-date with National Change of Address (NCOA). In one easy process your current contact address data is compared to USPS CASS and DPV certified data, correcting any typing errors and appending additional information.

On the eleventh Day of Data Quality, helpIT gave to me:

 

11 Pipers Piping

 

The Pied Piper was a character in German folklore who tried to sway a town to pay him to rid their village of rats. His pipe music would lure the rats out of hiding and they would follow him out of town. When the villagers refused to pay for this service, he piped away their children instead. Not the noblest use of his talents, but the ability to lead others is a powerful trait nonetheless.

Be the pied piper at your organization, only use your powers for good instead of evil. Make 2016 the year your organization makes data quality solutions a priority and others will be glad they followed you. Often the only thing holding a company back from reducing the costs of bad data is the knowledge and the leadership to move forwards. helpIT systems offers a full range of customer support solutions so that you and your company can feel confident about your next move.

On the twelfth Day of Data Quality, helpIT gave to me:

 

12 DBAs Drumming

 

More often than not, the squeaky wheel gets the grease. The loudest drummers in your office this season should be those making noise about the importance of data deduplication. Having an improperly deduped database can create upwards of 60 percent of dirty data in your contact database.

Those incorrect contacts are receiving marketing materials (which cost money), taking up manpower to organize and sift through in the CRM (which costs time), and getting calls from your sales people (which cost money and time).

This month alone I have received mail for 4 different past residents of my current apartment. You know what I do with it? I throw it away. So Horace will never get that credit card offer. Zachary will not be donating to the Salesian Missions. And Monique will not be showing up in court for her fifth and final notice to appear. (Feeling a little guilty about that last one.)

We hope you enjoyed our unique spin on the traditional “12 Days of Christmas”. While the holidays are nearing an end, helpIT systems is here to answer your data quality questions 365 days a year. We hope to make 2016 your best data quality year ever. Give us a call or visit our website at www.helpit.com to find out more information.

Remembering the helpIT Legacy

View ““You’ve come a long way, Baby”: Remembering the world’s first stored program computer

Last Friday was the 65th anniversary of the first successful execution of the world’s first software program and it was great to see the occasion marked by a post and specially commissioned video on Google’s official blog, complete with an interview earlier this month with my father, Geoff Tootill. The Manchester Small-Scale Experimental Machine (SSEM), nicknamed Baby, was the world’s first stored-program computer i.e. the first computer that you could program for different tasks without rewiring or physical reconfiguration. The program was a routine to determine the highest proper factor of any number. Of course, because nobody had written one before, the word “program” wasn’t used to describe it and “software” was a term that nobody had coined. The SSEM was designed by the team of Frederic C. Williams, Tom Kilburn and Geoff Tootill, and ran its first program on 21st June 1948.

I have heard first hand my father’s stories about being keen to work winter overtime as it was during post-war coal rationing and the SSEM generated so much heat that it was much the cosiest place to be! Also, his habit of keeping one hand in his pocket when touching any of the equipment to prevent electric shocks. Before going to work on the Manchester machine, my Geoff Tootill Notebookfather worked on wartime development and commissioning of radar, which he says was the most responsible job he ever had (at the age of just 21), despite his work at Manchester and (in the 60’s) as Head of Operations at the European Space Research Organisation. Although he is primarily an engineer, a hardware man, my father graduated in Mathematics from Cambridge University and had all the attributes to make an excellent programmer. I like to think that my interest in and aptitude for software stemmed from him in both nature and nurture – although aptitude for hardware and electronics didn’t seem to rub off on me. He was extremely interested in the software that I initially wrote for fuzzy matching of names and addresses as it appealed to him both as a computer scientist and as a linguist. My father then went on to design the uniquely effective phonetic algorithm, soundIT, which powers much of the fuzzy matching in helpIT’s software today, as I have written about in my blog post on the development of our phonetic routine.

The Manchester computing pioneers have not had enough recognition previously, and I’m delighted that Google has paid tribute to my father and his colleagues for their contribution to the modern software era – and to be able to acknowledge my father’s place in the evolution of our company.

Additional Resources:

Click & Collect – How To Do It Successfully?

In the UK this Christmas, the most successful retailers have been those that sell online but allow collection by the shopper – in fact, these companies have represented a large proportion of the retailers that had a good festive season. One innovation has been the rise of online retailers paying convenience stores to take delivery and provide a convenient collection point for the shopper, but two of the country’s biggest retailers, John Lewis and Next, reckon that click and collect has been the key to their Christmas sales figures – and of course they both have high volume e-commerce sites as well as many bricks and mortar stores.

The article here by the Daily Telegraph explains why “click and collect” is proving so popular, especially in a holiday period. The opportunities for major retailers are  obvious, especially as they search for ways to respond to the Amazon threat – but how do they encourage their customers to shop online and also promote in store shopping? The key is successful data-driven marketing: know your customer, incentivize them to use loyalty programs and target them with relevant offers. However, this also presents a big challenge – the disparity and inconsistency in the data that the customer provides when they shop in these different places.

In store, they may not provide any information, or they may provide name and phone number, or they may use a credit card and/or their loyalty card. Online they’ll provide name, email address and (if the item is being delivered), credit card details and their address. If they are collecting in store, they may just provide name and email address and pay on collection – and hopefully they’ll enter their loyalty card number, if they have one. To complicate matters further, people typically have multiple phone numbers (home, office, mobile), multiple addresses (home and office, especially if they have items delivered to their office) and even multiple email addresses. This can be a nightmare for the marketing and IT departments in successfully matching this disparate customer data in order to establish a Single Customer View. To do this, they need software that can fulfill multiple sophisticated requirements, including:

  • Effective matching of customer records without being thrown off by data that is different or missing.
  • Sophisticated fuzzy matching to allow for keying mistakes and inconsistencies between data input by sales representatives in store and in call centers, and customers online.
  • The ability to recognize data that should be ignored – for example, the in-store purchase records where the rep keyed in the address of the store because the system demanded an address and they didn’t have time to ask for the customer’s address, or the customer didn’t want to provide it.
  • Address verification using postal address files to ensure that when the customer does request delivery, the delivery address is valid – and even when they don’t request delivery, to assist the matching process by standardizing the address.
  • The ability to match records (i) in real-time, in store or on the website (ii) off-line, record by record as orders are fed though for fulfillment and (iii) as a batch process, typically overnight as data from branches is fed through. The important point to note here is that the retailer needs to be able to use the same matching engine in all three matching modes, to ensure that inconsistencies in matching results don’t compromise the effectiveness of the processing.
  • Effective grading of matches so that batch and off-line matching can be fully automated without missing lots of good matches or mismatching records. With effective grading of matching records, the business can choose to flag matches that aren’t good enough for automatic processing so they can be reviewed by users later.
  • Recognition of garbage data, particularly data collected from the web site, to avoid it entering the marketing database and compromising its effectiveness.
  • Often, multiple systems are used to handle the different types of purchase and fulfillment. The software must be able to connect to multiple databases storing customer data in different formats for the different systems

With a wide range of data quality solutions on the market, it’s often difficult to find a company that can check all of these boxes. That’s where helpIT systems comes in. If you are a multi-channel retailer currently facing these challenges, contact helpIT systems for a Free Data Analysis and an in depth look at how you can achieve a Single Customer View.

Data Quality and Gender Bending

We have all heard the story about the man who was sent a mailing for an expectant mother. Obviously this exposed the organization sending it to a good deal of ridicule, but there are plenty of more subtle examples of incorrect targeting based on getting the gender wrong. Today I was amused to get another in a series of emails from gocompare.com addressed to [email protected] The subject was “Eileen, will the ECJ gender ruling affect your insurance premiums?” 🙂 The email went on to explain that from December, insurers in the EU will no longer be able to use a person’s gender to calculate a car insurance quote, “which may be good news for men, but what about women…” They obviously think that my first name is Eileen and therefore I must be female.
Now, I know that my mother had plans to call me Stephanie, but I think that was only because she already had two sons and figured it was going to be third time lucky. Since I actually emerged noisily into the world, I have gotten completely used to Stephen or Steve and never had anyone get it wrong – unlike my last name, Tootill, which has (amongst other variations) been miskeyed as:

• Toothill                    • Tootil
• Tootle                      • Tootal
• Tutil                         • Tooil
• Foothill                    • Toohill
• Toosti                       • Stoolchill

“Stephen” and “Steve” are obviously equivalent, but to suddenly become Eileen is a novel and entertaining experience. In fact, it’s happened more than once so it’s clear that the data here has never been scrubbed to remedy the situation.
Wouldn’t it be useful then if there was some software to scan email addresses to pick out the first and/or last names, or initial letters, so it would be clear that the salutation for [email protected] is not Eileen?

Yes, helpIT systems does offer email validation software, but the real reason for highlighting this is that we just hate it when innovative marketing is compromised by bad data.  That’s why we’re starting a campaign to highlight data quality blunders, with a Twitter hash tag of #DATAQUALITYBLUNDER. Let’s raise the profile of Data Quality and raise a smile at the same time! If you have any examples that you’d like us to share, please comment on this post or send them to [email protected].

Note: As I explained in a previous blog (Phonetic Matching Matters!), the first four variations above are phonetic matches for the correct spelling, whereas the next four are fuzzy phonetic matches. “Toosti” and “Stoolchill” were one-offs and so off-the-wall that it would be a mistake to design a fuzzy matching algorithm to pick them up.

Keep your SQL Server data clean – efficiently!

Working with very large datasets (for example when identifying duplicate records using matching software) frequently can throw up performance problems if you are running queries returning large  volumes of data. However there are some tips and tricks that you can use to ensure your SQL code works as efficiently as possible.

In this blog post, I’m going to focus on just a few of these – there are many other useful methods, so feel free to comment on this blog and suggest additional techniques that you have seen deliver benefit.

Physical Ordering of Data and Indices

Indices and the actual physical order of your database can be very important. Suppose for example that you are using matching software to run a one off internal dedupe, looking to compare all records with several different match keys.  Let’s assume that one of those keys is zip or postal code and it’s significantly slower than the other key.

If you put your data into the physical postal code/zip order, then your matching process may run significantly faster since the underlying disk I/O will be much more efficient as the disk head won’t be jumping around (assuming that you’re not using solid state drives).  If you are also verifying the address data using post office address files, then again having it pre-ordered by postal code/zip will be a big benefit.

So how would you put your data into postcode order ready for processing?

There are a couple of options:

  • Create a clustered index on the postcode/zip field – this will cause the data to be stored in postcode/zip order,
  • If the table is in use and already has a clustered index, then the first option won’t be possible. However you may still see improved overall performance if you run a “select into” query pulling out the fields required for matching, and ordering the results by postal code/zip. Select this data into a working table and then use that for the matching process – having added any other additional non-clustered indices needed.

Avoid  SELECT *

Only select the fields you need.  SELECT * is potentially very inefficient when  working with large databases (due to the large amount of memory needed). If you only need to select a couple of fields of data (where those fields are in a certain range), and those fields are indexed, then selecting only those fields allows the index to be scanned and the data returned.  If you use SELECT *, then the DBMS will join the index table with the main data table – which is going to be a lot slower with a large dataset.

Clustered Index and Non-clustered Indices

Generally when working with large tables, you should ensure that your table has a clustered index on the primary key (a clustered index ensures that the data is ordered by the index – in this case the primary key).

For the best performance, clustered indices ought to be rebuilt at regular intervals to minimise disk fragmentation – especially if there are a lot of transactions occurring.  Note that non-clustered indices will also be rebuilt at the same time – so if you have numerous indices then it can be time consuming.

Use Appropriate Non-clustered Indices to Boost Query Efficiency

Non-clustered indices can assist with the performance of your queries – by way of example, non-clustered indices may benefit the following types of query:

  •  Columns that contain a large number of distinct values, such as a combination of last name and first name. If there are very few unique values,  most queries will not use the index because a table scan is typically more efficient.
  • Queries that do not return large result sets.
  • Columns frequently involved in the search criteria of a query (the WHERE clause) that return exact matches.
  • Decision-support-system applications for which joins and grouping are frequently required. Create multiple non-clustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
  • Covering all columns from one table in a given query. This eliminates accessing the table or clustered index altogether.

In terms of the best priority for creating indices, I would recommend the following:

1.) fields used in the WHERE condition

2.) fields used in table JOINS

3.) fields used in the ORDER BY clause

4.) fields used in the SELECT section of the query.

Also make sure that you use the tools within SQL Server to view the query plan for expensive queries and use that information to help refine your indices to boost the efficiency of the query plan.

Avoid Using Views

Views on active databases will perform slower in general, so try to avoid views. Also bear in mind that if you create indices on the view, and the data in the base tables change in some way, then the indices on both the base table and view will need updating – which creates an obvious performance hit.  In general, views are useful in data warehouse type scenarios where the main usage of the data is simply reporting and querying, rather than a lot of database updates.

Make use of Stored Procedures in SQL Server

The code is then compiled and cached, which should lead to performance benefits. That said you need to be aware of parameter sniffing and designing your stored procedures in such a way that SQL Server doesn’t cache an inefficient query execution plan.  There are various techniques that can be used:

  • Optimising for specific parameters
  • Recompile For All Execution
  • Copy parameters into Local Variables

For those interested, there’s a more in-depth, but easy to follow description of these techniques covered on the following page of SQLServerCentral.com

http://www.sqlservercentral.com/blogs/practicalsqldba/2012/06/25/sql-server-parameter-sniffing/

Queries to compare two tables or data sources

When using matching software to identify matches between two different data sources, you may encounter scenarios where one of the tables is small relative to another, very large, table, or where both tables are of similar sizes. We have  found that some techniques for comparing across the two tables run fine where both tables are not too large (say under ten million records), but do not scale if one or both of the tables are much larger than that.  Our eventual solution gets a little too detailed to describe effectively here, but feel free to contact us for information about how we solved it in our matchIT SQL application.

And Finally

Finally I’d recommend ensuring that you keep an eye on the disks housing your SQL Server database files: ensure that there’s at least 30% storage space free and that the disks are not highly fragmented; regularly doing this produces better performance.

In summary by making efforts to optimise the performance of your data cleaning operations, you will reduce load on your database server, allow regular use of the necessary applications to keep your data clean – and as a result keep your users happy.

Phonetic Matching Matters!

by Steve Tootill (Tootle, Toothill, Tutil, Tootil, Tootal)

In a recent blog entry, Any Advance on Soundex?, I promised to describe our phonetic algorithm, soundIT. To recap, here’s what we think a phonetic algorithm for contact data matching should do:

  • Produce phonetic codes that represent typical pronunciations
  • Focus on “proper names” and not consider other words
  • Be loose enough to allow for regional differences in pronunciation but not so loose as to equate names that sound completely different.

We don’t think it should also try and address errors that arise from keying or reading errors and inconsistencies, as that is best done by other algorithms focused on those types of issues.

To design our algorithm, I decided to keep it in the family: my father Geoff Tootill is a linguist, classics scholar and computer pioneer, who developed the logic design for the first commercial stored program computer at Manchester  University in 1948 – the first computer that stored programs in electronic memory

The first program stored in electronic memory

Geoff was an obvious choice to grapple with the problem of how to design a program that understands pronunciation… We called the resultant algorithm “soundIT”.

So, how does it work?

soundIT derives phonetic codes that represent typical pronunciation of names. It takes account of vowel sounds and determines the stressed syllable in the name. This means that “Batten” and “Batton” sound the same according to soundIT, as the different letters fall in the unstressed syllable, whilst “Batton” and “Button” sound different, as it is the stressed syllable which differs. Clearly, “Batton” and “Button” are a fuzzy match, just not a phonetic match. My name is often misspelled as “Tootle”, “Toothill”, “Tutil”, “Tootil” and “Tootal”, all of which soundIT equates to the correct spelling of “Tootill” – probably why I’m so interested in fuzzy matching of names! Although “Toothill” could be pronounced as “tooth-ill” rather than “toot-hill”, most people treat the “h” as part of “hill” but don’t stress it, hence it sounds like “Tootill”. Another advantage of soundIT is that it can recognize silent consonants – thus it can equate “Shaw” and “Shore”, “Wight” and “White”, “Naughton” and “Norton”, “Porter” and “Porta”, “Moir” and “Moya” (which are all reasonably common last names in the UK and USA).

There are always going to be challenges with representing pronunciation of English names e.g. the city of “Reading” rhymes with “bedding” not “weeding”, to say nothing of the different pronunciations of “ough” represented in “A rough-coated dough-faced ploughboy strode coughing and hiccoughing thoughtfully through the streets of the borough”. Although there are no proper names in this sentence, the challenges of “ough” are represented in place names like “Broughton”, “Poughkeepsie” and “Loughborough”. Fortunately, these challenges only occur in limited numbers and we have found in practice that non-phonetic fuzzy matching techniques, together with matching on other data for a contact or company, allow for the occasional ambiguity in pronunciation of names and places. These exceptions don’t negate the need for a genuine phonetic algorithm in your data matching arsenal.

We implemented soundIT within our dedupe package (matchIT) fairly easily and then proceeded to feed through vast quantities of data to identify any weaknesses and improvements required. soundIT proved very successful in its initial market in the UK and then in the USA. There are algorithms that focus on other languages such as Beider-Morse Phonetic Matching for Germanic and Slavic languages, but as helpIT systems market focus is on English and Pan-European data, we developed a generic form of soundIT for European languages. We also use a looser version of the algorithm for identifying candidate matches than we do for actually allocating similarity scores.

Of course, American English pronunciation of names can be subtly different – a point that was brought home to us when an American customer passed on the comment from one of his team “Does Shaw really sound like Shore?” As I was reading this in an email, and as I am a Brit, I was confused! I rang a friend in Texas who laughed and explained that I was reading it wrong – he read it back to me in a Texan accent and I must admit, they did sound different! But then he explained to me that if you are from Boston, Shaw and Shore do sound very similar, so he felt that we were quite right to flag them as a potential match.

No program is ever perfect, so we continue to develop and tweak soundIT to this day, but it has stood the test of time remarkably well – apart from Beider-Morse, I till don’t know of another algorithm that takes this truly phonetic approach, let alone as successfully as soundIT has done.

Steve Tootill (stEv tWtyl)

Where Is Your Bad Data Coming From?

As Kimball documents in The Data Warehouse Lifecycle Toolkit (available in all good book stores), there are five concepts that together, can be considered to define data quality:

Accuracy – The correctness of values contained in each field of each database record.

Completeness – Users must be aware of what data is the minimum required for a record to be considered complete and to contain enough information to be useful to the business.

Consistency – High Level or summarized information is in agreement with the lower-level detail.

Timeliness – Data must be up-to-date, and users should be made aware of any problems by use of a standard update schedule.

Uniqueness – One business or consumer must correspond to only one entity in your data. For example, Jim Smyth and James Smith at the same address should somehow be merged as these records represent the same consumer in reality.

So using Kimball’s list, we might know what kind of data we want in the database but unfortunately, despite our best intentions, there are forces conspiring against good data quality. While it doesn’t take a forensics degree, there are so many sources of poor data you may not even know where to look. For that, we’ve come up with our own list. Let’s take a look…

1. Data Entry Mistakes.

The most obvious of the bad data sources, these take the form of simple typing mistakes that employees can make when entering data into the system e.g. simple typos, entering data into the wrong fields, using variations on certain data elements.  Even under ideal circumstances, these are easy mistakes to make and therefore extremely common but unfortunately can be the source of high numbers of duplicate records.  But why is it so hard to get the data right? Consider these circumstances that can exacerbate your data entry process:

  • Poorly trained staff with no expectations for data entry
  • High employee turnover
  • Under-resourcing of call centres that leads to rushing customer exchanges
  • Forms that do not allow room for all the relevant info
  • Unenforced business rules because bad data is not tracked down to its source

2. Lazy Customers.

Let’s face it. Customers are a key source of bad data. Whether they are providing information over the phone to a representative or completing a transaction online, customers can deliberately and inadvertently provide inaccurate or incomplete data. But you know this already. Here are a few specific circumstances to look out for, especially in retail settings:

  • In store business rules that permit staff to enter store addresses or phone numbers in place of the real customer info
  • Multiple ‘rewards cards’ per household or family that are not linked together
  • Use of store rewards cards that link purchases to different accounts
  • Customers that subconsciously use multiple emails, nicknames or addresses without realizing it
  • Web forms that allow incorrectly formatted data elements such as phone numbers or zip codes
  • Customers pushed for time who then skip or cheat on certain data elements
  • Security concerns of web transactions that lead customers to leave out certain data or simply lie to protect their personal information

3. Bad Form

Web forms. CRMs. ERP systems. The way they are designed can impact data quality. How? Some CRM systems are inflexible and may not allow easy implementation of data rules, leading to required fields being left blank, or containing incomplete data. Indeed many web forms allow any kind of gibberish data to be entered into any fields which can immediately contaminate the database. Not enough space for relevant info or systems and forms that have not been updated to match the business process also pose a challenge. Many systems also simply do not perform an address check at entry – allowing invalid addresses to enter the system. When it comes to data quality, good form is everything.

4. Customization Simply Reroutes Bad Data

All businesses have processes and data items unique to that business or industry sector. Unfortunately, when systems do not provide genuine flexibility and extensibility, IT will customize the system as necessary. For example, a CRM system may be adjusted to allow a full range of user-defined data (eg to allow a software company to store multiple licence details for each customer). Where this happens, the hacks and workarounds can lead to a lack of data integrity in the system (e.g. you end up storing data in fields designed for other data types (dates in character fields).

5. Data Erosion is Beyond Your Control

Businesses and consumers move address. People get married and change their name. Business names change too plus contacts get promoted or replaced. Email addresses and phone numbers are constantly evolving. People die. No matter how sophisticated your systems are, some measure of data erosion is simply unavoidable. While good business rules will assist in updating data at relevant checkpoints, to maintain the best quality data, it’s important to update the data from reliable data sources on a regular basis.

6. New Data. Bad Data. Duplicate Data.

Many businesses regularly source new prospect lists that are subsequently loaded into the CRM. These can come from a variety of places including list vendors, trade shows, publications, outbound marketing campaigns and even internal customer communications and surveys. Although it’s exciting to consider procuring a new, large database of prospects, there are two ways this addition of data can go horribly wrong. First, the data itself is always suspect, falling prey to all the potential issues of data entry, data erosion and customer error. But even if you can corroborate or cleanse the data before entering, there is still a chance you will be entering duplicate records that won’t always be quickly identified.

7. Overconfidence

OK. So this may not be a true ‘source’ of bad data but it is the most important precipitating factor. You may think that by implementing business rules or by using a CRM’s built-in duplicate detection tools, that you are covered. In practice, business rules are important and valuable but are never foolproof and require constant enforcement, evaluation and updates. Moreover, built-in data quality features are typically fairly limited in scope and ability to simply detect exact matches. They simply not powerful enough to do the heavy lifting of a more sophisticated fuzzy and phonetic matching engine that will catch the subtle data quality errors that can lead to major data quality issues. This false sense of confidence means you can easily overlook sources of poor data and neglect to perform critical data quality checks.

So if you keep these seven bad data sources in mind – are you home free? Unfortunately not. These are simply the building blocks of bad data. When even just some of these conditions occur simultaneously, the risk of bad data multiplies  exponentially. The only true way to achieve the five-pronged data quality ideal outlined by Kimball (accuracy, completeness, consistency, timeliness and uniqueness) is through a comprehensive data quality firewall that addresses each of these components individually.

Stay tuned for more information on Best Practices in data quality that pinpoint specific business rules and software solutions to achieve true real-time data quality.

Data Quality and the Spill Chucker

One of my favorite software tools is the spell checker, due to its entertainment value. Colloquially known as the spill chucker due to the fact that if you mistype spell checker as spill chucker, the spell checker identifies that both “spill” and “chucker” are valid words, the spell checker has no concept of context. I was reminded of this the other day, when I received a resume from someone who had two stints as an “Account Manger” and was then promoted to “Senior Account Manger” 🙂 It would be very useful if the spell checker dictionary was more easily customizable, because then most business users (and probably all job applicants) would no doubt remove “Manger” from the dictionary as they have no need to use the word, or it is so infrequent that they’re happy for the spell checker to question it.

We have the same challenges with Data Quality – most data items are only correct if they are in the right context. For example, if you have a column in a table that contains last names, and then find a record that contains a company name in the last name column, it is out of context and is poor quality data. Another example I encountered nearly 20 years ago was reported in a computer magazine – a major computer company addressed a letter to:

Mr David A Wilson
Unemployed At Moment
15 Lower Rd
Farnborough
Hants
GU14 7BQ

Someone had faithfully entered what Mr. Wilson had written in the job title field rather than enter it in a Notes field – maybe the database designer hadn’t allowed for notes.

Effective Data Quality tools must allow for poorly structured data – they must be able to recognize data that is in the wrong place and relocate it to the right place. You can’t match records, correct addresses effectively etc. unless you can improve the structure of poorly structured data. Of course, the context can depend on the language – even British English and American English are different in this respect. I remember when we at helpIT first Americanized our software over 10 years ago, coming across a test case where Earl Jones was given a salutation of “My Lord” rather than simply “Mr. Jones”! Of course, “Earl” is almost certainly a first name in the US but more likely to be a title in the UK. Often, it isn’t easy programming what we humans know instinctively. Salutations for letters derived from unstructured data can be a major source of discomfort and merriment e.g. MS Society is an organization, not to be addressed as “Dear Ms Society”. The landlord at The Duke of Wellington pub shouldn’t receive a letter starting “My Lord”. “Victoria and Albert Museum” is an organization not “Mr & Mrs Museum”, even if it hasn’t been entered in the Organization column.

But going back to spell checkers, maybe they’re sometimes more intelligent than we give them credit for? Just the other day, mine changed what I was attempting to type: “project milestones” to “project millstones”. I did wonder whether it knew more than I did, or maybe it was just feeling pretty negative that day…