Weighing up the Cost of Bad Data image

Weighing up the Cost of Bad Data

In a recent survey conducted by helpIT systems, almost 25 percent of respondents cited finances as the biggest hindrance to maintaining superior contact databases.  We get it.  Data quality solutions can carry what may seem to be a hefty pricetag, and they won’t show up two days later in a nicely wrapped package like an Amazon Prime purchase.  As such, like any other expensive and complicated decision, data quality may well get pushed to the bottom of the pile.

Then again, just like going to the gym or eating salad instead of steak, the toughest behaviors to adapt are usually the most beneficial.  Because even though database management may be something we’d rather forget about, 40 percent of those same respondents stated that their companies were losing tens of thousands of dollars each year due to poor contact data quality.  So while the solution may not be cheap and easy, the cost of living without it does not appear to be either.  Data Warehousing Institute found that the cost of bad data to US businesses is more than $600 billion each year.  Is that a number your company can afford to ignore?

Many businesses do notice these dollars disappearing and choose to do something about it.  Unfortunately however, this is often simply a “quick fix”.  They look at their messy databases, pay someone to “clean them up”, and then everyone gets a pat on the back for a job well done.  And it is.  Until someone enters a new record in the CRM, a customer moves, or perhaps even dares to get a new phone number.  And I will shock everyone by reporting that this happens all the time.  Studies indicate up to a 2 percent degradation each month…even in a perfect database.

Right now you’re probably picking up on the fact that maintaining good data is going to cost money.  You’re right.  But the fact is, avoiding that cost is only going to cost more in the long run.  Just like having a well-trained sales team, a finely-targeted marketing plan, or a boss with years of experience…great results are an investment of time and resources rather than a happy accident.

Companies that choose to invest in good data quality, as well as to view it as an ongoing process rather than a simple one-time fix, are finding that the benefits by far outweigh the initial costs.  Advertising dollars are reaching their intended audiences and sales calls are reaching the right recipient, with customer satisfaction going through the roof.  Today’s consumer expects the personal touches that can only come from having an accurate and up-to-date Single Customer View, and it is good data quality solutions that will achieve them.

Taking a Look at Data Quality Services in SQL Server 2012

As part of our ongoing research and development, I’ve been looking at the new Data Quality Services functionality that’s available in SQL Server 2012 and how it can be used with your data to complement your matchIT SQL experience. We previously issued an overview before the release of 2012 (“What to Expect from DQS in SQL Server 2012”).

At the heart of DQS is the knowledge base.  It is essentially the memory bank of data that enables the DQS process to make decisions when dealing with the data it is presented with.

A very broad overview of how you might use DQS for data cleansing follows.  Essentially when using the data cleansing aspect of the product, each of your columns is mapped to a user-defined domain (for example you may set up a domain called Firstname).  The knowledge base functionality then allows you to build up a set of valid/invalid rules for the Firstname domain, which can be used to process your data.  Once you have created some domains, you can also build composite domains e.g. a Fullname composite domain could be constructed from the domains Prefix, Firstname and Lastname.  Now when running a DQS cleansing project with the defined knowledge base, you can map in an addressee field and split out prefixes, first names and last names.

This works well (albeit slowly) if your data is already well structured and not large volume, but name and address data is frequently not well structured and you may be working with millions of records.

Once the knowledge discovery process is complete, you can review the results and decisions made, and tweak and make amendments to the results.  A large drawback however, is that the normalization rules are value rather than pattern-based.  For example, in one case, the value #1 was parsed as a premise.  In this case, I actually wanted the domain to translate the ‘#1’ to a ‘1’, so I made the necessary amendment to the rule (i.e. marked it as invalid, and specified a value to correct it to).  Further occurrences of such examples, such as #5 and #10 were also amended in the domain to translate them to their integer versions.  After re-running the process, I found that other values in the same format that were not manually amended were not corrected;  the implication here is that you’d have to create correction rules for every single numerical possibility.  To try and get round this limitation, I attempted to apply a Term-Based-Relation (which is effectively a find and replace option in DQS), however because my hashes have no space prior to the following number (e.g. #10), the hash was not picked up as a separate term to be replaced.  If anyone has found a way of achieving this, then I’d be really interested to hear about it. What’s really needed is to allow regular expressions within the Term-Based Relation functionality, but this didn’t seem to work when I tried and I can find no reference to this being permissible in the documentation.

I think that as a data analyst or administrator, you could spend an awful lot of time tuning DQS to handle name and address cleansing, and perhaps still not be satisfied with the results – but if for example you used a dedicated contact data cleansing tool and combined that with DQS to clean other data (for example product names, order details etc.) then potentially you have a very powerful combination of products.

Having looked at the initial data cleansing that was possible in DQS, I then moved on to have a look at the inbuilt matching functionality that DQS provides.  I’ll cover this in another post soon, but let’s just take a quick look at performance: with DQS the processing performance was much slower than I expected, even when just processing a relatively small table of data.  For example, running a simple 100,000 record file through DQS matching took 7 minutes on my machine – but only 30 seconds when processing through our matchIT SQL product on the same machine.  100,000 records is fairly small for a production database, when large organizations typically have millions of customers.  DQS would handle batches of new records fairly quickly, but as it will only match within a table, it won’t be possible to match this against existing data without adding it to the existing main table and incurring the longer run time. At this point, I’ve yet to run DQS on a very large volume database and I suspect that it will be extremely time consuming. However, DQS does have great potential for handling some customer-specific data patterns that packaged software would also need tuning for – the advantage being that the DQS user will be more self-sufficient when dealing with the customer-specific patterns, and then be able to use the packaged software with less tuning required to its out-of-the-box configuration.

 

If you are interested in expanding DQS capabilities with an integrated Data Quality toolset, you can take a look at our recent matchIT SQL Demo.