Posts

Taking a Look at Data Quality Services in SQL Server 2012

As part of our ongoing research and development, I’ve been looking at the new Data Quality Services functionality that’s available in SQL Server 2012 and how it can be used with your data to complement your matchIT SQL experience. We previously issued an overview before the release of 2012 (“What to Expect from DQS in SQL Server 2012”).

At the heart of DQS is the knowledge base.  It is essentially the memory bank of data that enables the DQS process to make decisions when dealing with the data it is presented with.

A very broad overview of how you might use DQS for data cleansing follows.  Essentially when using the data cleansing aspect of the product, each of your columns is mapped to a user-defined domain (for example you may set up a domain called Firstname).  The knowledge base functionality then allows you to build up a set of valid/invalid rules for the Firstname domain, which can be used to process your data.  Once you have created some domains, you can also build composite domains e.g. a Fullname composite domain could be constructed from the domains Prefix, Firstname and Lastname.  Now when running a DQS cleansing project with the defined knowledge base, you can map in an addressee field and split out prefixes, first names and last names.

This works well (albeit slowly) if your data is already well structured and not large volume, but name and address data is frequently not well structured and you may be working with millions of records.

Once the knowledge discovery process is complete, you can review the results and decisions made, and tweak and make amendments to the results.  A large drawback however, is that the normalization rules are value rather than pattern-based.  For example, in one case, the value #1 was parsed as a premise.  In this case, I actually wanted the domain to translate the ‘#1’ to a ‘1’, so I made the necessary amendment to the rule (i.e. marked it as invalid, and specified a value to correct it to).  Further occurrences of such examples, such as #5 and #10 were also amended in the domain to translate them to their integer versions.  After re-running the process, I found that other values in the same format that were not manually amended were not corrected;  the implication here is that you’d have to create correction rules for every single numerical possibility.  To try and get round this limitation, I attempted to apply a Term-Based-Relation (which is effectively a find and replace option in DQS), however because my hashes have no space prior to the following number (e.g. #10), the hash was not picked up as a separate term to be replaced.  If anyone has found a way of achieving this, then I’d be really interested to hear about it. What’s really needed is to allow regular expressions within the Term-Based Relation functionality, but this didn’t seem to work when I tried and I can find no reference to this being permissible in the documentation.

I think that as a data analyst or administrator, you could spend an awful lot of time tuning DQS to handle name and address cleansing, and perhaps still not be satisfied with the results – but if for example you used a dedicated contact data cleansing tool and combined that with DQS to clean other data (for example product names, order details etc.) then potentially you have a very powerful combination of products.

Having looked at the initial data cleansing that was possible in DQS, I then moved on to have a look at the inbuilt matching functionality that DQS provides.  I’ll cover this in another post soon, but let’s just take a quick look at performance: with DQS the processing performance was much slower than I expected, even when just processing a relatively small table of data.  For example, running a simple 100,000 record file through DQS matching took 7 minutes on my machine – but only 30 seconds when processing through our matchIT SQL product on the same machine.  100,000 records is fairly small for a production database, when large organizations typically have millions of customers.  DQS would handle batches of new records fairly quickly, but as it will only match within a table, it won’t be possible to match this against existing data without adding it to the existing main table and incurring the longer run time. At this point, I’ve yet to run DQS on a very large volume database and I suspect that it will be extremely time consuming. However, DQS does have great potential for handling some customer-specific data patterns that packaged software would also need tuning for – the advantage being that the DQS user will be more self-sufficient when dealing with the customer-specific patterns, and then be able to use the packaged software with less tuning required to its out-of-the-box configuration.

 

If you are interested in expanding DQS capabilities with an integrated Data Quality toolset, you can take a look at our recent matchIT SQL Demo.

Phonetic Matching Matters!

by Steve Tootill (Tootle, Toothill, Tutil, Tootil, Tootal)

In a recent blog entry, Any Advance on Soundex?, I promised to describe our phonetic algorithm, soundIT. To recap, here’s what we think a phonetic algorithm for contact data matching should do:

  • Produce phonetic codes that represent typical pronunciations
  • Focus on “proper names” and not consider other words
  • Be loose enough to allow for regional differences in pronunciation but not so loose as to equate names that sound completely different.

We don’t think it should also try and address errors that arise from keying or reading errors and inconsistencies, as that is best done by other algorithms focused on those types of issues.

To design our algorithm, I decided to keep it in the family: my father Geoff Tootill is a linguist, classics scholar and computer pioneer, who played a leading role in development of the Manchester Small-Scale Experimental Machine in 1947-48, popularly known now as the “Baby” – the first computer that stored programs in electronic memory

The first program stored in electronic memory

Geoff was an obvious choice to grapple with the problem of how to design a program that understands pronunciation… We called the resultant algorithm “soundIT”.

So, how does it work?

soundIT derives phonetic codes that represent typical pronunciation of names. It takes account of vowel sounds and determines the stressed syllable in the name. This means that “Batten” and “Batton” sound the same according to soundIT, as the different letters fall in the unstressed syllable, whilst “Batton” and “Button” sound different, as it is the stressed syllable which differs. Clearly, “Batton” and “Button” are a fuzzy match, just not a phonetic match. My name is often misspelled as “Tootle”, “Toothill”, “Tutil”, “Tootil” and “Tootal”, all of which soundIT equates to the correct spelling of “Tootill” – probably why I’m so interested in fuzzy matching of names! Although “Toothill” could be pronounced as “tooth-ill” rather than “toot-hill”, most people treat the “h” as part of “hill” but don’t stress it, hence it sounds like “Tootill”. Another advantage of soundIT is that it can recognize silent consonants – thus it can equate “Shaw” and “Shore”, “Wight” and “White”, “Naughton” and “Norton”, “Porter” and “Porta”, “Moir” and “Moya” (which are all reasonably common last names in the UK and USA).

There are always going to be challenges with representing pronunciation of English names e.g. the city of “Reading” rhymes with “bedding” not “weeding”, to say nothing of the different pronunciations of “ough” represented in “A rough-coated dough-faced ploughboy strode coughing and hiccoughing thoughtfully through the streets of the borough”. Although there are no proper names in this sentence, the challenges of “ough” are represented in place names like “Broughton”, “Poughkeepsie” and “Loughborough”. Fortunately, these challenges only occur in limited numbers and we have found in practice that non-phonetic fuzzy matching techniques, together with matching on other data for a contact or company, allow for the occasional ambiguity in pronunciation of names and places. These exceptions don’t negate the need for a genuine phonetic algorithm in your data matching arsenal.

We implemented soundIT within our dedupe package (matchIT) fairly easily and then proceeded to feed through vast quantities of data to identify any weaknesses and improvements required. soundIT proved very successful in its initial market in the UK and then in the USA. There are algorithms that focus on other languages such as Beider-Morse Phonetic Matching for Germanic and Slavic languages, but as helpIT systems market focus is on English and Pan-European data, we developed a generic form of soundIT for European languages. We also use a looser version of the algorithm for identifying candidate matches than we do for actually allocating similarity scores.

Of course, American English pronunciation of names can be subtly different – a point that was brought home to us when an American customer passed on the comment from one of his team “Does Shaw really sound like Shore?” As I was reading this in an email, and as I am a Brit, I was confused! I rang a friend in Texas who laughed and explained that I was reading it wrong – he read it back to me in a Texan accent and I must admit, they did sound different! But then he explained to me that if you are from Boston, Shaw and Shore do sound very similar, so he felt that we were quite right to flag them as a potential match.

No program is ever perfect, so we continue to develop and tweak soundIT to this day, but it has stood the test of time remarkably well – apart from Beider-Morse, I till don’t know of another algorithm that takes this truly phonetic approach, let alone as successfully as soundIT has done.

Steve Tootill (stEv tWtyl)

Any Advance on Soundex?

A lot has been written about “phonetic algorithms” since Soundex was created for the US Census in (I think) 1880, but the world seemed to stand fairly still until computer software started to implement name matching in the 70’s. The strange thing is that Soundex seems to have remained the de facto standard until well into the 90’s, even up to the present day – strange because Soundex is manifestly prone to false matches as well as missing some typical true name matches. To explain why this happens, let’s look at how Soundex works:

Soundex constructs a crude non-phonetic key by keeping the initial letter of the name, then removing all vowels, plus the letters H, W and Y, and translating the remaining letters to numbers e.g. Tootill and Toothill both become T340. It gives the same number to letters that can be confused e.g. ‘m’ and ‘n’ both become 5. It also drops repeated consonants and consecutive letters that give the same number e.g. S and C. To illustrate some of the issues: Soundex translates Brady, Beard and Broad all to B630, and Wilkins and Wilson both to W425, and yet doesn’t match Knot and Nott – let alone more challenging examples like Dayton and Deighton.

A lot of the work done in the 70’s and 80’s focused on shoring up Soundex’s more obvious symptoms, rather than addressing the root problem – it doesn’t attempt to understand how people pronounce names. In the last 20 years, there have been various “phonetic” algorithms enter the public domain but Metaphone 3 is the only one that I know of that doesn’t invariably disregard the vowels after the first letter of the name.

Much of the material I read back in 1995 (when searching for a phonetic algorithm that worked better than Soundex) started off on the wrong tack by adopting a similar approach to Soundex. Often, the authors quoted various sounds to support their solutions which are rarely present in names of people e.g. the different “ua” sounds in “persuade” and “casual”, the “pt” in “Ptolemy”. Even when I’ve revisited the subject in the intervals since then, there has been little advance. Back in 1995, I decided that we at helpIT would write our own genuinely phonetic algorithm, and I laid down these requirements:

  • It must produce phonetic codes that represent typical pronunciations
  • It should focus on “proper names” and not consider other words
  • It should be loose enough to allow for British and American regional differences in pronunciation (e.g. “Shaw” and “Shore”) but not so loose as to equate names that sound completely different
  • It should not try and address other forms of fuzzy matching that arise from keying or reading errors and inconsistencies – phonetic matching should, by definition, simply address words that may sound the same.

The last point is important – the most effective way to match names is to use a combination of techniques to allow for different kinds of error, rather than try to create one algorithm that is a jack of all trades. I will describe our phonetic algorithm in a later post, but software development, like data quality, is a journey and not a destination – so we’re always looking to improve. If any of our readers know of any other algorithms that attempt to genuinely represent the phonetic attributes of names, I would be delighted to hear from you.

Steve Tootill (stEv tWtyl)