Posts

Phonetic Matching Matters!

by Steve Tootill (Tootle, Toothill, Tutil, Tootil, Tootal)

In a recent blog entry, Any Advance on Soundex?, I promised to describe our phonetic algorithm, soundIT. To recap, here’s what we think a phonetic algorithm for contact data matching should do:

  • Produce phonetic codes that represent typical pronunciations
  • Focus on “proper names” and not consider other words
  • Be loose enough to allow for regional differences in pronunciation but not so loose as to equate names that sound completely different.

We don’t think it should also try and address errors that arise from keying or reading errors and inconsistencies, as that is best done by other algorithms focused on those types of issues.

To design our algorithm, I decided to keep it in the family: my father Geoff Tootill is a linguist, classics scholar and computer pioneer, who developed the logic design for the first commercial stored program computer at Manchester  University in 1948 – the first computer that stored programs in electronic memory

The first program stored in electronic memory

Geoff was an obvious choice to grapple with the problem of how to design a program that understands pronunciation… We called the resultant algorithm “soundIT”.

So, how does it work?

soundIT derives phonetic codes that represent typical pronunciation of names. It takes account of vowel sounds and determines the stressed syllable in the name. This means that “Batten” and “Batton” sound the same according to soundIT, as the different letters fall in the unstressed syllable, whilst “Batton” and “Button” sound different, as it is the stressed syllable which differs. Clearly, “Batton” and “Button” are a fuzzy match, just not a phonetic match. My name is often misspelled as “Tootle”, “Toothill”, “Tutil”, “Tootil” and “Tootal”, all of which soundIT equates to the correct spelling of “Tootill” – probably why I’m so interested in fuzzy matching of names! Although “Toothill” could be pronounced as “tooth-ill” rather than “toot-hill”, most people treat the “h” as part of “hill” but don’t stress it, hence it sounds like “Tootill”. Another advantage of soundIT is that it can recognize silent consonants – thus it can equate “Shaw” and “Shore”, “Wight” and “White”, “Naughton” and “Norton”, “Porter” and “Porta”, “Moir” and “Moya” (which are all reasonably common last names in the UK and USA).

There are always going to be challenges with representing pronunciation of English names e.g. the city of “Reading” rhymes with “bedding” not “weeding”, to say nothing of the different pronunciations of “ough” represented in “A rough-coated dough-faced ploughboy strode coughing and hiccoughing thoughtfully through the streets of the borough”. Although there are no proper names in this sentence, the challenges of “ough” are represented in place names like “Broughton”, “Poughkeepsie” and “Loughborough”. Fortunately, these challenges only occur in limited numbers and we have found in practice that non-phonetic fuzzy matching techniques, together with matching on other data for a contact or company, allow for the occasional ambiguity in pronunciation of names and places. These exceptions don’t negate the need for a genuine phonetic algorithm in your data matching arsenal.

We implemented soundIT within our dedupe package (matchIT) fairly easily and then proceeded to feed through vast quantities of data to identify any weaknesses and improvements required. soundIT proved very successful in its initial market in the UK and then in the USA. There are algorithms that focus on other languages such as Beider-Morse Phonetic Matching for Germanic and Slavic languages, but as helpIT systems market focus is on English and Pan-European data, we developed a generic form of soundIT for European languages. We also use a looser version of the algorithm for identifying candidate matches than we do for actually allocating similarity scores.

Of course, American English pronunciation of names can be subtly different – a point that was brought home to us when an American customer passed on the comment from one of his team “Does Shaw really sound like Shore?” As I was reading this in an email, and as I am a Brit, I was confused! I rang a friend in Texas who laughed and explained that I was reading it wrong – he read it back to me in a Texan accent and I must admit, they did sound different! But then he explained to me that if you are from Boston, Shaw and Shore do sound very similar, so he felt that we were quite right to flag them as a potential match.

No program is ever perfect, so we continue to develop and tweak soundIT to this day, but it has stood the test of time remarkably well – apart from Beider-Morse, I till don’t know of another algorithm that takes this truly phonetic approach, let alone as successfully as soundIT has done.

Steve Tootill (stEv tWtyl)

Any Advance on Soundex?

A lot has been written about “phonetic algorithms” since Soundex was created for the US Census in (I think) 1880, but the world seemed to stand fairly still until computer software started to implement name matching in the 70’s. The strange thing is that Soundex seems to have remained the de facto standard until well into the 90’s, even up to the present day – strange because Soundex is manifestly prone to false matches as well as missing some typical true name matches. To explain why this happens, let’s look at how Soundex works:

Soundex constructs a crude non-phonetic key by keeping the initial letter of the name, then removing all vowels, plus the letters H, W and Y, and translating the remaining letters to numbers e.g. Tootill and Toothill both become T340. It gives the same number to letters that can be confused e.g. ‘m’ and ‘n’ both become 5. It also drops repeated consonants and consecutive letters that give the same number e.g. S and C. To illustrate some of the issues: Soundex translates Brady, Beard and Broad all to B630, and Wilkins and Wilson both to W425, and yet doesn’t match Knot and Nott – let alone more challenging examples like Dayton and Deighton.

A lot of the work done in the 70’s and 80’s focused on shoring up Soundex’s more obvious symptoms, rather than addressing the root problem – it doesn’t attempt to understand how people pronounce names. In the last 20 years, there have been various “phonetic” algorithms enter the public domain but Metaphone 3 is the only one that I know of that doesn’t invariably disregard the vowels after the first letter of the name.

Much of the material I read back in 1995 (when searching for a phonetic algorithm that worked better than Soundex) started off on the wrong tack by adopting a similar approach to Soundex. Often, the authors quoted various sounds to support their solutions which are rarely present in names of people e.g. the different “ua” sounds in “persuade” and “casual”, the “pt” in “Ptolemy”. Even when I’ve revisited the subject in the intervals since then, there has been little advance. Back in 1995, I decided that we at helpIT would write our own genuinely phonetic algorithm, and I laid down these requirements:

  • It must produce phonetic codes that represent typical pronunciations
  • It should focus on “proper names” and not consider other words
  • It should be loose enough to allow for British and American regional differences in pronunciation (e.g. “Shaw” and “Shore”) but not so loose as to equate names that sound completely different
  • It should not try and address other forms of fuzzy matching that arise from keying or reading errors and inconsistencies – phonetic matching should, by definition, simply address words that may sound the same.

The last point is important – the most effective way to match names is to use a combination of techniques to allow for different kinds of error, rather than try to create one algorithm that is a jack of all trades. I will describe our phonetic algorithm in a later post, but software development, like data quality, is a journey and not a destination – so we’re always looking to improve. If any of our readers know of any other algorithms that attempt to genuinely represent the phonetic attributes of names, I would be delighted to hear from you.

Steve Tootill (stEv tWtyl)