by Steve Tootill (Tootle, Toothill, Tutil, Tootil, Tootal)
In a recent blog entry, Any Advance on Soundex?, I promised to describe our phonetic algorithm, soundIT. To recap, here’s what we think a phonetic algorithm for contact data matching should do:
- Produce phonetic codes that represent typical pronunciations
- Focus on “proper names” and not consider other words
- Be loose enough to allow for regional differences in pronunciation but not so loose as to equate names that sound completely different.
We don’t think it should also try and address errors that arise from keying or reading errors and inconsistencies, as that is best done by other algorithms focused on those types of issues.
To design our algorithm, I decided to keep it in the family: my father Geoff Tootill is a linguist, classics scholar and computer pioneer, who played a leading role in development of the Manchester Small-Scale Experimental Machine in 1947-48, popularly known now as the “Baby” – the first computer that stored programs in electronic memory.
Geoff was an obvious choice to grapple with the problem of how to design a program that understands pronunciation… We called the resultant algorithm “soundIT”.
So, how does it work?
soundIT derives phonetic codes that represent typical pronunciation of names. It takes account of vowel sounds and determines the stressed syllable in the name. This means that “Batten” and “Batton” sound the same according to soundIT, as the different letters fall in the unstressed syllable, whilst “Batton” and “Button” sound different, as it is the stressed syllable which differs. Clearly, “Batton” and “Button” are a fuzzy match, just not a phonetic match. My name is often misspelled as “Tootle”, “Toothill”, “Tutil”, “Tootil” and “Tootal”, all of which soundIT equates to the correct spelling of “Tootill” – probably why I’m so interested in fuzzy matching of names! Although “Toothill” could be pronounced as “tooth-ill” rather than “toot-hill”, most people treat the “h” as part of “hill” but don’t stress it, hence it sounds like “Tootill”. Another advantage of soundIT is that it can recognize silent consonants – thus it can equate “Shaw” and “Shore”, “Wight” and “White”, “Naughton” and “Norton”, “Porter” and “Porta”, “Moir” and “Moya” (which are all reasonably common last names in the UK and USA).
There are always going to be challenges with representing pronunciation of English names e.g. the city of “Reading” rhymes with “bedding” not “weeding”, to say nothing of the different pronunciations of “ough” represented in “A rough-coated dough-faced ploughboy strode coughing and hiccoughing thoughtfully through the streets of the borough”. Although there are no proper names in this sentence, the challenges of “ough” are represented in place names like “Broughton”, “Poughkeepsie” and “Loughborough”. Fortunately, these challenges only occur in limited numbers and we have found in practice that non-phonetic fuzzy matching techniques, together with matching on other data for a contact or company, allow for the occasional ambiguity in pronunciation of names and places. These exceptions don’t negate the need for a genuine phonetic algorithm in your data matching arsenal.
We implemented soundIT within our dedupe package (matchIT) fairly easily and then proceeded to feed through vast quantities of data to identify any weaknesses and improvements required. soundIT proved very successful in its initial market in the UK and then in the USA. There are algorithms that focus on other languages such as Beider-Morse Phonetic Matching for Germanic and Slavic languages, but as helpIT systems market focus is on English and Pan-European data, we developed a generic form of soundIT for European languages. We also use a looser version of the algorithm for identifying candidate matches than we do for actually allocating similarity scores.
Of course, American English pronunciation of names can be subtly different – a point that was brought home to us when an American customer passed on the comment from one of his team “Does Shaw really sound like Shore?” As I was reading this in an email, and as I am a Brit, I was confused! I rang a friend in Texas who laughed and explained that I was reading it wrong – he read it back to me in a Texan accent and I must admit, they did sound different! But then he explained to me that if you are from Boston, Shaw and Shore do sound very similar, so he felt that we were quite right to flag them as a potential match.
No program is ever perfect, so we continue to develop and tweak soundIT to this day, but it has stood the test of time remarkably well – apart from Beider-Morse, I till don’t know of another algorithm that takes this truly phonetic approach, let alone as successfully as soundIT has done.
Steve Tootill (stEv tWtyl)