If you’ve found yourself reading this blog then you’re no doubt already aware of the importance of maintaining data quality through processes such as data verification, suppression screening, and duplicate detection. In this post I’d like to look a bit closer at how you draw value from, and make the best use of, the results of the hard work you invest into tracking down duplicates within your data.
The great thing about fuzzy matching is that it enables us to identify groups of two or more records that pertain to the same entity but that don’t necessarily contain exactly the same information. Records in a group of fuzzy matches will normally contain similar information with slight variations from one record to the next. For example, one record may contain a full forename whilst another contains just an abbreviated version or even none at all. You will also frequently encounter fuzzy matches where incorrectly spelt or poorly input data is matched against its accurate counterpart.
Once you’ve identified these groups of fuzzy matches, what do you do with them? Ultimately you want to end up with only unique records within your data, but there are a couple of ways that you can go about reaching that goal. One approach is to try and determine the best record in a group of matches and discard all of the records that matched against it. Other times, you may find that you are able to draw more value from your data by taking the most accurate, complete, and relevant information from a group of matched records and merging it together so that you’re left with a single hybrid record containing a superior set of data than was available in any of the individual records from which it was created.
Regardless of the approach you take, you’ll need to establish some rules to use when determining the best record or best pieces of information from multiple records. Removing the wrong record or information could actually end up making your data worse so this decision warrants a bit of thought. The criteria you use for this purpose will vary from one job to the next, but the following is a list of 7 questions that target the desirable attributes you’ll want to consider when deciding what data should be retained:
- How current is the data?
You’ll most likely want to keep data that was most recently acquired.
- How complete is the data?
How many fields are populated, and how well are those fields populated?
- Is the data valid?
Have dates been entered in the required format? Does an email address contain an at sign?
- Is the data accurate?
Has it been verified (e.g. address verified against PAF)?
- How reliable is the data?
Has it come from a trusted source?
- Is the data relevant?
Is the data appropriate for its intended use (e.g. keep female contacts over male if compiling a list of recipients for a woman’s clothing catalogue)?
- Is there a predetermined hierarchy?
Do you have a business rule in place that requires one set of data is always used over another?
When you have such a large range of competing criteria to consider, how do you apply all of these rules simultaneously? The approach we at helpIT use in our software is to allow the user to weight each item or collection of data, so they can choose what aspects are the most important in their business context. This isn’t necessarily whether an item is present or not, or how long it is, but could be whether it was an input value or derived from supplied information, or whether it has been verified by reference to an external dataset such as a Postal Address File. Once the master record has been selected, the user may also want to transfer data from records being deleted to the master record e.g. to copy a job title from a duplicate to a master record which contains fuller/better name and address information, but no job title. By creating a composite record, you ensure that no data is lost.
Hopefully this post will have given you something to think about when deciding how to deal with the duplicates you’ve identified in your data. I’d welcome any comments or questions.