Home » Best Practices » Creating Your Ideal Test Data
formats

Creating Your Ideal Test Data

Every day we work with customers to begin the process of evaluating helpIT data quality software (along with other vendors they are looking at). That process can be daunting for a variety of reasons from identifying the right vendors to settling on an implementation strategy, but one of the big hurdles that occurs early on in the process is running an initial set of data through the application.

Once you’ve gotten a trial of a few applications (hopefully including helpIT’s) and you are poised to start your evaluation to determine which one is going to generate the best result - you’ll need to develop a sample data set to run on the software. This is an important step not to be overlooked because you want to be sure that the software you invest in can deliver the highest quality matches so you can effectively dedupe your database and most importantly, TRUST that the resulting data is as clean as it possibly can be with the least possible wiggle room. So how do you create the ideal test data?

The first word of advice – use real data.

Many software trials will come preinstalled with sample or demo data designed primarily to showcase the features of the software. While this sample data can give you examples of generic match results, they will not be a clear reflection of your match results. This is why it is best to run an evaluation of the software on your own data whenever possible. Using the guidelines below, we suggest ‘identifying’ a real dataset that is representative of the challenges you will typically see within your actual database. That dataset will tell you if the software can find your more challenging matches, and how well it can do that.

For fuzzy matching features, you may like to consider whether the data that you test with includes these situations:

  • phonetic matches (e.g. Naughton and Norton)
  • reading errors (e.g. Horton and Norton)
  • typing errors (e.g. Notron, Noron, Nortopn and Norton)
  • one record has title and initial and the other has first name with no title
    (e.g. Mr J Smith and John Smith)
  • one record has missing name elements (e.g. John Smith and Mr J R Smith)
  • names are reversed (e.g. John Smith and Smith, John)
  • one record has missing address elements (e.g. one record has the village or house
    name and the other address just has the street number or town)
  • one record has the full postal code and the other a partial postal code or no postal code

When matching company names data, consider including the following challenges:

  • acronyms e.g. IBM, I B M, I.B.M., International Business Machines
  • one record has missing name elements e.g.
    1. The Crescent Hotel, Crescent Hotel
    2. Breeze Ltd, Breeze
    3. Deloitte & Touche, Deloitte, Deloittes.

You should also ensure that you have groups of records where the data that matches exactly, varies for pairs within the group. For example:

 

If you don’t have these scenarios all represented, you can doctor your real data to create them, as long as you start with real records that are as close as possible to the test cases and make one or at the most two changes to each record. In the real world, matching records will have something in common – not every field will be slightly different.

With regard to size, it’s better to work with a reasonable sample of your data than a whole database or file, otherwise the mass of information runs the risk of obscuring important details and test runs take longer than they need to. We recommend that you take two selections from your data – one for a specific postal code or geographic area, and one (if possible) an alphabetical range by last name. Join these selections together and then eliminate all the exact matches – if you can’t do this easily, one of the solutions that you’re evaluating can probably do it for you.

Ultimately, you should have a reasonable size sample without so many obvious matches, which should contain a reasonable number of fuzzier matches (e.g. matches where the first character of the postal code or last name is different between two records that otherwise match, matches with phonetic variations of last name, etc.)

__________________________________________________________________________

For more information on data quality vendor evaluations, please download our Practical Guide to Data Quality Vendor Selection.

 

One Response

  1. Good post Chris.

    I totally agree on using your own live data.

    About size you may want to test the speed of a product, as this can be a real surprise if you have a lot of rows to be processed.

    Also evaluating the test result will be a test of the user interface for interactive inspection, an area where many tools falls short.

    Some while ago I made a post with more on these subjects. The post is called Testing a Data Matching Tool .

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

© Copyright 2012
credit