As part of our ongoing research and development, I’ve been looking at the new Data Quality Services functionality that’s available in SQL Server 2012 and how it can be used with your data to complement your matchIT SQL experience. We previously issued an overview before the release of 2012 (“What to Expect from DQS in SQL Server 2012”).
At the heart of DQS is the knowledge base. It is essentially the memory bank of data that enables the DQS process to make decisions when dealing with the data it is presented with.
A very broad overview of how you might use DQS for data cleansing follows. Essentially when using the data cleansing aspect of the product, each of your columns is mapped to a user-defined domain (for example you may set up a domain called Firstname). The knowledge base functionality then allows you to build up a set of valid/invalid rules for the Firstname domain, which can be used to process your data. Once you have created some domains, you can also build composite domains e.g. a Fullname composite domain could be constructed from the domains Prefix, Firstname and Lastname. Now when running a DQS cleansing project with the defined knowledge base, you can map in an addressee field and split out prefixes, first names and last names.
This works well (albeit slowly) if your data is already well structured and not large volume, but name and address data is frequently not well structured and you may be working with millions of records.
Once the knowledge discovery process is complete, you can review the results and decisions made, and tweak and make amendments to the results. A large drawback however, is that the normalization rules are value rather than pattern-based. For example, in one case, the value #1 was parsed as a premise. In this case, I actually wanted the domain to translate the ‘#1’ to a ‘1’, so I made the necessary amendment to the rule (i.e. marked it as invalid, and specified a value to correct it to). Further occurrences of such examples, such as #5 and #10 were also amended in the domain to translate them to their integer versions. After re-running the process, I found that other values in the same format that were not manually amended were not corrected; the implication here is that you’d have to create correction rules for every single numerical possibility. To try and get round this limitation, I attempted to apply a Term-Based-Relation (which is effectively a find and replace option in DQS), however because my hashes have no space prior to the following number (e.g. #10), the hash was not picked up as a separate term to be replaced. If anyone has found a way of achieving this, then I’d be really interested to hear about it. What’s really needed is to allow regular expressions within the Term-Based Relation functionality, but this didn’t seem to work when I tried and I can find no reference to this being permissible in the documentation.
I think that as a data analyst or administrator, you could spend an awful lot of time tuning DQS to handle name and address cleansing, and perhaps still not be satisfied with the results – but if for example you used a dedicated contact data cleansing tool and combined that with DQS to clean other data (for example product names, order details etc.) then potentially you have a very powerful combination of products.
Having looked at the initial data cleansing that was possible in DQS, I then moved on to have a look at the inbuilt matching functionality that DQS provides. I’ll cover this in another post soon, but let’s just take a quick look at performance: with DQS the processing performance was much slower than I expected, even when just processing a relatively small table of data. For example, running a simple 100,000 record file through DQS matching took 7 minutes on my machine – but only 30 seconds when processing through our matchIT SQL product on the same machine. 100,000 records is fairly small for a production database, when large organizations typically have millions of customers. DQS would handle batches of new records fairly quickly, but as it will only match within a table, it won’t be possible to match this against existing data without adding it to the existing main table and incurring the longer run time. At this point, I’ve yet to run DQS on a very large volume database and I suspect that it will be extremely time consuming. However, DQS does have great potential for handling some customer-specific data patterns that packaged software would also need tuning for – the advantage being that the DQS user will be more self-sufficient when dealing with the customer-specific patterns, and then be able to use the packaged software with less tuning required to its out-of-the-box configuration.
If you are interested in expanding DQS capabilities with an integrated Data Quality toolset, you can take a look at our recent matchIT SQL Demo.