Taking a Look at Data Quality Services in SQL Server 2012

As part of our ongoing research and development, I’ve been looking at the new Data Quality Services functionality that’s available in SQL Server 2012 and how it can be used with your data to complement your matchIT SQL experience. We previously issued an overview before the release of 2012 (“What to Expect from DQS in SQL Server 2012”).

At the heart of DQS is the knowledge base.  It is essentially the memory bank of data that enables the DQS process to make decisions when dealing with the data it is presented with.

A very broad overview of how you might use DQS for data cleansing follows.  Essentially when using the data cleansing aspect of the product, each of your columns is mapped to a user-defined domain (for example you may set up a domain called Firstname).  The knowledge base functionality then allows you to build up a set of valid/invalid rules for the Firstname domain, which can be used to process your data.  Once you have created some domains, you can also build composite domains e.g. a Fullname composite domain could be constructed from the domains Prefix, Firstname and Lastname.  Now when running a DQS cleansing project with the defined knowledge base, you can map in an addressee field and split out prefixes, first names and last names.

This works well (albeit slowly) if your data is already well structured and not large volume, but name and address data is frequently not well structured and you may be working with millions of records.

Once the knowledge discovery process is complete, you can review the results and decisions made, and tweak and make amendments to the results.  A large drawback however, is that the normalization rules are value rather than pattern-based.  For example, in one case, the value #1 was parsed as a premise.  In this case, I actually wanted the domain to translate the ‘#1’ to a ‘1’, so I made the necessary amendment to the rule (i.e. marked it as invalid, and specified a value to correct it to).  Further occurrences of such examples, such as #5 and #10 were also amended in the domain to translate them to their integer versions.  After re-running the process, I found that other values in the same format that were not manually amended were not corrected;  the implication here is that you’d have to create correction rules for every single numerical possibility.  To try and get round this limitation, I attempted to apply a Term-Based-Relation (which is effectively a find and replace option in DQS), however because my hashes have no space prior to the following number (e.g. #10), the hash was not picked up as a separate term to be replaced.  If anyone has found a way of achieving this, then I’d be really interested to hear about it. What’s really needed is to allow regular expressions within the Term-Based Relation functionality, but this didn’t seem to work when I tried and I can find no reference to this being permissible in the documentation.

I think that as a data analyst or administrator, you could spend an awful lot of time tuning DQS to handle name and address cleansing, and perhaps still not be satisfied with the results – but if for example you used a dedicated contact data cleansing tool and combined that with DQS to clean other data (for example product names, order details etc.) then potentially you have a very powerful combination of products.

Having looked at the initial data cleansing that was possible in DQS, I then moved on to have a look at the inbuilt matching functionality that DQS provides.  I’ll cover this in another post soon, but let’s just take a quick look at performance: with DQS the processing performance was much slower than I expected, even when just processing a relatively small table of data.  For example, running a simple 100,000 record file through DQS matching took 7 minutes on my machine – but only 30 seconds when processing through our matchIT SQL product on the same machine.  100,000 records is fairly small for a production database, when large organizations typically have millions of customers.  DQS would handle batches of new records fairly quickly, but as it will only match within a table, it won’t be possible to match this against existing data without adding it to the existing main table and incurring the longer run time. At this point, I’ve yet to run DQS on a very large volume database and I suspect that it will be extremely time consuming. However, DQS does have great potential for handling some customer-specific data patterns that packaged software would also need tuning for – the advantage being that the DQS user will be more self-sufficient when dealing with the customer-specific patterns, and then be able to use the packaged software with less tuning required to its out-of-the-box configuration.

 

If you are interested in expanding DQS capabilities with an integrated Data Quality toolset, you can take a look at our recent matchIT SQL Demo.

Keep your SQL Server data clean – efficiently!

Working with very large datasets (for example when identifying duplicate records using matching software) frequently can throw up performance problems if you are running queries returning large  volumes of data. However there are some tips and tricks that you can use to ensure your SQL code works as efficiently as possible.

In this blog post, I’m going to focus on just a few of these – there are many other useful methods, so feel free to comment on this blog and suggest additional techniques that you have seen deliver benefit.

Physical Ordering of Data and Indices

Indices and the actual physical order of your database can be very important. Suppose for example that you are using matching software to run a one off internal dedupe, looking to compare all records with several different match keys.  Let’s assume that one of those keys is zip or postal code and it’s significantly slower than the other key.

If you put your data into the physical postal code/zip order, then your matching process may run significantly faster since the underlying disk I/O will be much more efficient as the disk head won’t be jumping around (assuming that you’re not using solid state drives).  If you are also verifying the address data using post office address files, then again having it pre-ordered by postal code/zip will be a big benefit.

So how would you put your data into postcode order ready for processing?

There are a couple of options:

  • Create a clustered index on the postcode/zip field – this will cause the data to be stored in postcode/zip order,
  • If the table is in use and already has a clustered index, then the first option won’t be possible. However you may still see improved overall performance if you run a “select into” query pulling out the fields required for matching, and ordering the results by postal code/zip. Select this data into a working table and then use that for the matching process – having added any other additional non-clustered indices needed.

Avoid  SELECT *

Only select the fields you need.  SELECT * is potentially very inefficient when  working with large databases (due to the large amount of memory needed). If you only need to select a couple of fields of data (where those fields are in a certain range), and those fields are indexed, then selecting only those fields allows the index to be scanned and the data returned.  If you use SELECT *, then the DBMS will join the index table with the main data table – which is going to be a lot slower with a large dataset.

Clustered Index and Non-clustered Indices

Generally when working with large tables, you should ensure that your table has a clustered index on the primary key (a clustered index ensures that the data is ordered by the index – in this case the primary key).

For the best performance, clustered indices ought to be rebuilt at regular intervals to minimise disk fragmentation – especially if there are a lot of transactions occurring.  Note that non-clustered indices will also be rebuilt at the same time – so if you have numerous indices then it can be time consuming.

Use Appropriate Non-clustered Indices to Boost Query Efficiency

Non-clustered indices can assist with the performance of your queries – by way of example, non-clustered indices may benefit the following types of query:

  •  Columns that contain a large number of distinct values, such as a combination of last name and first name. If there are very few unique values,  most queries will not use the index because a table scan is typically more efficient.
  • Queries that do not return large result sets.
  • Columns frequently involved in the search criteria of a query (the WHERE clause) that return exact matches.
  • Decision-support-system applications for which joins and grouping are frequently required. Create multiple non-clustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
  • Covering all columns from one table in a given query. This eliminates accessing the table or clustered index altogether.

In terms of the best priority for creating indices, I would recommend the following:

1.) fields used in the WHERE condition

2.) fields used in table JOINS

3.) fields used in the ORDER BY clause

4.) fields used in the SELECT section of the query.

Also make sure that you use the tools within SQL Server to view the query plan for expensive queries and use that information to help refine your indices to boost the efficiency of the query plan.

Avoid Using Views

Views on active databases will perform slower in general, so try to avoid views. Also bear in mind that if you create indices on the view, and the data in the base tables change in some way, then the indices on both the base table and view will need updating – which creates an obvious performance hit.  In general, views are useful in data warehouse type scenarios where the main usage of the data is simply reporting and querying, rather than a lot of database updates.

Make use of Stored Procedures in SQL Server

The code is then compiled and cached, which should lead to performance benefits. That said you need to be aware of parameter sniffing and designing your stored procedures in such a way that SQL Server doesn’t cache an inefficient query execution plan.  There are various techniques that can be used:

  • Optimising for specific parameters
  • Recompile For All Execution
  • Copy parameters into Local Variables

For those interested, there’s a more in-depth, but easy to follow description of these techniques covered on the following page of SQLServerCentral.com

http://www.sqlservercentral.com/blogs/practicalsqldba/2012/06/25/sql-server-parameter-sniffing/

Queries to compare two tables or data sources

When using matching software to identify matches between two different data sources, you may encounter scenarios where one of the tables is small relative to another, very large, table, or where both tables are of similar sizes. We have  found that some techniques for comparing across the two tables run fine where both tables are not too large (say under ten million records), but do not scale if one or both of the tables are much larger than that.  Our eventual solution gets a little too detailed to describe effectively here, but feel free to contact us for information about how we solved it in our matchIT SQL application.

And Finally

Finally I’d recommend ensuring that you keep an eye on the disks housing your SQL Server database files: ensure that there’s at least 30% storage space free and that the disks are not highly fragmented; regularly doing this produces better performance.

In summary by making efforts to optimise the performance of your data cleaning operations, you will reduce load on your database server, allow regular use of the necessary applications to keep your data clean – and as a result keep your users happy.