Working with very large datasets (for example when identifying duplicate records using matching software) frequently can throw up performance problems if you are running queries returning large volumes of data. However there are some tips and tricks that you can use to ensure your SQL code works as efficiently as possible.
In this blog post, I’m going to focus on just a few of these – there are many other useful methods, so feel free to comment on this blog and suggest additional techniques that you have seen deliver benefit.
Physical Ordering of Data and Indices
Indices and the actual physical order of your database can be very important. Suppose for example that you are using matching software to run a one off internal dedupe, looking to compare all records with several different match keys. Let’s assume that one of those keys is zip or postal code and it’s significantly slower than the other key.
If you put your data into the physical postal code/zip order, then your matching process may run significantly faster since the underlying disk I/O will be much more efficient as the disk head won’t be jumping around (assuming that you’re not using solid state drives). If you are also verifying the address data using post office address files, then again having it pre-ordered by postal code/zip will be a big benefit.
So how would you put your data into postcode order ready for processing?
There are a couple of options:
- Create a clustered index on the postcode/zip field – this will cause the data to be stored in postcode/zip order,
- If the table is in use and already has a clustered index, then the first option won’t be possible. However you may still see improved overall performance if you run a “select into” query pulling out the fields required for matching, and ordering the results by postal code/zip. Select this data into a working table and then use that for the matching process – having added any other additional non-clustered indices needed.
Avoid SELECT *
Only select the fields you need. SELECT * is potentially very inefficient when working with large databases (due to the large amount of memory needed). If you only need to select a couple of fields of data (where those fields are in a certain range), and those fields are indexed, then selecting only those fields allows the index to be scanned and the data returned. If you use SELECT *, then the DBMS will join the index table with the main data table – which is going to be a lot slower with a large dataset.
Clustered Index and Non-clustered Indices
Generally when working with large tables, you should ensure that your table has a clustered index on the primary key (a clustered index ensures that the data is ordered by the index – in this case the primary key).
For the best performance, clustered indices ought to be rebuilt at regular intervals to minimise disk fragmentation – especially if there are a lot of transactions occurring. Note that non-clustered indices will also be rebuilt at the same time – so if you have numerous indices then it can be time consuming.
Use Appropriate Non-clustered Indices to Boost Query Efficiency
Non-clustered indices can assist with the performance of your queries – by way of example, non-clustered indices may benefit the following types of query:
- Columns that contain a large number of distinct values, such as a combination of last name and first name. If there are very few unique values, most queries will not use the index because a table scan is typically more efficient.
- Queries that do not return large result sets.
- Columns frequently involved in the search criteria of a query (the WHERE clause) that return exact matches.
- Decision-support-system applications for which joins and grouping are frequently required. Create multiple non-clustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
- Covering all columns from one table in a given query. This eliminates accessing the table or clustered index altogether.
In terms of the best priority for creating indices, I would recommend the following:
1.) fields used in the WHERE condition
2.) fields used in table JOINS
3.) fields used in the ORDER BY clause
4.) fields used in the SELECT section of the query.
Also make sure that you use the tools within SQL Server to view the query plan for expensive queries and use that information to help refine your indices to boost the efficiency of the query plan.
Avoid Using Views
Views on active databases will perform slower in general, so try to avoid views. Also bear in mind that if you create indices on the view, and the data in the base tables change in some way, then the indices on both the base table and view will need updating – which creates an obvious performance hit. In general, views are useful in data warehouse type scenarios where the main usage of the data is simply reporting and querying, rather than a lot of database updates.
Make use of Stored Procedures in SQL Server
The code is then compiled and cached, which should lead to performance benefits. That said you need to be aware of parameter sniffing and designing your stored procedures in such a way that SQL Server doesn’t cache an inefficient query execution plan. There are various techniques that can be used:
- Optimising for specific parameters
- Recompile For All Execution
- Copy parameters into Local Variables
For those interested, there’s a more in-depth, but easy to follow description of these techniques covered on the following page of SQLServerCentral.com
Queries to compare two tables or data sources
When using matching software to identify matches between two different data sources, you may encounter scenarios where one of the tables is small relative to another, very large, table, or where both tables are of similar sizes. We have found that some techniques for comparing across the two tables run fine where both tables are not too large (say under ten million records), but do not scale if one or both of the tables are much larger than that. Our eventual solution gets a little too detailed to describe effectively here, but feel free to contact us for information about how we solved it in our matchIT SQL application.
Finally I’d recommend ensuring that you keep an eye on the disks housing your SQL Server database files: ensure that there’s at least 30% storage space free and that the disks are not highly fragmented; regularly doing this produces better performance.
In summary by making efforts to optimise the performance of your data cleaning operations, you will reduce load on your database server, allow regular use of the necessary applications to keep your data clean – and as a result keep your users happy.