Bridging the gap

Bridging the skills gap

TechMarketView’s UKHotViews© post Are you hiding from your skills crisis? last week really struck a chord. Kate Hanaghan gave some interesting feedback about Microsoft’s Cloud Skills Report (which surveyed 250 mid-sized to large UK organisations) but in our experience, many of the same issues apply to moving from proprietary in house systems or legacy packaged software to industry-standard data platforms such as SQL Server.

According to Kate, “individuals themselves are not always keen to move away from the technologies they have spent years working with” and suppliers need to “convince technologists (who tend to be middle aged and highly experienced) they must re-skill in certain areas to support the business as it attempts to grow in digital areas”.

Although as Kate says, many legacy technologies will be around for many years to come, I think that with the increasing pace of technological change, individuals are unwise if they ignore opportunities to embrace new technologies. Movement to the cloud is now so rapid that cost and competitive pressures will force many organisations that are currently steadfastly “on premise” to start moving across sooner rather than later – particularly marketing services companies where demand is elastic. Companies and individuals who try and move from 20 year old, non-standard technology straight to the cloud will struggle, whereas companies with more modern infrastructure and techies with more modern skills will have more of an evolutionary, beaten path .

Apart from competitive pressures, there are many other sound reasons for moving from such aging systems to industry-standard data platforms, as we wrote in Data cleansing – inside the database or on your desktop? One of the key reasons is that using a platform like SQL Server is much more extensible – for example, in the marketing services sector, our matchIT SQL package can connect directly with MIS products upstream and document composition products downstream using shared data, so all the data is processed within SQL Server. For the company, data is more secure and both errors and turnaround time are greatly reduced. For IT staff, it means they can enhance their CV’s with sought-after skills and be ready to embrace the next opportunity a rapidly changing world gives them – such as using Microsoft Azure or Apache Spark for cloud deployment.

I’ll leave the last word to Kate, who wrote to me about her post: “In some ways I just find it so hard to understand. Who wouldn’t want to future-proof their career?! I mean, we’re going to be working till we’re 80!!”

 

matchIT SQL – finalist at the DBM Awards 2016

matchIT SQL – finalist at the DBM Awards 2016

We were thrilled to be selected as finalists for this prestigious industry award, and to join our peers, which included some of the industry’s finest at the 2016 DBM awards. The event was hosted in an impressive and historic location in London, The Brewery.

Whilst helpIT systems offers a range of deployments for its class leading matchIT Data Quality Solutions, it was our  SQL Server integrated solution matchIT SQL that was recognised as standing out from the crowd, in the fiercely competitive “Innovation in database marketing-related software” category.

matchIT SQL’s shortlisting was in part due to our innovative and seamless integration with Microsoft SQL Server, and it’s unique ability to bring together a comprehensive array of contact data quality functions, all available and accessed natively within Microsoft SQL Server. Whether it be highly accurate and intelligent data matching, UK or international address validation, suppression screening and integrated email validation, through to its Power BI reporting or blend of local or Microsoft Azure and/or web service deployment, matchIT SQL is obviously hitting the mark.

Perhaps it is because marketing agencies and data professionals alike have realised how matchIT SQL helps them build and maintain highly accurate single customer views and comply with the new GDPR. Or that it enables the seamless preparation of targeted and accurate marketing and campaign data, as well as intelligently manage and monitor data workflows and data feeds coming into a business.

Whatever it was that resulted in matchIT SQL being shortlisted, we are very proud of our team here at helpIT systems who are behind this innovative technology, and who have been providing such class leading software in the Contact Data Quality space for over 25 years.

On the night, the award went to Purple Agency, who were behind a bespoke solution created for the Financial Times, which involved a system to capture and hold every nugget of customer data to help achieve complex goals. While matchIT SQL may not have been chosen as the winner on this occasion, we’re very proud of the selection as finalists and the recognition that helpIT systems is still punching well above its weight and is a pioneer in developing innovative Database Marketing related solutions.

Thank you to the team at the DBM Awards, and in particular to our fantastic hosts Stuart Cal, Anthony Begley and the team, who made the evening the great success it was.

Data cleansing – inside the database or on your desktop?

Data cleansing – inside the database or on your desktop?

We’ve been gathering feedback from some of our service provider clients recently about the impact on them of change: there’s a lot of things that they mentioned to us: organisational change and new demands due to new customers come up as often as they always have done, but what about enforced change from outside the organisation?

Two aspects that clients frequently mentioned are the new EU General Data Protection Regulations (GDPR) and the competitive need to keep up with advances in technology.  That’s why we were particularly pleased to get some great feedback from one of our clients who recently switched from a desktop data cleansing/mailing package to matchIT SQL.  They wanted to switch to matchIT SQL because they already use Microsoft SQL Server for all storage and manipulation of their customer data, and saw a lot of potential for increased automation with matchIT SQL – leveraging their familiarity with SQL Server Integration Services (SSIS) and Stored Procedures to achieve this.  However, the benefits they gained were far more fundamental and far-reaching than they envisaged…

Reducing the potential for errors

Having to export data from one system and import it into another (perhaps multiple times) introduces opportunity for mistakes: keeping record layouts in line, splitting and combining files etc.  By processing data in one system instead of several, the potential for these kinds of mistakes is eliminated and the corresponding time taken for Q/A (and sometimes repetition of steps) is greatly reduced.  Automating regular jobs also had a major impact on reducing errors and reruns.

Reducing data processing time

By fully utilising the power of their hardware, automatically spreading processing over all the machine cores available, they increased throughput dramatically.  Automation also allowed them to introduce more performance tuning – set up once and then repeated easily as often as required, without any additional user effort.  All told, our client is now able to turn round large jobs involving millions of records 4 times faster than they were able to with the old desktop system.  As they are finding that the data they get from their customers tends to be less consistent and more fragmented these days, this saving on turnaround time allows them to stick to deadlines which used to be in great danger of slipping.

Protecting data by processing it within a secure database

The increasing concern these days around data security, and the increased burden of the EU GDPR regarding data accuracy and the right of consumers to get a copy of their data, can cause many a sleepless night.  Two of the big benefits of processing all data within SQL Server are the access control that it provides and the auditability.  When you are exporting data from one system to another, perhaps via a flat file, it is very difficult to enforce such strict security while the data is in flight – and any security system is only as good as its weakest link.

Extensibility

Using Microsoft SQL Server together with all the off the shelf tools available means future enhancements can be made much more easily. Additionally as hardware performance and technology improves further in the future, SQL Server as a platform will keep up with those advances.

Bottom line

Working with a single solution inside a secure, high performance database system can have a huge impact on your business.  Whether it’s SQL Server or another database system that you use, you owe it to yourself to find out what the options are for switching to a data cleansing solution that lives inside your database.  Your IT staff will be happy to work with new server-based technology, they will get less out-of-hours calls, it will be easier for you to hire new staff with relevant skills – and you too will sleep more soundly!

Ready to find out how you can start benefiting too? Arrange a free 30 Day Trial of matchIT SQL – simply click here.

 

The New Paradigm in Healthcare Data Quality

There is no higher importance in managing customer information than when making decisions on health care. While most markets are busy striving for a ‘single customer view’ to improve customer service KPIs or marketing campaign results, healthcare organizations must focus on establishing  a ‘single patient view’, making sure a full patient history is attached to a single, correct contact.  Unlike in traditional CRM solutions, healthcare data is inherently disparate
and is managed by a wide variety of patient systems that, in addition to collecting and managing contact data, also tracks thousands of patient data points including electronic health records, insurance coverage, provider names,  prescriptions and more. Needless to say, establishing the relationships between patients and their healthcare providers, insurers, brokers, pharmacies and the like or even grouping families and couples together, is a significant challenge. Among them are issues with maiden/married last names, migration of individuals between family units and insurance plans, keying errors at point of entry or even deliberate attempts by consumers to defraud the healthcare system.

In many cases, the single patient view can be handled through unique identifiers , such as those for group health plans or for individuals within their provider network. This was an accepted practice at a recent Kaiser Permanente location I visited, where a gentleman went to the counter and reeled off his nine digit patient number before saying “hello”. But while patient ID numbers are standard identifiers, they will differ between suppliers and patients can’t be relied on to use it as their first method of identification. This is where accuracy and access to other collected data points (I.e. SSN, DOB and current address) becomes critical.

While healthcare organizations have done a decent job so far of attempting to establish and utilize this ‘single patient view’, the healthcare data quality paradigm is shifting once again. For example, The Patient Protection and Affordable Care Act (PPACA) means that healthcare organizations will now have to deal with more data, from more sources and face tougher regulations on how to manage and maintain that data.  The ObamaCare Health Insurance Exchange Pool means that more Americans can potentially benefit from health insurance coverage, increasing the number with coverage by around 30 million. Through these new initiatives, consumers will also have greater choice for both coverage and services  – all further distributing the data that desperately needs to be linked.

With such inherent change – how do you effectively service patients at the point-of-care? And, do you want your trained medics and patient management team to be responsible for the data quality audit before such care can even begin?

So what are the new dynamics that healthcare companies need to plan for?

  • Addition of new patients into a system without prior medical coverage or records
  • Frequent movement of consumers between healthcare plans under the choice offered by the affordable care scheme
  • Increased mobility of individuals through healthcare systems as they consume different vendors and services

This increased transactional activity means healthcare data managers must go beyond the existing efforts of linking internal data and start to look at how to share data across systems (both internal and external) and invest in technology that will facilitate this critical information exchange. Granted, this will be a significant challenge given the fact that many organizations have several proprietary systems, contract requirements and privacy concerns but oddly enough, this begins with best practices in managing contact data effectively.

Over the last year, I’ve worked with an increasing number of customers on the issue of managing the introduction of new data into healthcare databases.  Just like healthcare, data quality is both preventative and curative. Curative measures include triage on existing poor quality data, and investigating the latent symptoms of unidentified relationships in the data. The preventative measures are to introduce a regimen of using DQ tools to accurately capture new information at
point of entry efficiently, and to help identify existing customers quickly and accurately.

For healthcare customers, we’ve managed to do just this by implementing helpIT systems’ technology, matchIT SQL to deal with the backend data matching, validation and merging and findIT S2 to empower users to quickly and accurately identify existing patients or validate new patient details with the minimum of keystrokes. This complementary approach gives a huge return on investment allowing clinical end-users to focus on the task at hand, rather than repeatedly dealing with data issues.

Whenever there is movement in data or new sources of information, data quality issues will arise. But when it comes to healthcare data quality, I’m sure healthcare DBA’s and other administrators are fully aware of the stakes at hand. Improving and streamlining data capture plus tapping into the various technology connectors that will give physicians and service providers access to all patient data will have a profound effect on patient care, healthcare costs, physician workloads and access to relevant treatment. Ultimately, this is the desired outcome.

I’m delighted to be engaged further on this subject so if you have more insight to share, please comment on this or drop me a line.


Click & Collect – How To Do It Successfully?

In the UK this Christmas, the most successful retailers have been those that sell online but allow collection by the shopper – in fact, these companies have represented a large proportion of the retailers that had a good festive season. One innovation has been the rise of online retailers paying convenience stores to take delivery and provide a convenient collection point for the shopper, but two of the country’s biggest retailers, John Lewis and Next, reckon that click and collect has been the key to their Christmas sales figures – and of course they both have high volume e-commerce sites as well as many bricks and mortar stores.

The article here by the Daily Telegraph explains why “click and collect” is proving so popular, especially in a holiday period. The opportunities for major retailers are  obvious, especially as they search for ways to respond to the Amazon threat – but how do they encourage their customers to shop online and also promote in store shopping? The key is successful data-driven marketing: know your customer, incentivize them to use loyalty programs and target them with relevant offers. However, this also presents a big challenge – the disparity and inconsistency in the data that the customer provides when they shop in these different places.

In store, they may not provide any information, or they may provide name and phone number, or they may use a credit card and/or their loyalty card. Online they’ll provide name, email address and (if the item is being delivered), credit card details and their address. If they are collecting in store, they may just provide name and email address and pay on collection – and hopefully they’ll enter their loyalty card number, if they have one. To complicate matters further, people typically have multiple phone numbers (home, office, mobile), multiple addresses (home and office, especially if they have items delivered to their office) and even multiple email addresses. This can be a nightmare for the marketing and IT departments in successfully matching this disparate customer data in order to establish a Single Customer View. To do this, they need software that can fulfill multiple sophisticated requirements, including:

  • Effective matching of customer records without being thrown off by data that is different or missing.
  • Sophisticated fuzzy matching to allow for keying mistakes and inconsistencies between data input by sales representatives in store and in call centers, and customers online.
  • The ability to recognize data that should be ignored – for example, the in-store purchase records where the rep keyed in the address of the store because the system demanded an address and they didn’t have time to ask for the customer’s address, or the customer didn’t want to provide it.
  • Address verification using postal address files to ensure that when the customer does request delivery, the delivery address is valid – and even when they don’t request delivery, to assist the matching process by standardizing the address.
  • The ability to match records (i) in real-time, in store or on the website (ii) off-line, record by record as orders are fed though for fulfillment and (iii) as a batch process, typically overnight as data from branches is fed through. The important point to note here is that the retailer needs to be able to use the same matching engine in all three matching modes, to ensure that inconsistencies in matching results don’t compromise the effectiveness of the processing.
  • Effective grading of matches so that batch and off-line matching can be fully automated without missing lots of good matches or mismatching records. With effective grading of matching records, the business can choose to flag matches that aren’t good enough for automatic processing so they can be reviewed by users later.
  • Recognition of garbage data, particularly data collected from the web site, to avoid it entering the marketing database and compromising its effectiveness.
  • Often, multiple systems are used to handle the different types of purchase and fulfillment. The software must be able to connect to multiple databases storing customer data in different formats for the different systems

With a wide range of data quality solutions on the market, it’s often difficult to find a company that can check all of these boxes. That’s where helpIT systems comes in. If you are a multi-channel retailer currently facing these challenges, contact helpIT systems for a Free Data Analysis and an in depth look at how you can achieve a Single Customer View.

Taking a Look at Data Quality Services in SQL Server 2012

As part of our ongoing research and development, I’ve been looking at the new Data Quality Services functionality that’s available in SQL Server 2012 and how it can be used with your data to complement your matchIT SQL experience. We previously issued an overview before the release of 2012 (“What to Expect from DQS in SQL Server 2012”).

At the heart of DQS is the knowledge base.  It is essentially the memory bank of data that enables the DQS process to make decisions when dealing with the data it is presented with.

A very broad overview of how you might use DQS for data cleansing follows.  Essentially when using the data cleansing aspect of the product, each of your columns is mapped to a user-defined domain (for example you may set up a domain called Firstname).  The knowledge base functionality then allows you to build up a set of valid/invalid rules for the Firstname domain, which can be used to process your data.  Once you have created some domains, you can also build composite domains e.g. a Fullname composite domain could be constructed from the domains Prefix, Firstname and Lastname.  Now when running a DQS cleansing project with the defined knowledge base, you can map in an addressee field and split out prefixes, first names and last names.

This works well (albeit slowly) if your data is already well structured and not large volume, but name and address data is frequently not well structured and you may be working with millions of records.

Once the knowledge discovery process is complete, you can review the results and decisions made, and tweak and make amendments to the results.  A large drawback however, is that the normalization rules are value rather than pattern-based.  For example, in one case, the value #1 was parsed as a premise.  In this case, I actually wanted the domain to translate the ‘#1’ to a ‘1’, so I made the necessary amendment to the rule (i.e. marked it as invalid, and specified a value to correct it to).  Further occurrences of such examples, such as #5 and #10 were also amended in the domain to translate them to their integer versions.  After re-running the process, I found that other values in the same format that were not manually amended were not corrected;  the implication here is that you’d have to create correction rules for every single numerical possibility.  To try and get round this limitation, I attempted to apply a Term-Based-Relation (which is effectively a find and replace option in DQS), however because my hashes have no space prior to the following number (e.g. #10), the hash was not picked up as a separate term to be replaced.  If anyone has found a way of achieving this, then I’d be really interested to hear about it. What’s really needed is to allow regular expressions within the Term-Based Relation functionality, but this didn’t seem to work when I tried and I can find no reference to this being permissible in the documentation.

I think that as a data analyst or administrator, you could spend an awful lot of time tuning DQS to handle name and address cleansing, and perhaps still not be satisfied with the results – but if for example you used a dedicated contact data cleansing tool and combined that with DQS to clean other data (for example product names, order details etc.) then potentially you have a very powerful combination of products.

Having looked at the initial data cleansing that was possible in DQS, I then moved on to have a look at the inbuilt matching functionality that DQS provides.  I’ll cover this in another post soon, but let’s just take a quick look at performance: with DQS the processing performance was much slower than I expected, even when just processing a relatively small table of data.  For example, running a simple 100,000 record file through DQS matching took 7 minutes on my machine – but only 30 seconds when processing through our matchIT SQL product on the same machine.  100,000 records is fairly small for a production database, when large organizations typically have millions of customers.  DQS would handle batches of new records fairly quickly, but as it will only match within a table, it won’t be possible to match this against existing data without adding it to the existing main table and incurring the longer run time. At this point, I’ve yet to run DQS on a very large volume database and I suspect that it will be extremely time consuming. However, DQS does have great potential for handling some customer-specific data patterns that packaged software would also need tuning for – the advantage being that the DQS user will be more self-sufficient when dealing with the customer-specific patterns, and then be able to use the packaged software with less tuning required to its out-of-the-box configuration.

 

If you are interested in expanding DQS capabilities with an integrated Data Quality toolset, you can take a look at our recent matchIT SQL Demo.

Keep your SQL Server data clean – efficiently!

Working with very large datasets (for example when identifying duplicate records using matching software) frequently can throw up performance problems if you are running queries returning large  volumes of data. However there are some tips and tricks that you can use to ensure your SQL code works as efficiently as possible.

In this blog post, I’m going to focus on just a few of these – there are many other useful methods, so feel free to comment on this blog and suggest additional techniques that you have seen deliver benefit.

Physical Ordering of Data and Indices

Indices and the actual physical order of your database can be very important. Suppose for example that you are using matching software to run a one off internal dedupe, looking to compare all records with several different match keys.  Let’s assume that one of those keys is zip or postal code and it’s significantly slower than the other key.

If you put your data into the physical postal code/zip order, then your matching process may run significantly faster since the underlying disk I/O will be much more efficient as the disk head won’t be jumping around (assuming that you’re not using solid state drives).  If you are also verifying the address data using post office address files, then again having it pre-ordered by postal code/zip will be a big benefit.

So how would you put your data into postcode order ready for processing?

There are a couple of options:

  • Create a clustered index on the postcode/zip field – this will cause the data to be stored in postcode/zip order,
  • If the table is in use and already has a clustered index, then the first option won’t be possible. However you may still see improved overall performance if you run a “select into” query pulling out the fields required for matching, and ordering the results by postal code/zip. Select this data into a working table and then use that for the matching process – having added any other additional non-clustered indices needed.

Avoid  SELECT *

Only select the fields you need.  SELECT * is potentially very inefficient when  working with large databases (due to the large amount of memory needed). If you only need to select a couple of fields of data (where those fields are in a certain range), and those fields are indexed, then selecting only those fields allows the index to be scanned and the data returned.  If you use SELECT *, then the DBMS will join the index table with the main data table – which is going to be a lot slower with a large dataset.

Clustered Index and Non-clustered Indices

Generally when working with large tables, you should ensure that your table has a clustered index on the primary key (a clustered index ensures that the data is ordered by the index – in this case the primary key).

For the best performance, clustered indices ought to be rebuilt at regular intervals to minimise disk fragmentation – especially if there are a lot of transactions occurring.  Note that non-clustered indices will also be rebuilt at the same time – so if you have numerous indices then it can be time consuming.

Use Appropriate Non-clustered Indices to Boost Query Efficiency

Non-clustered indices can assist with the performance of your queries – by way of example, non-clustered indices may benefit the following types of query:

  •  Columns that contain a large number of distinct values, such as a combination of last name and first name. If there are very few unique values,  most queries will not use the index because a table scan is typically more efficient.
  • Queries that do not return large result sets.
  • Columns frequently involved in the search criteria of a query (the WHERE clause) that return exact matches.
  • Decision-support-system applications for which joins and grouping are frequently required. Create multiple non-clustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
  • Covering all columns from one table in a given query. This eliminates accessing the table or clustered index altogether.

In terms of the best priority for creating indices, I would recommend the following:

1.) fields used in the WHERE condition

2.) fields used in table JOINS

3.) fields used in the ORDER BY clause

4.) fields used in the SELECT section of the query.

Also make sure that you use the tools within SQL Server to view the query plan for expensive queries and use that information to help refine your indices to boost the efficiency of the query plan.

Avoid Using Views

Views on active databases will perform slower in general, so try to avoid views. Also bear in mind that if you create indices on the view, and the data in the base tables change in some way, then the indices on both the base table and view will need updating – which creates an obvious performance hit.  In general, views are useful in data warehouse type scenarios where the main usage of the data is simply reporting and querying, rather than a lot of database updates.

Make use of Stored Procedures in SQL Server

The code is then compiled and cached, which should lead to performance benefits. That said you need to be aware of parameter sniffing and designing your stored procedures in such a way that SQL Server doesn’t cache an inefficient query execution plan.  There are various techniques that can be used:

  • Optimising for specific parameters
  • Recompile For All Execution
  • Copy parameters into Local Variables

For those interested, there’s a more in-depth, but easy to follow description of these techniques covered on the following page of SQLServerCentral.com

http://www.sqlservercentral.com/blogs/practicalsqldba/2012/06/25/sql-server-parameter-sniffing/

Queries to compare two tables or data sources

When using matching software to identify matches between two different data sources, you may encounter scenarios where one of the tables is small relative to another, very large, table, or where both tables are of similar sizes. We have  found that some techniques for comparing across the two tables run fine where both tables are not too large (say under ten million records), but do not scale if one or both of the tables are much larger than that.  Our eventual solution gets a little too detailed to describe effectively here, but feel free to contact us for information about how we solved it in our matchIT SQL application.

And Finally

Finally I’d recommend ensuring that you keep an eye on the disks housing your SQL Server database files: ensure that there’s at least 30% storage space free and that the disks are not highly fragmented; regularly doing this produces better performance.

In summary by making efforts to optimise the performance of your data cleaning operations, you will reduce load on your database server, allow regular use of the necessary applications to keep your data clean – and as a result keep your users happy.