The doctor won’t see you now – NHS data needs a health check!

On BBC Radio 4 the other day, I heard that people who have not been to see their local GP in the last 5 years could face being ‘struck-off’ from the register and denied access until they re-register – the story is also covered in most of the national press, including The Guardian. It’s an effort to save money on NHS England’s £9bn annual expenditure on GP practices, but is it the most cost-effective and patient-friendly approach for updating NHS records?

Under the contract, an NHS supplier (Capita) will write every year to all patients who have not been in to see their local doctor or practice nurse in the last five years. This is aimed at removing those who have moved away or died – every name on the register costs the NHS on average around £136 (as at 2013/14) in payments to the GP. After Capita receives the list of names from the GP practice, they’ll send out two letters, the first within ten working days and the next within six months. If they get no reply, the person will be removed from the list. Of course, as well as those who have moved away or died, this will end up removing healthy people who have not seen the GP and don’t respond to either letter. An investigation in 2013 by Pulse, the magazine for GP’s, revealed that “over half of patients removed from practice lists in trials in some areas have been forced to re-register with their practice, with GP’s often blamed for the administrative error. PCTs (Primary Care Trusts) are scrambling to hit the Government’s target of removing 2.5 million patients from practice lists, often targeting the most vulnerable patients, including those with learning disabilities, the very elderly and children.” According to Pulse, the average proportion that were forced to re-register was 9.8%.

This problem of so-called ‘ghost patients’ falsely inflating GP patient lists, and therefore practice incomes, has been an issue for NHS primary care management since at least the 1990’s, and probably long before that. What has almost certainly increased over the last twenty years is the number of temporary residents (e.g. from the rest of the EU) who are very difficult to track.

A spokesperson for the BMA on the radio was quite eloquent on why the NHS scheme was badly flawed, but had no effective answer when the interviewer asked what alternatives there were – that’s what I want to examine here, an analytical approach to a typical Data Quality challenge.

First, what do we know about the current systems? There is a single UK NHS number database, against which all GP practice database registers are automatically reconciled on a regular basis, so that transfers when people move and register with a new GP are well handled. Registered deaths, people imprisoned and those enlisting in the armed forces are also regularly reconciled. Extensive efforts are made to manage common issues such as naming conventions in different cultures, misspelling, etc. but it’s not clear how effective these are.

But if the GP databases are reconciled against the national NHS number database regularly, how is it that according to the Daily Mail “latest figures from the Health and Social Care Information Centre show there are 57.6 million patients registered with a GP in England compared to a population of 55.1 million”? There will be a small proportion of this excess due to inadequacies in matching algorithms or incorrect data being provided, but given that registering a death and registering at a new GP both require provision of the NHS number, any inadequacies here aren’t likely to cause many of the excess registrations. It seems likely that the two major causes are:

  • People who have moved out of the area and not yet registered with a new practice.
  • As mentioned above, temporary residents with NHS numbers that have left the country.

To Data Quality professionals, the obvious solution for the first cause is to use specialist list cleansing software and services to identify people who are known to have moved, using readily available data from Royal Mail, Equifax and other companies. This is how many commercial organisations keep their databases up to date and it is far more targeted than writing to every “ghost patient” at their registered address and relying on them to reply. New addresses can be provided for a large proportion of movers so their letters can be addressed accordingly – if they have moved within the local area, their address should be updated rather than the patient be removed. Using the same methods, Capita can also screen for deaths against third party deceased lists, which will probably pick up more deceased names than the NHS system – simple trials will establish what proportion of patients are tracked to a new address, have moved without the new address being known, or have died.

Next, Capita could target the other category, the potential temporary residents from abroad, by writing to adults whose NHS number was issued in the last (say) 10 years.

The remainder of the list can be further segmented, using the targeted approach that the NHS already uses for screening or immunisation requests: for example, elderly people may have gone to live with other family members or moved into a care home, and young people may be registered at university or be sharing accommodation with friends – letters and other communications can be tailored accordingly to solicit the best response.

What remains after sending targeted letters in each category above probably represents people in a demographic that should still be registered with the practice. Further trials would establish the best approach (in terms of cost and accuracy) for this group: maybe it is cost-effective to write to them and remove non-responders, but if this resulted in only removing a small number, some of these wrongly, maybe it is not worth mailing them.

The bottom line is that well-established Data Quality practices of automatic suppression and change of address, allied with smart targeting, can reduce the costs of the exercise and will make sure that the NHS doesn’t penalise healthy people simply for… being healthy!

Golden Records Need Golden Data: 7 Questions to Ask

If you’ve found yourself reading this blog then you’re no doubt already aware of the importance of maintaining data quality through processes such as data verification, suppression screening, and duplicate detection. In this post I’d like to look a bit closer at how you draw value from, and make the best use of, the results of the hard work you invest into tracking down duplicates within your data.

The great thing about fuzzy matching is that it enables us to identify groups of two or more records that pertain to the same entity but that don’t necessarily contain exactly the same information. Records in a group of fuzzy matches will normally contain similar information with slight variations from one record to the next. For example, one record may contain a full forename whilst another contains just an abbreviated version or even none at all. You will also frequently encounter fuzzy matches where incorrectly spelt or poorly input data is matched against its accurate counterpart.

Once you’ve identified these groups of fuzzy matches, what do you do with them? Ultimately you want to end up with only unique records within your data, but there are a couple of ways that you can go about reaching that goal. One approach is to try and determine the best record in a group of matches and discard all of the records that matched against it. Other times, you may find that you are able to draw more value from your data by taking the most accurate, complete, and relevant information from a group of matched records and merging it together so that you’re left with a single hybrid record containing a superior set of data than was available in any of the individual records from which it was created.

Regardless of the approach you take, you’ll need to establish some rules to use when determining the best record or best pieces of information from multiple records. Removing the wrong record or information could actually end up making your data worse so this decision warrants a bit of thought. The criteria you use for this purpose will vary from one job to the next, but the following is a list of 7 questions that target the desirable attributes you’ll want to consider when deciding what data should be retained:

  1. How current is the data?
    You’ll most likely want to keep data that was most recently acquired.
  2. How complete is the data?
    How many fields are populated, and how well are those fields populated?
  3. Is the data valid?
    Have dates been entered in the required format? Does an email address contain an at sign?
  4. Is the data accurate?
    Has it been verified (e.g. address verified against PAF)?
  5. How reliable is the data?
    Has it come from a trusted source?
  6. Is the data relevant?
    Is the data appropriate for its intended use (e.g. keep female contacts over male if compiling a list of recipients for a woman’s clothing catalogue)?
  7. Is there a predetermined hierarchy?
    Do you have a business rule in place that requires one set of data is always used over another?

When you have such a large range of competing criteria to consider, how do you apply all of these rules simultaneously? The approach we at helpIT use in our software is to allow the user to weight each item or collection of data, so they can choose what aspects are the most important in their business context. This isn’t necessarily whether an item is present or not, or how long it is, but could be whether it was an input value or derived from supplied information, or whether it has been verified by reference to an external dataset such as a Postal Address File. Once the master record has been selected, the user may also want to transfer data from records being deleted to the master record e.g. to copy a job title from a duplicate to a master record which contains fuller/better name and address information, but no job title. By creating a composite record, you ensure that no data is lost.

Hopefully this post will have given you something to think about when deciding how to deal with the duplicates you’ve identified in your data. I’d welcome any comments or questions.

The Retail Single Customer View

One of the more interesting aspects of working for a data quality company is the challenge associated with solving real world business issues through effective data management. In order to provide an end-to-end solution, several moving parts must be taken into consideration, data quality software being just one of them.

A few months back, helpIT was given the challenge of working with a multi-channel retail organization seeking a Single Customer View to improve the effectiveness of their marketing efforts. They received customer data from several channels including: Point of Sale, Website, and Call Center. Their hope was to link up every transaction over the past 5 years to a single instance of the right customer.

In a vacuum, this is a pretty straightforward job:

  1. Standardize and clean all of the data
  2. Identify the priority of transaction data sources and merge all sources together based on individual contact information. Develop a single list or table of all unique customers and assign a unique ID
  3. Take all transaction data sources and then match against the unique customers and assign the ID of the unique customer to the transaction.

With live business data it’s rarely if ever that simple. Here are some of the issues we had to circumvent:

  • Very high rate of repeat customers across multiple channels
  • Different data governance rules and standards at all points of capture
    Point of Sale – Only name and email were required. When no address was provided – store address was used. “Not Provided” also acceptable causing other incomplete data Website – Entered as preferred by customer Call Center – Typed into CRM application, no customer lookup process
  • No “My Account” section on the website, which means that all orders are treated as new customers
  • Archaic Point-of-Sale application that was too expensive to replace for this project
  • Newly created SQL Server environment that acts as a central data repository but had no starting point for unique customers

To come up with a solution that could enable the customer to develop and also maintain a Single Customer View we proposed the following workflow that could be used for both.

Website

This was immediately identified as the best source of information because the customers are entering it themselves and have the genuine desire to receive delivery or be contacted if there is an issue with the orders. The process was started with an initial batch clean up of all historical web orders as follows:

  1. Run all orders through Address Validation and National Change of Address (NCOA) to get the most up to date information on the contacts
  2. Standardize all data points using the matchIT SQL casing and parsing engine
  3. Performe contact level deduplication with matchIT SQL using combination of exact and fuzzy matching routines and included confidence scores for all matches.
  4. Our client identified the confidence thresholds that were either confident matches to commit, matches that required manual review, and unique records. They completed the manual review and incorporated some further logic based on these matches to prevent future review for the same issues. The final step in their process was to identify a future score threshold to commit transactions from other sources to the customer record.
  5. The deduplication process was finalized and a new SQL Server table was created with standardized, accurate, and unique customer data that would act as the “golden record”.
  6. All transactions from the Web history file were updated with a new column containing the unique customer ID from the “golden record” table.

Call Center

This was the next area to undertake and was essentially a repetition of the process used on the Website data with the exception of the final steps of matching the cleaned version of the Call Center data with the “golden record” table.

After performing the overlap, any unique customers from the call center were then added to the “golden record” table and assigned an ID. All the overlapping Call Center customers received the overlapped ID from the “golden record” table which was then appended to the related Call Center transactions.

Store

This was the tricky part!

Some of the data contained full and accurate customer information, but nearly 30% of the customer transaction data captured at the store level contained the store address information.

So how did we overcome this?

  1. We created a suppression table that contained only their store addresses
  2. All store transactions with the minimum information for capture (at least a name and address) were standardized and then matched against the store suppressions file yielding a list of matches (customers with store info as their address) and non matches (customers that provided their own address information)
  3. For the customers that provided their address the process then went back to the same procedure run on the call center data
  4. For the customers with store information we had to use a different set of matching logic that ignored the address information and instead looked to the other data elements like name, email, phone, credit card number and date of birth. Several matchkeys were required because of the inconsistency in what matches would be found.
  5. The client then decided for the remaining portion of customers in the Store file (3%) to put those customers in a hold table until some other piece of information popped up that would allow for a bridging of the transaction to a new transaction.

A workflow diagram of the store process can be found to the left:

The key to the whole strategy was to identify a way to isolate records with alternative matching requirements on an automated basis. Once we separated the records with store addresses we were free to develop the ideal logic for each scenario, providing an end to end solution for a very complicated but frequently occurring data issue.

If you are attempting to establish the ever-elusive single customer view, remember that there are several moving parts to the process that go well beyond the implementation of a data quality software. The solution may well require a brand new workflow to reach the desired objective.

 

For more information about establishing a Single Customer View or to sign up for a Free CRM Data Analysis, click here.