The Cost of Addressing Data

In a recent article in Database Marketing, James Lawson discussed the ongoing debate of whether address data should be free. Back in November 2012, The Open Data User Group (ODUG) presented a paper that suggested an Open National Address Dataset (ONAD), built from three sources: Royal Mail’s Postcode Address File (PAF), Geoplace’s National Address Gazetteer (NAG) and OS’s AddressBase Plus. All three files would be made available under the Open Government Licence as free data, just like OS Code-Point Open.

As we all know, the Royal Mail’s PAF database is the most widely adopted reference file for UK addresses. It has been since its inception and I believe will remain so for many years to come, free or not.  In the more recent years we’ve had various rumours and promises of alternative and complimentary solutions from the LLPG & NLPG, Address  Layer 2 to the more recent NAG and subsequent AddressBase offerings but they’ve all had their challenges.  As with any accurate or usable database it takes a considerable amount of time and effort to compile and somebody needs to take this burden on.  I agree with the current concept that any small to low volume usage (as delivered by the Royal Mail’s postcode finder solution) should be free but the argument of savings and the implied benefits for public bodies is currently being debated for the recently released AddressBase solution.  The questions of industry benefit really comes from the high volume corporate users and wider DM industry.  The impact to pre-sortation requirements, which the Royal Mail enforce on their direct business clients as well as the rules they impose on the wholesale Down Stream Access partners, would no doubt be part of any discussion.  Whilst it would be Holy Grail to ensure all addresses entering the Royal Mail’s network were perfectly addressed, I think any serious conversation on the topic will always be deferred while a potential sale of the Royal Mail is in the wings.

 

 

Click & Collect – How To Do It Successfully?

In the UK this Christmas, the most successful retailers have been those that sell online but allow collection by the shopper – in fact, these companies have represented a large proportion of the retailers that had a good festive season. One innovation has been the rise of online retailers paying convenience stores to take delivery and provide a convenient collection point for the shopper, but two of the country’s biggest retailers, John Lewis and Next, reckon that click and collect has been the key to their Christmas sales figures – and of course they both have high volume e-commerce sites as well as many bricks and mortar stores.

The article here by the Daily Telegraph explains why “click and collect” is proving so popular, especially in a holiday period. The opportunities for major retailers are  obvious, especially as they search for ways to respond to the Amazon threat – but how do they encourage their customers to shop online and also promote in store shopping? The key is successful data-driven marketing: know your customer, incentivize them to use loyalty programs and target them with relevant offers. However, this also presents a big challenge – the disparity and inconsistency in the data that the customer provides when they shop in these different places.

In store, they may not provide any information, or they may provide name and phone number, or they may use a credit card and/or their loyalty card. Online they’ll provide name, email address and (if the item is being delivered), credit card details and their address. If they are collecting in store, they may just provide name and email address and pay on collection – and hopefully they’ll enter their loyalty card number, if they have one. To complicate matters further, people typically have multiple phone numbers (home, office, mobile), multiple addresses (home and office, especially if they have items delivered to their office) and even multiple email addresses. This can be a nightmare for the marketing and IT departments in successfully matching this disparate customer data in order to establish a Single Customer View. To do this, they need software that can fulfill multiple sophisticated requirements, including:

  • Effective matching of customer records without being thrown off by data that is different or missing.
  • Sophisticated fuzzy matching to allow for keying mistakes and inconsistencies between data input by sales representatives in store and in call centers, and customers online.
  • The ability to recognize data that should be ignored – for example, the in-store purchase records where the rep keyed in the address of the store because the system demanded an address and they didn’t have time to ask for the customer’s address, or the customer didn’t want to provide it.
  • Address verification using postal address files to ensure that when the customer does request delivery, the delivery address is valid – and even when they don’t request delivery, to assist the matching process by standardizing the address.
  • The ability to match records (i) in real-time, in store or on the website (ii) off-line, record by record as orders are fed though for fulfillment and (iii) as a batch process, typically overnight as data from branches is fed through. The important point to note here is that the retailer needs to be able to use the same matching engine in all three matching modes, to ensure that inconsistencies in matching results don’t compromise the effectiveness of the processing.
  • Effective grading of matches so that batch and off-line matching can be fully automated without missing lots of good matches or mismatching records. With effective grading of matching records, the business can choose to flag matches that aren’t good enough for automatic processing so they can be reviewed by users later.
  • Recognition of garbage data, particularly data collected from the web site, to avoid it entering the marketing database and compromising its effectiveness.
  • Often, multiple systems are used to handle the different types of purchase and fulfillment. The software must be able to connect to multiple databases storing customer data in different formats for the different systems

With a wide range of data quality solutions on the market, it’s often difficult to find a company that can check all of these boxes. That’s where helpIT systems comes in. If you are a multi-channel retailer currently facing these challenges, contact helpIT systems for a Free Data Analysis and an in depth look at how you can achieve a Single Customer View.

When Charitable Donations Fall – Who’s to Blame?

I was listening to a program on BBC Radio 4 yesterday morning (You and Yours) about the difficulties that charities are facing in these straitened times:

“Christmas is the season for giving and is often the big year-end push for many charities. But according to a report compiled by the Charities Aid Foundation and the National Council for Voluntary Organisations charitable donations have fallen by 20% in real terms in the past year, with £1.7bn less being given.”

There was a lot of interesting feedback from the expert contributors:

  • Sarah Miller head of public affairs at the Charities Commission commented that “The top complaint made to the FRSB (the Fundraising Standards Board) … is to do with the use of data, where people are perhaps being sent mailings that they don’t wish to receive or perhaps incorrect information is being used on mailings or they want to know where the data has come from or perhaps a mailing is going to a deceased family member and they’ve asked for it to stop and perhaps the charity still hasn’t made that change – so that’s the top complaint by far”.
  • John Low, Chief Executive of the Charities Aid Foundation stressed that “You must be as efficient in the way you run a charity as any business, and maybe more efficient, because it’s precious public money that you have and you have very serious responsibilities to your beneficiaries.

It is certainly true that as a charity or any form of non-profit organization, you have far less margin for error when mailing your donors than a commercial organization. If I get duplicate mail from a retailer that I shop at, or incorrectly addressed mail that obviously hasn’t been able to obtain postal discounts even if it was actually delivered, it might make me wonder whether their prices have to be inflated to allow for such inefficiencies – but I’ll still do the price comparison when next shopping. When I get duplicate or incorrectly addressed mail from a charity that I give to, I get upset that they’re wasting my donation. Even more so given that I know there are money-saving solutions (ranging from desktop software, to services and hybrid solutions) for ensuring that mail is not duplicated and correctly addressed. Moreover, many mailers upset next of kin by mailing to the deceased or simply waste large amounts of money by mailing to people who have moved.

Based on the feedback received by the FRSB, some charities have a pressing need to implement effective solutions for eliminating wastage in their direct mail:

  • Gone Away suppression will more than pay for itself by reducing print and post costs.
  • NCOA (National Change of Address) and other services will allow charities to mail donors at their new address.
  • Deceased and duplicate suppression will avoid the damage to the donor relationship that otherwise will inevitably occur.

Sarah Miller also told listeners:

“If there are ways that charities are interacting with you that you don’t like, do tell them. Tell them how you want to interact with them.”

I remember about 15 years ago, one of our customers working for Botton Village (a centre for adults with learning disabilities and other special needs in North Yorkshire in the UK) won a direct marketing award simply because they asked their donors how often and when they would like to be contacted and at what time(s) of year. This led to a significant increase in donations. These days of course, it is far less expensive to contact people by email, but some donors may prefer at least some communication by mail, or not want email contact. Consolidating and matching donor information when they may donate via the web or by post is obviously important – for example, so you can make sure that you claim Gift Aid for relevant donors, or avoid sending a scheduled communication if they’ve just donated.

Chris Mould, Executive Chairman of the Trussell Trust, the charity behind the UK Foodbank Network talked about how a front line food bank in the Network can get a web site at minimal cost with online data collection: “It doesn’t have to reinvent the wheel”. This chimed with John Low’s recommendation that charities can become more efficient by cooperating on their resource requirements.

One last and very important point: all the experts on the program agreed that fundraising campaigns really work  – regular communication with your donors is important to show where the money is going, but efficiency is even more important.

 

If you are a charity, struggling to get hold of your data quality challenges OR if you’ve noticed a major drop in donations and want to know if data quality is the cause, email us for a Free Data Quality Audit and we’ll highlight the issues that could be putting your initiatives at risk.

Keep your SQL Server data clean – efficiently!

Working with very large datasets (for example when identifying duplicate records using matching software) frequently can throw up performance problems if you are running queries returning large  volumes of data. However there are some tips and tricks that you can use to ensure your SQL code works as efficiently as possible.

In this blog post, I’m going to focus on just a few of these – there are many other useful methods, so feel free to comment on this blog and suggest additional techniques that you have seen deliver benefit.

Physical Ordering of Data and Indices

Indices and the actual physical order of your database can be very important. Suppose for example that you are using matching software to run a one off internal dedupe, looking to compare all records with several different match keys.  Let’s assume that one of those keys is zip or postal code and it’s significantly slower than the other key.

If you put your data into the physical postal code/zip order, then your matching process may run significantly faster since the underlying disk I/O will be much more efficient as the disk head won’t be jumping around (assuming that you’re not using solid state drives).  If you are also verifying the address data using post office address files, then again having it pre-ordered by postal code/zip will be a big benefit.

So how would you put your data into postcode order ready for processing?

There are a couple of options:

  • Create a clustered index on the postcode/zip field – this will cause the data to be stored in postcode/zip order,
  • If the table is in use and already has a clustered index, then the first option won’t be possible. However you may still see improved overall performance if you run a “select into” query pulling out the fields required for matching, and ordering the results by postal code/zip. Select this data into a working table and then use that for the matching process – having added any other additional non-clustered indices needed.

Avoid  SELECT *

Only select the fields you need.  SELECT * is potentially very inefficient when  working with large databases (due to the large amount of memory needed). If you only need to select a couple of fields of data (where those fields are in a certain range), and those fields are indexed, then selecting only those fields allows the index to be scanned and the data returned.  If you use SELECT *, then the DBMS will join the index table with the main data table – which is going to be a lot slower with a large dataset.

Clustered Index and Non-clustered Indices

Generally when working with large tables, you should ensure that your table has a clustered index on the primary key (a clustered index ensures that the data is ordered by the index – in this case the primary key).

For the best performance, clustered indices ought to be rebuilt at regular intervals to minimise disk fragmentation – especially if there are a lot of transactions occurring.  Note that non-clustered indices will also be rebuilt at the same time – so if you have numerous indices then it can be time consuming.

Use Appropriate Non-clustered Indices to Boost Query Efficiency

Non-clustered indices can assist with the performance of your queries – by way of example, non-clustered indices may benefit the following types of query:

  •  Columns that contain a large number of distinct values, such as a combination of last name and first name. If there are very few unique values,  most queries will not use the index because a table scan is typically more efficient.
  • Queries that do not return large result sets.
  • Columns frequently involved in the search criteria of a query (the WHERE clause) that return exact matches.
  • Decision-support-system applications for which joins and grouping are frequently required. Create multiple non-clustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
  • Covering all columns from one table in a given query. This eliminates accessing the table or clustered index altogether.

In terms of the best priority for creating indices, I would recommend the following:

1.) fields used in the WHERE condition

2.) fields used in table JOINS

3.) fields used in the ORDER BY clause

4.) fields used in the SELECT section of the query.

Also make sure that you use the tools within SQL Server to view the query plan for expensive queries and use that information to help refine your indices to boost the efficiency of the query plan.

Avoid Using Views

Views on active databases will perform slower in general, so try to avoid views. Also bear in mind that if you create indices on the view, and the data in the base tables change in some way, then the indices on both the base table and view will need updating – which creates an obvious performance hit.  In general, views are useful in data warehouse type scenarios where the main usage of the data is simply reporting and querying, rather than a lot of database updates.

Make use of Stored Procedures in SQL Server

The code is then compiled and cached, which should lead to performance benefits. That said you need to be aware of parameter sniffing and designing your stored procedures in such a way that SQL Server doesn’t cache an inefficient query execution plan.  There are various techniques that can be used:

  • Optimising for specific parameters
  • Recompile For All Execution
  • Copy parameters into Local Variables

For those interested, there’s a more in-depth, but easy to follow description of these techniques covered on the following page of SQLServerCentral.com

http://www.sqlservercentral.com/blogs/practicalsqldba/2012/06/25/sql-server-parameter-sniffing/

Queries to compare two tables or data sources

When using matching software to identify matches between two different data sources, you may encounter scenarios where one of the tables is small relative to another, very large, table, or where both tables are of similar sizes. We have  found that some techniques for comparing across the two tables run fine where both tables are not too large (say under ten million records), but do not scale if one or both of the tables are much larger than that.  Our eventual solution gets a little too detailed to describe effectively here, but feel free to contact us for information about how we solved it in our matchIT SQL application.

And Finally

Finally I’d recommend ensuring that you keep an eye on the disks housing your SQL Server database files: ensure that there’s at least 30% storage space free and that the disks are not highly fragmented; regularly doing this produces better performance.

In summary by making efforts to optimise the performance of your data cleaning operations, you will reduce load on your database server, allow regular use of the necessary applications to keep your data clean – and as a result keep your users happy.

Where Is Your Bad Data Coming From?

As Kimball documents in The Data Warehouse Lifecycle Toolkit (available in all good book stores), there are five concepts that together, can be considered to define data quality:

Accuracy – The correctness of values contained in each field of each database record.

Completeness – Users must be aware of what data is the minimum required for a record to be considered complete and to contain enough information to be useful to the business.

Consistency – High Level or summarized information is in agreement with the lower-level detail.

Timeliness – Data must be up-to-date, and users should be made aware of any problems by use of a standard update schedule.

Uniqueness – One business or consumer must correspond to only one entity in your data. For example, Jim Smyth and James Smith at the same address should somehow be merged as these records represent the same consumer in reality.

So using Kimball’s list, we might know what kind of data we want in the database but unfortunately, despite our best intentions, there are forces conspiring against good data quality. While it doesn’t take a forensics degree, there are so many sources of poor data you may not even know where to look. For that, we’ve come up with our own list. Let’s take a look…

1. Data Entry Mistakes.

The most obvious of the bad data sources, these take the form of simple typing mistakes that employees can make when entering data into the system e.g. simple typos, entering data into the wrong fields, using variations on certain data elements.  Even under ideal circumstances, these are easy mistakes to make and therefore extremely common but unfortunately can be the source of high numbers of duplicate records.  But why is it so hard to get the data right? Consider these circumstances that can exacerbate your data entry process:

  • Poorly trained staff with no expectations for data entry
  • High employee turnover
  • Under-resourcing of call centres that leads to rushing customer exchanges
  • Forms that do not allow room for all the relevant info
  • Unenforced business rules because bad data is not tracked down to its source

2. Lazy Customers.

Let’s face it. Customers are a key source of bad data. Whether they are providing information over the phone to a representative or completing a transaction online, customers can deliberately and inadvertently provide inaccurate or incomplete data. But you know this already. Here are a few specific circumstances to look out for, especially in retail settings:

  • In store business rules that permit staff to enter store addresses or phone numbers in place of the real customer info
  • Multiple ‘rewards cards’ per household or family that are not linked together
  • Use of store rewards cards that link purchases to different accounts
  • Customers that subconsciously use multiple emails, nicknames or addresses without realizing it
  • Web forms that allow incorrectly formatted data elements such as phone numbers or zip codes
  • Customers pushed for time who then skip or cheat on certain data elements
  • Security concerns of web transactions that lead customers to leave out certain data or simply lie to protect their personal information

3. Bad Form

Web forms. CRMs. ERP systems. The way they are designed can impact data quality. How? Some CRM systems are inflexible and may not allow easy implementation of data rules, leading to required fields being left blank, or containing incomplete data. Indeed many web forms allow any kind of gibberish data to be entered into any fields which can immediately contaminate the database. Not enough space for relevant info or systems and forms that have not been updated to match the business process also pose a challenge. Many systems also simply do not perform an address check at entry – allowing invalid addresses to enter the system. When it comes to data quality, good form is everything.

4. Customization Simply Reroutes Bad Data

All businesses have processes and data items unique to that business or industry sector. Unfortunately, when systems do not provide genuine flexibility and extensibility, IT will customize the system as necessary. For example, a CRM system may be adjusted to allow a full range of user-defined data (eg to allow a software company to store multiple licence details for each customer). Where this happens, the hacks and workarounds can lead to a lack of data integrity in the system (e.g. you end up storing data in fields designed for other data types (dates in character fields).

5. Data Erosion is Beyond Your Control

Businesses and consumers move address. People get married and change their name. Business names change too plus contacts get promoted or replaced. Email addresses and phone numbers are constantly evolving. People die. No matter how sophisticated your systems are, some measure of data erosion is simply unavoidable. While good business rules will assist in updating data at relevant checkpoints, to maintain the best quality data, it’s important to update the data from reliable data sources on a regular basis.

6. New Data. Bad Data. Duplicate Data.

Many businesses regularly source new prospect lists that are subsequently loaded into the CRM. These can come from a variety of places including list vendors, trade shows, publications, outbound marketing campaigns and even internal customer communications and surveys. Although it’s exciting to consider procuring a new, large database of prospects, there are two ways this addition of data can go horribly wrong. First, the data itself is always suspect, falling prey to all the potential issues of data entry, data erosion and customer error. But even if you can corroborate or cleanse the data before entering, there is still a chance you will be entering duplicate records that won’t always be quickly identified.

7. Overconfidence

OK. So this may not be a true ‘source’ of bad data but it is the most important precipitating factor. You may think that by implementing business rules or by using a CRM’s built-in duplicate detection tools, that you are covered. In practice, business rules are important and valuable but are never foolproof and require constant enforcement, evaluation and updates. Moreover, built-in data quality features are typically fairly limited in scope and ability to simply detect exact matches. They simply not powerful enough to do the heavy lifting of a more sophisticated fuzzy and phonetic matching engine that will catch the subtle data quality errors that can lead to major data quality issues. This false sense of confidence means you can easily overlook sources of poor data and neglect to perform critical data quality checks.

So if you keep these seven bad data sources in mind – are you home free? Unfortunately not. These are simply the building blocks of bad data. When even just some of these conditions occur simultaneously, the risk of bad data multiplies  exponentially. The only true way to achieve the five-pronged data quality ideal outlined by Kimball (accuracy, completeness, consistency, timeliness and uniqueness) is through a comprehensive data quality firewall that addresses each of these components individually.

Stay tuned for more information on Best Practices in data quality that pinpoint specific business rules and software solutions to achieve true real-time data quality.

Assessing Your Data Quality Needs

So you have data quality issues. Who doesn’t? Should you embark on a data quality project? Maybe but what are your objectives? Are there service issues related to poor data quality? Marketing issues? Other major integrations or warehousing projects going on? And once you clean up your data – then what? What will you do with the data? What benefit will a clean database pose for your organization? And without clear objectives, how can you even justify another major technology initiative?

Before any data quality project, it is critical to go beyond the immediate issues of duplicate records or bad addresses and understand the fundamental business needs of the organization and how cleaner day will enable you to make better business decisions. This will help you to establish accurate project parameters, keep your project on track and justify the investment to C level executives. So where do you begin? At the beginning.

Look beyond the pain.
In most cases, a specific concern will be driving the urgency of the initiative but it will be well worth the effort to explore beyond the immediate pain points to other areas where data is essential. Plan to involve a cross-section of the departments including IT, marketing, finance, customer service and operations to understand the global impact that poor data quality could be having on your organization.

Look back, down and forward.
Consider the data quality challenges you’ve had in the past, the ones you face today and the ones that have yet to come. Is a merger on the horizon? Is the company migrating to a new platform? Do you anticipate signficant staffing changes? Looking ahead in this way will ensure that the investment you make will have a reasonable shelf-life.

Look at the data you don’t have.
As you review the quality of the data you have, also consider what’s missing and what information would be valuable to customer service reps or the marketing department. It may exist in another data silo somewhere that just needs to be made accessible or it could require new data be collected or appended.

Be the customer.
Call the Customer Service Department and put them through the paces. Sign up for marketing materials online. Place an order on the website. Use different addresses, emails and nicknames. Replicate perfectly reasonable scenarios that happen every day in your industry and see how your infrastructure responds. Take good notes on the places where poor data impacts your experience and then look at the data workflow through fresh eyes.

Draw out the workflow.
Even in small organizations, there is tremendous value in mapping out the path your data takes through your business. Where it is entered, used, changed, stored and lost. Doing this will uncover business rules (or lack of) that are likely impacting the data, departments with complementary needs and or places in the workflow where improvements can be made (and problems avoided).

Think big and small.
Management and C-Level executives tend to think big. Data analysts and technical staff tend to think granularly and departmental users usually fall somewhere in the middle. Ultimately, the best solution can only be identified if you consider the global, technical and strategic business needs.

The challenges with identifying, evaluating and implementing an effective data quality solution are fairly predictable but problems almost always begin with incorrect assumptions and understanding of the overall needs of the organization. In some cases, the right data quality vendor can help you move through this process but ultimately, failure to broaden the scope in this way can result in the purchase of a solution that does not meet all the requirements of the business.

Click here to download a comprehensive Business Checklist that will help you to identify the data quality business needs within your organization. Then stay tuned for our next post in our Data Quality Project series.

helpIT Systems is Driving Data Quality

For most of us around the US, the Department of Motor Vehicles is a dreaded place, bringing with it a reputation of long lines, mountains of paperwork and drawn out processes. As customers, we loathe the trip to the DMV and though while standing in line, we may not give it much thought  – the reality is, poor data quality is a common culprit of some of these DMV woes. While it may seem unlikely that an organization as large and bureaucratic as the DMV can right the ship, today, DMV’s around the country are fighting back with calculated investments in data quality.

While improving the quality of registered driver data is not a new concept, technology systems implemented 15-20 years ago have long been a barrier for DMVs to actually take corrective action. However, as more DMVs begin to modernize their IT infrastructure, data quality projects are becoming more of a reality. Over the past year, helpIT has begun work with several DMVs to implement solutions designed to cleanse driver data, eliminate duplicate records, update addresses and even improve the quality of incoming data.

From a batch perspective, setting up a solution to cleanse the existing database paves the way for DMVs to implement other types of operational efficiencies like putting the license renewal process online, offering email notification of specific deadlines and reducing the waste associated with having (and trying to work with) bad data.

In addition to cleaning up existing state databases, some DMVs are taking the initiative a step further and working with helpIT to take more proactive measures by incorporating real-time address validation into their systems.  This ‘real-time data quality’ creates a firewall of sorts, facilitating the capture of accurate data by DMV representatives – while you provide it (via phone or at a window). With typedown technology embedded directly within DMV data entry forms, if there is a problem with your address, or you accidently forgot to provide them with information that affects the accuracy, like your apartment number or a street directional (North vs. South), the representatives are empowered to prompt and request clarification.

Getting your contact data to be accurate from the start means your new license is provided immediately without you having to make another visit, or call and wait on hold for 30 minutes just to resolve the problem that could have been no more than a simple typo.

Having met several DMV employees over the past year, it’s obvious that they want you to have an excellent experience. Better data quality is a great place to start. Even while DMV budgets are slashed year after year, modest investments in data quality software are yielding big results in customer experience.

 

If you want to learn more about improving the quality of your data, contact us at 866.332.7132 for a free demo of our comprehensive suite of data quality products.