Where Big Data, Contact Data and Data Quality come together

We’ve been working in an area of untapped potential for Big Data for the last couple of years, which can best be summed up by the phrase “Contact Big Data Quality”. It doesn’t exactly roll off the tongue, so we’ll probably have to create yet another acronym, CBDQ… What do we mean by this? Well, our thought process started when we wondered exactly what people mean when they use the phrase “Big Data” and what, if anything, companies are doing in that arena. The more we looked into it, the more we concluded that although there are many different interpretations of “Big Data”, the one thing that underpins all of them is the need for new techniques to enable enhanced knowledge and decision making. I think the challenges are best summed up by the Forrester definition:

“Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers. To remember the pragmatic definition of Big Data, think SPA — the three questions of Big Data:

  • Store. Can you capture and store the data?
  • Process. Can you cleanse, enrich, and analyze the data?
  • Access. Can you retrieve, search, integrate, and visualize the data?”

As part of our research, we sponsored a study by The Information Difference (available here) which answered such questions as:

  • how many companies have actually implemented Big Data technologies, and in what areas
  • how much money  and effort are organisations investing in it
  • what areas of the business are driving investment
  • what benefits are they seeing
  • what data volumes are being handled

We concluded that plenty of technology is available to Store and Access Big Data, and many of the tools that provide Access also Analyze the data – but there is a dearth of solutions to  Cleanse and Enrich Big Data, at least in terms of contact data which is where we focus. There are two key hurdles to overcome:

  1. Understanding the contact attributes in the data i.e. being able to parse, match and link contact information. If you can do this, you can cleanse contact data (remove duplication, correct and standardize information) and enrich it by adding attributes from reference data files (e.g. voter rolls, profiling sources, business information).
  2. Being able to do this for very high volumes of data spread across multiple database platforms.

The first of these should be addressed by standard data cleansing tools, but most of these only work well on structured data, maybe even requiring data of a uniform standard – and Big Data, by definition, will contain plenty of unstructured data which is of widely varying standards and degrees of completeness. At helpIT systems, we’ve always developed software that doesn’t expect data to be well structured and doesn’t rely on data being complete before we can work with it, so we’re already in pretty good shape for clearing this hurdle – although semantic annotation of Big Data is more akin to a journey than a destination!

The second hurdle is the one that we have been focused on for the last couple of years and we believe that we’ve now got the answer – using in-memory processing for our proven parsing/matching engine, to achieve super-fast and scalable performance on data from any source. Our new product, matchIT Hub will be launching later this month, and we’re all very excited by the potential it has not just for Big Data exploitation, but also for:

  • increasing the number of matches that can safely be automated in enterprise Data Quality applications, and
  • providing matching results across the enterprise that are always available and up-to-date.

In the next post, I’ll write about the potential of in-memory matching coupled with readily available ETL tools.