Distilling Deeper Truths from Dirty Data

A ThoughtPoint by Dr Barry Devlin, 9sight Consulting

July 2019

Sponsored by CortexAG

A German version of this article is available at Informatik Aktuell.

Dirty data is not the real problem. CortexDB reframes the challenge: figuring out the true meanings hidden in the grime created when data sources—from people to sensors—deliver low quality data. In the fourth article of this series, we explore data distillation.

ThoughtPoint 4 of a 5-part Series

Data scientists’ jobs are the most data-driven—as well as, allegedly, the sexiest—jobs in digital business. The goal is to seek out business insights from the extensive data available in a digitised world. Before it became glamourous, data science was called data mining, arguably a better name. Like physical mining, the initial challenge for data scientists is to refine relatively tiny nuggets of data gold from enormous quantities of raw data ore and, equally important, to generate clean and pure gold-standard data.

Much of the dross in raw data comes from how we misspell names, abbreviate words, misplace values in manual data entry or as speech-to-text applications misunderstand our speech. Further problems arise when we fail to fully specify the context of the data generated or captured by the billions of sensors deployed on the Internet of Things. The incoming data is dirty in many different ways, but cleansing each instance is only a part of the problem. The real challenge arises when, from millions of arriving data records, we need to distil those that refer to the same real-world entity and uniquely specify the truth meanings of each. Traditionally, name and address data is the most commonly cited example of this challenge, but it applies to any type of loosely structured information.

Solutions to this type of problem—often called record linkage—date back many decades and aim to create a “golden record” of cleansed and reconciled data about each entity. Traditionally, the process consists of multiple, sequential, partially overlapping cleansing and comparison steps. The choice of which steps and in which order is often manually determined, depending on the types of data involved, the categories of problems seen, and the researcher’s skills and experience.

Often called data wrangling today, this process often falls to data scientists, wasting up to 80% of their time. Their skills are better employed analysing data and producing insights rather than cleaning up dirty data. Data wrangling tools can certainly ease data scientists’ pain, but mostly simplify and visually augment existing manual cleansing steps, rather than addressing the underlying issues beneath the “dirty data” problem.

Context is the Answer… But What’s the Question?

To solve this problem in a digital business—large numbers of individually dirty data records differing inconsistently across enormous data sets from multiple, often conflicting sources—we must reframe our thinking. The real business need is not to cleanse dirty data; rather the need is to distil from it uniquely identifiable identities for unique real-world entities, even while individual records may still contain unreconcilable differences within their data. This shifts our focus from errors in the data values (naked data) to the context of data creation and use (context-setting information, CSI), allowing us to work around those errors.

By separately storing and continuously aligning naked data and CSI, as described in “CortexDB Reinvents the Database”, an Information Context Management System (ICMS), such as CortexDB, enables the creation of an integrated and highly automated system for reconciling and cleansing data from multiple disjoint sources. A lifelike scenario demonstrates what this means.

360° Customer Data Management

A fictitious Large European Automobile Producer (let’s call them LEAP for short) sells and services its vehicles through a network of dozens of dealers in the Far East. Each dealer runs its own IT systems, which vary in size and complexity depending on dealership size. Each manages its own customer data, sales and services in its own language and according to local laws. Furthermore, LEAP also has multiple, partially inconsistent IT systems for different business functions and/or regions, due to acquisitions and IT legacy development.

How can LEAP become an integrated digital business, consolidating data from all dealers and internal departments to optimise its operations, define and track KPIs at local and international levels, and deliver sales and service excellence to all its customers and partners? How can it even hope to achieve even a small part of that aim when there is no complete, consistent master list of its customers?

Attempts to cleanse and consolidate the customer master set using data wrangling tools fail early. Each dealer’s data must be imported and cleansed individually, but this approach cannot account for customers who buy from multiple dealers. Furthermore, such manual systems are exceedingly difficult to apply to continuously changing data that must support real-time operational and reporting systems.

As an ICMS, CortexDB enables the creation of an integrated set of all customer data that can be distilled—cleansed and reconciled—continuously and automatically.

Every different customer record from every system, both internal and external, is stored individually, time-stamped, and in its raw format in the CortexDB document store. No standard structure, naming nor ordering of data fields needs to be defined in advance. New or changed schemata can be handled with equal ease. Every record, with full historical sequencing, consists of an unordered and unconstrained set of key:value pairs, stored forever. Storing a complete set of customer records in original form as naked data is a prerequisite to its ongoing distillation, as well as any required auditing.

As each record is loaded into the document store, CortexDB adds its CSI into the associated sixth normal form, inverted index database. Using a combination of techniques, including deterministic methods based on likely-unique fields such as social media IDs, probabilistic phonetic and linguistic methods, and semantic graph analysis, high probability matching records are identified and tagged. As new records are added daily, they are automatically integrated into the database and assigned system-wide IDs with high probability of uniqueness with reference to real-world entities. In a highly automated approach, human involvement is limited to oversight and validation of unusual cases.

The Meaning of Context

The scenario above shows how an ICMS can support the automated distillation of dirty customer identity data from multiple sources. It can be easily extended to a range of other types of data, such as product names and descriptions, contracts, medical diagnoses, and call transcripts.

By taking a step back from the common, simplistic view of “cleaning dirty data” and focusing on the context-setting information, we can see that the real need is to distil uniquely identifiable entities from poor-quality data. CortexDB’s unique indexed schemaless structure allows us to step beyond the data to the meaning of the information being stored and processed.

The fifth and final article in this series positions Information Context Management Systems in the larger picture of digital transformation and shows how CortexDB offers broader possibilities than traditional database systems.

Links to other articles in the series:

Article 1: CortexDB Reinvents the Database – June 2019
Article 2: Making Data Agile for Digital Business – June 2019
Article 3: Managing Data on Behalf of Different Actors – June 2019
Article 5: CortexDB Drives Agile Digital Transformation – July 2019