A ThoughtPoint by Dr Barry Devlin, 9sight Consulting

May 2019

Sponsored by CortexAG

A German version of this article is available at Informatik Aktuell.

CortexDB is the first example of a new class of information management tool that allows data and context to be managed with both independence and integration, a feature that is central to reuse of information for different business purposes.

ThoughtPoint 1 of a 5-part Series

Over more than thirty years as a thought leader in information management, my observation is that most advances are incremental and built on existing ideas. I have encountered truly novel thinking only rarely. I have thus been delighted to meet a small German software company that has incorporated some innovative and potentially revolutionary thinking in its product.

Cortex AG designates the core or its product, CortexDB, as a database, and indeed as nothing less than a “Multi-Model NoSQL DBMS”. I believe they undersell their work. This focus on data and database models, and the internal structure of the database distracts from the fundamental difference between what CortexDB does and what all previous databases do.

I believe that CortexDB may be the first of a new type of product: an adaptive information context management system (ICMS)

To understand why and what this means, we need to look at the history of databases and go back to an old debate: the difference between data and information—the fundamentals of which I described in “Business unIntelligence”—as a prelude to redefining what CortexDB is and can do.

In the mid-1960s, researchers began defining data bases (yes, two words) as sets of data that could be used by more than one application, instead of having a specific dataset for every instance. When data is owned and used by only one application, it can be stored simply and efficiently as a sequential string of values (like a CSV file without a header). If you want to share data, you need at least to name the individual fields. Ideally, you also need some idea of the logical relationships between individual fields. Data stores thus became databases, as these “naked” data values were named, described and dressed up in hierarchical and later, relational structures.

These names, descriptions and relationships are context-setting information (CSI), also called metadata. CSI forms the bridge between data and information. I use the following definitions:

Information: the recorded and stored symbols and signs we use to describe the world and our thoughts about it, and to communicate with each other. Information consists of data and CSI. It’s information that we really need and use as people and businesses.

Data: “facts”—measurements, statistics, the output of physical sensors, etc.—in the form of numeric or simple textual values. I put “facts” in quotes to signify that facts are seldom as hard or fixed as we assume; they are deeply dependent on the frame of reference of those who create and/or store them. As a result, CSI—as a representation of that frame of reference—is also the responsibility of these data creators.

Data base vs. Information base

The discussion above suggests that every database lies somewhere on the spectrum from data to information management. A key-value store is much closer to data. A relational database with a fully populated set of description tables in much closer to information. And because information is what we as people need, relational databases have come to dominate computing.

However, there is an additional consideration. The context—and by extension, meaning—of any piece of information in a database is precisely and only that intended by its creator(s), including engineers who understand the data and business experts who know the intended usage (CSI) of that data. Using the resulting information for another purpose may lead to errors if the meaning assumed by the second application aligns poorly with that defined for the first.

This is the exact challenge found when we take information from the operational environment to the world of BI, even though both are built on relational databases. The basic problem is that a relational database can only represent one context at a time: that of its creator. As a result, in the world of BI, we must build and populate a second database to hold the context needed for analytical tasks. In fact, we find we must build multiple databases for many different analytical contexts.

A second—and more pressing problem for modern digital business—is that of onboarding externally sourced data. Here, information from the external environment is often largely decontextualized (for example, in a CSV format file) as it is passed over the Internet to the receiving enterprise. Data wrangling is, in effect, the rebuilding of context needed to interpret and use the incoming data.

The underlying problem is the same in both cases: We have failed to recognise that there are multiple usage contexts for the same base data set. The database creator embeds the CSI for their original usage scenario in the database design. Subsequent usage scenarios pose varying levels of mismatch to that original design. As usage scenarios become ever more varied and complex, we need something beyond a traditional database to manage the mix and match of uses.

An Information Context Management System

This is where CortexDB comes in. It provides a system that almost completely separates the management of “naked data”[1] from management of CSI, and further ensures that the two areas are completely synchronised in the event of changes to one or the other. I call this an adaptive Information Context Management System (ICMS).

CortexDB stores the naked data in a document store where the only CSI is the names of the fields and record IDs for the documents. (These are, of course, the minimum requirement for joining to the extended CSI.) Meanwhile, the CSI resides in a 6th normal form (6NF) relational structure. I’ll describe the structure in deeper detail in my next article.

In the case of differing application needs that demand conflicting ways of envisaging the data, multiple sets of CSI that structure data access and use in incompatible ways can easily be created and maintained in parallel. Of course, where applications use information the same way, they all use the same set of CSI.

By providing a single store of naked data, supported by multiple sets of CSI, CortexDB can support both operational/transactional applications and informational/analytical applications on the same data (provided, of course, that the naked data is stored at the appropriate atomic level of detail).

In the case of onboarding externally sourced information, the initial data store can be designed and created with only minimal knowledge of the detail of how the incoming data is internally interrelated or how it relates to existing internal data. Those elements of context and structure can be added later and incrementally as the understanding and use on this onboarded data evolves.

This latter case is an important example of how the context / meaning asserted by the data creator (e.g. in the Internet of Things) may be in large part unknown to the eventual users of the data (in this case, within the receiving enterprise). Here, the ICMS provides the means to define and explore multiple ways of interpreting and using poorly described information. As digital businesses create and exchange ever more data, this issue is set to become ever more common.

These two important areas of information processing—and others—are united in requiring far more and far deeper agility in the creation, structuring and use of data than traditional data processing. As the programming world has adopted the precepts of agile development, now too must data management.

In the second article of this series, I dive into some more detail how CortexDB is structured internally and show that it supports data agility.


[1] I’ve chosen to use “naked data” rather than “raw data” because the latter phrase has other uses, especially in onboarding external data. Naked data is information that is “stripped” of as much CSI as possible, consisting mostly of “pure” value data.