Data integration – the activity of creating a single instance of an event, transaction, customer, order, etc., from multiple sets of related data – has been a standard practice for more than 20 years. The creation of this “single version of the truth” has been the holy grail of many organizations desiring to become data-driven enterprises.

Today, we don’t need a room full of coders, creating the programs to perform the data integration processes. We have very sophisticated model- and metadata-driven, easy-to-use data integration technology. But, recently, this seemingly stable world has witnessed significant disruption. The data sources for the integration processes no longer consist of only well-behaved, mostly operational sets of data with known formats and data volumes. Enter the age of IoT and big data, that is, massive sets of data with unknown or variable formats, size, quality, and sources and innovative, scalable technologies for storing and analyzing this data, further upending the status quo.

But disruption does not mean chaos!

Fortunately, data integration vendors have embraced the disruptive forces and changed the way we perform data integration to accommodate and tame the chaos in clever ways. Diyotta produced a concise blog describing the 5 Principles of Modern Data Integration. Please read over them as they provide well thought-out tenets of model data integration processing, so there is no need for me to reiterate them.

I would like to focus on the future of data integration, hopefully capturing where we need to go from here. Here are a few suggestions for future development in data integration technologies that also follow directly from the five principles above.

Your comments, additions, or disagreements are always welcome!

  1. Creation of highly portable data integration processes. Creating portable data integration packages insulates your data integration processes from technological changes and/or evolving new data sources/targets. It means creating “containers” of data integration processes. You may start with a relational DBMS, and later move your data and data integration processing elsewhere – say to Hadoop without impacting the previously built processes. The key is that wherever you choose to create your BI or analytics environment via data integration, you can adapt to the evolving changes in the technology which are inevitable. And your data integration architecture should have built-in resilience towards these evolving changes.
  2. Support for multiple deployment options. Organizations are deploying their analytic assets on-premises, in the cloud (public and private), and in traditional databases or on a Hadoop cluster. The data must be integrated within the cloud or on-premises. And the integration should occur within databases, applications, or middleware. Finally we need data integration to occur in batches, via a request/response mechanism (APIs), or in real time or micro-batches. Real time means the data integration platform must be able to capture large volumes of rapid but small messages from streaming data.
  1. Support for both traditional data pull (extract, then integrate) and the newer push process. Data is pushed from the source when something changes or upon a timed schedule to the backend data integration service. This is quite different from the traditional data integration mechanism of pulling or extracting the data on demand. This push action would be far more reactive and timely for some sets of data. In this case, ETL would mean event-transform-load and would support the continuous delivery of data to the transformation and loading processes.
  2. Distributed vs. Centralized. While we see global boundaries starting to shrink as businesses expand (e.g., Uber, Netflix, AirBnB), this also mandates that the data landscape become more diverse and distributed. This is most obvious as we begin to see Cloud, Mobile and IoT initiatives gaining momentum across enterprises. Even if data volume is not an issue in your organisation or you don’t have an IoT initiative, it is unrealistic in today’s analytics architecture to demand that all data end up in a single physical centralized data store. There are many reasons why data may remain distributed. We need modern architectures that can manage distributed data processing and storage as well as centralized ones.

Final thoughts:

Data integration and its associated technologies, data quality processing and data profiling, are undergoing a paradigm shift from traditional RDBMS environments to faster platforms like Hadoop, on-premises to the cloud, batch to real-time or bulk to singular API calls. Make no mistake, this shift, in no way, negates the need for this critical step in creating a data-driven organization. It may go by different names like ELT, ETL, data prep, and data “munging” (simpler forms of data integration) but it still boils down to ensuring that the data for analytics is accessible, understandable and of sufficient quality for reliable results.