Data Integration – Where do we go from here?

Posted By claudia
Modern data integration

08 March 2016

Data integration – the activity of creating a single instance of an event, transaction, customer, order, etc., from multiple sets of related data – has been a standard practice for more than 20 years. The creation of this “single version of the truth” has been the holy grail of many organizations desiring to become data-driven enterprises. Today, we don’t need a room full of coders, creating the programs to perform the data integration processes. We have very sophisticated model- and metadata-driven, easy-to-use data integration technology. But, recently, this seemingly stable world has witnessed significant disruption. The data sources for the integration processes no longer consist of only well-behaved, mostly operational sets of data with known formats and data volumes. Enter the age of IoT and big data, that is, massive sets of data with unknown or variable formats, size, quality, and sources and innovative, scalable technologies for storing and analyzing this data, further upending the status quo. But disruption does not mean chaos! Fortunately, data integration vendors have embraced the disruptive forces and changed the way we perform data integration to accommodate and tame the chaos in clever ways. Diyotta produced a concise blog describing the 5 Principles of Modern Data Integration. Please read over them as they provide well thought-out tenets of model data integration processing, so there is no need for me to reiterate them. I would like to focus on the future of data integration, hopefully capturing where we need to go from here. Here are a few suggestions for future development in data integration technologies that also follow directly from the five principles above. Your comments, additions, or disagreements are always welcome!
  1. Creation of highly portable data integration processes. Creating portable data integration packages insulates your data integration processes from technological changes and/or evolving new data sources/targets. It means creating “containers” of data integration processes. You may start with a relational DBMS, and later move your data and data integration processing elsewhere – say to Hadoop without impacting the previously built processes. The key is that wherever you choose to create your BI or analytics environment via data integration, you can adapt to the evolving changes in the technology which are inevitable. And your data integration architecture should have built-in resilience towards these evolving changes.
  2. Support for multiple deployment options. Organizations are deploying their analytic assets on-premises, in the cloud (public and private), and in traditional databases or on a Hadoop cluster. The data must be integrated within the cloud or on-premises. And the integration should occur within databases, applications, or middleware. Finally we need data integration to occur in batches, via a request/response mechanism (APIs), or in real time or micro-batches. Real time means the data integration platform must be able to capture large volumes of rapid but small messages from streaming data.
  1. Support for both traditional data pull (extract, then integrate) and the newer push process. Data is pushed from the source when something changes or upon a timed schedule to the backend data integration service. This is quite different from the traditional data integration mechanism of pulling or extracting the data on demand. This push action would be far more reactive and timely for some sets of data. In this case, ETL would mean event-transform-load and would support the continuous delivery of data to the transformation and loading processes.
  2. Distributed vs. Centralized. While we see global boundaries starting to shrink as businesses expand (e.g., Uber, Netflix, AirBnB), this also mandates that the data landscape become more diverse and distributed. This is most obvious as we begin to see Cloud, Mobile and IoT initiatives gaining momentum across enterprises. Even if data volume is not an issue in your organisation or you don’t have an IoT initiative, it is unrealistic in today’s analytics architecture to demand that all data end up in a single physical centralized data store. There are many reasons why data may remain distributed. We need modern architectures that can manage distributed data processing and storage as well as centralized ones.
Final thoughts: Data integration and its associated technologies, data quality processing and data profiling, are undergoing a paradigm shift from traditional RDBMS environments to faster platforms like Hadoop, on-premises to the cloud, batch to real-time or bulk to singular API calls. Make no mistake, this shift, in no way, negates the need for this critical step in creating a data-driven organization. It may go by different names like ELT, ETL, data prep, and data “munging” (simpler forms of data integration) but it still boils down to ensuring that the data for analytics is accessible, understandable and of sufficient quality for reliable results.
  • Great read.
    I like your “event-transform-load ” point, as it happens that the use of data integration in real time scenarios, on top of streaming is getting mainstream
    On another dimension, what is your thought on self service data preparation ? I feel that today, the topic tends to a bit restricted to the BI use case (the same way that ETL was restricted to BI use case at its early days).
    But if you take a broader view, you find that Self service data preparation can be potentially applied to any data integration use case. Don’t you think that this could disrupt the market to a similar level that Data Discovery for BI “front-ends”

    • Claudia Imhoff

      Thank you for your comments, Jean-Michal. Self-service data preparation is certainly becoming popular with the much vaunted data scientist and other technologically savvy analysts. However, I do believe that it is still not something for the everyday, non-tech savvy individual — it does take some knowledge of databases, and the technical jargon of joins, unions, etc.

      Data prep could be used for other simple data integration situations but remember that data prep is not ETL, ELT or whatever you call the heavy-duty data integration combined with data quality processing used in many formal data warehouses. Care should be given as to the proper scenarios in which data prep is perfectly suited.

  • Well said Claudia. I think you’ve nailed it in terms of how people have historically viewed “data integration” for analytical use cases. I do believe, however, that there are a number of other use cases that fall under the integration umbrella that once were separated into EAI/ESB (app integration) vs. ETL/EDW (data integration) and are also ripe for rethinking. Organizations should look to more modern platforms that don’t have this legacy distinction and are built for self service and able to connect data, apps, APIs and Things faster. SnapLogic’s head of engineering recently published 10 New Requirements for Modern Integration that you might also find interesting: http://www.snaplogic.com/blog/modern_data_integration_requirements/

    Keep up the great work with the #BBBT!

    Darren

  • Hello there! I just want to offer you a huge thumbs up for your
    excellent information you have here on this post.

    I am coming back to your web site for more soon.

Subscribe to our Blog!

Learn more about Modern Data Integration and upcoming Diyotta events, webinars, and product updates.

* indicates required