The Increasingly Distributed Data Landscape
Over the last several years, data complexity has increased dramatically in many companies & continues to do so as more & more data is captured & more databases & object stores emerge to store it. This includes both internal & external data sources.
- Core transaction processing systems are now on the cloud as well as on-premises.
- NoSQL databases are being adopted to allow web and mobile commerce applications to capture non-transactional data at scale.
- The adoption of cloud storage is also increasing – especially for capturing big data.
- Multiple data warehouses have been built creating islands of overlapping historical data.
- Big data platforms like Hadoop and Graph DBMSs have entered the enterprise extending analytical environments beyond the data warehouse.
The Modern Analytical Ecosystem
The emergence of big data has resulted in new analytical workloads that are not well suited to traditional data warehouse environments. These workloads, typically being driven by data characteristics (variety, velocity and volume) and the types of analysis required, have caused many companies to extend their analytical set-up beyond the data warehouse to include multiple analytical data stores. Multiple platforms now exist in the enterprise to support different analytical workloads. As a result data integration and data movement has increased rapidly across data stores in this new analytical ecosystem.
The distributed data landscape is causing increased complexity. Different data integration technologies are being used in different parts of the ecosystem. Both production and agile self-service data integration technologies are being used and silos have emerged. Also many companies are rapidly reaching a point where a ‘data deluge’ is occurring in that data is now arriving faster than they can consume it. The conclusion here is obvious. There has to be a better, more governed way to fuel productivity and agility without causing data inconsistency and chaos. Everyone for themselves is not an option.
Data Integration Use Cases in a Distributed Lake
- Data is being collected via streaming, batch ingest, replication and archiving with some data too big to move once captured.
- Data lakes / reservoirs are increasingly becoming distributed
- Compliance with different data privacy laws in different jurisdictions around the world is a key reason why some data will be kept apart.
- Data integration software should exploit the power of underlying platforms to scale ETL processing.
Managing Distributed Data Integration Using Diyotta
Data Integration Requirements in a Distributed Landscape
Integrate multiple data types in-motion and at rest
Define once, execute anywhere
Pushdown processing to exploit scalable platforms
Execute in a hybrid environment
Nest workflows and invoke 3rd party data integration jobs
Support rule versioning for compliance
Data integration on-demand and as a service
The explosion in the number of data sources, together with the need to analyse new types of data has led many companies to extend their analytical environments beyond the data warehouse to include new data stores and platforms optimised for new analytical workloads. The data is becoming harder to access because it is in multiple data stores and multiple formats and yet, paradoxically, business is demanding more and more agility, together with the ability to respond much more rapidly than ever before.
In this kind of environment, companies need new tools to manage and govern data ingestion, data integration and data movement across workload optimised analytical systems. They also need the ability to scale to handle volume and velocity as required. Diyotta Data Integration Suite is a clear candidate technology to deal with this new set of requirements, address the data deluge, unify data integration siloes and allow companies to remain agile in a distributed data landscape.