1st Principle of Modern Data Integration
Posted By Jonathan Wu
19 April 2016
This post is the first of five that are dedicated towards explaining the principles of modern data integration, which is an optimal approach towards addressing modern and Big Data needs.
Traditional Approaches Now Obsolete
The principles of modern data integration were created out of the challenges of working with traditional data integration technologies,which are architected and designed to extract data from a relational database management system (RDBMS) and move it to an intermediary integration server for transformation and processing. After the data has been processed, it is then loaded into a target database. This approach, also referred to as Extraction, Transformation and Load (ETL) works well for low data volumes and where the source and target systems are RDBMS. However, as data volumes grow, ETL becomes a bottleneck due to degrading performance of the integration server(s). While adding ETL integration servers in an attempt to offset the degradation of performance, it is a costly proposition that quickly becomes cost prohibitive. Other issues with ETL arise when new data types appear and different platforms are introduced. ETL often does not work due to incompatibility, which forces organizations to create work around solutions to address modern and Big Data movement and integration. These drawbacks lead us to a different paradigm for various data integration scenarios like loading data into Hadoop, or moving data across Hadoop and Teradata, Netezza, or ETL offloading from slow and legacy ETL server(s) onto the newer robust and modern data architectures.
Take the Processing to Where the Data Lives
The first principle of modern data integration is to “take the processing to where the data lives.” Transactional systems such as accounting, payroll, e-commerce, logistics management, customer relationship management and external data sources such as information brokers and sensors are generating tremendous volumes of data. Most organizations have multiple transactional systems to facilitate their operational activities and many external sources that are supplying data. There is too much data to move every time the need to blend or transform arise.
The logic of taking the processing to where the data lives is simple — place agents where the data lives and process it locally. An agent is software code that executes the instructions it receives. Carefully coordinated instructions should be sent to the agent, and the work done on the host platform before any data is moved. For example, concatenating the first name and last name of a customer to create a full name or calculating the profit margin of each product by subtracting the cost from the sales price could be performed on the host system before the data results are moved. By taking the processing to where the data lives, you eliminate the bottleneck of the ETL server and decrease the movement of data across the network.
The objective of the 1st principle of modern data integration is to utilize host systems for specific processing in order to create efficiency by preparing and moving only the data that is needed. The other four principles of modern data integration build upon each other to form an efficient and highly effective architecture to address modern and Big Data.