2nd Principle of Modern Data Integration
Posted By Jonathan Wu
27 April 2016
This post is the second of five that are dedicated towards explaining the principles of modern data integration, which is an optimal approach towards addressing modern and big data needs.
The first principle of modern data integration is to “take the processing to where the data lives.” The objective of the 1st principle of modern data integration is to utilize host systems for specific processing in order to create efficiency by preparing and moving only the data that is needed.
The second principle of modern data integration is to “fully leverage all platforms based on what they were designed to do well.” The second principle is defined to create an optimal balance of processing and workload with the first principle.
Let’s take look at Sprint’s use case where billions of call data records are generated each day by their operational systems. In order to perform fraud analytics, Sprint’s operational systems are supplying over 100 data feeds into Hadoop, a NoSQL system, and then data is provisioned to an application for fraud detection. On the operational systems, the processing is focused on sourcing the selected data and then moving the data to Hadoop. As with most operational systems, the processing of data is limited so as not to impact the data entry/collection performance. Hadoop is serving as a data repository and collecting all of the data feeds from the operational systems. Hadoop is a powerful databased designed to distribute processing needs and with built-in functions that can easily handle the data integration requirements from modern and big data. After utilizing the robust capabilities of Hadoop, the integrated and transformed data is then provisioned to an analytics application for fraud detection.
The flow of data in the Sprint use case highlights the function of each platform. Operational systems are designed for data collection/entry/generation, with the byproduct of supplying data needed for analysis. Hadoop is designed for distributed processing and handling big data. With over 100 data feeds, 12 billion records, and 6 terabytes of data each day, Hadoop is providing an economically and efficient platform for processing the data. The analytics application was designed for output in the form of analysis, alerts and reports. With these three platforms, processing of data ranges from extraction and transmission to extensive data blending and enrichment. Every platform is utilized with workloads that it handles well with its set of built-in functions.
The objective of the 2nd principle of modern data integration is to utilize the source and target platforms for the capabilities that they were created and available to do. Modern data integration allows you to call-out those native functions locally, process them within powerful platforms, and distribute the workload in the most efficient way possible. The other three principles of modern data integration continue to build upon the first two principles to form an efficient and highly effective architecture to address modern and big data.