Accelerating Data Lake Implementations
Posted By Jonathan Wu
12 June 2016
Organizations are addressing their big data and modernization needs with data lakes, which have become an important component for an enterprise information management strategy and architecture. As a repository for raw digital data in its native format that is either structured, semi-structured or unstructured, data lakes are often utilized for offloading historical data from operational systems as well as collecting data feeds from external sources and cloud-based applications. Using Hadoop as a cost effective data lake technology platform, massive storage and processing capabilities are provided. In many cases, we are seeing organizations utilize a data lake as a collection and processing point and then feeding data sets into their existing data warehouses and data marts to support analytical, reporting, and discovery capabilities.
While numerous data lake use cases and benefits have been publicized, the greatest challenge to adoption is getting data into and out of Hadoop. In today’s age, where data landscape is widely distributed across on-premises and in the cloud, one of the most common pain point enterprises face is how to ingest data into data lake seamlessly and in a most efficient manner. A common practice is the creation of custom code for moving and ingesting data into Hadoop, which requires talented and skilled professionals. If Hadoop is new to an organization, in-house talent will require time and training to master the knowledge needed to create custom code. An alternative is to retain the services of external consultants, which can be an expensive approach. However, after the custom code has been developed and the external consultants paid for their services, maintaining the custom code in the event that new data feeds are needed or existing data feeds need to be changed is always a challenge. Five factors need to be addressed before considering custom code as a viable option: 1) the in-house skills, 2) enterprise development standards, 3) sustainability in production, 4) total cost of ownership, and 5) future scalability. These five factors were studied by Sprint before they implemented Hadoop.
In addition to the technical considerations being evaluated, the project team at Sprint also had to address the business drivers for the initiative. Their business team needed access to the data as soon as possible and they could not wait 10 to 12 months to have custom code created. They also needed to address data provisioning from Hadoop into downstream analytical applications and could not afford to have traditional data integration technology become the bottleneck in the data movement and transformation process. After ruling out custom code development and benchmarking data integration technologies, Diyotta was selected as the modern data integration technology for Hadoop platforms at Sprint. With Diyotta Modern Data Integration Suite, Sprint is able to move and integrate about 12 billion records which is approximately 6 terabytes of data per day into Hadoop. They also found that with their in-house team using Diyotta’s technology, they were able to complete the data integration development 3 times faster and 4 times less expensive than hiring external consultants to develop custom code.
Diyotta provides modern, productive, and sustainable approach to ingest data into Hadoop data lake and utilizes the processing power of the existing data platforms like Hadoop, Teradata and many other. Diyotta’s Modern Data Integration Suite is a proven technology that accelerates data lake implementation.
To learn more about Diyotta, please visit our website: www.diyotta.com.