Best Practices for Building A Data Lake in Hadoop

Posted By Diyotta Inc.

25 August 2015

 

r how to load data in hadoop data lake | moving data out of hadoop data lake

Big Data is the ‘oil’ of the 21st century, and just as oil is inherently useless when unrefined, so is data. The fact that the modern data environment has changed drastically in the age of big data and the Internet of Things is no surprise. Not only has the technology changed, so have the data types. The classic approach—placing all data into the traditional data warehouse environment before analyzing it—has become a bottleneck. Avoiding the Data Swamp In comes the data lake—a term that has gained much traction—acting as a data landing area for the raw data from the many, and ever-increasing number of data sources in organizations. While there are differing opinions on the data lake, one thing is clear; it’s all about how you manage it. We want to focus on managing this landing zone to ensure the data lake solution is addressing organizational needs, is sustainable and maintainable, and most importantly, avoids turning the data lake into a swamp. Screen Shot 2015-08-25 at 9.42.04 AM Typically, the landing zone is a set of tables that mimic source definitions. Data in the landing zone is usually not persisted and tables are modeled and managed. However, this landing zone is never a good option for any type of analytics. Users typically do not have access to this zone and it’s mainly suited for structured data in classic data warehouse architecture. It’s pretty limiting, isn’t it? While there is much to learn from the classic landing zone—like data governance and metadata lineage—there is also much to overcome from the classic end, like development time for data ingestion pipelines, pre-defined Schema Constraint, and limitation in data storage, loading and access methods. In the data lake itself, there are a whole slew of organic issues brought about by its characteristics—including a lack of data discovery, as well as data refinement and data security concerns. Clearly we are in desperate need for a “different” type of Landing Zone. This new type of landing zone should have three crucial components. It should be able to load any and every type of data from Structured (classic), to Semi-Structured (email, social media, text), and Unstructured (media, logs, sensors). Organizations should not have to worry about building the model in advance and they should not have to decide what to load or what not to, or be constrained by the “classic” path of building the schema before loading. Lastly, it should enable agility in business analytics to get the data analyzed faster and allow for deeper insights that previously were not possible. So what is the key to building a better data lake solution? Fast & Seamless Loading  – It should not create development bottlenecks for data ingestion pipelines, but rather it should allow any type of data to be loaded seamlessly in a consistent manner. Data ingestion pipelines to Hadoop should be fast, automated and manage target structures. It should not require building the target schemas before loading. You should be able to load a new data lake initiative within days. Flexibility & Discovery – A data lake solution should not lock into a technology pattern or behavior. It should, however, allow for flexible data refinement policies, auto data discovery and provide an agile development environment. Quick Data Transformation  – You should not have to take the data out of Hadoop to prepare and transform. Instead, you should have the capability to rapidly implement out of the box transformations that are ready for use, and you should be able to make those transformations in the native environment. Audit, Balance & Control – You should be able to get accurate statistics and load control data for better insights into processes with integrated capabilities that provide an operational dashboard using the statistics. Metadata Lineage Insight – You should be able to get forward and backward, step-by-step, and full lineage at an Entity & Attribute level and inbuilt impact analysis should be available for managing changes faster. Operations Visibility – The data lake solution should be able to provide real-time operations monitoring and debug capabilities with real-time alerts on new data arrivals. (Like the image above) Most importantly, you should be able to use existing skillsets without the need to learn new and rapidly evolving technologies. In order to extract the most value out of your data, you need to be able to adapt quickly and integrate your data seamlessly. So, how do you get the most out of your ‘oil’?  We will give you a hint… it’s called Modern Data Integration, and it’s the key to refining your data and keeping your data lake from becoming a data swamp.

Subscribe to our Blog!

Learn more about Modern Data Integration and upcoming Diyotta events, webinars, and product updates.

* indicates required