Data lakes provide a cost effective central repository of raw digital data in its native format (structured, semi-structured or unstructured) for data discovery, analysis and provisioning capabilities to other applications. The ability to utilize information from multiple sources, both internal and external, for providing rich analysis and insights is extremely valuable. Unfortunately, the approach of simply “dumping” data into a data lake often leads to information chaos and murky business value. Information chaos occurs when there are redundant, similar or undecipherable data elements, sets or files that are difficult to distinguish from one another and/or are unknown.

Data Governance for Data Lakes

When a data lake has information chaos, most individuals are reluctant to access it because of the amount of additional work required. Having to figure out the source of the data and what transpired to the data from where it originated to its present state in the data lake, can be extremely frustrating and time consuming for an individual if data governance is not available. Data governance for data lakes provides clarity. However, many applications that provide data governance for data lakes are focused on just the ecosystem of the data lake. The activities that take place to the data before it lands into the data lake are seldom recorded in an easily understandable manner, if at all.

Comprehensive Data Lineage and Governance

To provide comprehensive data governance for data lakes, data lineage must be provided.  Data lineage is defined as the historical record of activities and events of data from the system of origin to its current state. Comprehensive data lineage must start with where the data originated and include details of the movement of the data from system of record and all activities that have been applied to the data before it lands and while it is in a data lake. While custom scripting of data movement and integration is a common practice for data lakes, it does not provide data lineage capabilities and deciphering a custom script to understand all the activities taking place can be very difficult.

Transparent Data Lakes

For transparent data lakes, an efficient and scalable approach is needed for curating, cataloguing and ingesting data while utilizing the processing power of existing Big Data platforms such as Hadoop, Massively Parallel Processing platforms and NoSQL. Data lineage should visually represent the data from source, movement, transformation activities and landing with the definitions at each stage of the process. In addition, the capability to catalogue the data selected and to organize datasets while they are ingested into a data lake is vital for data discovery and analytics.

Diyotta’s Modern Data Integration (MDI) Suite orchestrates movement and integration of Big Data. The MDI Suite obtains meta data from the system of record and generates meta data from the data flows designed within the module thereby providing comprehensive data lineage. The MDI Suite’s meta data repository can integrate with data governance applications such as IBM Global Information Catalog and Apache Atlas for Hadoop to provide transparency for data lakes.

To learn more about Diyotta’s Modern Data Integration Suite, please visit www.diyotta.com/products.