Spark is all the rage in the world of big data, and it’s more than hype. Its in-memory processing engine is gaining momentum because of the speed at which it can process complex analytics and compute intensive data integration work.
According to John O’Brien of Radiant Advisors in the recent research, Why Spark Matters, “One can accurately state that Spark is “the hot thing” in big data these days. Its popularity has grown rapidly over the past 18 months, demonstrated by the fact that most practitioners are evaluating Spark as a component of their big data architectures. While skeptics may have wondered if Spark warrants the attention it has received, the marketplace is nevertheless embracing the processing engine as an addition to the Hadoop ecosystem. This is due both to its ability to address some of the shortcomings of MapReduce and to the demand for faster and more powerful processing for the full data pipeline.”
With Spark adoption exploding, it begs the question, “Now that I’m in the process of digesting the elephant (Hadoop), should I really undertake yet another opens source project that could potentially consume my already scarce resources?” A rational answer would be that there are three rather significant challenges being faced by enterprise organizations in their move to deploy Spark for high-performance, advanced analytics:
- Companies moving from well-defined, highly structured environments like data warehouses and moving into less-defined, less-structured modern data platforms, still require enterprise data standards.
- The move to modern data platforms like Hadoop has already consumed valuable resources due to the complexity of the migration, the difficulty of an additional move to Spark concerns most IT organizations.
- Even though Spark has the speed and analytic prowess necessary to process complex algorithms, data blending and enrichment still occupies 70-80% of the effort for every project.
While Spark has made into many labs in large organizations, there is still a healthy hesitation regarding bringing it into production environments. Today’s announcement of Diyotta 3.5 includes technology to overcome all three of these significant challenges to Spark adoption. Let’s address them one at a time:
The move to modern data platforms like Hadoop has already consumed valuable resources due to the complexity of the migration; the difficulty of an additional move to Spark concerns most IT organizations.
To address the issue of complexity and the need for additional resources, Diyotta 3.5 with its deep Spark integration, uses an abstraction layer to hide the complexity of migrating existing data integration from both traditional and modern data platforms to Spark. From a user perspective, once you have your Spark instance up and running, all you need to do is register the instance, and change one pull down menu from your previous execution engine to Spark.
The simplicity of this move supports an incremental move to Spark, where companies can start testing out Spark with MapReduce and Hive, move to Hive and Tez, and ultimately make the move to native Spark for the highest increase in performance. Years of data blending and enrichment work that has been done by your organization no longer need to be reworked. Your investment is not lost in the migration; it is fully leveraged.
Companies moving from well-defined, highly structured environments like data warehouses and moving into less-defined, less-structured modern data platforms, still require enterprise data standards.
To address the ongoing need to support data governance, Diyotta 3.5 utilizes rich metadata capture, storage, and access, even for data stored or processed in modern data platforms like Hadoop and Spark. Features already present in the Diyotta Modern Data Integration Suite are now applied to Spark.
First, we capture data definitions and metadata from traditional platforms. Where it’s necessary, we can even reverse engineer existing code with up to 80% accuracy. Second, we store that metadata in our platform. Third, we transfer that metadata into modern data platforms to the extent that these less-structured environments support metadata. And finally, we make sure that all centrally stored metadata is accessible to data engineers, data scientists, and end users. This rich metadata includes full lineage usable to support the most robust data governance programs.
Even though Spark has the speed and analytic prowess necessary to process complex algorithms, data blending and enrichment still occupies 70-80% of the effort for every project.
While data integration continues to rise as a primary use case for Spark, our assumption is that not all integration processing will be done in-memory. Diyotta 3.5 continues to allow data engineers the freedom to leverage all platforms for their dataflows and to distribute processing to the platforms best suited for each phase of the flow.
For example, a customer with a Teradata or Netezza data warehouse may want to do some of the data transformation on their traditional platforms, utilizing the rich native functions already in operation there. The resulting data set can be loaded into Hadoop to be blended and enriched with less structured data scraped from the internet and social media data. Once that blending is complete, the data is moved to the Spark processing engine for final, heavy-lifting transformations, aggregations, and advanced analytics.
For anyone who has hesitated on making a move to Spark, your wait is over. Diyotta 3.5 speeds the transition to Spark and makes Spark an integral part of you overall data ecosystem.