Apache Hive

Apache Hive is the data warehousing component of Hadoop and it is extremely well compatible with structured data, enabling ad-hoc queries against large transactional datasheets. It facilitates querying and managing large data sets in a distributed storage. Hive uses a language similar to SQL called HIVE QL for its queries. These HQL statements are broken down into several MapR jobs or commands and finally executed around a Hadoop cluster. This makes Hive queries very high on latency. Hive might not be apt for every application – definitely not for ones that need a quick turn-around time as in DB2. Also it is not fit for operations that require a lot of write operations because technically Hive is a read-based operation.

Apache Spark ELT

With evolution in database technology, business users have often been looking for the cheapest possible means to store data. Sometimes “cold data” can significantly waste premium storage space adding zero value to the business. Moving the data and replicating the existing ELT workloads to Apache Spark saves a lot of cost per terabyte and introduces massive scalability to the project. ELT stands for Extract, Load, Transform. Apache Spark ELT is an open source data integration process in which transformation takes place in an intermediate server before it is loaded into the target which is Apache Spark. The capability is mostly useful for processing large data sets required for BI and Big Data analytics For traditional data warehouse projects, going the route of Apache Spark ELT is a great way to cleanse, filter, reformat and translate business data and prep it up for future analytics at lower costs and quicker on time.

Big Data

For ages, people have been accumulating and analyzing data to understand and organize the world. In the analogue age, collecting and processing data was costly and time-consuming. Statisticians and other information analysts were generally limited to working with small data samples. The transformation shift was introduced over a digitalization of the world. As the name suggests, Big Data is a massive volume of data of different kinds (structured, unstructured, semi-structured) that accumulate in high velocity and have valuable insights for key decision making. Naturally traditional database techniques cannot integrate or process this data. What we require is a new system of tools with a fresh perspective on algorithms and analytics. Marketing, Media/Entertainment, Healthcare, Politics (with data based electorate report in many countries), Banking, Telecom, Education, Manufacturing have all changed their dynamics to accommodate the insights of Big Data. Look up Wikipedia’s definition:

Big Data Processing with Spark ETL

As Big Data assumes an infinite shape, one needs to process and integrate data from various sources for the right decisions for any business. Apache Hadoop’s two stage MapReduce operation is gradually being ousted by Apache Spark. Its in-memory architecture allows user programs to load data into a cluster’s memory in real time and constantly query it. Unlike traditional ETL, Big ETL today is powered by open sources such as Spark, creating a community of developers and users working hand in hand to improve the data. The long and busy pipeline in an ETL gets easily cleansed. Big Data Processing with Apache Spark ETL has become a viable choice for many because its user friendly, quick, cheap and delivers analytics in real time.


Cloudera happens to be a modern platform for data management and analytics. Cloudera was one of the first ground breaking technology that has turned Hadoop into an enterprise data hub. As enterprises needed more services accomplished over Hadoop such as security, provisioning, management, etc, Cloudera decided to make Hadoop’s technology available over a scale. Enterprise Manageability and Data processing and analysis are Cloudera’s well known domains. The ability to instrument Big Data in the success of the company is the secret of its absolute use in many organizations and domains. Look up Wikipedia’s definition:

Data Extraction and load into Hadoop

There are two aspects to this idea. What is data extraction? Data extraction is a process of retrieving data from poorly structured data sources to prep data up for mostly data storage and data analytics. This is one of the primary step towards modernization of data integration from a typical traditional data warehouse model to a more modern data lake model. Apache Hadoop is an open source software framework that stores data on a massive scale and runs many applications on clusters of commodity hardware. Hadoop’s distributed computing model processes big data real fast and pretty much has the capacity to extract data from any source in any given location such as a business premise or cloud. Data is extracted from systems such as relational databases or logs in data warehouses to offer a more structured format for querying and analyzing and loaded into Hadoop.

Data Integration

The technical accompanied with business process of combining data of different kinds and in different premises into meaningful and valuable business information is data integration. Till date, in many companies, data integration remains a manual operation. Many companies who work with Big data on a day to day basis use various data integration software. However, even in larger corporations, beyond enterprise data warehouse, data marts or reporting databases, getting data off enterprise applications for simple reporting is often handled with manual coding with SQL scripts. Many use SQL to move data from different points for integration. Modern data integration builds on technologies and processes beyond primary ETL functions. There is a need for added data quality, data governance and profiling, data reliability. Today, modern data integration assures interoperable multi-platform solutions as well as cloud based solutions where the footprint of the integration tool is very light. Modern data integration actually holds the key to make or break a business with sound business analytics, multichannel marketing, intelligent business automation, alignment with digital transformation and a cutting edge in the competitive world. Look Up Wikipedia’s definition:

Data Integration with Apache Spark

Data Integration is a process of transforming BIG data (structured, semi-structured and unstructured) into a set of more useful set of information for a given business use case. Apache Spark is a lightning fast cluster computing engine for processing/integrating very large data sets. Spark’s expressive development APIs allow data workers to efficiently execute data streaming, integrate SQL workloads that require fast iterative access to data sets. With Spark running on Hadoop YARN, developers can exploit Spark’s processing power to integrate data, derive insights and enrich data within a single, shared dataset in Hadoop. Modern Data Integration has reached a pinnacle in terms of cost effectiveness and timelines as a business enterprise solution with Apache Spark.

Data Lake

Data Lake is a vast storage repository holding data in its most native format until it is queried for. The data structure and requirements are not defined until the data is needed. A data lake uses a flat architecture to store data rather than a unidirectional, hierarchical data ware house that stores data in files or folders. Each data element in a data lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question comes up, the data lake can be queried for appropriate data and a smaller set of the data lake analyzed to help answer the question.

Data Loading with Hadoop

One of the primary objectives of a data warehouse is to bring all disparate data format into one unified data system that help in forming key business decisions. Data loading is the process that deals with copying and loading data sets from a source file, folder or application to a data storage or a processing utility such as Hadoop. Data is extracted from a data source and transformed into a format that is supported by the destination application, where the data is loaded. The Apache Hadoop platform incudes the Hadoop Distributed File System which is designed for scalability as well as fault tolerance. HDFS stores large files by dividing them into blocks (64 or 128 MB) and replicating them on three or more servers. Capacity and performance can be scaled by adding Data Nodes, and a single Name Node mechanism manages data placement and monitors server availability. HDFS clusters in production use today reliably hold petabytes of data on thousands of nodes.

ETL (Extraction, Transformation, Loading)

The methodology and tasks of ETL are not just relevant to data warehousing. A myriad of proprietary applications and database systems are key to the functionalism of any enterprise. For data to be shared between applications or systems, integrating them, giving at least two applications a similar perspective is what this mechanism is all about.

Extraction of Data

During extraction, the desired data is identified and extracted from various sources, such as database systems and applications. It is always not possible to identify the specific subset of interest. This makes it imperative to extract more data than necessary so identification of relevant data is done later. Depending on the source system’s capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation.


HBase is a column-based storage system enabling users to employ Hadoop datasets as indices in any conventional RDBMS. It is an open-source, NoSQL database built on Hadoop and modeled after Google BigTable. One of the key features of HBase is that it has the capacity to provide random access and strong consistency for large data sets – be they unstructured or semistructured data in a schemaless database organized by column families. The open-source code enable handling petabytes of data on thousands and thousands of nodes. It can rely on data redundancy, batch processing and essentially all features in a distributed applications system in the Hadoop ecosystem.


Mainframe is often called (big iron). Mainframe is a very large and fast computing system capable of supporting hundreds or even thousands of users simultaneously. The mainframe computers are completely capable of offering greater availabilityand security than a smaller scale machine. Historically, mainframes are known to be a more centralized system rather than distributed computing, although that distinction is blurring as smaller computers are assuming more powerful roles and mainframes are becoming more multi-purpose. Mainframe computers are computers used byorganizations for specialized and critical applications for bulk data processing such as market analysis, enterprise resource planning and customer behaviour on a product or service.


Apache Pig is a high-level procedural language platform that needs very advanced training or skills. It is meant to simplify querying large data sets in Hadoop and MapReduce. Apache Pig provides an easier access to data that is in Hadoop clusters and offers a means for analyzing large datasets with Hadoop applications thus making the implementation of simple or complex workflows and the designation of multiple data inputs where data can then be processed by multiple operations.

Transportation of Data

After extraction of data, it has to be physically transported to the target system or to an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. Real-Time Transport Protocol (RTP) is an internet protocol standard that specifies a way for programs to manage the real time transmission of multimedia data over other unicast or multicast network services. RTP is widely used in internet telephony applications. However it does not guarantee real-time delivery of multimedia data (network governs a lot of that). RTP does provide the wherewithal to manage the data as it arrives to co-ordinate.