Parallel Computation Frameworks in Data Engineering

Author: Narendra Mandadapu
Posted: Aug 9, 2023

In this article our focus will be on parallel computation frameworks that are currently gaining prominence in the realm of Data Engineering.

In data engineering, you often have to pull in data from several sources, combine them, clean them, or aggregate them.

What is Parallel Computing?
Parallel Computing Frameworks
1. Hadoop
- HDFS
- MapReduce
2. Hive
- Hive: An Illustration
3. Spark
- Resilient Distributed Datasets (RDD)
- PySpark
  - PySpark: An Illustration
Summary

What is Parallel Computing?

Parallel computing is the foundation of nearly all contemporary data processing tools. But why has it become so crucial in the context of big data? The primary factor memory and processing power, but mostly memory.

When big data processing tools perform a processing task, they split it into several smaller subtasks. The processing tools then distribute these subtasks over several computers. These are usually commodity computers, which means they are widely available and relatively inexpensive.

If each computer processed the entire task alone, it would take a significant amount of time. However, by dividing the task into smaller subtasks and having multiple computers work on them simultaneously, the task can be completed much faster.

Benefits of Parallel Computing

Having multiple processing units provides the benefit of increased processing power. However, parallel computing for big data offers another significant advantage. Rather than loading all the data into one computer's memory, you can divide the data and load subsets into the memory of various computers. This approach reduces the memory footprint per computer, allowing the data to fit into the closest processor memory, the RAM. It has the potential to make a more significant impact.

The extra processing power.
Instead of needing to load all of the data in one computer's memory, you can partition the data and load the subsets into the memory of different computers.
The memory footprint per computer is relatively small, and the data can fit in the memory closest to the processor, the RAM.

Risks of Parallel Computing

It's important to keep in mind that there's a cost involved in switching to parallel computing. When you split a task into smaller subtasks and then combine the results, the processes need to communicate with each other. This communication overhead can become a bottleneck if your processing requirements are not significant or if you need more processing units.

For example, if you only have two processing units, it might not be worth splitting a task that only takes a few hundred milliseconds to complete. Additionally, because of the overhead, the speed increase won't be linear. This effect is known as parallel slowdown.

Quiz on Parallel Computing

Check your knowledge from this article by answering the questions below.

Click on the answer you believe to be correct for each question to see if you are right or wrong!

Q.1. Why parallel computing?

Question: You've seen the benefits of parallel computing. However, you've also seen it's not the silver bullet to fix all problems related to computing.

Which of these statements is NOT correct?

Select one answer from the below

Parallel computing can be used to speed up any task.

Parallel computing can optimize the use of multiple processing units.

Parallel computing can optimize the use of memory between several machines.

Parallel Computing Frameworks

Alright. After understanding the fundamentals of parallel computing, let's delve into different parallel computing frameworks that are currently popular in the data engineering industry.

1. Hadoop

Apache Hadoop

If you've undertaken some exploration of extensive data ecosystems, chances are you're familiar with Hadoop.

Hadoop is a collection of open-source projects, diligently maintained by the Apache Software Foundation. A few may seem a touch outdated, but they still hold relevance in our data engineering journey.

There are two Hadoop projects we want to focus on this article:

HDFS
MapReduce

HDFS

HDFS stands as a distributed file system. It resembles the file system found on your personal computer, with the sole distinction being that the files reside across multiple distinct computers.

HDFS has been pivotal in the domain of extensive data and, by extension, parallel computing. Presently, cloud-managed storage systems such as Amazon S3 often supplant HDFS.

MapReduce

MapReduce was one of the pioneering paradigms for processing substantial data, where a program divides tasks into subtasks, distributing the workload and data across numerous processing units.

In the case of MapReduce, these processing units take the form of several computers within the cluster.

Nonetheless, MapReduce had its drawbacks, with one being the challenge of composing MapReduce jobs. A plethora of software applications emerged to tackle this issue, and one of these was Hive.

2. Hive

Hive serves as an overlay within the Hadoop ecosystem, rendering data from various sources queryable in an organized manner through Hive's variant of SQL: Hive SQL.

Initially developed by Facebook, the Apache Software Foundation now shepherds the project. Though MapReduce was initially in charge of executing Hive jobs, it has since integrated with various other data processing tools.

Hive: An Illustration

Let's examine an example:

This Hive query calculates the mean age of Grand Slam players per year of their participation.

# calculates the mean age of tennis players
SELECT year, AVG(age)
FROM view.grand_slam_events
GROUP BY year

As anticipated, this query bears a striking resemblance to a conventional SQL query. Yet, beneath the surface, this query undergoes transformation into a task that can operate across a cluster of computers.

Hive Query => Hive => Hadoop Map Reduce

3. Spark

Apache Spark

The alternative parallel computation framework is Spark. Spark allocates data processing tasks between clusters of computers.

Spark distributes data processing tasks between clusters of computers. While MapReduce-based systems tend to need expensive disk writes between jobs. Spark tries to keep as much processing as possible in memory. In this regard, Spark also serves as a response to the limitations of MapReduce.

The disk writes inherent in MapReduce posed particular constraints in interactive exploratory data analysis, where each step builds upon its precursor.

Spark originated at the University of California, nurtured in Berkeley's AMPLab. The Apache Software Foundation currently oversees the project.

Resilient Distributed Datasets (RDD)

Spark's architecture rests upon something called as resilient distributed datasets, or RDDs.

Without going into technical intricacies, Spark represents a data structure that upholds data distributed among multiple nodes. Unlike DataFrames, RDDs lack named columns. Conceptually, you might envision RDDs as lists of tuples.

We can perform two categories of operations on these data structures:

Transformations, such as map or filter.
Actions, such as count or first.

Transformations yield transformed RDDs, whereas actions yield a singular outcome.

PySpark

When dealing with Spark, developers commonly use a programming language interface like PySpark. PySpark is the Python interface for Spark.

PySpark encompasses a DataFrame abstraction, enabling operations reminiscent of those performed on pandas DataFrames. PySpark, in conjunction with Spark, manages all the intricate parallel computing operations.

Note: Interfaces to Spark also exist in languages like R or Scala.

PySpark: An Illustration

Consider the following PySpark example. Similar to the Hive Query previously encountered, it computes the average age of tennis players based on the year of the Grand Slam event. Instead of utilising the SQL abstraction, as seen in the Hive example, it employs the DataFrame abstraction.

# DataFrame abstraction in PySpark

# load the dataset into grand_slam_events_spark
(grand_slam_events_spark
    .groupBy('Year')
    .mean('Age')
    .show())

Summary

Spark is a highly efficient data processing tool that allocates tasks to numerous computer clusters. It is important to note that Spark's memory-based processing approach eliminates the need for expensive disk writes commonly found in MapReduce systems, resulting in significantly faster and more cost-effective data processing.

Apache Spark servers as an excellent alternative to the limitations of MapReduce.

Hadoop:

Collection of open source packages for Big Data
MapReduce is a part of Hadoop architecture
HDFS is also part of Hadoop

Hive:

Initially used Hadoop MapReduce
It is built the need to use structured queries for parallel processing

PySpark:

Uses DataFrame Obstraction
Uses Python interface for the Spark framework

Conclusion: In the Big Data world Spark is probably a more popular choice for data processing.

Post Tags:

Previous: Data Engineer vs Data Scientist
Next: DAG Runs