Introduction to Apache Spark

4 min readJul 11, 2019

MapReduce and Spark are both used for large-scale data processing. However, MapReduce has some shortcomings which renders Spark more useful in a number of scenarios.

Shortcomings of MapReduce

Every workflow has to go through a map and reduce phase: Can’t accommodate a join, filter or more complicated workflows like map- reduce-map.
MapReduce relies heavily on reading data from disk: Performance bottleneck, especially bad for iterative algorithms which may cycle through the data several times.
Only native Java programming interface available: Python is also available, but it makes implementation complex and is not very efficient for floating point data.
Not that easy in terms of programming and requires lots of hand coding.

Solution — Apache Spark

A new framework: Not a complete replacement of the Hadoop stack, just a replacement for Hadoop MapReduce and more
Capable to using Hadoop ecosystem, e.g., HDFS, yarn

Solutions by Spark

Spark provides over 20 highly…

Introduction to Apache Spark

Shortcomings of MapReduce

Solution — Apache Spark

Solutions by Spark

Written by Saloni Goyal

No responses yet