Introduction to Apache Spark
4 min readJul 11, 2019
MapReduce and Spark are both used for large-scale data processing. However, MapReduce has some shortcomings which renders Spark more useful in a number of scenarios.
Shortcomings of MapReduce
- Every workflow has to go through a map and reduce phase: Can’t accommodate a join, filter or more complicated workflows like map- reduce-map.
- MapReduce relies heavily on reading data from disk: Performance bottleneck, especially bad for iterative algorithms which may cycle through the data several times.
- Only native Java programming interface available: Python is also available, but it makes implementation complex and is not very efficient for floating point data.
- Not that easy in terms of programming and requires lots of hand coding.
Solution — Apache Spark
- A new framework: Not a complete replacement of the Hadoop stack, just a replacement for Hadoop MapReduce and more
- Capable to using Hadoop ecosystem, e.g., HDFS, yarn
Solutions by Spark
- Spark provides over 20 highly…