Introduction to Apache Spark

Saloni Goyal
4 min readJul 11, 2019

MapReduce and Spark are both used for large-scale data processing. However, MapReduce has some shortcomings which renders Spark more useful in a number of scenarios.

Shortcomings of MapReduce

  1. Every workflow has to go through a map and reduce phase: Can’t accommodate a join, filter or more complicated workflows like map- reduce-map.
  2. MapReduce relies heavily on reading data from disk: Performance bottleneck, especially bad for iterative algorithms which may cycle through the data several times.
  3. Only native Java programming interface available: Python is also available, but it makes implementation complex and is not very efficient for floating point data.
  4. Not that easy in terms of programming and requires lots of hand coding.

Solution — Apache Spark

  • A new framework: Not a complete replacement of the Hadoop stack, just a replacement for Hadoop MapReduce and more
  • Capable to using Hadoop ecosystem, e.g., HDFS, yarn
Source

Solutions by Spark

  1. Spark provides over 20 highly

--

--

Saloni Goyal
Saloni Goyal

Written by Saloni Goyal

What matters is going out there and doing it, not thinking about it, not worrying what others might think, not even being attached to a result, just doing it.

No responses yet