Presentation skills

MapReduce needs to translate all calculations into Map and Reduce, which makes it difficult to describe the complex process.

In addition to Map and Reduce, Spark supports multiple data model operations, such as RDD, DataFrame, and DataSet, providing a flexible programming model.

Disk I/o

After each operation, MapReduce reads data from the disk and writes data to the disk. Only a small part of the data is stored in the memory as a temporary cache. Therefore, the DISK I/O overhead is high.

Spark directly puts intermediate results into memory, which not only improves the iterative efficiency, but also avoids a lot of double calculation. According to the official data, the same iterative efficiency is Spark:Hadoop=110:0.9.

Task delay

MapReduce divides tasks into a series of operations for execution. Each operation involves disk I/OS. Tasks are not connected in a timely manner and the next operation cannot be performed until the previous one is completed.

Spark is based on the DAG task scheduling and execution mechanism. It does not involve disk I/O delay, and the iterative operation is faster.

Memory management

When a MapReduce job is started, the maximum memory is specified in the JVM and cannot exceed the specified maximum memory.

After the specified maximum memory is exceeded, Spark uses the operating system memory, ensuring basic memory usage and avoiding resource waste caused by excessive memory allocation

Parallel processing

In MapReduce, a process runs a task in sequence.

In Spark, each thread runs one task, increasing parallelism.