Source:

Blog.csdn.net/forward__/a…

Blog.csdn.net/ibeifeng8/a…

Blog.csdn.net/luanpeng825…

The body of the

1. Connection and difference between Hadoop and Spark

Different levels of problem solving

First, Hadoop and Spark are both big data frameworks, but they exist for very different purposes. Hadoop is essentially more of a distributed data infrastructure: It distributes huge data sets to multiple nodes in a cluster of ordinary computers for storage, meaning you don’t need to buy and maintain expensive server hardware. At the same time, Hadoop indexes and tracks this data, making big data processing and analysis more efficient than ever before.

Spark is a tool for processing large data stored in distributed storage. It does not store distributed data.

The two can be combined and separated

In addition to the distributed data storage function known as HDFS, Hadoop also provides a data processing function called MapReduce. Therefore, we can completely abandon Spark and use Hadoop’s own MapReduce to complete data processing. Conversely, Spark doesn’t have to be attached to Hadoop to survive. But as mentioned above, it does not provide a file management system, so it must be integrated with other distributed file systems to work. Here, we can choose HDFS of Hadoop or other cloud-based data system platforms. But Spark is still used with Hadoop by default, and after all, the combination is thought to be the best.

The following is the most concise interpretation of MapReduce, excerpted from the web:

Map

Now let’s get together and add up everyone’s statistics. This is Reduce.

Spark data processing speed is faster than MapReduce in seconds

Spark is much faster than MapReduce because it processes data differently. MapReduce processes data in steps: “Reads data from the cluster, processes it once, writes the result to the cluster, reads updated data from the cluster, performs the next processing, writes the result to the cluster, and so on… Kirk Borne, a data scientist at Booz Allen Hamilton, explains.

Spark, on the other hand, does all its data analysis in memory in near “real time” : “reads the data from the cluster, does all the necessary analysis processing, writes the results back to the cluster, done,” Born says. The batch processing speed of Spark is nearly 10 times faster than that of MapReduce, and the data analysis speed in memory is nearly 100 times faster.

MapReduce is perfectly acceptable if the data and result requirements to process are mostly static and you have the patience to wait for batch processing to complete.

But if you need to analyze flow data, such as data collected from sensors in a factory, or if your application requires multiple data processing, you might want to use Spark.

Most machine learning algorithms require multiple data processing. In addition, Spark is commonly used in the following scenarios: real-time marketing activities, online product recommendations, network security analysis, and machine diary monitoring.

The disaster recovery

Both have different ways of recovering from disasters, but both are good. Because Hadoop writes data to disk after each processing, it is inherently resilient to system errors.

The Spark data objects are stored in the Resilient Distributed Dataset (RDD) Distributed in the data cluster. “These data objects can be placed in either memory or disk, so RDD can also provide complete disaster recovery capabilities,” Borne points out.

2. Hive: Hive is a data warehouse tool based on Hadoop. Hive maps structured data files (or unstructured data) to a database table, provides simple SQL query functions, and converts SQL statements into MapReduce jobs. Its advantage is that it has low learning cost and can quickly implement simple MapReduce statistics through SQL-like statements without developing special MapReduce applications. It is very suitable for statistical analysis of data warehouse. With Hive, instead of writing MapReduce, you can write SQL statements.

End of the text.