Analysis of big data framework Hadoop

Author: magic good

Source: Hang Seng LIGHT Cloud Community

Hadoop concept and its evolution

Hadoop originated in Nutch. Nutch was designed to be a large web-wide search engine with web scraping, indexing, querying, etc., but as the number of web pages crawled increased, it ran into serious scalability problems — how to store and index billions of web pages.

Two papers published by Google in 2003 and 2004 provided a feasible solution to the problem.

Distributed File system (GFS), which can be used to process the storage of massive web pages.
MAPREDUCE, a distributed computing framework, can be used to deal with the indexing of massive web pages.

Nutch’s developers completed the corresponding open source implementations of HDFS and MAPREDUCE, which were spun off from Nutch into a separate project, HADOOP, which became the Apache top project by January 2008 (the same year Cloudera was founded), Ushered in a period of rapid growth.

Broadly speaking, Hadoop refers to an ecosystem of big data that includes a lot of other software.
In a narrow sense, Hadoop is the software alone.

This section describes historical versions of Hadoop

0. X series version: the earliest open source version of Hadoop, which is widely used in foreign countries, because big data has not been developed in China at that time. On this basis, 1.x and 2.x versions are evolved

1. X version series: the second generation of open source version of Hadoop. It mainly fixes some bugs of version 0.

2. X version series: This version has major architecture changes and introduces many new features, such as the YARN platform. It is the most used version in China because the country was in the period of big data outbreak at that time.

3. X Version series: Some important functions and optimizations are introduced, including HDFS erasure codes, multiple Namenode support (two or more), MR Native Task optimization, YARN Cgroup based memory and disk I/O isolation, and the minimum JDK version is JDK1.8. It’s a late release, not used much now, but it’s going to become mainstream.

Hadoop three corporate hair version introduction

– Free open source version of Apache

Liverpoolfc.tv: hadoop.apache.org/

Pros: With open source contributors around the world, code can be updated iteratively quickly

Disadvantages: Version upgrade, version maintenance, version compatibility, version patch may not be considered too thoughtful, learning can be used, the actual production work environment as far as possible not to use

Where to download all apache software (including historical versions) :

archive.apache.org/dist/

– Free and open source hortonWorks

Liverpoolfc.tv: hortonworks.com/

Hortonworks is the vice president of Hadoop development at Yahoo. He led two dozen core members to form HortonWorks. The core product software HDP (Ambari), HDF is free and open source, and provides a complete set of Web management interfaces. For us to manage our cluster status through the Web interface, the WEB management interface software HDF Url (ambari.apache.org/)

– Paid version of software ClouderaManager

Website: www.cloudera.com/

Cloudera is a big data company in the United States that implements stable operation between versions through various internal patches on the Hadoop version of Apache, which is open source. All versions of the big data ecosystem software are provided with corresponding versions, which solves various problems such as difficulty in upgrading and version compatibility

Hadoop module composition

Hadoop HDFS: a distributed file system with high reliability and throughput.
Hadoop MapReduce: a distributed offline parallel computing framework.
Hadoop YARN: a framework for job scheduling and cluster resource management.
Hadoop Common: tool module that supports other modules.

The architecture model of Hadoop

NameNode and ResourceManager single-node architecture model

File system core modules:

NameNode: the primary node in a cluster. It is used to manage data in the cluster
SecondaryNameNode: secondaryNameNode is used to manage metadata information in Hadoop
DataNode: A secondary node in a cluster. It is used to store data in the cluster

Core module of data computing:

ResourceManager: receives computing request tasks from users and allocates cluster resources
NodeManager: Receives tasks assigned by applicationMaster
ApplicationMaster: Resourcemanager starts an appMaster for each computing task. AppMatser applies for resources and allocates tasks
NameNode and ResourceManager High availability architecture model

File system core modules:

NameNode: the primary node in a cluster. It is used to manage data in the cluster. Usually, two nodes are used to implement high availability (HA)
JournalNode: Metadata information management process, usually an odd number
DataNode: Secondary node used for data storage

Core module of data computing:

ResourceManager: the primary node of the Yarn platform. ResourceManager is used to receive various tasks. The ResourceManager consists of two nodes and is constructed as a high availability node
NodeManager: secondary node of Yarn platform. It is used to process tasks assigned by ResourceManager

The current status of Hadoop

Since 2015, Hadoop has been exposed to a number of problems. Increasingly, analysts at Gartner, IDG and others, Hadoop users and Hadoop and big data insiders are reporting problems.

The reasons are as follows:

The Hadoop stack is too complex, with too many components, difficult to integrate, and too expensive to play with
Hadoop is not innovating fast enough (or starting from a low base) and lacks a unified philosophy and governance, making integration between its many components complex
Impacted by Cloud technology, in particular, S3-like object storage provides cheaper, easier to use, and more scalable storage than HDFS, leveraging Hadoop’s base HDFS
Expectations of Hadoop are too high. Hadoop grew out of cheap storage and batch processing. People expect Hadoop to fix all the problems with big data
Talent is expensive and scarce

conclusion

In conclusion, as the first generation of big data solution, Hadoop has passed its peak, and big data has entered the second generation: distributed database.

Distributed database, especially MPP database, has solved the basic analysis problems of big data well, and will continue to develop in the direction of easier and faster use in the future.

Advanced data analysis is sinking into the database. The difficulty in advanced data analysis is not the analysis, but the quantity and quality of the data itself. Expect more innovation in this area.