I. Overview of big data
concept
Big Data: refers to the Data collection that cannot be captured, managed and processed by conventional software tools within a certain time range. It is a massive, high-growth and diversified information asset that requires new processing mode to have stronger decision-making power, insight and discovery ability and process optimization ability. The main solution, mass data storage and mass data analysis and calculation problems.
The characteristics of
- A large Volume of
- The Velocity of Velocity
- Variety
- Value (Low Value density)
Hadoop framework
What is the Hadoop
- Hadoop is a distributed system infrastructure developed by the Apache Foundation.
- The main solution, mass data storage and mass data analysis and calculation problems.
- In broad terms, Hadoop usually refers to a broader concept — the Hadoop ecosystem.
Hadoop ### history
GFS => HDFS Map-Reduce => MR BigTable => HBase
There are three major distributions of Hadoop
Hadoop has three major distributions: Apache, Cloudera, and Hortonworks
- The original (most basic) version of Apache is best for getting started.
- Cloudera is widely used by large Internet companies;
- Hortonworks is well documented;
The advantage of the Hadoop
- High reliability: Hadoop maintains multiple data copies, so even if a computing element or storage of Hadoop fails, data will not be lost.
- High scalability: Allocating task data between clusters, easily scaling thousands of nodes;
- Efficiency: Under the idea of MapReduce, Hadoop works in parallel to speed up task processing;
- High fault tolerance: Can automatically reassign failed tasks;
Third, Hadoop composition
In hadoop1. x, MapReduce in Hadoop processes service logic operations and resource scheduling at the same time, resulting in high coupling. HDFS in Hadoop is responsible for data storage.
In hadoop2. x mode, Hadoop adds Yarn for resource scheduling, MapReduce for service logic computing, and HDFS for data storage.
HDFS
- NameNode (nn) : Stores file metadata, such as file name, file directory structure, file attributes (generation time, number of copies, and file permissions), block list of each file, and DataNode where the block resides.
- DataNode (DN) : Stores file block data and the checksum of block data in the local file system.
- Secondary NameNode (2nn) : Secondary background program used to monitor the HDFS status and obtain snapshots of HDFS metadata at intervals.
MapReduce
MapReduce divides the calculation process into two phases: Map and Reduce
- Map phase processes input data in parallel.
- In the Reduce phase, Map results are summarized.
Yarn
-
ResourceManager (RM) provides the following functions:
- Handle client requests;
- Monitor the NodeManager;
- Start or monitor ApplicationMaster;
- Allocation and scheduling of resources;
-
NodeManager (NM) has the following functions:
- Manage resources on a single node.
- Process commands from ResourceManager.
- Handle commands from ApplicationMaster;
-
ApplicationMaster (AM) does the following:
- Responsible for starting MapTask based on slice information (job.split);
- Allocates resources for the application and assigns them to internal tasks;
- Task monitoring and fault tolerance;
- Container is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPUS, disks, and networks.
- Container: Container is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPUS, disks, and networks.
Hadoop port
9870: NameNode Web access port
<! - 9870: HDFS NameNode WEB UI port number --> <property> <name>dfs.namenode.http-address</name> <value>0.0.0.0:9870</value> <description> The address and the base port where the dfs namenode web ui will listen on. </description> </property>Copy the code
8088: Web access port of ResourceManager
<! - 8088: Yarn ResourceManager WEB UI port number --> <property> <description>The HTTP address of The RM WEB Application.</description> <name>yarn.resourcemanager.webapp.address</name> <value>${yarn.resourcemanager.hostname}:8088</value> </property> <! -- yar - site. The XML configuration - > < property > < name > yarn. The resourcemanager. The hostname < / name > < value > centos7202 < value > / < / property >Copy the code
9868: Secondary NameNode Web access port
<! - 9868: HDFS Secondary NameNode WEB UI port - > < property > < name > DFS. The NameNode. Secondary. HTTP - address < / name > <value>0.0.0.0:9868</value> <description> The secondary Namenode HTTP server address and port. </description> </property> <! -- HDFS - site. The XML configuration - > < property > < name > DFS. The namenode. Secondary. HTTP - address < / name > < value > centos7203:9868 < / value > </property>Copy the code
19888: Job History Web access port of the History server
<! - 19888: HDFS the high availability of HDFS RPC ports - > < property > < name > graphs. The jobhistory. Webapp. Address < / name > < value > 0.0.0.0:19888 < value > </property> <! -- mapred-site.xml --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>centos7203:19888</value> </property>Copy the code
10020: Historical server connection port
<! Server address - history - > < property > < name > graphs. The jobhistory. Address < / name > < value > 0.0.0.0:10020 < value > / < description > graphs JobHistory Server IPC host:port</description> </property> <! -- mapred-site.xml --> <property> <name>mapreduce.jobhistory.address</name> <value>centos7203:10020</value> </property>Copy the code
8485: Journalnode connection port
<! - 8485: Journalnode RPC port --> <property> <name> dfs.journalNode. rpc-address</name> <value>0.0.0.0:8485</value> <description> The JournalNode RPC server address and port. </description> </property>Copy the code