1. Introduction and development history of Hadoop

The article directories

1.1 In a narrow sense, Hadoop refers to an open source software of Apache.

  • Implement open source software frameworks in the Java language
  • Allows distributed processing of large data sets across clusters of computers using a simple programming model

2.1 Hadoop Core Components

  • Hadoop HDFS(Distributed File Storage System) : Provides massive data storage
  • Hadoop YARN (Cluster Resource Management and Task Scheduling Framework) : Solve resource task scheduling
  • Hadoop MapReduce (Distributed Computing Framework) : Solves massive data computing

3.1 website:hadoop.apache.org/

4.1 In a broad sense, Hadoop refers to the big data ecosystem built around Hadoop.

5.1 History of Hadoop

  • Father of Hadoop :Doug Cutting
  • Hadoop originated from the Apache Lucene subproject :Nutch Nutch was designed to build a large, full-web search engine. Bottlenecks: How to solve the storage and indexing of billions of web pages
  • Google three papers
  1. The Google File System: Google Distributed File System GFS
  2. MapReduce:Simplified Data Processing on Large Clusters: Google Distributed Computing Framework
  3. MapReduce Bigtable: A Distributed Storage System for Structured Data: Google Structured Data Storage System
  • Three papers in Chinese download address: download.csdn.net/download/qq…

6.1 summarize

  • In the narrow sense, Hadoop refers to software and in the broad sense, Hadoop refers to ecosystems
  • Doug Cutting, the father of Hadoop
  • Hadoop has its roots in the Nutch project
  • Inspired by 3 papers on Google
  • Open source to the Apache Software Foundation in 2008

2. Hadoop features advantages and domestic and foreign applications

2.1 Hadoop Features Advantages

2.1 Hadoop Application abroad

2.2 Hadoop Domestic Application

2.3 summarize

  • The magic of Hadoop’s success — versatility — The precise distinction between what you do and how you do it is a business problem and how you do it is a technical problem. The user takes care of the business and Hadoop takes care of the technology
  • The appeal of Hadoop’s success — simplicity

3. Hadoop distribution and architecture changes

3.1 Hadoop Distribution

3.2 Hadoop distribution

  • Apache Open Source community version: hadoop.apache.org/
  • Commercial distribution Cloudera: www.cloudera.com/products/op… Hortonworks : https://www.cloudera.com/products/hdp.html

  • The latest version is:3.2.2

4. Hadoop Architecture Changes (1.0-2.0 changes)

  1. Hadoop 1.0

    HDFS(Distributed file storage)

    MapReduce(Resource Management and Distributed Data Processing)

  2. Hadoop 2.0 HDFS (distributed file storage) MapReduce(distributed data processing) YARN (Cluster resource management and task scheduling)

5. Hadoop Architecture Changes (new 3.0)

  • Hadoop 3.0 architectural components are similar to Hadoop 2.0 in that 3.0 focuses on performance optimization.

  • General compact kernel, classpath isolation, shell script refactoring
  • Hadoop HDFS EC erasure codes and multiple NameNode support
  • Localization optimization of Hadoop MapReduce tasks and automatic inference of memory parameters
  • Hadoop YARN Timeline Service V2 and queue configuration