The target

  • Understanding big Data
  • Big data ecology from Hadoop framework

1 big data cognition

Skip the conceptual big data features, the past (money), the advantages, etc., and focus directly on the practical aspects of the job

1.1 Department business process

1.2 Department Organizational structure

Hadoop–> Big data ecology

2.1 What is Hadoop

Hadoop is the cornerstone of big data ecology. If you understand the divide-and-conquer concept of MapReduce, you will become more proficient in using a series of big data processing tools such as Hive, Spark and Flink. For now, Hadoop in a broad sense usually refers to a broader concept — the Hadoop ecosystem.

  • Hadoop mainly solves the following problems: massive data storage and massive data analysis and calculation.

2.2 Development of Hadoop

  • The development course

    • 2006年–Hadoop1.0

    • 2012年–Hadoop2.0

    • 2018年–Hadoop3.0

  • The difference between hadoop1.x and Hadoop2.x

2.3 the Hadoop advantage

  • High reliability

    Hadoop maintains multiple copies of data underneath, preventing data loss even if a Hadoop computing element or storage fails

  • High scalability

    Distributing task data across clusters allows easy scaling of thousands of nodes

  • High efficiency

    Under the idea of MapReduce, Hadoop works in parallel, speeding up task processing

  • High fault tolerance

    Automatically reassign failed tasks

2.4 the Hadoop components

Note: preliminary introduction for the time being, separate elaboration later

  • HDFS
    • NameNode: Stores file metadata, such as file name, file directory structure, file attributes (generation time, number of copies, and file permissions), block list of each file, and DataNode where the block resides
    • DataNode: Stores file block data and the checksum of block data in the local file system
    • Secondary NameNode: Secondary background program used to monitor the HDFS status and obtain snapshots of THE HDFS metadata at intervals
  • MapReduce
    • Map phase: Input data is processed in parallel
    • Reduce phase: Collects Map results
  • Yarn
    • The ResourceManager (RM)
      • Process client requests
      • Monitoring the NM
      • Start or monitor the AM
      • Resource allocation and scheduling
    • NodeManager (NM)
      • Manage resources on a single node
      • Process commands from RM
      • Process commands from AM
    • ApplicationMaster (AM)
      • Responsible for data segmentation
      • Allocates resources to the application and assigns internal tasks
      • Task monitoring and fault tolerance
    • Container
      • Resource abstraction in Yarn encapsulates dimensional resources on nodes, such as memory, CPU, disks, and network

3. Big data technology ecosystem

3.1 Schematic diagram of ecological system

3.2 Related technical explanation

  • Sqoop: used to transfer data between Hadoop, Hive, and traditional database (MySQL). It can import data from a relational database to the HDFS of Hadoop or from the HDFS to a relational database.
  • Flume: a highly available, highly reliable, distributed massive log collection, aggregation, and transmission system. You can customize various data senders in the log system to collect data. Flume also provides the ability to easily process data and write data to various data receivers (customizable).
  • Kakfa is a high-throughput distributed publish-subscribe messaging system with the following features:
    • Message persistence is guaranteed by O(1) disk data structure, which can maintain long-term stable performance against terabytes of message storage.
    • High throughput: Kafka can support millions of messages per second, even with very ordinary hardware.
    • Support for differentiating messages between Kafka servers and consumer machine clusters.
    • Hadoop supports parallel data loading.
  • Flink: Is the most used flow processing framework for stateful computation of unbounded and bounded data streams.
  • Spark: Is the most widely used memory computing framework for big data. It can perform calculations based on data stored in Hadoop.
  • Hbase: Hbase is a distributed, column-oriented database suitable for unstructured data storage.
  • Hive: a data warehouse tool based on Hadoop that maps structured data to a database table and provides simple SQL queries. SQL can be converted into MapReduce tasks to run, very suitable for statistical analysis of warehouse.
  • ZooKeeper is a reliable coordination system for large distributed systems. It provides configuration and maintenance, name service, distributed synchronization, and group service. The goal is to encapsulate complex and error-prone key services, and provide users with easy-to-use interfaces and systems with high performance and stable functions.

OK, I will first have a preliminary understanding of the knowledge related to the big data ecosystem. In the subsequent stage, I will make a comprehensive summary from theory to practice.

Well, that’s all for today, bye bye ~