1. Understand the big data ecosystem

The target

Understanding big Data
Big data ecology from Hadoop framework

1 big data cognition

Skip the conceptual big data features, the past (money), the advantages, etc., and focus directly on the practical aspects of the job

1.1 Department business process

1.2 Department Organizational structure

Hadoop–> Big data ecology

2.1 What is Hadoop

Hadoop is the cornerstone of big data ecology. If you understand the divide-and-conquer concept of MapReduce, you will become more proficient in using a series of big data processing tools such as Hive, Spark and Flink. For now, Hadoop in a broad sense usually refers to a broader concept — the Hadoop ecosystem.

Hadoop mainly solves the following problems: massive data storage and massive data analysis and calculation.

2.2 Development of Hadoop

The development course
- 2006年–Hadoop1.0
- 2012年–Hadoop2.0
- 2018年–Hadoop3.0
The difference between hadoop1.x and Hadoop2.x

2.3 the Hadoop advantage

High reliability

Hadoop maintains multiple copies of data underneath, preventing data loss even if a Hadoop computing element or storage fails
High scalability

Distributing task data across clusters allows easy scaling of thousands of nodes
High efficiency

Under the idea of MapReduce, Hadoop works in parallel, speeding up task processing
High fault tolerance

Automatically reassign failed tasks

2.4 the Hadoop components

Note: preliminary introduction for the time being, separate elaboration later

HDFS
- NameNode: Stores file metadata, such as file name, file directory structure, file attributes (generation time, number of copies, and file permissions), block list of each file, and DataNode where the block resides
- DataNode: Stores file block data and the checksum of block data in the local file system
- Secondary NameNode: Secondary background program used to monitor the HDFS status and obtain snapshots of THE HDFS metadata at intervals
MapReduce
- Map phase: Input data is processed in parallel
- Reduce phase: Collects Map results
Yarn
- The ResourceManager (RM)
  - Process client requests
  - Monitoring the NM
  - Start or monitor the AM
  - Resource allocation and scheduling
- NodeManager (NM)
  - Manage resources on a single node
  - Process commands from RM
  - Process commands from AM
- ApplicationMaster (AM)
  - Responsible for data segmentation
  - Allocates resources to the application and assigns internal tasks
  - Task monitoring and fault tolerance
- Container
  - Resource abstraction in Yarn encapsulates dimensional resources on nodes, such as memory, CPU, disks, and network

3. Big data technology ecosystem

3.1 Schematic diagram of ecological system

3.2 Related technical explanation

Sqoop: used to transfer data between Hadoop, Hive, and traditional database (MySQL). It can import data from a relational database to the HDFS of Hadoop or from the HDFS to a relational database.
Flume: a highly available, highly reliable, distributed massive log collection, aggregation, and transmission system. You can customize various data senders in the log system to collect data. Flume also provides the ability to easily process data and write data to various data receivers (customizable).
Kakfa is a high-throughput distributed publish-subscribe messaging system with the following features:
- Message persistence is guaranteed by O(1) disk data structure, which can maintain long-term stable performance against terabytes of message storage.
- High throughput: Kafka can support millions of messages per second, even with very ordinary hardware.
- Support for differentiating messages between Kafka servers and consumer machine clusters.
- Hadoop supports parallel data loading.
Flink: Is the most used flow processing framework for stateful computation of unbounded and bounded data streams.
Spark: Is the most widely used memory computing framework for big data. It can perform calculations based on data stored in Hadoop.
Hbase: Hbase is a distributed, column-oriented database suitable for unstructured data storage.
Hive: a data warehouse tool based on Hadoop that maps structured data to a database table and provides simple SQL queries. SQL can be converted into MapReduce tasks to run, very suitable for statistical analysis of warehouse.
ZooKeeper is a reliable coordination system for large distributed systems. It provides configuration and maintenance, name service, distributed synchronization, and group service. The goal is to encapsulate complex and error-prone key services, and provide users with easy-to-use interfaces and systems with high performance and stable functions.

OK, I will first have a preliminary understanding of the knowledge related to the big data ecosystem. In the subsequent stage, I will make a comprehensive summary from theory to practice.

Well, that’s all for today, bye bye ~