Takeaway:
- Chapter 1: Getting to know Hadoop
- Chapter 2: A More efficient WordCount
- Chapter 3: Getting data from elsewhere onto Hadoop
- Chapter 4: Take your data on Hadoop somewhere else
- Chapter 5: Hurry up, my SQL
- Chapter 6: Polygamy
- Chapter 7: More and more Analysis tasks
- Chapter 8: My data in Real time
- Chapter 9: My data is public
- Chapter 10: Awesome machine learning
Some beginners often ask me in blogs and QQ that they want to develop in the direction of big data, what technologies they should learn and what the learning route is like. They think big data is hot, and the employment is good and the salary is very high. If you are confused and want to develop in the direction of big data for these reasons, it is also ok. Then I would like to ask, what is your major and what is your interest in computer/software? Computer major, interested in operating system, hardware, network, server? Software major, interested in software development, programming, writing code? Or mathematics, statistics major, especially interested in data and figures.
In fact, this is to tell you the three development directions of big data, platform building/optimization/operation and maintenance/monitoring, big data development/design/architecture, data analysis/mining. Please don’t ask me which is easy, which has good prospects, which has more money.
Let’s start with the 4V features of big data:
- Large amount of data, TB->PB
- Various data types, such as structured and unstructured text, logs, videos, pictures, geographic locations, etc.
- The commercial value is high, but this value needs to be mined more quickly through data analysis and machine learning on the basis of massive data.
- With high processing timeliness, the processing demand of massive data is no longer limited to offline computing.
Nowadays, in order to deal with these characteristics of big data, there are more and more open source big data frameworks. Here are some common ones:
File storage: Hadoop HDFS, Tachyon, KFS Offline computing: Hadoop MapReduce, Spark Streaming, real-time computing: Storm, Spark Streaming, S4, Heron K-V, NOSQL database: HBase, Redis, and MongoDB Resource management: YARN and Mesos Log collection: Flume, Scribe, Logstash, and Kibana Message system: Kafka, StormMQ, ZeroMQ, and RabbitMQ Query analysis: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Flink, Kylin, Druid Distributed coordination services: Ambari, Ganglia, Nagios, Cloudera Manager Data Mining, machine learning: Mahout, Spark MLLib Data synchronization: Sqoop Task scheduling: Oozie…
Dazzled, there are more than 30 kinds of above, do not say proficient, all can use, estimate also did not have a few.
As far as I am concerned, the main experience is in the second direction (development/design/architecture), take my advice.
Chapter 1: Getting to know Hadoop
1.1 Learn baidu and Google
Whatever problem you have, try searching and solve it yourself.
Google first choice, turn over the past, with Baidu.
1.2 References Official documents are preferred
Especially for getting started, official documentation is always the preferred document.
I believe that those who make this piece are mostly cultural people, English can do just fine, really look not go down, please refer to the first step.
1.3 Let Hadoop run first
Hadoop can be regarded as the originator of big data storage and computing, and most open source big data frameworks now rely on Hadoop or are well compatible with it.
Here’s what you need to know about Hadoop:
- Hadoop 1.0, Hadoop 2.0
- Graphs, HDFS
- The NameNode, DataNode
- The JobTracker, TaskTracker
- Yarn, ResourceManager, and NodeManager
Build your own Hadoop. Use steps 1 and 2 to get it running.
You are advised to use the installation package command-line interface (CLI) instead of using the management tool.
Also: Hadoop1.0 just knows it, now it’s all Hadoop 2.0.
1.4 试试使用Hadoop
HDFS directory operation commands; Upload and download file commands; Submit to run the MapReduce sample program;
Open the Hadoop WEB page to view Job running status and Job running logs.
Know where the Hadoop system logs are.
1.5 It’s time you understood how they work
MapReduce: How to divide and conquer; HDFS: Where the data is and what is a copy; What Yarn is and what it does; What the NameNode is doing; What does ResourceManager do?
1.6 Write a MapReduce program
Write your own (or copy) WordCount, package it, and submit it to Hadoop to run.
You don’t know Java? Shell, Python, there’s something called Hadoop Streaming.
If you have carefully completed the above steps, congratulations, you have a foot in.
Chapter 2: A More efficient WordCount
2.1 Learn some SQL
Do you know databases? Can you write SQL? If not, learn SQL.
Version 2.2 SQL WordCount
How many lines of WordCount did you write (or copy) in 1.6?
Let me show you mine:
SELECT word,COUNT(1) FROM wordcount GROUP BY word;
That’s the beauty of SQL. It takes dozens or even hundreds of lines of code to program. Using SQL to process and analyze data on Hadoop is convenient, efficient, and easy to use. Whether offline or real-time computing, more and more big data processing frameworks are actively providing SQL interfaces.
2.3 Hive SQL On Hadoop
What is Hive? The official explanation:
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
Why Hive is a data warehouse tool and not a database tool? Some friends may not know data warehouse, data warehouse is a logical concept, the bottom is the use of the database, data in the data warehouse has these two characteristics: the most complete historical data (massive), relatively stable; The so-called relatively stable, refers to the data warehouse is different from the business system database, the data will be updated frequently, once the data into the data warehouse, will rarely be updated and deleted, only will be a large number of queries. Hive also has these two characteristics. Therefore, Hive is suitable for massive data warehouse tools, rather than database tools.
2.4 Installing and Configuring Hive
For details, see 1.1 and 1.2 to install and configure Hive. You can access the Hive cli.
2.5 Try Using Hive
Refer to 1.1 and 1.2 to create the Wordcount table in Hive and run the SQL statement in 2.2. Locate the SQL task you just ran on the Hadoop WEB interface.
Check whether the SQL query result is the same as that in MapReduce in 1.4.
2.6 How Does Hive work
Why MapReduce jobs are displayed on the Hadoop WEB interface when SQL is clearly written?
2.7 Learn Basic Hive Commands
Create and drop tables; Load data into a table; Download Hive table data.
For details about Hive syntax and commands, see 1.2.
If you have thoroughly followed the steps of Chapters 1 and 2 in “Notes for Beginners in Big Data Development”, you should already have the following skills and knowledge:
- The difference between 0 and Hadoop2.0
- How MapReduce works (again, how to use Java to count the 10 most frequently used words in a 10GB file given 1G of memory);
- HDFS data reading and writing process; PUT data to HDFS. Download data from HDFS.
- Write a simple MapReduce program that runs into problems and knows where to look for logs.
- Can write simple SQL statements such as SELECT, WHERE, GROUP BY, etc.
- The process of converting Hive SQL into MapReduce;
- Common Hive statements include: Create tables, delete tables, load data to tables, partition tables, and download data in tables to a local directory.
HDFS is a distributed storage framework provided by Hadoop, which can be used to store massive data. MapReduce is a distributed computing framework provided by Hadoop, which can be used to collect and analyze massive data On HDFS. Hive is SQL On Hadoop. Hive provides SQL interfaces. Developers only need to write SQL statements that are easy to use. Hive translates SQL statements into MapReduce and executes them.
Your “big data platform” looks like this:
So the question is, how does the massive data get to HDFS?
Chapter 3: Getting data from elsewhere onto Hadoop
This can also be called data collection, which collects data from various data sources onto Hadoop.
3.1 HDFS PUT Command
You’ve probably used this before.
The put command is also commonly used in actual environments. It is usually used in shell and Python scripts.
Proficiency is recommended.
3.2 HDFS API
HDFS provides apis for writing data. You can write data into HDFS using programming languages. The PUT command also uses APIS.
In the actual environment, it is generally not necessary to write programs and use APIS to write data to HDFS. It is usually encapsulated by other frameworks. For example, INSERT statement in Hive and saveAsTextfile in Spark.
It is recommended to understand the principle and write Demo.
3.3 Sqoop
Sqoop is an open source framework mainly used for data exchange between Hadoop/Hive and traditional relational databases such as Oracle/MySQL/SQLServer.
Just as Hive translates SQL into MapReduce, Sqoop translates your specified parameters into MapReduce and executes them in Hadoop to exchange data between Hadoop and other databases.
Download and configure Sqoop yourself (Sqoop1 is recommended; Sqoop2 is more complex).
Understand common configuration parameters and methods for Sqoop.
Use Sqoop to complete data synchronization from MySQL to HDFS; Use Sqoop to complete data synchronization from MySQL to Hive table;
PS: If Sqoop is used as a data exchange tool for subsequent selection, it is recommended to master it well; otherwise, it is enough to understand and use Demo.
3.4 the Flume
Flume is a distributed massive log collection and transmission framework. Because of the “Collection and transmission framework”, it is not suitable for relational database data collection and transmission.
Flume collects logs from network protocols, message systems, and file systems in real time and transmits them to the HDFS.
Therefore, if your business has data from these sources and needs real-time collection, Flume should be considered.
Download and configure Flume.
Use Flume to monitor a file that continuously appending data, and transfer the data to HDFS;
PS: Flume configuration and use is relatively complicated, if you do not have enough interest and patience, you can skip Flume first.
3.5 Alibaba open source DataX
The reason for this introduction is that the data exchange tool between Hadoop and relational database currently used in our company is developed based on DataX and is very useful.
You can refer to my blog post entitled Download and Use of Taobao DataX, a Massive Data exchange tool for Heterogeneous Data Sources.
DataX is now in version 3.0 and supports many data sources.
You can also do secondary development on top of it.
PS: Anyone interested can explore and use it, and compare it to Sqoop.
If you have done all the above, your “big data platform” should look like this:
Chapter 4: Take your data on Hadoop somewhere else
The previous section described how to collect data from data sources to Hadoop. Once the data is collected to Hadoop, it can be analyzed using Hive and MapReduce. The next question is, how do you synchronize the results from Hadoop to other systems and applications?
In fact, the methods here are basically the same as those in Chapter 3.
4.1 HDFS GET Command
GET files from the HDFS to the local PC. Mastery is required.
4.2 HDFS API
With 3.2.
4.3 Sqoop
With 3.3.
Use Sqoop to synchronize files from HDFS to MySQL. Use Sqoop to complete the Hive table data synchronization to MySQL;
4.4 DataX
With 3.5.
If you have done all the above, your “big data platform” should look like this:
If you have thoroughly followed the steps of Chapters 3 and 4 in “Writing for Beginners in Big Data Development 2”, you should already have the following skills and knowledge:
Know how to collect existing data to HDFS, including offline collection and real-time collection;
You already know that SQoop (or DataX as well) is a data interchange tool between HDFS and other data sources;
You already know that Flume can be used for real-time log collection.
Build a Hadoop cluster, collect data to Hadoop, analyze data using Hive and MapReduce, and synchronize the analysis results to other data sources.
The next problem is that as Hive is used more and more, you will find a lot of problems, especially the slow speed. In most cases, even though my data volume is very small, it has to apply for resources and start MapReduce to execute.
Chapter 5: Hurry up, my SQL
In fact, everyone has found that the Hive background using MapReduce as the execution engine is a bit slow.
As a result, there are more and more SQL On Hadoop frameworks, and from what I understand, the most commonly used are SparkSQL, Impala, and Presto in order of popularity.
These three frameworks provide SQL interfaces to quickly query and analyze data on Hadoop, based on either half or full memory. Please refer to 1.1 for a comparison of the three.
We are currently using SparkSQL. There are probably several reasons why SparkSQL is used:
There are other things you do with Spark that you don’t want to introduce too many frameworks;
Impala requires too much memory and does not deploy too many resources.
5.1 About Spark and SparkSQL
What is Spark and what is SparkSQL? Some core concepts and definitions of Spark are explained. What is the relationship between SparkSQL and Spark? What is the relationship between SparkSQL and Hive? Why SparkSQL runs faster than Hive?
5.2 How Do I Deploy and Run SparkSQL
What are the deployment modes of Spark? How do I run SparkSQL on Yarn? Use SparkSQL to query tables in Hive.
PS: Spark is not a technology that can be mastered in a short period of time. Therefore, you are advised to start with SparkSQL first.
About the Spark and SparkSQL, may refer to http://lxw1234.com/archives/category/spark
If you have done all the above, your “big data platform” should look like this:
Chapter 6: Polygamy
Please don’t be tempted by the name. In fact, WHAT I want to say is a collection and multiple consumption of data.
In actual service scenarios, especially for some monitoring logs, it is too slow to analyze data from HDFS if you want to learn some indicators from logs (real-time calculation is described in the following chapters). Flume collects data from HDFS, but Flume cannot scroll files to HDFS at short intervals. This can result in a large number of small files.
Kafka is the answer to the need to collect data once and consume it many times.
6.1 about Kafka
What is Kafka?
The core concepts and definitions of Kafka.
6.2 How To Deploy and Use Kafka
Use stand-alone Kafka and successfully run its own producer and consumer examples.
Write and run your own producer and consumer programs using Java programs.
Flume and Kafka integration, using Flume to monitor logs and send log data to Kafka in real time.
If you have done all the above, your “big data platform” should look like this:
In this case, the data collected by Flume is not directly sent to HDFS, but first to Kafka. The data in Kafka can be consumed by multiple consumers at the same time, one of which is to synchronize the data to HDFS.
If you have thoroughly followed the steps of Chapters 5 and 6 in “Writing for Beginners in Big Data Development 3”, you should already have the following skills and knowledge:
- Why Spark is faster than MapReduce?
- Use SparkSQL instead of Hive to run SQL faster.
- Kafka is used to collect data once and consume it many times.
- You can write your own program to complete the Kafka producer and consumer.
From the previous study, you have mastered the big data platform of data acquisition, data storage and computing, most of the skills, such as data exchange, and every step of the, need a task (program) to complete, and there is a dependency between each task, for instance, must wait for the data acquisition task successfully completed, data computing tasks to be able to run. If a task fails to be executed, you need to send an alarm to the development and maintenance personnel and provide complete logs for troubleshooting.
Chapter 7: More and more Analysis tasks
Not only analysis task, data acquisition, data exchange is also a task. Some of these tasks are timed, while others depend on other tasks to trigger. When a platform has hundreds or thousands of tasks to maintain and run, crontab alone is not enough, and a scheduling monitoring system is needed to do this. The scheduling monitoring system is the backbone of the entire data platform, similar to AppMaster, and is responsible for assigning and monitoring tasks.
7.1 Apache Oozie
1. What is Oozie? What are the functions? 2. What types of tasks (programs) can Oozie schedule? 3. What task triggering modes does Oozie support? 4. Install Oozie.
7.2 Other Open-source Task scheduling systems
Azkaban:
https://azkaban.github.io/
Light – a task – the scheduler:
https://github.com/ltsopensource/light-task-scheduler
Zeus:
https://github.com/alibaba/zeus
And so on…
In addition, here is the task scheduling and monitoring system developed separately before. For details, please refer to “Task Scheduling and Monitoring System of Big Data Platform”.
If you have done all the above, your “big data platform” should look like this:
Chapter 8: My data in Real time
In chapter 6, Kafka introduces some business scenarios that need real-time indicators. Real-time can be divided into absolute real-time and quasi-real-time. The absolute real-time delay requirements are generally in milliseconds, and the quasi-real-time delay requirements are generally in seconds and minutes. For absolute real-time service scenarios, Storm is commonly used. For other quasi-real-time service scenarios, Storm or Spark Streaming can be used. Of course, if you can, you can also write your own program to do it.
8.1 Storm
1. What is Storm? What are the possible application scenarios? 2. What are the core components of Storm and what roles do they play? 3. Simple installation and deployment of Storm. 4. Write Demo program by myself and use Storm to complete real-time data flow calculation.
8.2 Spark Streaming
1. What is Spark Streaming and how is it related to Spark? 2. Spark Streaming and Storm: What are their advantages and disadvantages? 3. Use Kafka + Spark Streaming to complete the Demo program of real-time calculation.
If you have done all the above, your “big data platform” should look like this:
At this point, your big data platform infrastructure has been formed, including data acquisition, data storage and computing (offline and real-time), data synchronization, task scheduling and monitoring modules. Now it’s time to think about how best to make the data available.
Chapter 9: My data is public
Usually provides external (business) data access, which generally includes the following aspects:
Offline: For example, the data of the last day is provided to the specified data source (DB, FILE, FTP) every day. Offline data can be provided by Sqoop, DataX and other offline data exchange tools.
Real-time: For example, the recommendation system of online website needs to obtain the recommendation data for users from the data platform in real time, which requires a very low delay (less than 50 milliseconds).
Possible solutions include HBase, Redis, MongoDB, and ElasticSearch based on delay requirements and real-time data query requirements.
OLAP analysis: In addition to the specification of the underlying data model, OLAP also requires higher and higher response speed to queries. Possible solutions include Impala, Presto, SparkSQL, and Kylin. If your data model is scale, Kylin is the best choice.
AD hoc query: AD hoc query data is arbitrary, and it is generally difficult to establish a general data model, so possible solutions include Impala, Presto, and SparkSQL.
So many mature frameworks and solutions need to be combined with their own business requirements and data platform technology architecture, select the appropriate. There is only one principle: the simpler and more stable is the best.
If you already know how to deliver data well, your “big data platform” should look something like this:
Chapter 10: Awesome machine learning
About this, I this layman also can be a brief introduction. As a mathematics major, I was very ashamed and regretted that I had not studied mathematics well.
In our business, there are three types of problems that can be solved by machine learning:
- Classification problem: including binary classification and multiple classification, binary classification is to solve the problem of prediction, like predicting whether an email is spam; Multi-classification solves the classification of text;
- Clustering problem: roughly categorize users based on the keywords they have searched.
- Recommendation problem: Make recommendations based on users’ browsing history and clicking behavior.
Most industries use machine learning to solve these kinds of problems.
Introduction to learning Routes:
Basic mathematics;
Machine Learning in Action, Python is the best;
SparkMlLib provides some encapsulated algorithms, as well as methods for feature processing and feature selection.
Machine learning is really awesome, and it is also the goal of my study.
Add machine learning to your big data platform.
Editor’s Recommendation:
- Nobody stop me! Today I want to expose the Chinese version of JavaScript resources!
- Remember to come back to like me – Java resources complete Chinese version
- Hive basic knowledge of data warehouse based on Hadoop
- Python Resources
- Front-end must-have! Top 10 Popular JavaScript Frameworks and libraries