“Write in front” : I am “Yun Qi”, a big data development ape who loves technology and can write poems. Nickname comes from wang anshi’s poem in a sentence [qi qi of cloud, or rain yu yuan], very like. On the one hand, blogging is a little summary and record of my own learning; on the other hand, I hope to help more friends who are interested in big data. If you are also interested in data center, data modeling, data analysis and Flink/Spark/Hadoop/ digital warehouse development, you can pay attention to me, let us mine the value of data together ~

One, foreword

Before I wrote an article “Interview nearly 20 large and medium-sized factories in a month, in the Winter of the Internet break through the heavy circle, successfully landed”, many friends left messages and private messages about my learning route of big data, and asked me some questions about the work experience and want to change to big data, not a word to say clearly. I spent a month compiling a big data learning route that I learned at the beginning, starting with the most basic big data cluster construction. I hope it can help you.

But before we start, I still hope you can think clearly, if you are confused, why do you want to develop in the direction of big data, and I just want to ask, what is your major, and what are your interests in computer/software?

Computer science, interested in operating systems, hardware, networking, servers? Software major, interested in software development, programming, writing code? Or is he a math or statistics major with a special interest in data and numbers?

Please leave a comment on this topic (•̀ ω •́)✧

This is actually related to the three development directions of big data:

  • Platform building/optimization/operation and maintenance/monitoring
  • Big data development/design/architecture
  • Data analysis/mining

Nowadays, in order to cope with these characteristics of big data, there are more and more open source big data frameworks, and they are getting stronger and stronger. Let’s first list some common ones:

File storage: Hadoop HDFS, Tachyon, and KFS

Offline computing: Hadoop MapReduce and Spark

Spark Streaming, Real-time computing: Storm, Spark Streaming, Flink

K-v and NOSQL databases: HBase, Redis, and MongoDB

Resource management: YARN and Mesos

Log collection: Flume, Scribe, Logstash, and Kibana

Messaging systems: Kafka, StormMQ, ZeroMQ, RabbitMQ

Query analysis: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Flink, Kylin, Druid

Distributed coordination service: Zookeeper

Cluster management and monitoring: Ambari, Ganglia, Nagios, Cloudera Manager

Data mining, machine learning: Mahout, Spark MLLib

Data synchronization: Sqoop

Task scheduling: Oozie

Dazzled, there are more than 30 kinds of above, don’t say proficient, all can use, estimate also not a few.

Personally, I’m mainly working in the second direction (development/design/architecture), so I’ll start with the history of big data. Due to their limited experience, the content of this article references the circle of many teachers’ views, for your reference and mutual learning.

Second, the history of big data

About the history of big data, I think “Luo Junwu” in the AI era, don’t understand big Data? It is very clear in the text. Big data has gone through five stages in its nearly thirty years history.

2.1 Initiation stage: Emergence of data warehouse

In the 1990s, business intelligence (also known as BI) was born. It turns existing business data into knowledge that helps bosses make business decisions. For example, in the retail scenario, sales data and inventory information of goods need to be analyzed in order to make a reasonable purchase plan.

Obviously, business intelligence is dependent on data analysis, which requires the aggregation of data from multiple business systems (e.g., transaction systems, warehousing systems) and the scope query of large amounts of data. But the traditional database is facing the single business to add, delete, change and check, can not meet this demand, so prompted the emergence of the concept of data warehouse.

Traditional data warehouse, for the first time clear data analysis application scenarios, and adopt a separate solution to achieve, do not rely on business database.

2.2 Technological change: Hadoop was born

Around 2000, the ADVENT of the PC Internet era, at the same time brought a huge amount of information, very typical of the two features:

  • Data scale is getting bigger: Internet giants like Google and Yahoo can generate hundreds of millions of pieces of behavioral data a day.
  • Diversified data types: in addition to structured business data, there are massive user behavior data, multimedia data represented by images and videos.

Clearly, the traditional data warehouse cannot support the Internet era of business intelligence. In 2003, Google published three seminal papers (commonly known as the “Google Trois”) : MapReduce, BigTable, and GFS. These three papers laid the theoretical foundation for modern big data technology.

The trouble is that Google didn’t open source the source code for these three products, but only published detailed design papers. In 2005, Yahoo funded the open source implementation of Hadoop in accordance with these three papers, and this technological change officially began the era of big data.

Hadoop has the following advantages over traditional data warehouses:

  • Fully distributed, can use cheap machines to build clusters, can fully meet the massive data storage needs.
  • The data format is weakened and the data model and data storage are separated to meet the analysis requirements of heterogeneous data.

As Hadoop technology matured, the concept of “data lake” was introduced at the Hadoop World Conference in 2010.

You can read my blog about the data lake theory.

What is the use of Data Lake? Let’s take a look…

Enterprises can build data lakes based on Hadoop and make data their core asset. Thus, the data Lake kicked off the commercialization of Hadoop.

2.3 Data Factory Era: The rise of big data platforms

There are ten technologies involved in commercial Hadoop, and the whole data development process is very complex. In order to complete a data requirements development, involving data extraction, data storage, data processing, data warehouse construction, multidimensional analysis, data visualization and a set of processes. Such a high technical threshold will obviously restrict the popularization of big data technology.

At this time, the big data platform (the idea of platform as a service, PaaS) came into being. It is a full-link solution for the R&D scenario, which can greatly improve the research and development efficiency of data, make the data be processed as quickly as on the assembly line, and the original data become indicators, appearing in each report or data product.

2.4 Era of Data value: Ali proposed data Center

Around 2016, it has already belonged to the era of mobile Internet. With the popularity of big data platforms, a lot of big data application scenarios have been spawned.

Now some new problems: in order to achieve rapid business requirements, chimney type development mode led to different lines of business data is completely fragmented, this caused a large amount of data index of repeated development, not only low r&d efficiency and at the same time also waste storage and computing resources, makes big data application cost is higher and higher.

At this time, ma Yun’s father, with great foresight, called out the concept of “Data center”, and the slogan of “One Data, One Service” began to be heard in the big Data field. “The core idea of data Center is to avoid double calculation of data, improve data sharing ability and enable business through data service.”

About Ali Data Center, please refer to this article reproduced from “Tan Hu, Chen Xiaoyong” :

A detailed explanation of Ali Cloud data in Taiwan, an article comprehensive understanding of big data “net red”

3. What are the core technologies of big data?

The concept of big data is abstract, and the size of the stack of big data technologies will blow your mind.

The system of big data technology is huge and complex. The basic technologies include data collection, data preprocessing, distributed storage, NoSQL database, data warehouse, machine learning, parallel computing, visualization and other technical categories and different technical levels. Firstly, a generalized big data processing framework is given, which is mainly divided into the following aspects: “data collection and pretreatment, data storage, data cleaning, data query analysis and data visualization.”

  • “Data collection” : This is the first step of big data processing. Data sources mainly fall into two categories. The first is the relational database of each business system, which is regularly extracted or synchronized through tools such as Sqoop or Cannal. The second type is various buried point logs, which are collected in real time through Flume.
  • “Data storage” : Once the data is collected, the next step is to store the data in HDFS and output it to the streaming computing engine in Kafka in the case of real-time log flow.
  • “Data Analysis” : This step is the core part of data processing, including offline processing and streaming processing. The corresponding computing engine includes MapReduce, Spark, and Flink, and the processed results are saved to the pre-designed data warehouse or to various storage systems such as HBase, Redis, and RDBMS.
  • Data application: Includes various data application scenarios, such as data visualization, business decisions, and AI.

Iv. Digital warehouse architecture under big data

Data warehouse is a form of data organization from the perspective of business. It is the basis of big data application and data center. The multi-store system generally adopts the hierarchical structure shown in the following figure.

Insert the picture description here

In this way, our development focus is on the DWD layer, the detail data layer, here is mainly some wide tables, or detailed data storage; In the DWS layer, we will aggregate data for different dimensions. In principle, the DWS layer is a market layer, which is generally divided according to topics and belongs to the category of dimensional modeling. Ads is the output of various reports on the application layer.

Study Guide

Home, take a study guide to reading

——-> The growth path of big Data Development Engineers (sort out Zizhihu)

Secondly, AliYun big data ACA and ACP (two are aliyun big data certification, worth a test!)

——-> Ali Cloud Big Data Development practice series topics (also known as my road of big data development in Ali Cloud)

The following is a set of system architecture diagram and warehouse hierarchical model diagram designed by me with the big data development components of Aliyun (I will tell you more about the specific design ideas when I have the opportunity).

Here, I strongly recommend ali’s book, “The Road to Big Data: Alibaba’s Big Data Practice”! Great masterpiece!!

Then, look at the predecessor’s study guide for big Data Open Source framework (so detailed, I was lazy and didn’t want to draw it). Write at the end, after all, the blogger has only been in the business for two years. Then for some friends’ questions, I try to give different suggestions for different people.

For fresh graduates

Personally, I think this year’s students should lay a good foundation. Undergraduate courses such as data structure, algorithm foundation, operating system, principle of compilation, computer network and so on are generally offered. These courses must be good to learn, solid foundation to learn other things are not big, and a lot of big company interview can ask these things. If you are going to work in IT, these things will be very helpful.

As for what language to learn, I think for the big data industry, Java is more. If you have time and interest, you can learn Scala, which is a great language to write Spark in.

The cluster environment must be set up. Conditional words can build a small distributed cluster, no condition can install a virtual machine in their computer and then build a pseudo distributed cluster. It helps you get the most out of Hadoop, but it also allows you to actually do something with it. All the holes you step in are your precious treasure.

Then you can try to write some of the common operations of data computation such as de-duplicate, sort, table association, etc.

Then I have a little friend, this year 211 big data major graduation, just came to Hangzhou to intern for two weeks on the line of two warehouse task, I treat him as (upgrade) ~→ the strongest intern (he and I have to show off, the intern who came earlier than him is still doing chores…) Ha ha ha.

For those who have work experience and want to change careers

It mainly examines three aspects, one is the foundation, two is the learning ability, three is the ability to solve problems.

The foundation inspects very well, finish to a few pen test questions to know what level basically.

Learning ability is very important because writing Javaweb is not the same as writing Mapreduce. At present, there are many kinds of big data processing technologies, and enterprises do not only use one when they use it. The development of another industry is relatively fast, so they should always learn new things and put them into practice.

The ability to solve problems is very important at any time, especially in data development. We often encounter many data problems, for example, the final BI data does not match. Generally speaking, a final data is often derived from many original data and processed N times in the middle. It requires you to be sensitive to data and be able to get to the bottom of a problem and solve it in as short a time as possible.

The basics are good, so just review them two weeks before you change jobs. Learning ability and problem solving ability should be exercised more in daily work. The minimum requirement of social recruitment is the above three points, if you also learn something about big data in daily life, it is a good plus.

Here are some personal experiences and insights that will help you “(̀ㅂ•́)و✧”.

I am “yunqi”, a love of technology, poetry writing big data development ape, welcome to pay attention to my public number [yunqi QI], Love&Peace!