I don’t know if you are a computer graduate or a practitioner. In short, students with Java foundation will learn big data more easily, while those with zero foundation need to learn from Java and Linux.

If you are a strong learner and have a lot of self-discipline, you can do it on your own.

For those who can learn, the biggest disadvantage of self-study is that there is no real big data training program.

Some imaginary projects shared on the network simply cannot meet the requirements of enterprises. So that’s something you have to figure out for yourself. Of course, if you are learning at the same time, the enterprise can give you a big data position there is no problem.

Now LET me talk about the current hot employment direction of big data:

1. Big data research and development

2. Big data analysis and mining

3. Deep learning

4. Artificial Intelligence



Before sharing, I would like to recommend the big data learning material sharing group 615997810, which is the largest big data learning exchange place in China, with 2,000 people gathered together. Whether you are small white or big bull, I welcome it. Today’s information has been uploaded to the group file, and we will share the content from time to time. Including a latest big data tutorial for 2018 study compiled by myself, welcome beginners and advanced friends.

Java:

Just learn JavaSE, the standard version of Java.

Linux:

I mainly master the theoretical basis of Linux operating system and the practical knowledge of server configuration. At the same time, THROUGH a lot of experiments, I focus on training my practical ability. Make students understand the important position and wide range of use of Linux operating system in the industry. On the basis of learning Linux, deepen the understanding of server operating system and practice configuration ability. Deepen the understanding of basic knowledge of computer network and apply it in practice.

Master the installation, cli operation, user management, disk management, file system management, software package management, process management, system monitoring, and system troubleshooting of the Linux operating system. Master Linux network configuration, DNS, DHCP, HTTP, FTP, SMTP, and POP3 service configuration and management. For further learning other network operating system and software system development to lay a solid foundation. At the same time, if you have time to learn about javaweb and frameworks, it will make your big data learning a little more free.

Now that I have finished the basics, let’s talk about what big data technologies we still need to learn. You can learn them in the order I wrote.

Hadoop:

What problem does Hadoop solve? Hadoop is the reliable storage and processing of big data (too big for one computer to store, too big for one computer to process in the required time).

Remember that this can be a node in your big data study.

Zookeeper:

ZooKeeper is a distributed, open source distributed application coordination service. It is an open source implementation of Google’s Chubby and an important component of Hadoop and Hbase. It provides consistency services for distributed applications, including configuration and maintenance, domain name service, distributed synchronization, and group service.

His goal is to encapsulate complex and error-prone critical services and provide users with easy-to-use interfaces and efficient, functional and stable systems.

ZooKeeper provides distributed exclusive lock, election, and queue interfaces in ZooKeeper-3.4.3 SRC ECipes. Distribution locks and queues have Java and C versions, and elections have only Java versions.

Mysql:

MySQL is a relational database management system developed by MySQL AB, a Swedish company. It is currently a product of Oracle. MySQL is one of the most popular Relational Database Management systems. In WEB applications, MySQL is the best RDBMS (Relational Database Management System) application software.

MySQL is a relational database management system that keeps data in different tables instead of putting all data in a large warehouse, which increases speed and flexibility.

The SQL language used by MySQL is the most commonly used standardized language for accessing databases. MySQL software adopts the double licensing policy, which is divided into community edition and commercial edition. Due to its small size, fast speed and low total cost of ownership, especially the characteristics of open source, MySQL is generally selected as the website database for the development of small and medium-sized websites.

Sqoop:

This is used to import data from Mysql into Hadoop. You can also export the Mysql table as a file and put it in HDFS. Of course, you need to pay attention to Mysql pressure in production environment.

Hive:

This is a great tool for people who know SQL syntax, and it makes it easy to handle big data without the hassle of writing MapReduce programs. Some people say Pig? He and Pig can master just about one.

Oozie:

It can help you manage your Hive or MapReduce scripts and Spark scripts. It can also check if your application is executing correctly, alert you to errors, retry your application, and most importantly, configure dependencies for your tasks. I’m sure you’ll love it, or you’ll feel like shit when you look at all the scripts and cronds.

Hbase:

This is a NOSQL database in the Hadoop ecosystem. Its data is stored in the form of key and value and key is unique, so it can be used for data weighting. It can store much more data than MYSQL. So it is often used for big data processing after the storage destination.

Kafka:

This is a useful queue tool. What is a queue for? Wait in line for tickets, you know? Queue data also need more treatment, it with collaboration of other students will not call you, why do you give me so much of the data (such as hundreds of G file) I have to come over how to deal with, you don’t blame him because he is not a big data, can you tell him I put the data in the queue when you use one take, So he can stop complaining and start optimizing his program, because it’s his problem. Not the questions you gave me. You can also use this tool for online real-time data storage or HDFS, in conjunction with a tool called Flume, which provides easy data processing and writing to various data receivers (such as Kafka).

The Spark:

It is designed to compensate for the speed of mapReduce-based data processing by loading data into memory for calculation rather than reading it from a slow and evolving hard drive. It is particularly suitable for iterative operations, so algorithm streams are particularly keen on it. It is written in Scala. It can be handled by either the Java language or Scala, because both are JVM.

In the future, I will comb and solve the above knowledge points.