Want to be a data scientist/engineer? Systematically plan the way of big data learning from scratch

The field of big data is so broad that people who want to start learning big data and related technologies are often intimidated. The variety of big data technologies also makes it difficult for beginners to choose where to start.

Focus on the author: need big data learning video materials, other articles can be found in daishen organization

That’s why I wanted to write this article. This article will help you get started on your journey to learn big data and find a job in the big data industry. The biggest challenge we face right now is choosing the right characters based on our interests and skills.

To address this issue, I detailed each big data-related role in this article, taking into account the different roles of engineers and computer science graduates.

I tried to answer every question that people encounter or might encounter in learning big data in detail. To help you choose your career path according to your interests, I have added a set of trees to help you find the right path.

Note: The learning path tree

With the help of this tree, you can choose a path based on your interests and goals. Then you can begin your journey of learning big data. Background reply “career path” 3 words, download the HD version.

Table of contents

1. How do YOU get started?

Big data learning group; 722680258 Zero base to project combat, welcome to join

2. What are the job requirements in the big data field?

3. What is your field and where do you fit in?

4. Outline your role in big data

5. How to become a big data engineer?

O What are big data industry terms?

O Systems and structures you need to understand

O Learn to design solutions and learn related technologies

6. Big data learning path

7. Resources

1. How do YOU get started?

The most common question people ask me when they want to start learning about big data is, “Should I learn Hadoop?” (Hadoop is an open source software for distributed storage and computing. It consists of HDFS and MapReduce, the open source implementations of Google’s GFS and MapReduce, respectively. Because of its ease of use and scalability, Hadoop has become a popular framework for mass data processing. The word Hadoop comes from the name given to a toy elephant by the son of its inventor.) Distributed computing, Kafka (Kafka is a distributed publish/subscribe messaging system developed by LinkedIn),NoSQL (nonrelational databases), Spark (Spark is an open source cluster computing environment similar to Hadoop), But there are some differences)?”

I usually have one answer: “It depends on what you want to do.”

So let’s approach the problem in a methodical way. We will explore this learning path step by step.

What are the career needs in the big data industry?

There are many fields in the big data industry. Generally they can be divided into two categories:

Big data Engineering

Big data analysis

These domains are independent and interrelated.

Big data engineering involves the design, deployment, acquisition, and maintenance of large amounts of data. Big data engineers need to design and deploy such a system to make relevant data available to different consumers and internal applications.

Big data analytics, on the other hand, takes advantage of massive amounts of data provided by systems designed by big data engineers. Big data analysis includes trend analysis, pattern analysis and the development of different classification, prediction and prediction systems.

So, in a nutshell, big data analytics is advanced computing of data. Big data engineering is the top-level construction of system design, deployment and computing operation platform.

3. What is your field and where do you fit in?

Now that we’ve looked at the types of careers available in the industry, let’s figure out how to determine which field is right for you. That way, we can determine your position in the industry.

Generally speaking, based on your educational background and industry experience, we can categorize you as follows:

Education background

(Including interests, not necessarily related to your college education)

Computer science

mathematics

Experience in industry

The couple

The data,

Computer Engineer (working in data related field)

So, with the categories above, you can position your field as follows:

Example 1: “I am a computer science graduate with no solid math skills.”

If you are interested in computer science or math, but have no previous experience, you are defined as a newcomer.

Example 2: “I am a computer science graduate working in database development.”

Your interest is in computer science and you are suited for the role of computer engineer (data related engineering).

Example 3: “I’m working as a data scientist in statistics.”

You have an interest in the field of mathematics, which fits your career role as a data scientist.

So, target your field by reference.

(The fields defined here are critical to determining your learning path in the big data industry.)

4. Map out your roles by domain

Now that you’ve identified your field, let’s map out the positions you want to work toward.

If you have excellent programming skills and an understanding of how computers work on networks, and you have no interest in math or statistics, in that case you should aim for a big data engineering position.

If you’re good at programming and have an education or interest in math or statistics, you should aim for a position as a big data analyst.

5. How to become a big data engineer

Let’s start by defining what an industry-recognized big data engineer needs to learn and understand. The first and most important step is to identify your requirements. You can’t just start learning big data without knowing your personal needs. Otherwise, you will always be blind.

In order to understand your needs, you must understand common big data terminology. So let’s look at what big data really means.

5.1 Terminology of big data

Big data engineering usually includes two aspects — data requirements and processing requirements.

5.1.1 Terminology of data requirements

Structure: You should know that data can be stored in tables or files. Data stored in a predefined data model (that is, with a schema) is called structured data. If the data is stored in a file with no predefined model, it is called unstructured data. (Type: structured/unstructured).

Capacity: We use capacity to define the amount of data. (Type: S/M/L/XL/XXL/ flow)

Sink throughput: The Sink throughput is defined by the data rate that the system can accept. (Type: H/M/L)

Source throughput: Defined as the rate at which data is updated and converted into the system. (Type: H/M/L)

5.1.2 Handle requirements terminology

Query time: time required for system query. (Type: long/medium/short)

Processing time: The time required to process data. (Type: long/medium/short)

Accuracy: The accuracy of data processing. (Type: accurate/approximate)

5.2 Systems and Architectures you need to know

Scene 1:

To analyze a company’s sales performance, you need to design a system that creates a data pool from multiple sources, such as customer data, leadership data, call center data, sales data, product data, blogs, etc.

5.3 Learn to design solutions and technologies

Solution for scenario 1: Sell datapools

(This is my personal solution, if you think of a better solution please share it below)

So how does a data engineer solve this problem?

It is important to remember that the purpose of a big data system is not only to seamlessly integrate data from various sources and make it usable, but also to make it easy, quick and accessible to analyze and utilize the data used to develop applications (in this case, intelligent control panels).

Define the final goal:

Create a data pool by combining data from a variety of sources.
Data is automatically updated at regular intervals (in this case, maybe once a week).
Data available for analysis (during recording time, maybe even daily)
Easily accessible architecture and seamless deployment of the analysis control panel.

Now that we know our final goal, let’s try to formulate our requirements in as formal terms as possible.

5.3.1 Data requirements

Structure: Most data is structured and has a defined data model. But data sources such as web logs, customer interaction/call center data, image data from sales catalogs, product advertising data, etc., are unstructured. The availability and requirements of image and multimedia advertising data may depend on individual companies.

Conclusion: Structured and unstructured data

Size: L or XL (select Hadoop)

Sink throughput: High

Quality: medium (Hadoop&Kafka)

Completeness: incomplete

5.3.2 Handling Related requirements

Query duration: Medium to long

Processing time: medium to short

Accuracy: accuracy

As multiple data sources are integrated, it is important to note that different data will enter the system at different rates. For example, web logs can be streamed into the system in a highly granular continuous stream.

Based on the above analysis of our system requirements, we can recommend the following big data systems.

6. Big data learning path

Now you have an understanding of the big data industry and the different roles and requirements of big data practitioners. Let’s take a look at which path you should follow to become a big data engineer.

We know that the big data space is full of multiple technologies. Therefore, it is important that you learn the techniques relevant to your big data role. It’s a little different from any normal field, like data science or machine learning, where you can start somewhere and try to do everything in that field.

Below you’ll find a tree you should go through to find your own way. Even if some of the techniques in the tree are pointed to as strengths of the data scientist, it’s always nice to know all the techniques up to the “leaf nodes” if you go down a path. This tree is derived from the Lambda schema example.

Note: The learning path tree

One of the basic concepts that any engineer who wants to deploy an application must know is Bash scripting. You have to be comfortable with Linux and bash scripting. This is the basic requirement for processing big data.

At its core, most big data technologies are written in Java or Scala. But don’t worry, if you don’t want to write code in these languages, you can always choose Python or R, because most big data technologies now support Python and R.

So you can start with any of these languages. I recommend Python or Java.

Next, you need to be familiar with working in the cloud. That’s because if you’re not handling big data in the cloud, no one will take it seriously. Try practicing small data sets on AWS, SoftLayer, or any other cloud vendor. Most of them have a free level for students to practice on. You can skip this step for a while if you want, but be sure to work in the cloud before any interviews.

Next, you need to understand a distributed file system. The most popular distributed file system is the Hadoop distributed file system. At this stage you can also learn about NoSQL databases that you find relevant to your field. The following figure can help you choose a NoSQL database to study based on your area of interest.

The paths so far are hard basics that every big data engineer must know.

Now, you decide whether you want to work with data streams or large amounts of data at rest. This is a choice between two of the four V’s used to define big data (Volume, Velocity, Variety, and Veracity).

So let’s say you’ve decided to use data streams to develop real-time or near-real-time analytics systems. Then you should take the Kafka route, or maybe the Mapreduce route. Then you follow your own path. Note that in Mapreduce paths, you don’t need to learn pig and Hive at the same time. It is enough to study only one of them.

Summary: By way of tree graph.

Start at the root node and perform depth-first traversal.

Stop checking the resources given in the link at each node.

If you are knowledgeable and confident in using the technique, move on to the next node.

Try to complete at least 3 programming problems at each node.

Move to the next node.

Reaches the leaf node.

Start with the alternative path.

The last step (# 7) gets in your way! To be honest, there are no apps for streaming or slow delay data processing. Therefore, you need to be technically proficient in implementing a full Lambda architecture.

Also, note that this is not the only way to learn big data techniques. You can always create your own path. But this is a path that can be used by anyone.

If you want to enter the world of big data analytics, you can follow the same path, but don’t try to make everything perfect.

For data scientists who can handle big data, you need to add some machine learning channels to the tree below and focus on the machine learning channels rather than the tree provided below. But we can talk about machine learning channels later.

Add the selected NoSQL database based on the data type you used in the tree above.

This table describes the data store type requirements and software selection

As you can see, there are a large number of NoSQL databases to choose from. So it often depends on the type of data you’re going to use.

And in order to provide a definitive answer as to what type of NoSQL database to use, you need to consider your system requirements such as latency, availability, elasticity, accuracy and of course the type of data you are currently working with.

7. Resources

A Beginner’s guide to Bash from Machtelt Garrels

1.Python

Let everyone into a python experts, from Coursera (https://www.coursera.org/specializations/python)

The path to data science in Python, From Coursera (https://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-business-intelligence-big-data /)

Java

Java programming introduction 1: start using Java code, from Udemy (https://www.edx.org/course/introduction-programming-java-1-starting-uc3mx-it-1-1x)

Intermediate and advanced Java programming, from Udemy (https://www.udemy.com/intermediate-advanced-java-programming/)

To introduce Java programming 2 from Udemy (https://www.edx.org/course/introduction-programming-java-2-writing-uc3mx-it-1-2x)

Java object-oriented programming, data structure and transcend specialization, from Coursera (https://www.coursera.org/specializations/java-object-oriented)

cloud

Big data technology foundation, from amazon web services (https://www.edx.org/course/introduction-programming-java-starting-uc3mx-it-1-1x)

AWS on big data, from amazon web services (https://aws.amazon.com/training/course-descriptions/bigdata/)

HDFS

Big data and Hadoop points, from Udemy (https://www.udemy.com/big-data-and-hadoop-essentials-free-tutorial/)

Big data base, data from large university (https://bigdatauniversity.com/learn/big-data/)

Introduction to Hadoop toolkit (https://www.udemy.com/hadoopstarterkit/)

Apache Hadoop document (https://hadoop.apache.org/docs/r2.7.2/)

Book, Hadoop cluster deployment (http://shop.oreilly.com/product/0636920033448.do)

Apache Zookeeper

Apache Zookeeper document (http://shop.oreilly.com/product/0636920028901.do)

Books – Zookeeper (https://zookeeper.apache.org/doc/r3.4.6/)

Apache Kafka

Complete beginners Apache Kafka courses (http://shop.oreilly.com/product/0636920028901.do)

Learning Apache Kafka basic and advanced topics (https://www.udemy.com/learn-apache-kafka-basics-and-advanced-topics/)

Apache Kafka documents (https://kafka.apache.org/documentation/)

Book – Learn Apache Kafka (https://www.amazon.in/Learning-Apache-Kafka-Nishant-Garg-ebook/dp/B00U2MI8MI/256-7260357-1334049?_encoding=UTF8&ta G = googinhydr18418-21)

Big data using MySQL management (https://www.udemy.com/beginners-guide-to-postgresql/)

SQL Courses (http://www.sqlcourse.com)

PostgreSQL beginner’s guide (https://www.udemy.com/beginners-guide-to-postgresql/)

High performance MySQL (http://shop.oreilly.com/product/0636920022343.do)

Hive

Using Hive access Hadoop data (https://cognitiveclass.ai/learn/big-data/0

Learning Apache Hadoop ecosystem Hive (https://cognitiveclass.ai/learn/big-data/)

Apache Hive Documentation (https://hive.apache.org)

Hive Programming (https://hive.apache.org)

Apache Pig 101, data from large university (https://cognitiveclass.ai/courses/introduction-to-pig/)

With Apache Pig by Hadoop programming (https://bigdatauniversity.com/courses/introduction-to-pig/)

Apache Pig document (http://shop.oreilly.com/product/0636920044383.do)

Books – Pig programming (https://pig.apache.org/docs/r0.12.0/)

Apache Storm

Using Apache Storm real-time analysis (https://www.udacity.com/course/real-time-analytics-with-apache-storm – ud381)

Apache Storm (https://www.udacity.com/course/real-time-analytics-with-apache-storm – ud381 document),

Apache Kinesis

Apache Kinesis document (https://aws.amazon.com/cn/documentation/kinesis/)

Amazon Kinesis by Amazon Web Services flow browse developer resources (https://aws.amazon.com/cn/documentation/kinesis/)

Amazon Kinesis Streams developer resources, from amazon web services (https://aws.amazon.com/documentation/kinesis/)

Apache Spark

Data of science, engineering and Apache (https://www.edx.org/xseries/data-science-engineering-apache-spark)

Apache Spark document (https://www.edx.org/xseries/data-science-engineering-apache-spark)

Books – learn Spark (https://www.edx.org/xseries/data-science-engineering-apacher-sparktm)

Apache Spark Streaming

Apache Spark Streaming documentation (http://spark.apache.org/streaming/)

endnotes

I hope you enjoyed reading this article. With this learning path, you will be able to embark on your journey in the big data industry. I’ve covered most of the main concepts you’ll need to look for a job.

Want to be a data scientist/engineer? Systematically plan the way of big data learning from scratch

Related Posts

Eureka service registration and discovery

CAP theory in distribution

Experience graph database from data preprocessing