What is big data?
BIG DATA refers to the collection of DATA that cannot be captured, managed and processed by conventional software tools within a certain period of time. It is a massive, high-growth and diversified information asset that requires a new processing mode to have stronger decision-making ability, insight and discovery ability and process optimization ability.
The 5V features of big data include VOLUME, VELOCITY, VARIETY, VALUE and VERACITY.
Why learn big Data?
At present, global data is characterized by explosive growth and massive aggregation. Big data computing technology perfectly solves the problems of collection, storage, calculation and analysis of massive data. It is estimated that the market size of big data will reach usd 80 billion by 2022, with an annual growth rate of 15.37%. The era of big data will open another era in which human society will utilize data value. The formulation and implementation of national big data strategic policies is also a powerful condition for the continuous development of big data market.
- A wide range of applications: the industry scale is unprecedented, and all industries continue to exert their strength: including finance, government affairs, transportation, telecommunications, commerce and trade, medical care, education, tourism, industry, agriculture and other industries.
- Employment salary is high: the average monthly salary of the industry is 22,690 yuan, 30K-50K 29.7%, 20K-30K 43.2%, 15K-20K 12.2%, 10K-15K 2.7%, 6K-8K 8.1%.
- The gap is large: the industry daily recruitment of 154,598 positions, zhaopin.com daily recruitment of 50,916, 51job.com daily recruitment of 55,804, hunting recruitment daily recruitment of 10,000 +, job club daily recruitment of 37,878.
- Policy support: The state vigorously promotes the implementation of big data development strategy, and the industry policy environment is good.
Government Work Report of the Second Session of the 12th National People’s Congress: “We will set up a platform for entrepreneurship and innovation in emerging industries to catch up with the advanced in new-generation mobile communications, integrated circuits, big data and advanced manufacturing.”
During the 18th National Congress of the Communist Party of China, The State Council issued a document: “The issue of action To Promote The Development of Big Data”, proving that big data has become a new driving force to promote economic transformation and development.
According to the report to the 19th National Congress of the Communist Party of China (CPC), “Speed up the building of China into a manufacturing powerhouse, speed up the development of advanced manufacturing, and promote the deep integration of the Internet, big data, artificial intelligence and the real economy.”
Outline of big data Learning Route:
Stage 1: Java language foundation stage
1.1 Overview of the Java programming language
1.1.1 Introduction to Computer Languages and Programming 1.1.2 Introduction to the Java Ecosystem……
1.2 Basic Java Syntax
1.2.1 Branch Loop Statement 1.2.2 If Branch Structure……
1.3 Object-oriented programming
1.3.1 Software Lifecycle 1.3.2 Software Design Principles……
1.4 Object-oriented advanced programming
1.4.1 Package Management and Functions 1.4.2JavaBean Specifications……
1.5 Common Libraries in Java
1.5.1 Wrapper Class 1.5.2 Packing and Unpacking……
1.6 Enumeration and exception classes
1.6.1 Enumeration Definition and Use 1.6.2 Viewing the underlying implementation through enumeration’s Class file……
1.7 Java data structures and collections framework generics
1.7.1 Data Structure Examples 1.7.2 Definition and Usage of Arrays……
1.8 I/O flows in Java
1.8.1 Common Operations of the File Class 1.8.2 Recursively Traversing folders……
1.9 Multithreading in Java
1.9.1 Relationship between Programs, Processes, and Threads
1.10 Network programming and reflection in Java
1.10.1 Network Communication Protocol 1.10.2 Network Layer 7 Protocol……
1.11 New Java8 features
1.11.1Lambda Expressions 1.11.2Java can be programmed functionally……
1.12 Java Foundation enhanced
1.12.1 Introduction and Construction of Tomcat 1.12.2 Software B/S and C/S……
02. Common basic commands
03. System management
04. Enhanced Linux operation
05. Programming Linux shell
Hadoop Ecology
07. Overview of distributed systems
08. Getting Started with Hadoop
Hadoop pseudo-distributed
Hadoop distributed
11. Basic concepts of HDFS
12. Application development of HDFS
13. I/O flow operations of HDFS
14, NameNode working mechanism
15. Working mechanism of DataNode
16. Zookeeper introduction
17, Zookeeper
18. Principle of HA framework
19. Hadoop-ha cluster configuration
20. MapReduce framework principles
Shuffle mechanism
22. Mapreduce case 1
23. Mapreduce case 2
24. Hive start
Hive DDL data definition
Hive partition table
27. Hive bucket table
28. Hive query
Hive advanced query Join and sort
Hive functions
31. Hive DML data management
32. Hive file storage
33. Hive enterprise tuning
Hive enterprise tuning ii
35. Hive enterprise level project practice
Flume details
37, Sqoop details
38. Hbase concepts
39. Hbase operations
40. Hbase integration
41. Hbase actual combat and optimization
Stage 3: Distributed computing framework
3.1 the scala
3.1.1 Installing IDEA And Configuring Environment Variables 3.1.2 Maven Local Library Configuration 3.1.3 JDK Environment Variables 3.1.4 IDEA Version Configuration……
3.2 Spark Core
3.2.1 Big Data Architecture 3.2.2 Architecture 3.2.3 Spark Cluster 3.2.4 Spark Cluster Configuration……
3.3 Spark SQL
3.3.1 History of Spark SQL 3.3.2 Principle of Spark SQL 3.3.3 DataFrame Overview 3.3.4 Method of Creating a DataFrame……
3.4 Spark Streaming
3.4.1 Spark Streaming Overview 3.4.2 Principle of Spark Streaming overview 3.4.3 Comparison between Spark Streaming and Storm 3.4.4 Concept of DStream……
3.5 kafka
3.5.1 Basic Concepts of Kafka 3.5.2 Development History of Kafka 3.5.3 Application Background of Kafka 3.5.4 Basic JMS……
3.6 ElasticSearch
3.6.1 Introduction to the Full-text Search Technology 3.6.2 GETTING Started in ES Installation and Configuration 3.6.3 Installing ES Plug-ins 3.6.4 Basic Operations of ES……
3.7 Logstash
3.7.1 Logstash Overview 3.7.2 Input Component 3.7.3 Filter Component 3.7.4 Output Component……
3.8 Kibana
3.8.1 Kibana Introduction 3.8.2 Kibana Environment Preparation 3.8.3 Kibana Installation 3.8.4 Kibana Demo……
3.9 Kibana
3.9.1 What is NoSQL 3.9.2 Classification of NoSQL Databases 3.9.3 Introduction to Redis 3.9.4 History of Redis……
4.1 Mutual Finance field – advertising
Project introduction: Build advertising platform, carry out advertising business, attract potential customers and promote products, including launching micro-service platform, bidding module, customer group portrait and recommending products to thousands of people.
4.2 E-commerce Platform
Project description: Embedded services, user segmentation and portrait, establishment of credit system, online activities.
4.3 Shared Bikes
Project introduction: Compose travel rules according to user behavior trajectory, and dynamically dispatch car use conditions according to user group travel rules and regional conditions.
4.4 Industrial big data
Project description: State Grid _ provincial power transmission/transformation monitoring project: monitoring the sensing equipment of the line, ensuring the safety of the equipment, reducing the failure cost, dynamically monitoring the working condition of the line and substation secondary equipment, and alarm automation.
4.5 the traffic
Project introduction: guizhou bureau of transportation offline/real-time monitoring project: through traffic bayonet to collect real-time data, dynamic monitoring of the traffic and accident conditions, avoid congestion, avoid traffic accident, convenient accurate speed, prevent the deck, and provide the best travel plan, forecast congestion coefficient, the optimal path planning for at all levels.
4.6 tourism
Project description: Anshun Smart Tourism integrates all kinds of tourismrelated application systems and information resources to achieve information sharing and cooperation in public security, transportation, industry and commerce and other related fields, and jointly create a benign tourism cloud ecosystem.
4.7 medical
Project Introduction: In a municipal People’s Hospital, with the continuous increase of aging, the prevalence rate is getting higher and higher. Increase the big data platform, collect medical data, improve the accuracy of diagnosis, prevent the occurrence of some diseases, monitor the progress of rehabilitation of related diseases, and truly solve the difficulty of seeing a doctor and reduce the incidence of diseases.
Stage 5: Big data analysis
5.1 Data Analyze Basis for Data analysis
Introduction to AI& Machine Learning & Deep Learning 5.1.2 Data Science……
5.2 Preparing the Working Environment
5.2.1 Common Python Techniques for Data Analysis 5.2.2 Python String Operations……
5.3 Concepts and criteria of data visualization
5.3.1 Python Matplotlib library 5.3.2 Matplotlib Architecture……
5.4 Python machine learning
5.4.1 Basic Concepts of Machine Learning 5.4.2 Classification algorithms and Regression Algorithms……
5.5 Selecting a Model
5.5.1 Training Model 5.5.2 Test Model……
5.6 Tree Building Process
5.6.1 Important Parameters of the Decision tree in SkLearn 5.6.2 Importance scores of features can be obtained through the Decision Tree……
5.7 Grid Search
5.7.1 10-fold cross-validation 5.7.2 Model evaluation indicators and Model selection……
5.8 There are three types of naive Bayes algorithms in SKLearn
5.8.1 Bernoulli Model 5.8.2 Multinomial Model……
5.9 Color Features
5.9.1 Texture Features 5.9.2 Shape Features……
5.10 Handwritten digit recognition
5.10.1 Face Recognition 5.10.2 Object Recognition……
5.11 Basic composition of the text
5.11.1 Common Python Text Processing Functions (String Operations) 5.11.2 Regular Expressions……
5.12 Basic composition of the text
5.12.1 Topic Model and LDA 5.12.2 Latent Dirichlet Allocation (LDA)……
Big Data Video tutorial:
Introduction to 2019 Big Data and Career Development
This tutorial introduces the basic concepts and ecosystem of Hadoop in big data, as well as its application in the enterprise. Finally, build a Hadoop environment, and show how Hadoop analysis and statistics.
Big data of 2.019 million front met with career development pan.baidu.com/s/17rJ2iBRD…
Tutorial 2,Hadoop Ecosystem video tutorial
This tutorial covers the Hadoop ecosystem technology, including Linux, HDFS, MapReduce, ZooKeeper, Hive, SQOOP, etc., and compares the teaching, from the basic to the advanced, easy to deal with the Hadoop ecosystem.
5 days to learn Hadoop based tutorial pan.baidu.com/s/1gMrPQKKt…
Tutorial 3,New Hive tutorial
In enterprises, offline data is mainly derived from existing files with fixed formats or structured data accumulated in databases. How to efficiently manage data and conduct basic statistical analysis is a skill that every big data developer must master.
2019 new Hive introductory tutorial pan.baidu.com/s/1iVFTXVm0…
Tutorial 4,Hadoop Introduction 2019
Hadoop Introduction covers the Hadoop ecosystem technologies, including Linux, HDFS, MapReduce, ZooKeeper, Hive, and SQOOP.
2019 the latest Hadoop tutorial pan.baidu.com/s/1NfMUR4zT…
Tutorial five,Hive Course details
In the enterprise, the main sources of offline data is an existing file with the fixed format, or accumulation of structured data in the database, how to efficient data management and statistical analysis of basic is each big data developers must master the skills, the tutorial on the basis of the Hadoop cluster, system tells the story of the role of the Hive, install the deployment process, Common built-in functions, UDF introduction, data import and export related components, combined with some enterprise scenarios are explained.
Hive Mandatory Tutorial pan.baidu.com/s/1I-RsrZPi…
Tutorial 6,Statistical machine learning algorithms in detail
Decision tree is a basic classification and regression method. Learning usually involves three steps: feature selection, decision tree generation, and decision tree pruning.
2019 big data statistical machine learning algorithms: pan.baidu.com/s/1aFPKBgCc…
Tutorial 7: Spark basics and source code analysis
Apache Spark is the most commonly used memory-based technology framework in the big data industry. In particular, RDD features and applications help you understand Spark and task submission processes and caching mechanisms.
A full range of Spark pan.baidu.com/s/1235kpqE4 video tutorial…
Play with data visualization
Data visualization technology is mainly used to improve the readability of data by presenting data in the form of charts. It is widely used in various platforms and business intelligence fields to facilitate the interpretation and sharing of data results.
2019 new fast spin HBase ~ serial pan.baidu.com/s/1RbjmaBDC…
Tutorial nine,Logistic regression tutorial for machine learning
Classification (logistic regression) and regression (linear regression). As you build your process using logistic regression or linear regression (the simpler the better), you will become familiar with some of the concepts in machine learning. You’ll also know how to prepare your data and what the challenges are (such as filling in missing values and feature selection).
Big data tutorial – machine learning of logistic regression pan.baidu.com/s/1ElzIP6np…
Tutorial ten,Introduction to Machine learning
This course introduces supervised learning, semi-supervised learning and unsupervised learning in machine learning, and details data + algorithm = AI applications.
Big data tutorial – machine learning of the linear regression pan.baidu.com/s/1i3gpkVrr…
Tutorial 11Advanced Tutorial on Big Data -SVM models
The classical support vector machine algorithm only gives the algorithm of binary classification, but in the practical application of data mining, it usually needs to solve the problem of multi-class classification. It can be solved by the combination of multiple binary support vector machines. There are one-to-many combination mode, one-to-one combination mode and SVM decision tree. Then it can be solved by constructing a combination of multiple classifiers. The main principle is to overcome the inherent shortcomings of SVM and combine the advantages of other algorithms to solve the classification accuracy of multi-class problems. For example, combined with rough set theory, a kind of combinatorial classifier of multi-class problems with complementary advantages is formed.
Big Data tutorial – SVM model for machine learning pan.baidu.com/s/1GmOy-iU2…
Tutorial 12.Multivariate relationship between advertising and media in linear regression case
This course covers the industrial application of regression models, the already important method of hyperparameter tuning, the raw data obtained by loading data sets, and the elaboration of the selection modeling process.
Big data tutorial – machine learning of the linear regression pan.baidu.com/s/1i3gpkVrr…
Tutorial 13.Quick Start Spark
Apache Spark is the most commonly used memory-based technology framework in the big data industry. In particular, RDD features and applications help you understand Spark and task submission processes and caching mechanisms.
2019 Quick Start of Big Data Spark~ Serial pan.baidu.com/s/1z_et0uq8…
Tutorial 14.Quickly play with the SparkGraphx series
Spark GraphX is a distributed graph processing framework. In social networks, there are complex connections between users, such as friends and followers of wechat, QQ and Weibo users, which form a huge graph. It cannot be processed on a single computer, but can only be processed using a distributed graph processing framework. Spark GraphX is a distributed graph processing framework.
2019 SparkGraphx series pan.baidu.com/s/1_9PDPimg…
Tutorial 15.Lambda expressions in 2 days
This video series aims to cover a new feature of JAVA8: Lambda expressions.
2019 big data: 2 days to Lambda expressions pan.baidu.com/s/180n1SMnp…
Tutorial 16.Quick Start Scala
This series of videos is a comprehensive introduction to Scala from simple to in-depth. It is mainly aimed at Scala users who have a certain programming language foundation, such as Java language, to learn more easily.
Quick Start on Big Data Scala~ serial links pan.baidu.com/s/1_V0E5DZY…
Tutorial 17.Learn More about Scala
This series of videos is a comprehensive introduction to Scala from simple to in-depth. It is mainly aimed at Scala users who have a certain programming language foundation, such as Java language, to learn more easily.
A full set of Scala pan.baidu.com/s/18AUDdTUS video tutorial…
Tutorial 18.Artificial Intelligence will learn to look at machine learning with mathematics
From the perspective of deep learning engineering practice, this chapter helps engineers sort out and learn the calculus knowledge used in deep learning.
Big Data ai must Learn to look at machine learning with mathematics pan.baidu.com/s/1Q_fqIE5R…
Tutorial 19.2019Java multithreading introduction
Java provides built-in support for multithreaded programming. A thread is a single sequential flow of control in a process, and multiple threads can be concurrent in a process, each performing a different task in parallel.
2019 Java multi-thread earnestly pan.baidu.com/s/1kHUkh7Zq…
Tutorial 20.2019 Big Data Quick Start Flink
Flink is an open source distributed streaming and batch platform; The core of Flink is the streaming data streaming engine, and then batch processing is implemented on the basis of the streaming engine. In contrast to Spark, which has a batch engine at its core, streaming is implemented on top of the batch engine.
Big data the quickstart Flink ~ serial pan.baidu.com/s/1g3ubsn8R…
Tutorial 21.Azkaban is the latest small white rapid scheduling framework
This course video is intended for anyone who knows or has systematically studied the components of the Hadoop ecosystem. If you do not have a relevant understanding of big data and can understand the concept, many operations cannot be related.
2019 new small white crash course scheduling framework azkaban pan.baidu.com/s/1RVLh8UVL qian feng big data 】 【…
Tutorial 222019JAVA design pattern introduction
Design patterns represent best practices and are generally adopted by experienced object-oriented software developers.
2019 Java design patterns: qian feng big data pan.baidu.com/s/1FqdYFOOA 】…
Tutorial 23Streaming operations for collections of new JAVA8 features
This course introduces the collection flow operation, data preparation, the use of collect method,reduce method, Max and min method,matching operation,count method,forEach method and so on.
2019 java8 collection of new features of current operating ~ serial pan.baidu.com/s/1ttcPxagR…
Tutorial 24.Linear regression complete solution
This course explains the derivation process of parameter estimation, which should be combined with business in the industrial algorithm world, and understand the understanding and derivation of hypothesis function and loss function and optimal function.
Big data tutorial – machine learning of the linear regression pan.baidu.com/s/1i3gpkVrr…
Tutorial 25.ElasticSearch quick start tutorial
Full-text search is in great demand. The open source solution Elasricsearch (Elastic) is a great tool. It is currently the first choice for full text search engines.
2019 the latest ElasticSearch pan.baidu.com/s/182RTgdJN quickstart tutorial…
Tutorial 262019 latest quick play Hbase
HBase is a distributed, column-oriented open source database based on HDFS. It is a distributed storage system for structured data. The HBase technology can be used to build large-scale structured storage clusters on inexpensive PC servers. Is the basic framework that every big data should master.
2019 new fast spin HBase ~ serial pan.baidu.com/s/1RbjmaBDC…
Tutorial 27Oozie is the latest quick dispatch framework in 2019
Oozie is a workflow-based task scheduling tool in the big data ecosystem and a common tool used by big data engineers. In this course, you will learn the principles, installation and configuration of Oozie, scheduling Shell scripts using Oozie, scheduling multiple Shell scripts logically, scheduling MapReduce jobs directly, and scheduling multiple jobs logically.
2019 new small white crash course scheduling framework ooziepan.baidu.com/s/1Wmh41Q4m…
Tutorial 282019 Latest Flume quick play tutorial
Flume is Cloudera’s highly available, highly reliable, distributed system for collecting, aggregating, and transferring massive logs. Flume is flexible and simple based on streaming architecture. Big data is one of the big data development engineers must be able to framework. Good for code development and maintenance.
2019 new fast spin pan.baidu.com/s/1gLowi7EZ Flume tutorial…
Tutorial 29Spark Livy getting started to mastering
Livy is cloudera’s solution for connecting and managing Spark using REST.
Big data tutorial – Spark Livy entry to the proficient pan.baidu.com/s/1h6oU3gLW…