How does 2021 algorithm post basic skill tree point?

This article was written in April 2020. It has about 7000 words and is expected to take 18 minutes to read. Sit down and relax.

0 – preface

Note: This paper assumes that traditional algorithms are the basic skills of all engineers, so the algorithms mentioned later mainly refer to machine learning and deep learning algorithms.

Although the focus of my job search is still on the back-end, in order to graduate from the current master’s degree in artificial intelligence and expand the path of algorithm engineering, it is necessary to simply plan the knowledge and skills needed in the algorithm part.

In order to get the Offer of 2021 algorithm post as the goal, this paper starts with the algorithm post experience of 2020 and analyzes the skill tree of the required points.

1 – The difference between different algorithm posts

The first thing to note is that technical jobs in different fields are differentiated according to their exposure to research and business.

We can refer to Howard’s question “What is the difference between academic research and industrial R&D?” on Zhihu. [1], the algorithm can be roughly divided into:

Business oriented, Development oriented in most cases;
Technology-oriented, both Research and Development;
Research oriented, Research oriented in most cases;

In recent years, this performance in the algorithm hills reflect obvious, because in a few years ago most of the algorithms are still in research stage, only recently with some mature machine learning package after the libraries are open, some in the industry began to found that the algorithm can generate the actual value, so the algorithm post has presented the explosive growth of recent years, Especially business-oriented algorithm engineers (because most companies still want these algorithms to deliver more business value faster).

Of course, it is already a little late. Now it is not the time at the beginning, when you can use framework and tuning parameters to get the Offer. Now the algorithm post is more like the normal situation after the wave, which requires not only a solid theoretical foundation, but also rich project practice.

Personally, I prefer business-oriented algorithm engineering, so this article is written with this goal. If you are interested in the details of the three different positions, you can read an article by Xi Xiaoyao on the public account “Refuse to follow the trend, Talk about the differences and experience of several algorithm posts” [2].

2-2020 Impressions after reading

In order to get a better understanding of what companies in various industries value (and probably the technologies they use), I chose to look for possible answers directly from the algorithm post, which is mainly posted on Niuke [3].

The points found can be simply divided into the following categories:

Pure mathematical correlation
Machine learning
Deep learning
NLP related
Recommendation algorithm

Some of the traditional algorithms are not included (Leetcode and books like Finger Offer are more than enough). I can see the surface classics is limited, the content provided in the surface classics is also limited, so the following content can not be said to summarize all, but at least can extract a large part of the frequent keywords (if there is really a need to write a crawler + keyword extraction).

Because the content is not particularly much of the order of dependence, so in accordance with the appearance of the order to list.

2.1 – Pure mathematics related

Event probability calculation
Dirichlet distribution
Maximum likelihood estimation and Bayesian estimation
.

2.2 – Machine learning

Data cleaning and data smoothing
Commonly used dimensionality reduction methods, PCA
LDA(Linear Discriminant Analysis)
Decision tree, ID3, C4.5, CART
XGBoost, LightGBM, Random Forest, Adaboost, GBDT
SVM principle, duality problem
L1 and L2 are regularized
A fitting
Feature selection method
LR(Logistic Regression) and SVM, Linear SVM and LR
Clustering method, K-means, hierarchical clustering
Evaluation index and ROC of the model
Naive Bayes’ principle
Scikit – learn, numpy
Bagging and boosting
Integrated learning
Classification method
Optimization of model on-line
Continuous values, discrete values, the benefits of discretization of continuous features
Regression method, linear regression, Ridge regression, Lasso regression, LR
The relation between information gain, information gain ratio and Gini coefficient
The principle and significance of one-hot coding
Optimizers (Gradient Descent,…).
Statistical learning algorithm
.

2.3 – Deep learning

Feedforward Neural Network
Back Propagation
Convolutional, pooling, full connected
CNN(convolution), RNN(gradient disappearance problem), LSTM, GRU
GAN
Target detection, R-CNN, Fast R-CNN, Faster R-CNN, YOLO, SSD…
SoftMax, Sigmoid
Embedding
Attentional mechanism
GCN(Graph Convolutional Network)
Optimizers(Gradient Descent, BGD, SGD, Adam, Adagard…)
Tensorflow, Keras, PyTorch
Activation(SigmoID, SoftMax, Relu…)
MobileNet
Dropout
CPU and GPU acceleration
.

2.4 – NLP related

Keyword extraction, TF-IDF
Named entities
LDA(Latent Dirichlet Allocation)
word2vec
Bert, Transformer,
.

2.5 – Recommendation algorithm

Content-based recommendations
Collaborative filtering recommendation, UserCF, and ItemCF
How to deal with sparse matrices
.

2.6 – Summary of experience

In most algorithm interviews, the interviewer’s questions are about the project on the resume. We can see that many of the above projects involve some points, and the interviewer may ask further questions, such as:

Why should the original problem of SVM be transformed into a dual problem? Why is the dual problem easy to solve? Why can’t the original problem be solved?
In K-means, I wanted to cluster into 100 categories but only found that it could cluster into 98 categories. Why?
What is the difference between LR and SVM?
For PCA, there will be a first principal component and a second principal component. How and why is the first principal component the first and why?
Bagging or Boosting, which can make the variance smaller? Why?
.

Therefore, in the learning process, we should not only know How, but also know several Why. One is to be able to answer questions in the interview, and the other is to better understand the tool in hand.

3 – The base skill tree of the algorithm

The points summed up by the surface are still a bit messy, so I refer to some algorithm learning route posts to simply categorize and sort out each point, mainly refer to the heart of machine this complete AI learning route, the most detailed English resources collation [4], I also refer to a summary of the knowledge points I found when and where, if anyone knows the source please let me know in the comments, the link to the picture will be attached at the end of the article (the picture is too big, so it will be hard to see).

3.1 – Fundamentals of Mathematics

Higher mathematics
Linear algebra
Probability theory and mathematical statistics

It is not to say that one has to thoroughly eat the above three textbooks before starting to learn the following ones. In fact, many methods in the field of artificial intelligence are only used in a small part of them. There are some books and documents specifically summarizing mathematical knowledge needed in machine learning [4]. You can find these in the computer heart official account article (I also put a link at the end of the article), those with basic math can be used for review, and those without math are still advised to review the textbook in areas that are difficult to understand.

Some posts may add a convex optimization in the mathematical basis part. Personally, convex optimization can be said to be the most boring course in the pure learning process, which is mostly the definition of convex optimization and the proof of theoretical formula. Therefore, it is suggested to enter a certain point for in-depth study when encountering in the later period.

3.2 – Programming basics

For numerical analysis and artificial intelligence, the Python library is convenient and sufficient for getting started. Version 3.5 or 3.6 is currently recommended.

Anaconda(or Miniconda) is a handy Python virtual environment and package management software, but it can be tricky at times (such as some weird environment requirements for algorithmic frameworks), but the initial stage is sufficient for most purposes.

Pycharm is the Python IDE that most people use. If you have some charm, you can use vscode+.

3.3 – Data processing/analysis/mining

In practice, many machine learning and deep learning methods can only work in high-quality data, such as enough information and enough noise and error information. In the actual data collection process, in many cases, it is impossible to make the data so perfect, so some preliminary data processing (collection, cleaning, sampling, denoising, dimensionality reduction, etc.) is needed. .

In addition to the basic Python language, you need to master the basic data processing library, such as NUMpy, PANDAS, and Matplotlib.

This book is a hands-on guide to how to efficiently solve a wide variety of data analysis problems using Python libraries including NumPy, Pandas, Matplotlib, and IPython. If you run the code all at once, you can solve most of the data analysis problems.

In addition, there is [4] :

Feature engineering in data mining
Some data mining projects

Data mining can help us preliminarily understand some relationships among data features, and add or delete some features to help subsequent learning. Data mining can be introduced systematically through introductory books or courses, most of which are not very sophisticated.

3.4 – Traditional machine learning

Introduction to 3.4.1 track –

If you start learning mathematical and theoretical formulas at the beginning without trying to figure out how to use them, it’s hard to understand why you need them at all.

Before learning each machine learning algorithm, you can first understand the general function of this thing, and then ask the question “how is this implemented?” To explore the theory of the algorithm, in order to better understand the mathematics and formula.

Here is a recommended site, artificial intelligence learning library for product managers.

Encyclopedia of artificial intelligence field, very suitable for small white and novice AI field. The vast majority of AI materials on the market are “science and engineering materials” that pursue rigor. The world is not short of rigorous, accurate and obscure AI materials, but it is short of easy-to-understand content. We want to get rid of complicated formulas, complicated logic, complicated terminology. Do a set of AI knowledge base that liberal arts students can understand.

3.4.2 – theory

The theoretical parts of machine learning are:

Machine learning for the problem
- classification
  - The decision tree
  - K – nearest neighbour
  - SVM
  - Logistic regression
  - The bayesian
  - Random forests
  - .
- Return to the
  - Linear regression
  - Least squares regression
  - Local regression
  - The neural network
  - .
- clustering
  - K-means
  - EM
  - .
- Dimension reduction
  - Principal component analysis (PCA)
  - Linear discriminant analysis LDA
  - .
- .
Return to the
- Linear regression
- Logistic regression
- .
Decision trees and random forests
- ID3
- C4.5
- CART
- Regression tree
- Random forests
- .
SVM
- Linear separable
- Linear indivisibility
Maximum entropy and EM algorithm
Multi-algorithm combination and model optimization
- Model selection
- Model state analysis
- Model optimization
- Model integration
Bayesian network
Hidden Markov chain HMM
- Markov chain
- Hidden Markov chains
Topic model LDA
Integrated learning
.

Inner OS: This is basically the same as some book catalogs.

Recommended courses [4] :

Machine Learning — Endar Ng (Coursera)
Machine Learning – Endar Ng, source: NetEase Cloud Classroom, NetEase Cloud classroom translation and handling of the above courses.
CS229 — Ng, Stanford, source: NetEase Cloud, similar to Machine Learning, with more mathematical requirements and derivation of formulas.
Fundamentals of Machine Learning – Lin Xuantian, Taiwan University, source: Bilibili, humorous teacher, focuses on the theoretical knowledge of machine Learning, supporting book Learning From Data.

Recommended books [4] :

Watermelon book “Machine learning” – Zhou Zhihua, mainly is the core mathematical theory and algorithm of machine learning.
Statistical Learning Methods – Li Hang, more complete and professional machine learning theory knowledge, as a solid theory is very good.
Pattern Recognition and Machine Learning (PRML for short) was written by Christopher Bishop, director of Microsoft Research Cambridge Laboratory. Douban rating 9.5, this book has been open source by Microsoft, address: www.microsoft.com/en-us/resea…

Practice rule 3.4.3 –

After the initial introduction and learning of the theory, in order to learn and use the learned algorithm, you can try to practice.

First, some common tools to stretch your capabilities (so you don’t build your own wheels) :

Scikit-learn: a powerful Python third-party machine learning library that covers everything from data preprocessing to training models. Using sciKit-Learn in the field can greatly reduce the time and amount of code we write, freeing up more energy to analyze data distribution, adjust models, and modify overparameters.
XGBoost: XGBoost is a massively parallel vTREE tool and is the fastest and best open source VTree toolkit available, more than 10 times faster than the usual vTree toolkit. In data science, it is used by a large number of Kaggle players for data mining competitions, including more than two kaggle competition winners. In terms of industrial scale, xGBoost’s distributed version is widely portable, running on YARN, MPI, Sungrid Engine, and many other platforms. It also maintains various optimizations of the stand-alone parallel version, making it a good solution to industrial scale problems.
LightBGM: LightGBM (Light Gradient Boosting Machine) is also a distributed Gradient lifting framework based on decision tree algorithm. In order to meet the needs of the industry to shorten the calculation time of the model, LightGBM is designed with two main ideas: 1. Reduce memory usage to make sure that a single machine can use as much data as possible without sacrificing speed; 2. Reduce the cost of communication, improve the efficiency of multiple machines in parallel, and achieve linear acceleration in computing. LightGBM is designed to provide a fast, efficient, low memory footprint, high accuracy data science tool that supports parallel and large-scale data processing.
.

Then you can go to Kaggle to align with the big guys. If you have the ability and idea, you can create a project by yourself.

If you have a deeper understanding of some algorithms, you can even try to reproduce them in your own code.

Recommended books:

Scikit-learn and TensorFlow Machine Learning Usage Guide: This book is divided into two parts. The first part introduces basic machine learning algorithms, and each chapter is equipped with scikit-Learn practical projects. The second part introduces neural networks and deep learning. Each chapter is equipped with TensorFlow practical projects. If it’s just machine learning, look at part 1.

3.5 – Deep learning

Introduction to 3.5.1 track of –

Ai learning libraries for product managers are also recommended here.

Theory of 3.5.2 –

The theoretical part of deep learning is about [4] :

Basic neural network
- neurons
- The activation function
- Basic structure: input layer, hidden layer, output layer
- Back propagation algorithm
CNN
- Convolution layer
- Pooling layer
- The connection layer
- Typical network structure of CNN (LeNet, AlexNet, VGG, ResNet…)
RNN
- One-way RNN
- Two-way RNN
- The depth of the RNN
- LSTM
- GRU
GAN
.

You can start from the breadth, based on the knowledge, choose a direction of in-depth study:

Computer vision (image and video processing, mainly CNN);
NLP for natural language processing (including text and speech processing, and RNN for sequence data);
Generate models (GAN, VAE, etc.);

Recommended courses [4] :

“Deep Learning” – Andrew Ng, source: NetEase Cloud, the whole topic consists of five courses: 01. Neural networks and deep learning; 02. Improved deep neural networks – hyperparametric debugging, regularization and optimization; 03. Structured machine learning projects; Convolutional Neural network; 05. Sequence model.
Ai, source: Bilibili, with Ng “Deep Learning”, the biggest characteristic of this course is “top-down” rather than “bottom-up”, is the best Deep Learning through actual Learning course, Chinese alphabet, source CSDN.
CS230 – Ng, Stanford, source Bilibili, covers basic models of deep learning such as CNNs, RNNs, LSTM, Adam, Dropout, BatchNorm, Xavier/He Initialization, It involves areas such as healthcare, autonomous driving, sign language recognition, music generation and natural language processing.

Recommended books [4] :

Open source book neural Network and Deep Learning by Qiu Xipeng of Fudan University. This book took Qiu three years to sort out this deep learning knowledge system by combining his own research, daily teaching and practice. This book mainly introduces the basic knowledge of neural networks and deep learning, the main models (feedforward network, convolutional network, circular network, etc.) and their applications in computer vision, natural language processing and other fields [5].
Deep Learning, the book introduces basic mathematical knowledge, machine learning experience, and the current theory and development of deep learning, which can help ai technology enthusiasts and practitioners have a comprehensive understanding of deep learning under the guidance of the thinking of three experts and scholars.
Shen Post “deep Learning 500 Questions”, the author is an outstanding graduate of Sichuan University Tan Jiyong. The project collected 500 questions and answers in the form of deep learning interview q&A. The book covers topics such as probability, linear algebra, machine learning, deep learning, and computer vision. The book is not finished yet, but it has already received 2.4W Stars on Github (currently 3.7W).

3.5.3 – practice

After the initial introduction and learning of theories, we can try to practice the deep learning algorithm in order to learn and apply it flexibly.

First, some common tools to stretch your capabilities (so you don’t build your own wheels) :

TensorFlow is Google’s open source deep learning framework, but the interfaces are low-level and may be difficult to get started with.
Keras, an advanced neural network API written in Python, can be run as a back end with TensorFlow, CNTK, or Theano. Keras is nice to start with, but too much encapsulation can be cumbersome to customize, so they are geared toward fast experimentation, fast validation tasks.
PyTorch, a set of deep learning frameworks released by Facebook, focuses on lower-level apis that deal directly with array expressions. It received a lot of attention last year as a solution for academic research and deep learning application preferences that need to optimize custom expressions.

As for which tool is better, there is a lot of debate among the “supporters”. There is no need to decide which one to choose.

After choosing a tool to learn, you can go to Kaggle to align with the leaders. If you have the ability and idea, you can create a project by yourself.

3.6 other

Reinforcement learning, transfer learning, computer vision, NLP, recommendation systems, knowledge graphs, etc., are not covered here, but you can find them in the Heart of Machines article.

3.7 – Paper reading

Most of the theoretical content of machine learning and deep learning comes from the papers published in the field of computer research, and the current cutting-edge technologies are also published in the papers in recent years.

As an expansion stage after the introduction, theory and practice, you can increase your knowledge by reading cutting-edge papers.

Since reading cutting-edge papers is not a required skill for a business-oriented algorithm engineer, I will not cover it here. Similarly, you can find an introduction to reading cutting-edge papers in the Heart of the Machine article.

4 –

Not long ago, a 404 website sent me a video with the title “Don’t Learn Machine Learning – Daniel Bourke”. Youtube, of which the author of the core content is not just learning algorithms and learning algorithms, to create products (or application, or problem solving) and learning algorithms, conditional students can see (temporarily haven’t seen the domestic translation carry, if you have time to have the opportunity to I can translate handling it).

Learning to Offer is not necessarily the best way to go. My goal is to develop the backend as the main line. The reason why I haven’t completely given up the algorithm in this part is partly because of my major and more because I know that only these algorithms can effectively solve some problems. I will use more algorithms to enable programmers to solve more problems.

5 – Refer to the article

[1] What is the difference between research in academia and R&D in industry? , www.zhihu.com/question/36…
[2] Refuse to follow the trend, talk about the difference and experience of several algorithms – Xi Xiaoyao
[3] algorithm engineer selected surface by the collection – meanwhile, www.nowcoder.com/discuss/exp…
[4] complete AI learning course, the most detailed resources – the heart of the machine, in both English and Chinese mp.weixin.qq.com/s/dI0im1AZm…
[5] Fudan Professor Qiu Xipeng published Neural Network and Deep Learning – Datawhale