By Zhang Xiangyu
Becoming a qualified development engineer is not an easy thing. It requires mastering a series of skills from development to debugging to optimization, and each of these skills requires sufficient effort and experience. And to become a qualified machine learning algorithm engineer (hereinafter referred to as algorithm engineer) is even more difficult, because in addition to mastering the general skills of engineers, but also need to master a not small machine learning algorithm knowledge network.
Let’s break down the skills required to be a qualified algorithm engineer and take a look at the skills required to be a qualified algorithm engineer.
Figure 1 Machine learning algorithm engineer skill tree
01 Basic development ability
Algorithmic engineers need to be engineers first and foremost, so they need to master some of the skills that all development engineers need to master.
Some students have some misunderstanding about this, thinking that the algorithm engineer only needs to think and design the algorithm, don’t care how the algorithm is implemented, and someone will help you to implement the algorithm you come up with. This idea is wrong. In most positions in most enterprises, algorithm engineers are responsible for the whole process from algorithm design to implementation to implementation.
I have seen some enterprises implement the organizational structure of algorithm design and algorithm implementation separation, but in this structure, it is not clear who should be responsible for the algorithm effect, algorithm designers and algorithm developers have a lot of pain, the specific reason is not in the scope of this article, but I hope you remember is that, Basic development skills are required of all algorithm engineers.
There are so many skills involved in basic development that I’ve chosen only two of the most important ones to illustrate.
[Unit Test]
In enterprise applications, a complete solution to a problem usually includes many processes, each of which needs iterative optimization and debugging. How to divide complex tasks into modules and ensure the correctness of the overall process? The most practical approach is unit testing.
Unit testing is not simply a testing skill, it is first and foremost a design ability. Not every piece of code can be unit-tested, but it can only be unit-tested if the code is divided into units — modules — in the first place. After breaking the project down into modules that can be developed and tested independently, coupled with separate, repeatable unit tests for each module, the correctness of each module can be guaranteed, and if the correctness of each module can be guaranteed, the correctness of the overall process can be guaranteed.
For the algorithm development process changes frequently development activities, module design and unit testing is an important guarantee not to dig holes for themselves and others. It is also an important prerequisite for making changes and optimizations to your code.
【 Logic abstract reuse 】
Abstract reuse of logic can be said to be the most important principle in all software development activities, one of the important principles to measure the level of a programmer’s code is to see the proportion of duplicate code and similar code in his code. A lot of duplicate code or similar code is a reflection of laziness on the part of the engineer who thinks it’s easiest to copy and paste or just copy. Not only does this look ugly, it’s also very error-prone, not to mention difficult to maintain.
There is often a lot of similar logic in algorithmic development projects, such as similar processing for multiple features, and many similar processing methods in raw data ETL. If you don’t abstract away the repeating logic, your code looks like it’s all line after line, which can be cumbersome to read and maintain.
02 Fundamentals of probability and statistics
Probability and statistics can be said to be one of the cornerstones of machine learning. From a certain perspective, machine learning can be regarded as a systematic way of thinking and cognition of the uncertain world based on probabilistic thinking. Learning to look at problems from a probabilistic perspective and describe problems in probabilistic language is one of the most important foundations for deep understanding and skillful use of machine learning technology.
Probability theory has a lot of content, but it is reflected by the specific distribution as the specific expression carrier, so it is very important to learn the probability distribution and its various properties well.
-
For discrete data, Bernoulli distribution, binomial distribution, multinomial distribution, Beta distribution, Dirichlet distribution and Poisson distribution need to be understood and mastered.
-
For off-line data, gaussian distribution and exponential distribution are more important distributions. These distributions run through all the models of machine learning, and they exist in all kinds of data on the Internet and in the real world, and understanding the distribution of data is what you need to do with it.
In addition, relevant theories of hypothesis testing also need to be mastered. In this era of so-called big data, the most deceptive thing is probably data. Only by mastering relevant theories such as hypothesis testing and confidence interval, can we distinguish the truth and falsehood of data conclusions. For example, whether there is a real difference between the two groups of data, whether the indicators are really improved after the launch of a strategy and so on. This kind of problem is very common in practical work, and it is equivalent to being blind in the era of big data if you do not master relevant ability.
In terms of statistics, some common parameter estimation methods need to be mastered, such as maximum likelihood estimation, maximum posterior estimation, EM algorithm and so on. These theories, like optimization theory, are theories that can be applied to all models, the foundations of foundations.
Machine learning theory
Just because there are more and more open source toolkits out of the box doesn’t mean algorithm engineers can ignore learning and mastering the fundamentals of machine learning. There are two main reasons for this:
-
Master the theory to a variety of tools, skills and flexible application, rather than just copy. Only on this basis can we really have the ability to build a machine learning system and continuously optimize it. Otherwise, you’re a machine learning brick remover, not a qualified engineer. Problems will not be solved, let alone optimization of the system.
-
The purpose of learning the basic theories of machine learning is not only to learn how to build machine learning systems, but more importantly, these basic theories reflect a set of thoughts and thinking modes, including probabilistic thinking, matrix thinking, optimization thinking and other sub-fields. This set of thinking mode is very helpful for data processing, analysis and modeling in this era of big data. If you do not have this kind of thinking in your mind and still use the old improbability and scalar thinking to think about problems in the big data environment, the efficiency and depth of thinking will be very limited.
The theoretical connotation and extension of machine learning are very broad, which is not enough to be summarized in a single article. Therefore, HERE I have listed some core elements and introduced some helpful contents for practical work. After mastering these basic contents, you can continue to explore and learn.
[Basic Theory]
Basic theories are those that do not involve any specific models, but focus on learning itself. Here are some useful basic concepts:
-
VC dimension. The VC dimension is an interesting concept, and it’s basically a class of functions that describe how many samples it can divide up into all combinations. What’s the significance of the VC dimension? It is that when you select a model and its corresponding features, you can roughly know how large a data set the selection of models and features can classify. In addition, the size of VC dimension of a class of functions can also reflect the possibility of over-fitting of this class of functions.
-
Information theory. From a certain point of view, machine learning and information theory are two sides of the same problem. The optimization process of machine learning model can also be regarded as the process of minimizing the amount of information in the data set. Understanding the basic concepts in information theory is of great benefit to the study of machine learning theory. For example, the information gain used in decision tree to make split decision basis, information entropy to measure the amount of data information and so on, these concepts are very helpful for the understanding of machine learning problem. For this part, please refer to Elements of Information Theory.
-
Regularization and bias-variance tradeoff. If the main contradiction in China at this stage is “the contradiction between people’s ever-increasing needs for a better life and unbalanced and inadequate development”, then the main contradiction in machine learning is the contradiction between models trying to fit data and models not over-fitting data. Regularization is one of the core techniques to resolve this contradiction. The specific regularization methods are not discussed here, but what needs to be understood is the idea behind various regularization methods: bias-variance tradoff. The balance and trade-offs between different interest points are important differences among various algorithms. Understanding this point is of great importance for understanding the core differences between different algorithms.
-
Optimization theory. Most machine learning problem solving can be divided into two phases: modeling and optimization. The so-called modeling is the various methods to describe the problem with the model that we will mention later, and optimization is the process of obtaining the optimal parameters of the model after the modeling is completed. There are many models commonly used in machine learning, but not so many optimization methods behind them. In other words, many models use the same set of optimization methods, and the same optimization method can be used to optimize many different models. It is necessary to have a thorough understanding of various commonly used optimization methods and ideas, to understand the process of model training, as well as to explain the effect of model training under various circumstances. These include maximum likelihood, maximum posterior, gradient descent, quasi-Newton method, L-BFGS and so on.
There are many basic theories of machine learning. You can start from the above concepts and take them as the starting point for learning. In the learning process, you will encounter other contents that need to be learned, just like a network slowly spread out, and continue to accumulate your knowledge. In addition to Andrew Ng’s famous courses, Learning from Data, an open course, is also worth Learning. There is no background requirement for this course, and what it teaches is the basis of the foundation under all models, which is very close to the core essence of machine Learning. The Chinese version of the course, called Fundamentals of Machine Learning, is also available online and is taught by students of the English version above.
Supervised learning
After understanding the basic concepts of machine learning, you can enter into the learning of some concrete models. In the current industrial practice, the application of supervised learning is still the most extensive, this is because many problems we encounter in reality are hoping to make a prediction of a certain attribute of something, and these problems can be transformed into supervised learning problems through reasonable abstraction and transformation.
Before studying complex models, I recommend that you study the simplest models, typically naive Bayes. Naive Bayes has a strong assumption, which is not satisfied with many problems and the model structure is very simple, so its optimization effect is not the best. But because of its simple form, it is very helpful for learners to deeply understand each step of the whole model in the process of modeling and optimization, which is very useful for understanding how machine learning works. At the same time, the naive Bayes model form can be very unified with the form of logistic regression after some clever transformation, which undoubtedly provides another Angle of interpretation of logistic regression, and plays a very important role in a deeper understanding of the most commonly used model of logistic regression.
Once you have mastered the basic flow of machine learning models, you need to learn the two most basic model forms: linear model and tree model, corresponding to linear regression/logistic regression and decision regression/classification tree respectively. Now commonly used models, both shallow model and deep model of deep learning, are based on these two basic model forms.
The question that needs to be carefully considered when studying these two models is: what is the essential difference between these two models? Why do we need these two models? What are their differences in training and prediction accuracy, efficiency, complexity, etc.? Once you understand these essential differences, you can use the model freely, depending on the problem and the data.
Having mastered the two basic forms of the linear model and the tree model, the next step is to master the complex forms of the two basic models. The complex form of linear model is multilayer linear model, that is, neural network. The complex forms of tree model include Boosting combination represented by GDBT and Bagging combination represented by random forest.
Boosting and Bagging are also worth learning and understanding, which represent two general enhancement methods. Boosting is boosting, which is to continuously optimize on the basis of previous ones. Bagging believes that “three stinks are better than one”, which means that a strong classifier can be obtained by combining multiple weak classifiers.
These two combination methods have their own advantages and disadvantages, but they can be used for reference in daily work. For example, in recommendation systems, we often use multi-dimensional data as recall sources, which is a kind of Bagging idea from a certain perspective: each individual recall source may not give the best performance, but a combination of multiple recall sources can achieve better results than each individual recall source. So the idea is more important than the model itself.
【 Unsupervised learning 】
Although supervised learning currently accounts for the majority of machine learning application scenarios, unsupervised learning is also very important both in terms of data size and function.
One category of unsupervised learning is clustering. The meaning of clustering can be divided into two categories: one is to take the clustering results themselves as the ultimate goal, and the other is to use the clustering results as features in supervised learning. However, these two meanings are not bound to a specific clustering method, but just different ways of using the results after clustering, which requires continuous learning, accumulation and thinking in work. In the introductory learning stage, we need to master the core differences of different clustering algorithms. For example, among the most common clustering methods, what kind of problems are suitable for Kmeans and DBSCAN respectively? What are the assumptions of the Gaussian mixture model? What are the relationships between documents, topics, and words in LDA? These models are best learned together to grasp the connections and differences between them, rather than treating them as isolated things.
In addition to clustering, embedding representation is also an important method of unsupervised learning. The difference between this method and clustering is that clustering uses existing features to divide data, while embedded representation is to create new features, which is a brand new representation of samples. This new representation provides a new perspective on the data, and this perspective opens up new possibilities for data processing. In addition, this practice, although emerging from the NLP field, is so universal that it can be used to process a wide variety of data with good results that it has become a must-have skill.
A good place to start in machine Learning theory is An Introduction to Statistical Learning with Application in R, which provides a good explanation of some commonly used models and theoretical foundations, At the same time, there are appropriate exercises to consolidate the knowledge. Advanced Learning can be done using Elements of Statistical Learning, an updated version of the above book, and PatternRecognition and Machine Learning, which is well known.
04 Development languages and tools
Having mastered enough theoretical knowledge, we also need enough tools to put these theories into practice. In this part, we introduce some commonly used languages and tools.
[Development language]
Python has been the most popular language in data science and algorithms in recent years, mainly because of its low barriers to use, ease of use, complete tool ecosystem, and good platform support. So I won’t bore you with Python. But in addition to learning Python, I recommend learning R for the following reasons:
-
The R language has the most complete statistical tool chain. We discussed the importance of probability and statistics above, and the R language provides the most comprehensive support in this area. Some everyday statistical needs may be done faster in R than in Python. While Python’s statistical science tools continue to improve, R remains the largest and most active community in statistical science.
-
The cultivation of vectorization, matrix and tabular thinking. All data types in R are vectorized, and an integer variable is essentially a one-dimensional vector of length one. On this basis, R builds efficient matrix and (DataFrame) data types, and supports very complex and intuitive operations on them. These data types and ways of thinking are also being adopted by more modern languages and tools, such as Ndarray in Numpy and DataFrame introduced in the latest version of Spark, which are either directly or indirectly inspired by R. The data operations defined above are the same as the operations on dataframes and vectors in R. Just like learning programming should start from C linguistics, learning data science and algorithm development I suggest everyone learn R, learn not only its language itself, but also its connotation thought, for everyone to master and understand modern tools are of great benefit.
In addition to R, Scala is a language worth learning. The reason is that it is currently the language that combines the object-oriented and functional paradigms better, because it doesn’t force you to write in a functional way, and it gives you enough support where you can. This makes it relatively easy to use, but with experience and knowledge, you can write more and more advanced and elegant code with it.
[Development tools]
In terms of development tools, the Python suite is by far the most useful. Numpy, Scipy, Sklearn, PANDAS, and Matplotlib are available for most analysis and training tasks on a single computer. However, for model training, there are some more focused tools that can give better training accuracy and performance, such as LibSVM, Liblinear, XGBoost, etc.
In terms of big data tools, Hadoop and Spark are still the mainstream tools for offline computing, while SparkStreaming and Storm are also mainstream choices for real-time computing. There are many new platforms emerging in recent years, such as Flink and Tensorflow, which are worth paying attention to. It is worth mentioning that to master Hadoop and Spark, one should not only master their coding techniques, but also have a certain understanding of their operating principles. For example, how map-Reduce processes are implemented on Hadoop, and what time-consuming operations are performed on Spark. How aggregateByKey and groupByKey operate differently, etc. Only after mastering these, can you use these big data platforms freely. Otherwise, it is easy to have problems such as time-consuming programs, running down, memory bursting and so on.
05 Architecture Design
Finally, we spend some time talking about the architecture of machine learning systems.
The architecture of machine learning system refers to a set of overall systems that can support machine learning training, prediction, stable and efficient operation of services and their relationships.
As business scale and complexity grow to a certain extent, machine learning is bound to move towards systematization and platformization. At this time, it is necessary to design a set of overall architecture according to the business characteristics and the characteristics of machine learning itself, including the architecture design of upstream data warehouse and data flow, the architecture of model training, and the architecture of online services, etc. Learning this set of architecture is not as simple as the previous content, there are not too many ready-made textbooks to learn, more abstract summary based on a large number of practices, the current system constantly evolve and improve. But it’s definitely the best job you can fight for as an algorithm engineer. The advice here is to practice more, summarize more, abstract more and iterate more.
Current situation of machine learning algorithm engineer field
It’s arguably the best time to be a machine learning algorithm engineer, and they’re in high demand across industries. Typical examples include the following segments:
-
Recommendation system. Recommendation system solves the problem of information efficient matching and distribution in massive data scenarios. In this process, machine learning plays an important role in candidate set recall, result ordering, and user portrait.
-
Advertising system. There are many similarities between advertising system and recommendation system, but there are also significant differences. Besides platform and users, the interests of advertisers should be considered at the same time. The two parties become three parties, which makes some problems more complicated. Its use of machine learning is similar to recommendations.
-
Search system. Machine learning technology is widely used in many aspects of the infrastructure construction and upper ranking of the search system. Moreover, in many websites and apps, search is a very important traffic entrance. The optimization of the search system by machine learning will directly affect the efficiency of the whole website.
-
Risk control system. Risk control, especially Internet financial risk control, is another important battlefield of machine learning in recent years. It is no exaggeration to say that the ability to use machine learning can largely determine the risk control ability of an Internet financial enterprise, and the risk control ability itself is the core competitiveness of these enterprises’ business security. You can feel the relationship between them.
But as the saying goes, “the higher the salary, the greater the responsibility”, the requirements for algorithm engineers are also increasing. In general, a senior algorithm engineer should be able to deal with the whole process of “data acquisition → data analysis → model training and tuning → model online”, and constantly optimize various links in the process. An engineer might start at one point in the process, expanding his scope.
In addition to the fields listed above, there are many traditional industries that are also exploring the ability of machine learning to solve traditional problems, and the future potential of the industry is huge.
【 References 】
The relationship between naive Bayes and logistic regression can be referred to: www.cs.cmu.edu/~tom/mlbook…
“Learning_from_data”
AnIntroduction to Statistical Learning with Application in R:
English version download;
The Chinese translation is called introduction to Statistical Learning, the first seven chapters of the video course
Elements ofStatistical Learning
Elements ofInformation Theory
About the author: Zhang Xiangyu, principal of referral algorithm Department of second-hand trading platform, algorithm architect, responsible for referral system and other algorithm-related work. Email: [email protected]
This article is the programmer’s original article, shall not be reproduced without permission.