How to start machine learning from scratch

This article was published by Rong Chang on Tencent’s wechat account

preface

Introduction: As a mathematics major, a person who started to engage in machine learning halfway, in the process of learning machine learning naturally stepped on numerous pits, but also through many detours that should not have taken. Therefore, I really want to summarize a material on how to get into machine learning, which can also be regarded as a little contribution to the future people.

In March 2016, as AlphaGo defeated Li Shi 乭, artificial intelligence began to enter people’s vision in a large scale. Not only the Internet engineers are concerned about the development of artificial intelligence, but also the people outside have begun to pay attention to the impact of artificial intelligence on daily life. With the increasing enhancement of face recognition ability, the popularity of personalized news recommendation apps, and the popularity of open source tools such as TensorFlow, more and more people begin to gradually switch to the field of artificial intelligence. No matter the background developers with computer background, electronic communication engineers or other engineers, People from traditional science fields, such as mathematics and physics, are gradually switching to the field of machine learning.

As a career changer, it is natural to introduce your professional background. The author majored in mathematics and applied mathematics when he was an undergraduate, which can be understood as basic mathematics by laymen. During the doctoral period, my research direction is dynamic system and fractal geometry, and I still do basic mathematics, which has little to do with computer.

If you want to know what the author is doing, you can refer to the article “Complex Dynamical System (1) — Fatou set and Julia Set” in Zhihu. As for machine learning, I haven’t come across it or even heard of it during my school years. However, I was able to write some codes like C++ due to professional needs during my study, and I also left my footprints on UVA and OJ.

2015: Try to transform

The road is difficult, the road is difficult, there are many different roads, where are you today?

After graduation in 2015, I happened to be engaged in machine learning related work in Tencent. But when I first came in, I was under a lot of pressure. Looking back, I took some detours that I shouldn’t have taken. Use Li Bai’s “the road is difficult” in the poem to describe the mood at that time is “the road is difficult, the road is difficult, many different roads, now in?”

In October 2015, I came into contact with a modest project for the first time, that is, XX recommended project. This project was the second recommendation project received by the group at that time. At that time, the recommendation system was still built on the big data cluster without any explanatory documents and front-end pages, and the whole system and process were complicated and tedious. However, in the process of contacting this system, I gradually began to learn some simple commands of the Linux operating system and the use of SQL.

In fact, I not only learned SQL through this system, but also helped the business side extract data through the ADS duty at that time, and further deepened the basic knowledge of SQL. For SQL learning, I have read two very good introductory textbooks “SQL Basics Tutorial” and “HIVE Programming Guide” in 2015. Read the Linux command line and Shell scripting guide and get a sense of it. After working for a while, I wrote an article called “HIVE Basics “to summarize some common SQL algorithms.

In addition to using SQL to process data, you also need to understand common machine learning algorithms in order to do machine learning. The first machine learning algorithm I came into contact with at that time was Logistic Regression. Since Logistic Regression of machine learning was mentioned, the concept of cross validation was unavoidable, which is a basic concept in machine learning.

New features, such as inner product, can be constructed through the category attributes of articles and the basic features of users. Later, in the process of learning, the outer product and Cartesian product of features were gradually added. In addition to the cross of features, there are many methods to construct features, such as the standardization, normalization, discretization, binarization and other operations of features. In addition to structural features, how to judge the importance of features is a very key problem.

The most common method is to look at the weights of trained models, and mathematical tools such as Pearson’s correlation coefficient and KL divergence can also be used to roughly determine whether features are valid. During this period, I also wrote several articles on “Cross validation”, “Introduction to Feature Engineering”, and “KL Divergence”. About feature engineering, in addition to reading some necessary books, the most important thing is to practice, only practice can enrich their experience.

Before the recommendation system was built, the weights of models were calculated offline by Logistic Regression algorithm, and then imported into the online system for real-time calculation and scoring. In addition to the offline algorithm, FTRL algorithm capable of online learning was learned in December 2015. After the research, I shared my summary in the group at the beginning of 2016 and recently transferred the article to my wechat official account “Follow the Regularized Leader”.

In the process of XX recommendation project, I learned that data is the cornerstone of the whole machine learning project. If the quality of data is not good, data preprocessing is needed, and developers are even urged to solve the problem of data reporting. Generally speaking, in order to do a good recommendation project, besides feature engineering and algorithm, the most important thing is data verification. At that time, the experience was to check the data of multiple parties, that is, the results calculated offline by the algorithm, the results calculated online, and the results displayed in the real product. The data of these three parties must be completely consistent. If they are inconsistent, they need to check again, rather than continue to promote the project. During this time, I stepped on numerous data pits, so the lesson is to always check the data repeatedly.

2016: Zero to one

Standing on the shoulders of giants, can see further. — Learn the recommendation system

“Only by standing on the shoulders of giants can one see further.” In February 2016, in addition to the personalized tuning algorithm of the home page of XX recommendation project, I also started another small project, trying to open the TAB of the home page, which is to recommend different items for different users. A simpler way to do this small project is to use an ItemCF or heat propagation algorithm to recommend similar programs to users after they have listened to a particular program.

In fact, this scenario has long been successful in the industry, is not a new scenario. It’s like when a user sees a book on an e-commerce site and then gets recommended for other related books. Before also wrote a simple algorithm of the recommendation system “material diffusion algorithm”, recommended for your reference. As for ItemCF and the related content of heat conduction algorithm, I will continue to improve it in the following Blog.

“Read a thousand times, its meaning.” In the process of using the whole recommendation system, the author only roughly know how the whole system is built. To understand the relevant algorithms of machine learning as a whole, just doing projects is far from enough. During the recommendation business, Zhou Zhihua spent some time reading machine Learning, a textbook published in early 2016. However, I personally feel that this book is not difficult, but I just need another book to understand the subtlety of it, that is machine Learning Actual Combat. In Machine Learning in Action, there is not only a description of the principles of machine learning related algorithms, but also a detailed source code, which is enough to make every beginner from novice to beginner.

The way ahead is so long without ending, yet high and low I’ll search with my will unbending.

By going from zero to one, I mean having experienced how to start a new business from zero during the year. In 2016, in order to introduce the machine learning field of business security, XX project team is established within the department, in the project department didn’t actually do a massive wave of trying, no successful experience, or even a suitable system to use, and safe and recommend business basically is not the same thing.

For the recommendation system, the accuracy of the recommendation to the user determines whether CTR reaches the standard. However, for the security system, the accuracy rate needs to be more than 99% if it wants to go online to crack down on black industry. The most commonly used algorithm in previous recommendation systems is logistic regression, which can store two kinds of features of items and users. The other algorithms are mainly ItemCF and heat conduction algorithm. As a result, during the XX project, the previous technical solution was not available, and it was necessary to rebuild a set of framework system based on the actual scenario of business security.

However, I did not have actual business experience in the security project at that time, and the tentative plan was to conduct pilot machine learning based on XX1 and XX2 businesses. In order to do a good job in this project, at first the author research the several startups, known as the machine learning + security, one of the research is XX company, because they published an article, it introduces the machine learning how to apply on the business security, that is to build a set of unsupervised + supervised + artificial tagging confrontation system. The author summarizes the outlier detection algorithms learned in two or three months at that time, and the links of the article are as follows: “Outlier detection Algorithm (I)”, “outlier detection Algorithm (II)”, “Outlier detection Algorithm (III)”, and “Overview of outlier detection algorithms”.

At the end of 2016, by coincidence, a colleague saw my November 2016 article and asked me how to build game AI. At the time, I knew little about the application scenarios of game AI, except that DeepMind made AlphaGo and used deep neural networks to play Atari games in 2013. In December, I spent some time studying reinforcement learning and deep learning, and also built a simple DQN network for reinforcement learning training. After several times of contact and communication, I finally made a simple game AI in January 2017. Through machine learning, I was able to conduct autonomous learning of game AI. Although I am not in the game department, I have developed a strong interest in game AI through this event and have written two articles “Reinforcement learning and Functional Analysis” and “Deep Learning and Reinforcement Learning”.

2017: Drum again

While working on daily projects, I also got into quantum computing in 2017. In the following months of work, I continued to investigate the basic knowledge of quantum computing and some technical schemes of quantum machine learning, and wrote two articles “Quantum Computing (I)” and “Quantum Computing (II)”, which introduced the basic concepts and skills of quantum computing.

Thirty fame and dust, eight thousand li cloud and moon

When I say “rebooting”, I really mean working on new projects from zero to one again in 2017. In July 2017, as the machine learning framework of business security has been gradually improved, XX project is coming to an end, so I have a new project in my hands, that is, cloud intelligent operation and maintenance project. The operation center is still in the exploration and initial stage, and the proposal of intelligent operation and maintenance (AIOPS) in the industry started gradually in 2017, that is, from manual operation and maintenance, automatic operation and maintenance, to artificial intelligence operation and maintenance stage, also known as AIOPS. Only in this way can the operation center realize the real coffee operation and maintenance stage.

I officially contacted the operation and maintenance project in August 2017. According to the communication with the operation and maintenance students, THERE were several pain points and difficulties in the business at that time. For example, abnormal detection of cloud Monitor time series, ROOT cause analysis of Hubble, ROOT cause analysis of cloud ROOT system, troubleshooting, cost optimization and other projects. Given the shortage of AIOPS staff and the lack of academic research on such technical solutions, how to carry out machine learning in operation and maintenance is a huge challenge. Just like the shield system in those days, no matter how it does, it can easily access other recommendation businesses, and it also has relatively mature internal experience. There are numerous successful cases in the academic and industrial circles. However, intelligent operation and maintenance was only promoted in 2017. Before that, it was all about manual operation and DevOps. Therefore, how to build a set of intelligent operation and maintenance system that can be used in the department as soon as possible becomes a huge challenge. There are basically the following problems:

Heavy historical baggage
AIOPS is short of staff
There is no mature system framework

In this case, it is impossible to introduce technology from outside. We can only rely on our own research, and our cooperative colleagues mainly focus on business operation and peacekeeping operation development. When first contact intelligent operations project is Hubble’s multidimensional drill-down analysis, the business scenario was once found success rate and other indicators fall, need accurate from multidimensional index of abnormal, from operators, for example, provinces, indices such as mobile phone found in the cause of success rate drop, this is the classic analysis of the returning. After the investigation, I found that several major articles could be referred to. After comprehensive consideration, I wrote a material called “Exploration of Root Cause Analysis”. PS: In addition to Hubble multi-dimensional drilling, I feel that in BI intelligent business analysis, in fact, this kind of method can also be used to find “why DAU is falling?” “Why didn’t revenue meet expectations?”

In addition to Hubble’s multi-dimensional drill-down, cloud Monitor’s time series anomaly detection algorithm is a more difficult project. The previous Monitor anomaly detection algorithm relies on the developer to set three thresholds (maximum value, minimum value and volatility) according to the characteristics of the curve for anomaly detection. The result is inaccurate accuracy, insufficient coverage and huge labor costs. When millions of curves require exception detection, it is completely unreasonable to manually configure thresholds for each curve. Thus, the result is that someone needs to be on duty every week, and problems may not be found in time. For time series algorithm, we can usually think of ARIMA algorithm, RNN and LSTM algorithm of deep learning, and Prophet tool of Facebook open source recently. The author has investigated all these methods, and will write relevant articles to introduce the use of ARIMA, RNN and Prophet in the future. Welcome everyone to communicate.

In fact, the above several time series prediction and anomaly detection algorithms are mainly based on a single time series, and are basically for those relatively stable time series with historical laws to operate. If a time series model is built separately for each curve, it is no different from threshold detection, and the labor cost is still huge. Moreover, in the actual scenario of Cloud Monitor, these abnormal detection models of time series have their own defects, which cannot achieve the effect of “one person can pick out millions of KPI curves”.

Therefore, after a lot of research, we innovatively put forward a technical solution, and successfully achieved the abnormal detection of “million curves” by using a few models. That is the unsupervised learning scheme plus the supervised learning scheme. In the first layer, we use the unsupervised algorithm to filter out most of the anomalies, and in the second layer, we use the supervised algorithm to improve the accuracy and recall rate.

In all kinds of algorithms of time series anomaly detection, usually papers are for a certain type of time series, using a certain type of model, the effect can be optimal. However, in our application scenario, the curves we have seen are so strange that the author cannot tell us how many shapes there are. Therefore, it is absolutely not advisable to use a model of a certain type of time series. However, in the process of learning machine learning, there is an integrated learning method, that is, to take the results of multiple models as features, and use these features to train a relatively general model, so as to detect anomalies in all cloud weaving Monitor time series. This kind of method is summarized by the author, that is, “Introduction to Time Series (I)”. Finally, we achieved “one person selects one million curves”, successfully removing the business effect of setting thresholds.

Into the future

Also the rest of the good xi, although nine dead its still not regret.

In the process of career change, the author has also gone through detours, experienced the pain of troubleshooting data problems, experienced the joy of achieving business indicators, and felt how to build a system from zero to one. Here I write an article to record the author’s growth experience in the past two years, hoping to make my meager contribution to help those who have the ambition to switch to machine learning. From the two years of project experience, in order to do a good project from zero to one, it is necessary to have a good plan at the beginning, and then adjust the direction step by step according to the progress of the project. But without an adequate accumulation of knowledge, it is difficult to find the right way forward.

“Also the rest of the heart of the good xi, although nine death of its still not regret.” Sometimes, some people will give up a long-term goal for the sake of short-term interests. However, if you want to go further, the best plan is to let yourself and the team grow together. The best thing is that everyone has a long-term goal and cannot indulge themselves because of some minor fluctuations. At the same time, if a team or an individual is too eager to achieve results, it will often lead to failure, while persistent learning is the only way to do scientific research and carry out work.

The poet Lu You once taught his younger generation: “If you want to learn poetry, you should do more than poetry.” It means that if you want to really write good poetry, you have to put a lot of effort into your life and experience the ups and downs of life, rather than reading poems and poems over and over again.

If seen tianlong eight people know, jiu Mozhi at that time on the Shaolin temple to challenge, in front of shaolin monks to show their own learning of Shaolin seventy-two stunts, many Shaolin monks are frightened. At that time, Xu Zhu, who was watching the battle, said to the shaolin monks, “Although jiu Mozhi’s moves are unique skills of Shaolin, they are in essence driven by the use of small non-phase skills. The technique is the same, but it uses Taoist internal forces.” Why the shaolin monks did not see the key to the jiu Mo zhi kung fu, it is because the Shaolin monks in the practice of kung fu, has been holding the secret book in practice, a lifetime of practice at the end of the 13 special skills.

In fact, from the jiu Mozhi’s personal practice, the key to practicing martial arts is not in the secret books of martial arts. Did not find the key Buddhist sutras, did not find the movement of the dharma, no matter how many years holding the secret book of martial arts practice, after all, with others have an essential gap.

In the past two years in SNG social network operation Department, I have used recommendation projects, done security projects, and am doing operation and maintenance projects. I am the only person in the department (I don’t know whether it is accurate or not) who has done three kinds of projects. I have used recommendation system and built two systems from zero to one. At present, the author’s personal interest is focused on AIOPS, because THE author believes that machine learning must have its own application in the traditional field of business operation and maintenance. It is believed that IN the near future, AIOPS will be implemented in various scenes of operation and maintenance, and truly move towards coffee operation and maintenance.

The author’s summary article mentioned in this article can refer to the author’s personal public number: Mathematical life

The authors introduce

Jung Chang is a machine learning researcher at Tencent. She graduated from The National University of Singapore with a PhD in mathematics. At present, he is committed to researching the application of artificial intelligence in recommendation system, business security system and intelligent operation and maintenance system, such as the intelligent monitoring system of Cloud Monitor.

Welcome to pay attention to Tencent Wechat public account (TencentCOC), the first time to get more operation and maintenance technology practice dry goods oh ~

How to start machine learning from scratch

preface

2015: Try to transform

2016: Zero to one

2017: Drum again

Into the future

Related Posts

Straight to the dry, hand-write a front-end scaffolding tool that belongs to you

Model optimization for Python deep learning

Pytorch environment building and model training under Windows