In this chapter, the author will start with a review of the development history of artificial intelligence, and then introduce the development status of artificial intelligence, and introduce some basic concepts of machine learning. This chapter, as the beginning of the book, is intended to give readers a general idea of what machine learning is. What changes will it bring to our lives?
1.1 background
As Einstein said, “Throughout the history of science, from Greek philosophy to modern physics, there have been repeated attempts to reduce seemingly extremely complex natural phenomena to a few simple basic concepts and relations, which are the basic principles of natural philosophy as a whole.” The history of human evolution, in a sense, is a process of induction and deduction. From the slash-and-burn Neolithic age to the modern industrial Revolution and the development of modern science and technology, human beings have accumulated a lot of experience. These lessons range from common sense, “you reap what you sow,” to formulas like relativity. Human civilization is advancing along the axis of time, and artificial intelligence may be the answer we need to make use of past experience to promote human society to leap forward again.
The origins of artificial intelligence can be traced back to the 17th century or earlier, when the definition of artificial intelligence was based on reasoning. People imagined that if two philosophers or historians disagreed, instead of quarrelling endlessly, all the theories in the world would be abstracted into a language similar to mathematical symbols, and they could solve the conflict simply by taking out a pen and calculating. This kind of abstract logic gave future generations guidance, and today, machine learning is also used in the industry to abstract business logic into numbers to calculate, so as to solve business problems. But in ancient times, such logic was only in the minds of scientists. In fact, it was not until the advent of machines that ARTIFICIAL intelligence really gained widespread attention as a discipline.
The origins of modern artificial intelligence cannot be avoided without mentioning a single name: Turing (see Figure 1-1).
Figure 1-1 Alan Turing
With the outbreak of the Second World War, more and more machines began to replace manual work, and people began to wonder when machines could do thinking instead of human beings. In the 1940s, discussions about artificial intelligence began to take off. However, there needs to be a standard to determine how well a machine can be considered artificial intelligence. Turing used the most straightforward language to describe artificial intelligence, the Turing test (see Figure 1-2).
Figure 1-2 Turing Test
In 1950 Alan Matheson Turing, a pioneer of computer science and cryptography, published a paper called computational Machines and Intelligence, which defined an artificial intelligence test in which the subject performed an experiment with a machine that claimed to have human intelligence. During the test, the tester and the testee are separated, and the tester can only ask the testee some questions through some device (such as keyboard), and it is ok to ask any questions. After a few questions, the machine failed the Turing test if the tester could correctly tell who was a human and who was a machine, and if the tester could not tell who was a machine and who was a human, the machine had human intelligence.
Another important symbol of artificial intelligence is the birth of the discipline of artificial intelligence, which takes place at the Dartmouth Conference in 1956. The conference theorized that “any other property of learning or intelligence can be described with such precision that it can be simulated by machines.” Much like machine learning algorithms are today, we need to extract features that can represent a business, and then use the algorithms to train models to make predictions about a set of unknown outcomes. The conference has given a boost to the development of artificial intelligence in a broader field. In the following 20 years, human beings made breakthroughs in artificial intelligence, especially in the research of some relevant statistical algorithms. The representative ones, such as neural network algorithm, were born in this period. With the support of these intelligent algorithms, more real scenes can be simulated at the mathematical level, and humans gradually learn to predict through the combination of data and algorithms, so as to achieve intelligent applications to a certain extent.
Ai has also encountered many challenges in its development. In the 1970s, with the gradual maturity of theoretical algorithms, the development of artificial intelligence encountered a bottleneck in computing resources. With exponential increases in computational complexity, the big machines of the 1970s couldn’t afford it all. At the same time, the Internet was still in its infancy, and the accumulation of data was just getting started. Scientists often do not have enough data to train models, take Optical Character Recognition (OCR) for example. If we want to train a set of OCR model with high accuracy for a certain scene, we need tens of millions of data samples, which were impossible in terms of data acquisition, storage and computing costs. Therefore, ARTIFICIAL intelligence will be limited by the lack of computing power and data volume for a long time.
Despite nearly two decades in the doldrums, data scientists have never stopped exploring ARTIFICIAL intelligence. In the 21st century, with the explosive development of the Internet, more and more images and text data are shared on the web, staying on the servers of Internet giants, and with it the collection of users’ browsing and shopping records. The Internet has become a big data warehouse, and many Internet titans have turned their attention to data mining. Databases have become gold mines, data scientists have begun to use lines of formulas and codes to dig for the value behind data, and more and more companies have started to buy and sell data. These codes and formulas are the subject of this book — machine learning algorithms. “Alibaba is a data company,” Mr Ma said in public speeches many years ago. Cloud computing is the “hoe” wielded over fertile patches of data that need to be cultivated by machine learning algorithms. The accumulation of PB level data makes people have to transfer single computer computing to multiple computers, and the parallel computing theory has been widely applied, which gives birth to the concept of cloud computing. Cloud computing, or distributed computing, simply takes a very complex task apart, lets hundreds of machines each perform a small module of the task, and then aggregates the results.
The open source distributed computing architecture represented by Hadoop provides technical support for distributed computing for more enterprises. With efficient deep learning architectures such as Caffe and Tensorflow open sourced, many small businesses have the ability to develop improved algorithmic models on their own. The application of artificial intelligence has become popular and gradually integrated into our life. People are getting used to typing a term on Google and getting tens of millions of messages instantly back, paying by swiping their faces or fingerprints, and getting smart product recommendations when shopping on Taobao. The development of image recognition, text recognition and speech recognition has brought a subversive impact on our life. In 2016, a Google show about ARTIFICIAL intelligence took the ai industry to a new level. Machine intelligence defeating human Go players has long been considered an impossible task, but AlphaGo has pulled it off. The success of AlphaGo not only proves the practicality of deep learning and Monte Carlo search algorithms, but also reaffirms the fact that humans are no longer the only vehicle to generate intelligence. Any machine that can receive, store and analyze information can produce intelligence. The key factor here is the magnitude of the information and the depth of the algorithm.
The history of artificial intelligence is a history of collecting and analyzing past experiences. Before the emergence of machines, human beings could only judge things on the basis of a small amount of information through others’ sharing and their own practice. Such cognition of external things was limited by human brain power and knowledge. Unlike human brain power, a machine in the abstract sense can be treated as an information black hole, absorbing all information and making large-scale analysis, generalization and deduction of the data day and night. If humans share the knowledge gained from machine learning, artificial intelligence can be formed. Therefore, with the development of human society, the accumulation of data and iteration of algorithms will further promote the development of the whole artificial intelligence.
As mentioned above, the development of artificial intelligence is reflected in the experience induction and thinking driven by machines, so the engine behind artificial intelligence is the focus of this book — machine learning algorithm. Machine learning is a multidisciplinary research discipline, involving biology, statistics, computer and other disciplines. At the current stage of development, machine learning algorithms mainly abstract the scenes in life into mathematical formulas, and rely on the super computing power of machines to generate models through iteration and deduction to predict or classify new social problems. In fact, the development history of artificial intelligence is accompanied by the evolution history of machine learning algorithms. It is with the continuous development of machine learning algorithms and the improvement of computing power that the artificial intelligence industry has developed and thus reached the current hot situation. The following is an introduction of some achievements of machine learning algorithms in the current stage, so that we can understand the use of machine learning algorithms.
1.2 Development Status
The previous section reviewed the development history of artificial intelligence. Regardless of the limitations of hardware conditions such as computing power, artificial intelligence in today’s world can be summarized as the combination of data and intelligent algorithms. Through the analysis of past experience to obtain experimental model, and use this model to guide the actual business. If you think of ai as a human brain, the blood inside is the data, and the blood vessels inside the brain carry the data flow, which can be regarded as the relevant machine learning algorithms. Therefore, before introducing machine learning algorithms, we have to first understand the characteristics of the era of big data, and then introduce some uses of machine learning algorithms for the current data explosion situation.
1.2.1 Data status
The 21st century is destined to belong to the Internet. In this digital age, there are many new terms, such as cloud computing, e-commerce and sharing economy. Big data is also a product of the Internet age, appearing in newspapers, on television and on the web. “Big data” has become synonymous with the information age, and even before many people have time to recognize it, they have begun to be dominated by it. What is data? Data have been available since the beginning of the existence of the objective world, from the speed, Angle and mass of celestial bodies in the universe to the birth, evolution and evolution of human civilizations. Data is everywhere, but its value lies in how it is collected and used.
It was driven by the Internet that humans began to collect and use data. What strikes me most about the era of big data is that the map of the future of big data is both clear and vague. What is clear is that people have started to realize that data is valuable and have started to collect data, and look at what people are doing? According to the latest report of storage Market Research, the world’s annual data storage volume is about 50 exabytes, which comes from the Internet, healthcare, communications, public security and military industries. Next, let’s take a look at how this data is generated.
Take Facebook, the world’s largest SOCIAL networking service. Facebook now has 950 million users, and every action of those users, including every notification, page visit, and viewing a friend’s page, is tracked by Facebook’s servers and generates historical behavior data. The world’s 950 million users spend an average of more than 6.5 hours a month on Facebook, generating an incredible amount of data. About 500 terabytes of data are generated on Facebook every day, so let’s take a look at what that includes. People share 2.5 billion content items a day, including status updates, wall posts, images, videos and comments, 2.7 billion likes a day, and 300 million images a day.
Internet giants such as Facebook, Google and Alibaba are already amassing data and analysing it to feed their businesses. But to this day, less than one percent of all data produced in the world is saved each year, and less than ten percent of that can be tagged and analyzed. This situation creates bottlenecks in two aspects: on the one hand, the bottleneck of data generation and data collection, and on the other hand, the bottleneck between collected data and data that can be analyzed.
For the bottleneck of data generation and data collection, on the one hand, the reason is the limitation of hardware storage cost, but with the development of hard disk technology and the improvement of production capacity, this defect is gradually weakened. The author believes that the main reason for the imbalance between data collection and data generation is the lack of data collection standards. Although, Internet companies have formed a mature system for data collection and standard setting, such as website click behavior, log collection and so on. But for more industries, especially for traditional industries, the way of data collection is still in the process of exploration, and from the current point of view, such exploration will continue for a long time. Although the idea of Internet thinking and the Internet of everything in the world is advocated now, it is difficult to replicate the experience of the Internet for data collection to the traditional industry. Because the Internet industry has natural advantages in data collection, the Internet data are hosted in the database and recorded in the hard disk in binary mode, so as long as a little processing can form a structured data of high quality. But in traditional industries, such as the construction industry, where data is laid out brick by brick on a construction site, how such data is converted to binary storage needs to be specified by new standards, which are more limited by technical means. If our image recognition is smart enough to quantify the site data in a single photo, we might be able to solve this problem. For traditional industries, the process of data intelligence may need to wait patiently.
More standards and technical support are needed for data collection, but there are also some defects in data application. It would be life-changing if the data collected in the world were fully utilized, but unfortunately only a small percentage of the data is available for analysis. There are two main factors that cause such a dilemma. One is that the current mainstream machine learning algorithms are all supervised learning algorithms, and the data source required for supervised learning is marked data, which often depends on manual marking. For example, we need a piece of data to train the model for movie recommendation. In addition to the known characteristic data of the movie, we also need a piece of marking data to indicate the good-looking degree of the movie, which is similar to the movie score of Douban. Such data is difficult to be directly generated by computer calculation and requires manual marking. The impact of manual marking is that while it is difficult to generate large numbers of samples (tens of millions of samples of data), it would be a huge undertaking for 10 million people to sit in one place and watch a movie and then rate it. On the other hand, the cost of manual marking is too high, there are many third-party companies responsible for marking, marking services can often be sold at a high price in the market.
Another factor contributing to the low percentage of analyzable data is the low ability to process unstructured data. Unstructured data refers to data such as text or images, voice or video. This part of data comes from users’ comments on Tieba, profile pictures on social software, and video presentations on live broadcasting platforms. Although the current level of science and technology has been able to analyze text and images, it is still in a relatively basic stage in terms of mass processing and feature extraction. Take image recognition as an example, the mature ones at present include face recognition and fingerprint recognition, etc. The characteristic of image recognition is that the recognition of each thing needs to train the corresponding model, and this model needs a large number of training samples to improve the accuracy rate. A mature model usually needs tens of millions of training samples. Face data is relatively easy to obtain, so the corresponding model is easy to train, but if we need to train a model to recognize a certain cup, the training data for this cup is difficult to reach the ideal level, which also improves the threshold of image recognition in a specific scene.
As the Internet continues to evolve, so does the generation of data. According to the widely cited report “Digital Universe 2020” jointly released by International Data Corporation (IDC) and EMC, the global digital universe will expand to 40,000 exabytes by 2020, which is more than 5,200 GB per person. How this amount of data will be stored and used effectively is something we can’t yet imagine. However, it is certain that Data will become an important resource, just like hydropower and coal. In the era of big Data, especially in the future era of Data explosion, Data will show greater potential, and human society will also enter the era of Data Technology (DT).
1.2.2 Status quo of machine learning algorithms
We’ve talked about big data, but machine learning makes a lot of sense. The traditional way machines work is that programmers feed a machine a series of instructions, which can be interpreted as code, and the machine follows these instructions step by step, usually with predictable results. This kind of logic doesn’t work in machine learning, where we feed data into a machine (or more accurately, a machine learning algorithm), and the machine returns results based on the data. The results are learned from the data itself, and the learning process is performed by the algorithm. We can define machine learning as a method in which computers use existing data (experience) to produce a model and then use that model to predict the future. The process is very similar to human learning, except that the machine is a monster capable of analyzing large numbers and learning tirelessly (see Figure 1-3).
Figure 1-3 Differences between machine learning and people
Machine learning has deep connections to pattern recognition, statistical learning, data mining, computer vision, speech recognition and natural language processing. Now living in such an era of DT, machine learning is always the shadow of everywhere. The application of artificial intelligence brought by machine analysis of big data is changing people’s way of life and way of thinking bit by bit. Seeing this, many people will ask: What can machine learning do? In fact, machine learning has served all aspects of our life. Here is a simple shopping scene to introduce how machine learning is applied in our daily life.
It’s 2016, and if you haven’t tried online shopping, you’re really behind The Times. Online shopping has become a way of life. Let’s talk briefly about the application of machine learning algorithms to shopping behavior. Let’s say we’re at a restaurant and we see someone with a nice short-sleeved shirt. We want to buy the same shirt, but we’re too embarrassed to ask. So we can take a candid photo of the person’s T-shirt first and then take a photo of Litao (see Picture 1-4) to show the same style of the shirt.
Picture 1-4 Pai Li Tao
This is where the image recognition technology of machine learning is used. However, there are often many styles similar to this dress, so we need to sort these styles according to certain rules, which involves the training of machine learning algorithm model. Through this model, we rank all similar styles and finally get the final display order.
Of course, most of the time we search for goods by keyboard input, but if we are lazy, we can also choose to input content by voice, which is the use of voice to text. After we search for a product, there will be a list of recommendations in the sidebar of the web page, and each user’s recommendation list is different, which is called a thousand thousand pages. The realization of this scene depends on the user portrait in the background of the recommendation system, and the user portrait is a typical application of big data and machine learning algorithm. By mining the user’s characteristics, such as gender, age, income and hobbies, the user can recommend the goods that the user may buy, so as to achieve personalized recommendation.
At this point, we finally put the items in the cart and started placing orders. Before we placed the order, we found that the money in the online bank account was not enough, and we wanted to apply for some loans. At this point, we find that there is a loan line, how is this line calculated? This involves the problem of financial risk control, and financial risk control is also based on machine learning algorithms to train the model and calculate.
After placing orders, our products are arranged for distribution, and currently, except for a few remote areas, they can be received within five days. This period includes the packaging of goods, shipping from stock to transit stock, distribution from lower level warehouse to higher level warehouse, and downward distribution. The reason why so many processes can be completed in a short time is that the warehouse has made demand prediction in advance in terms of inventory and prepared goods near possible demand places in advance. This prediction algorithm is also based on the machine learning algorithm.
Our Courier gets the goods and opens the map to navigate. The system has designed the delivery route for him, which avoids congestion and designs the route to the shortest distance as far as possible. This is also calculated by machine learning algorithm. What if the Courier walks in the door and we get the goods and find the clothes are not the right size? Open customer service, type in the question, and we get an instant reply because the customer service person is probably not really a “customer service person”, just a customer service robot. Intelligent customer service system using text semantic analysis algorithm, can accurately determine the user’s questions, and give answers to the corresponding questions. At the same time, intelligent customer service can also analyze the context of the user’s problem, if the problem is very serious need to compensate, such as: “your product hurt my bad stomach” such problems will be picked out by the customer service robot through emotional analysis, handed over to the special person to deal with.
As mentioned above, the author simply lists several applications of machine learning in online shopping, which involves many intelligent algorithms, including model training and prediction, semantic analysis, text emotion analysis, image recognition technology and voice recognition technology. We can see that in the most common scenario, online shopping, machine learning algorithms almost run through the whole process.
Of course, we can also list many examples such as the above, because there are too many scenarios, it is impossible to list all of them. Here, some high-frequency scenarios of machine learning are listed as follows through the segmentation of scenarios.
- Clustering scenarios: crowd division and product category division, etc.
- Classified scenarios: advertising forecast and website user click forecast, etc.
- Regression scenarios: rainfall forecast, commodity purchase forecast and stock turnover forecast, etc.
- Text analysis scenarios: news label extraction, automatic text classification and text key information extraction.
- Relationship graph algorithm: Social Network Site (SNS) Network relationship mining and financial risk control, etc.
- Pattern recognition: speech recognition, image recognition and handwriting recognition.
The applications listed above are only a small part of the application scenarios of machine learning algorithms. In fact, with the accumulation of data, machine learning algorithms can penetrate into all walks of life and play a huge role in the industry. With the spread of data intelligence, data driven and other ideas, machine learning algorithms are becoming a universal basic ability to export. We can foresee that with the development of algorithms and computing power in the future, machine learning should have deeper applications in various fields such as finance, health care, education and security. In particular, the author expects that machine learning algorithms can make breakthroughs in cracking genetic codes and breaking cancer. Meanwhile, the development of new concepts and technologies such as unmanned cars and Augmented Reality (AR) also depends on the development of machine learning algorithms. I believe that in the future, machine learning algorithms will truly overturn life and change human destiny.
1.3 Basic concepts of machine learning
Before starting the machine learning algorithm process is introduced, because of machine learning is a multidisciplinary cross discipline, there are a lot of similar to the concept of statistics, but on the way and the traditional statistical and has certain difference, we need to understand some basic concepts of machine learning related, because if it is not clear these concepts, Reading and understanding of some literature can be a barrier. The following section will help you understand the basic terms and concepts of machine learning. First, the basic process of machine learning will be introduced, and then the basic concepts used in data, algorithm and evaluation of machine learning will be introduced.
1.3.1 Machine learning process
Machine learning is a process of data flow, analysis and results. Many people spend a lot of time on algorithm selection or optimization, but in fact, every step of machine learning is crucial, and there are abundant materials to introduce the specific implementation of the algorithm. I would like to spend more time on data processing and the whole machine learning process.
The whole process of machine learning can be roughly divided into six steps, which are arranged in top-down order according to data flow, including scene analysis, data preprocessing, feature engineering, model training, model evaluation, and offline/online service (see Figure 1-5). The basic functions of these steps are introduced one by one.
Figure 1-5 Data mining process
(1) Scene analysis. Scenario analysis is to think clearly about the whole business logic first and abstract our business scenario. For example, we make an advertisement click prediction, which is to judge whether a user clicks on an advertisement or not, which can be abstracted into a binary classification problem. Then we can choose algorithms according to supervised learning and dichotomous scenarios. In summary, scenario abstraction is about matching the business logic with the algorithm.
(2) Data preprocessing. Data preprocessing is mainly for data cleaning, processing for null values and garbled codes in the data matrix, splitting and sampling of the overall data, and normalization or standardization of single or multiple fields. The main objective of the data preprocessing stage is to reduce the influence of dimensional and noise data on the training data set.
(3) Feature engineering. There is nothing wrong with the statement that feature engineering is the single most important step in machine learning. Especially at present, with the popularity of open source algorithm libraries and the continuous maturity of algorithms, the quality of algorithms is not necessarily the most critical factor to determine the results, and the effect of feature engineering determines the quality of the final model in a sense. Through an example explain the role of the characteristics of engineering, 2014, an Internet giant held a big data race, the teams in more than 1000, in the end, it almost all the teams with the same set of algorithms, because the merits of the algorithm is relatively easy to judge, the characteristics of different algorithms are not the same, And the selection of algorithms is limited. However, the selection and derivation of features are extremely uncertain. In the eyes of 100 people, there may be 100 different features. Therefore, in the later stage of this competition, people often compete with each other to choose the quality of features. When the algorithm is relatively fixed, good results can be determined by good features.
(4) Model training. The “logistic regression dichotomy” component as shown in Figure 1-6 represents the algorithm training process. After data preprocessing and feature engineering, the training data is entered into the algorithm training module and the model is generated. In the “prediction” component, the model and prediction set data are read for calculation and the prediction results are generated.
Figure 1-6 Model training
(5) Model evaluation. The calculation result of machine learning algorithm is generally a model, and the quality of the model directly affects the following data business. The evaluation of the maturity of the model is actually the evaluation of the whole machine learning process.
(6) Offline/online services. In the actual business application process, machine learning usually needs to be used with scheduling system. Specific case scenario is as follows: the user every day will be the day of incremental data into the database table, through the scheduling system start machine learning offline training services, generate the latest offline model, and then through the online prediction (usually through a Restful API, send data to a server computing algorithm model, and then return the result) in real-time prediction. Figure 1-7 shows the architecture.
Figure 1-7 Machine learning service architecture
By using this architecture, offline training and online prediction can be combined and the whole business logic from offline to online can be connected.
1.3.2 Data source structure
The basic flow of machine learning has been introduced previously. The data structure of machine learning will be introduced below. If the machine learning algorithm is compared to a data processing factory, the data entering the factory is the raw material used for processing by the algorithm. What is the structure of the data required by the machine learning algorithm? If you often pay attention to articles related to big data, you will have heard of the terms “structured data” and “unstructured data”. Of course, “semi-structured data” can also be derived from this, and the structure of these kinds of data will be introduced below.
(1) Structured data. Structured data refers to the log data structure commonly seen in daily database processing. It is stored in a database in a matrix structure and can be displayed in a two-dimensional table structure, as shown in Figure 1-8.
Figure 1-8 Example of structured data
Structured data is mainly composed of two parts. One part is the meaning of each field, that is, the headers of fields such as age, sex and CP as shown in Figure 1-8, and the other part is the specific value of each field. Generally speaking, the data processed by machine learning algorithm is structured data, because machine learning needs to put data into matrix to do some mathematical operations, and structured data is stored in matrix form, so machine learning algorithm usually only supports structured data.
There are also two very important concepts in structured data that need to be introduced, namely features and target columns. These are the two terms most frequently used in machine learning algorithms, in which feature refers to the attributes of the object described by the data. For example, if a group of data is used to describe a person, the height, weight, gender and age of the person are all features. In a data set of structured data, there is usually a feature for each column of data.
The target column represents the marking result of each piece of data, because as previously introduced, the principle of machine learning is actually learning experience from historical data, and the target column represents the result of this group of data. , for example, we want to pass a medical data to predict whether the object has a heart attack, need to be generated by hundreds of thousands of copies of the training data model, the hundreds of thousands of copies of the training data need marking, that is to say, it will know in advance what kind machine check-up index of person is sick, what kind of people do not have sicken, so as to study the predictive model. As an example, figure 1-9 shows the data results required for heart disease prediction, where the fields in the box indicate whether the object is sick, and this column is the target column. The other three fields, age, sex, and cp, describe the characteristics of the object and are characteristic columns.
(2) Semi-structured data. Semi-structured data refers to data stored in a certain structure, but not in the form of two-dimensional database rows. Typical semi-structured data is stored data with an XML extension, as shown in Figure 1-10.
Figure 1-9 Target column description
Figure 1-10 Semi-structured data
Another type of semi-structured data is in a data table where some fields are textual and some fields are numeric. As shown in table 1-1.
Table 1-1 Semi-structured data
ID |
Occupation |
Income |
---|---|---|
Xiao li |
The teacher |
241 |
wang |
The cook |
521 |
liu |
The driver |
421 |
A small party |
athletes |
23636 |
Semi-structured data is often used for the transmission of some data, but there is still a certain distance in the application of machine learning algorithms. Data conversion is needed to convert semi-structured data into structured data for operation.
(3) Unstructured data. Data mining of unstructured data has always been a hot topic in the field of machine learning, especially with the development of deep learning, the current processing of unstructured data seems to find a direction. Typical unstructured data is an image, text, or voice file. These data cannot be stored in a matrix structure, and the current practice is to convert unstructured data into binary storage format and then use algorithms to mine the information. Chapters 6 and 7 detail how to use deep learning algorithms to process unstructured data.
The above is an introduction to the three types of data structures that need to be processed in real business scenarios. Machine learning algorithms have good support for structured data. For semi-structured data and unstructured data, in real business scenarios, these two types of data are usually transformed before data mining through algorithms. The method of converting unstructured data into structured data is also described in Chapter 4.
1.3.3 Algorithm classification
The process and data source structure of machine learning have been introduced above, and the classification of algorithms is briefly explained below. Machine learning algorithms include clustering, regression, classification and text analysis of dozens of scenarios, the commonly used algorithm types are about 30, and there are a lot of deformation, we divide machine learning into four types, respectively supervised learning, unsupervised learning, semi-supervised learning and enhanced learning.
(1) Supervised learning. Supervised Learning means that every training data sample entered into the algorithm has corresponding expected value, that is, target value. The process of machine Learning is actually a process of mapping characteristic value and target queue. For example, if we know the history of a stock and some information about its earnings, the number of people in the company, we want to predict the future trend of the stock. Then, in the process of training the algorithm model, we hope to obtain a formula through calculation, which can reflect the influence of the company’s profit and the number of employees on the stock trend. The way to train through the characteristics of past data and the end result is supervised learning. The training data source of supervised learning algorithm needs to be composed of eigenvalue and target queue.
As shown in Figure 1-11, IFHealth is the target cohort, and AGE, sex and CP are the feature cohort, which is a typical training data set of supervised learning. Because supervised learning depends on the marking of each sample, it can get the exact target value mapped to each feature sequence, so it is often used for regression and classification scenes. Table 1-2 lists common supervised learning algorithms.
Table 1-2 Supervised learning
Classification algorithm |
K nearest neighbor, naive Bayes, decision tree, random forest, GBDT and support vector machine, etc |
---|---|
Regression algorithm |
Logistic regression, linear regression, etc |
One problem with supervised learning is the high cost of obtaining target values. For example, if we want to predict the quality of a movie, we have to rely on manual labeling of a large number of movies when generating training sets. Such human cost makes supervised learning a relatively expensive learning method to some extent. How to obtain a large amount of marker data has always been a difficult problem in supervised learning.
Figure 1-11 Supervised learning
(2) Unsupervised learning. Unsupervised Learning, after Learning the concept of supervised Learning mentioned above, in fact, Unsupervised Learning will be easier to understand. Unsupervised learning is a machine learning algorithm that does not rely on marking data for training samples. Since there is no target queue, there is no final result in the characteristic environment, so such data may not be suitable for some regression and classification scenarios. Unsupervised learning is mainly used to solve the problems of some clustering scenarios, because when the target value is missing from our training data, all we can do is compare the distance relationship between different samples. Common unsupervised learning algorithms are shown in Table 1-3.
Table 1-3 Semi-supervised learning
Clustering algorithm |
K – Means, DBSCAN, etc |
---|---|
Recommendation algorithm |
Collaborative filtering, etc. |
Compared with supervised learning, one of the major advantages of unsupervised learning is that it does not rely on the marking data. In many specific conditions, especially when the marking data need to rely on a lot of manual acquisition, unsupervised learning or semi-supervised learning can be tried to solve the problem.
(3) Semi-supervised learning. Semi-supervised Learning is a kind of machine Learning that has gradually become popular in recent years. As mentioned above, it is very resource-consuming to obtain marking data in some scenarios, but unsupervised learning is difficult to solve the problems of classification and regression scenarios. Therefore, people began to try to use machine learning algorithm by marking part of the sample. This kind of algorithm application of training data of part marking sample is semi-supervised learning. At present, many semi-supervised learning algorithms are the deformation of supervised learning algorithms. This book will introduce a semi-supervised learning algorithm — tag propagation algorithm. In fact, the semi-supervised algorithm has a lot of applications, we recommend to in-depth understanding.
(4) Reinforcement learning. Reinforcement Learning is a complex type of machine Learning, which emphasizes that the system will constantly interact with the outside world, get the feedback from the outside world, and then determine its own behavior. Reinforcement learning is currently a hot category of algorithms in the field of artificial intelligence. Typical cases include unmanned car driving and Alphago playing Go. Hidden Markov segmentation algorithm introduced in this book is a reinforcement learning idea.
The above is an introduction to supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. Supervised learning mainly solves the classification and regression scenes, unsupervised learning mainly solves the clustering scenes, semi-supervised learning solves the classification scenes where the marking data is difficult to obtain, and reinforcement learning mainly aims at the scenes where reasoning is constantly needed in the process. This book introduces the four types of machine learning algorithms, and the specific classification is shown in Table 1-4 to facilitate targeted learning.
Table 1-4 Algorithm classification
Supervised learning |
Logistic regression, K nearest neighbor, Naive Bayes, Random Sen, support vector machine |
---|---|
Unsupervised learning |
K-means, DBSCAN, collaborative filtering, LDA |
Semi-supervised learning |
Label propagation |
Reinforcement learning |
Hidden markov |
1.3.4 Overfitting problem
There will be a lot of problems in the process of machine learning model training, such as unreasonable setting of parameters or gradients and incomplete data cleaning. However, if you ask a data mining engineer what is the most common problem in the field of data mining, his answer is probably “overfitting”. This is why we have a separate section on over-fitting in data mining.
Over-fitting, which literally means over-fitting, often occurs in the training and prediction of linear classifier or linear model. Over fitting phenomenon is often encountered problems in data mining process, such as through the training set a model, the model for the prediction exactness rate of the training set is very high, can reach 95%, but we are in a data set to forecast, found that only 30% accuracy, appear this kind of circumstance is likely to be the cause of training over fitting phenomenon.
The principle of overfitting is that the machine learning algorithm overlearns the training set data, which sounds a little hard to understand, but here is an example to explain. Suppose we have a set of two-dimensional data displayed in a coordinate system, and we want to perform a linear regression training on this two-dimensional data. If the fitted curve is a dotted line as shown in Figure 1-12, it is actually a form of underfitting, and the curve fitting is not ideal, because a curve consistent with data distribution is not well fitted by regression algorithm.
Figure 1-12 Linear fitting curve I
Let’s look at Figure 1-13 again.
Figure 1-13 Linear fitting curve II
If the result of the final fitting is as shown in Figure 1-13, it is an ideal situation. We can see that the final curve trend has almost described the distribution of data, and this kind of curve is ideal. So what is overfitting? Let’s look at figure 1-14.
Figure 1-14 Linear fitting curve 3
As shown in Figure 1-14, this situation is typical of overfitting, and the curves in the figure have been completely consistent with the data distribution. So some people might ask, isn’t the purpose of linear regression to find the curve that best fits the trend of the data? Why is it bad when we get a result that perfectly matches the trend of the data? This is because the purpose of training linear regression curves or linear classifiers is to classify or predict other data sets. If the curve fitting for the training set is too “perfect”, the model is likely to be inaccurate when we make predictions for other prediction sets, because the model is excessively close to the characteristics of the training set and lacks robustness. Therefore, in the machine learning training process, 100% fitting training set data is not necessarily good.
Through the previous introduction, we have understood the phenomenon and principle of over-fitting. Then what causes the over-fitting problem? The reasons can be summarized as follows.
- The sample of the training data set is single. For example, the training sample only has the data of the white duck, it is definitely wrong to use the generated model to predict the black duck. Therefore, in the process of training, training samples are required to be as comprehensive as possible, covering all data types.
- The noise data of the training sample is too interfered, and the noise data is the interference data in the index data set. Too much noise data can cause the model to record many noise characteristics, ignoring the relationship between input and output.
- The model is too complex and too many model parameters are also an important factor of over-fitting. Therefore, a mature model is not necessarily very complex, but requires that the model has stable output performance for different data sets.
In view of such common problems as overfitting, there have been many prevention and solution methods as shown below.
- In training and model building, it is necessary to start from a relatively simple model, not to adjust the model to be very complex and have many features at the beginning, which is easy to cause the occurrence of over-fitting phenomenon. Moreover, when the model is too complex and over-fitting occurs, it is also difficult to identify which part of the characteristics of the specific problems.
- Data sampling must cover all data types as much as possible. In addition, data need to be cleaned before algorithm training, otherwise if a large number of noise data is mixed, it will increase the probability of overfitting problem.
- In the training process of the model, we can also use mathematical means to prevent the occurrence of over-fitting phenomenon, and we can add penalty function to the algorithm to prevent over-fitting. If you want to know more about it here, you can refer to regularization L1 and L2 specifications, and this part of the book will not be expanded.
The above has carried on the basic description for the over-fitting problem, and also introduced the cause of the problem and the prevention method. Because overfitting problem is very likely to be encountered in the process of using machine learning algorithm, it is very important to master the knowledge of this aspect and the means to deal with it. We hope that through the study of this section, we can consciously avoid the occurrence of over-fitting problems in model training.
1.3.5 Evaluation of results
Already introduced some machine learning algorithms in the process of the specific data mining may encounter some of the concepts and noun, we know that the ultimate goal of machine learning algorithm is to generate model, the stand or fall of model need through some indicators to assess, to introduce it now in machine learning algorithms may be used in some concept about the result evaluation. Commonly used concepts may include accuracy, recall, F1 value, ROC, and AUC, which may seem a bit conceptual, as each metric evaluates results from a different dimension. The meanings of these concepts are described below.
(1) Accuracy rate, recall rate and F1 value. Since Precision, Recall and F1 (F-measure) values are often compared together, these three related indicators are introduced together. To calculate these three indicators, we need to understand the meanings of TP, TN, FP and FN.
- TP (True Positive) : a Positive sample is predicted by the model.
- TN (True Negative) : the original sample is a Negative sample, which is predicted by the model.
- FP (False Positive) : the original negative sample is predicted as a Positive sample by the model.
- FN (False Negative) : a positive sample is predicted by the model as a Negative sample.
These four concepts are a little hard to read, so let’s use a practical example to illustrate them. For example, there is a prediction set with 500 girls and 100 boys. The model needs to be generated through machine learning to try to identify the girls in the score data set, so the girls are positive samples. Suppose that our final prediction result is 70 girls and 20 boys. The accuracy rate, recall rate and F1 values are calculated below. TP, TN, FP and FN values are calculated first. In this case, TP is the number of girls predicted to be girls, so TP is 70; FP is the number of boys predicted to be girls, and FP is 20; FN represents the number of women who are predicted to be male, and the value of FN is 500−70=430.
The final formula of accuracy, recall rate and F1 value is as follows.
As can be seen from the above formula, in fact, the concept of accuracy rate is the correct proportion of model prediction in popular terms. The recall rate represents the proportion of predicted positive samples to total positive samples. It can be seen that accuracy rate and recall rate are two dimensions of the model. The former evaluates accuracy, while the latter evaluates coverage. Of course, in the actual model evaluation, we hope that the accuracy rate and recall rate are as high as possible, but in fact, these two indicators are contradictory. In order to evaluate the accuracy rate and recall rate in a more balanced way, we create F1 value. F1 value expresses the comprehensive evaluation of accuracy and recall rate. At present, a lot of model evaluation is done through F1 value, which can be combined with the two indicators in consideration of F1 value.
(2) ROC and AUC. The Receiver Operating Characteristic Curve (ROC) Curve is a commonly used model evaluation algorithm Curve in binary scenarios. Figure 1-15 shows the ROC Curve.
Figure 1-15 roc curve
The toothed arc curve in FIG. 1-15 is the ROC curve. The horizontal axis of the curve is the FP value mentioned above, and the TP value is the vertical axis. How do you evaluate a model? The ROC curve can be clearly shown, as long as the model curve is closer to the upper left corner, the better the effect of the model. The area enclosed by the ROC curve and the horizontal axis is represented by the AUC value (that is, the part covered by color in Figure 1-15). The larger the AUC value is, the better the effect of the model is. The value of AUC is 0 to 1, usually greater than 0.5. When the value of AUC reaches above 0.9, it proves that the effect of this model is relatively good.
The concepts of ROC and AUC are introduced above. AUC is realized by the area calculated by ROC curve. AUC and F1 values are used to evaluate the final result. The function of ROC is to obtain the information contained in the model through the smoothness and slope of the curve.
1.4 Summary of this chapter
As the opening chapter of the book, this chapter introduces the protagonist of the book — machine learning algorithm through the development history of the field of artificial intelligence. In fact, machine learning algorithm has been running through our daily life. It is because of the popularity of these intelligent algorithms that more and more people turn their eyes to this new technology. We introduce the development status of machine learning algorithms by taking examples to help readers sort out the application fields of this discipline. In addition, we introduce some basic concepts to help beginners get started on machine learning. With the above background, the following is a formal introduction to the whole process of machine learning.