Article 6 | Spark MLlib machine learning (1)

MLlib is a machine learning library provided by Spark. You can easily build machine learning applications by calling the algorithms encapsulated in MLlib. It offers a wealth of machine learning algorithms, such as classification, regression, clustering and recommendation algorithms. Among other things, MLlib standardizes apis for machine learning algorithms, making it easier to combine multiple algorithms into a single Pipeline or workflow. Through this article, you can learn:

What is machine learning
Big data and machine learning
Machine learning classification
The Spark MLLib introduction

Machine learning, a branch of artificial intelligence, is a multidisciplinary discipline, involving probability theory, statistics, approximation theory, convex analysis, computational complexity theory and other disciplines. Machine learning theory is about designing and analyzing algorithms that allow computers to “learn” automatically. Because a large number of statistical theories are involved in learning algorithms, machine learning is particularly closely related to inferential statistics, also known as statistical learning theory. In algorithm design, machine learning theory focuses on achievable, effective learning algorithms.

Source: Mitchell, T. (1997). Machine Learning. McGraw Hill.

What is machine learning

Machine learning has been applied to all branches of artificial intelligence, such as expert systems, automatic reasoning, natural language understanding, pattern recognition, computer vision, intelligent robotics and other fields. Machine learning is a sub-discipline of artificial intelligence that focuses on enabling machines to learn from past experiences, model uncertainties in data, and make predictions about the future. There are many applications of machine learning, such as search, recommendation systems, spam filtering, face recognition, speech recognition and so on.

Big data and machine learning

In the era of big data, the speed of data generation is very amazing. The Internet, the mobile Internet, the Internet of Things, GPS and so on are all generating data all the time. The storage and computing capacity required to process these data is also increasing at a geometric level. Thus, a series of big data technologies represented by Hadoop are born. These big data technologies provide a reliable guarantee for processing and storing these data.

Data, information and knowledge are three levels from large to small. Simple data is difficult to explain some problems, so it needs to add some experience of people to convert it into information. The so-called information is to eliminate uncertainty. We often say information asymmetry, which means it is difficult to eliminate some uncertain factors when sufficient information is not available. And knowledge is the highest stage, so data mining is also called knowledge discovery.

The task of machine learning is to use some algorithms to apply to big data and then mine the potential knowledge behind it. The more trained data, the more advantages machine learning can show. Problems that machine learning could not solve before can be solved by big data technology now, and the performance will be greatly improved, such as speech recognition, image recognition and so on.

Machine learning classification

Machine learning is mainly divided into the following categories:

Supervised learning

It’s basically a synonym for classification. Supervision in learning comes from instances of markers in training datasets. For example, in the postcode recognition problem, a set of handwritten postcode images and their corresponding machine-readable transformations are used as training examples to supervise the learning of classification models. Common supervised learning algorithms include linear regression, logistic regression, decision tree, naive Bayes, support vector machines and so on.
Unsupervised learning

Is essentially a synonym for clustering. The learning process is unsupervised because the input instance has no class tag. The task of unsupervised learning is to mine potential structures from a given data set. For example, if you give the machine pictures of cats and dogs and don’t label them with any labels, but hope that the machine can sort them into two categories, eventually the machine will classify them into two categories, but it doesn’t know which ones are cat photos and which ones are dog photos. For the machine, it is equivalent to dividing them into A and B. Common unsupervised learning algorithms include k-means clustering, principal component analysis (PCA), etc.
Semi-supervised learning

Semi-supervised learning is a machine learning technique that uses labeled and unlabeled instances when learning a model. Semi-supervised learning is to make the learner use unlabeled samples to improve learning performance independently of external interaction.

Semi-supervised learning has a very strong practical demand, because it is easy to collect a large number of unlabeled samples in practical applications, but it takes manpower and material resources to acquire the labeled samples. For example, in the computer aided medical image analysis, a large number of medical image can be obtained from the hospital, but the image of the lesion if hope medical experts all identified less marked data is not reality, not tag data more this phenomenon is more obvious in the Internet applications, such as in the web page recommendation to users to mark, please interested in web pages, However, few users are willing to spend much time to provide tags, so there are few samples of tagged pages, but there are countless pages on the Internet that can be used as untagged samples.
Reinforcement Learning

Also known as reinforcement learning and evaluation learning, it is an important machine learning method, which has many applications in intelligent control of robots and analysis and prediction. A common model of reinforcement learning is the standard Markov Decision Process (MDP).

The Spark MLLib introduction

MLlib is Spark’s machine learning library, which simplifies the engineering practice of machine learning. MLlib contains a wealth of machine learning algorithms: classification, regression, clustering, collaborative filtering, principal component analysis, and more. Currently, MLlib is divided into two code packages: spark.mllib and spark.ml.

spark.mllib

Spark MLlib is an important part of Spark and is a machine learning library initially provided. One disadvantage of the library is that using Spark MLlib can make the structure of the application complex, even difficult to understand and implement, if the data set is very complex and requires multiple processing, or if the new data needs to be combined with multiple already-trained single models.

Spark.mllib is a primitive ALGORITHM API based on RDD that is currently in maintenance. The library contains four common machine learning algorithms: classification, regression, clustering and collaborative filtering. Note that no new functionality is added to the RDD-based API.

spark.ml

Spark1.2 introduced ML Pipeline. Over the course of several versions, Spark ML overcomes some of the limitations of MLlib handling machine learning problems (complexity, unclear flow) and provides users with a DataFrame API based machine learning library. Making the whole process of building machine learning applications simple and efficient.

Spark ML is not an official name and is used to refer to the MLlib library based on the DataFrame API. DataFrame provides a friendlier API than RDD. The many benefits of DataFrame include Spark data sources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and a unified API across languages.

The Spark ML API provides a number of features processing functions, such as feature selection, feature transformation, category numerization, regularization, and dimensionality reduction. In addition, the ML library based on DataFrame API supports the construction of machine learning Pipeline, which organizes some tasks in the machine learning process together in order to facilitate operation and migration. The Spark. Ml library is recommended.

Data transformation

Data transformation is an important work in data preprocessing, such as data normalization, discretization, index derivation and so on. Spark ML provides a variety of data conversion algorithms. For details, see the official website as follows:

In the above conversion algorithm, word frequency inverse document frequency (TF-IDF), Word2Vec, PCA are more common, if you have done text mining processing, then it should be no stranger to this.

Data code

Big data is the foundation of machine learning and provides sufficient data training sets for machine learning. When the amount of data is very large, redundant dimension attributes need to be deleted or reduced by data protocol technology to achieve the purpose of simplifying the data set. Similar to the idea of sampling, although the data capacity is reduced, the integrity of the data is not changed. Feature selection and dimension reduction methods provided by Spark ML are as follows:

Feature selection and dimensionality reduction are common methods in machine learning. The above methods can be used to reduce feature selection, eliminate noise, and maintain the original data structure features. In particular, principal component analysis (PCA) has played an important role in the field of statistics and machine learning.

Machine learning algorithm

Spark supports common machine learning algorithms such as classification, regression, clustering, and recommendation. See the table below:

conclusion

This article introduces machine learning in general, including basic concepts of machine learning, basic classification of machine learning, and Introduction of Spark machine learning library. In the next article, I will share a machine learning application based on Spark ML library, mainly involving LDA topic model and K-means clustering.

Follow the public account “Big data Technology and Data Warehouse” and reply to “Information” to get big data videos and books