This article is edited and compiled according to “Machine learning/Deep Learning Engineering Practice” shared by Wu Jianjun, senior AI expert of Ping An Life Insurance in “Latest Research and Application Practice of Machine learning/Deep Learning in The Financial Field” at Ping An Life Insurance &DataFunTalk Algorithm theme Technology Salon.
Today, I will share the following aspects: Ping An Life AI application technology overview, data processing and coding, model application and real-time service, algorithm and model training.
First of all, let’s talk about the AI application technology overview of Ping An Life Insurance. First of all, the development is divided into a big data platform, which is divided into platform level development and application level development. Platform level development mainly includes offline computing platform, real-time computing platform, and multidimensional analysis engine, etc. Application-level development includes data collection and cleaning, statistical report development, portrait mining and so on. The algorithm research is divided into three directions. The first one is statistical analysis. Financial data is complicated and requires a large amount of human and financial resources to conduct statistical analysis, which is often used. There are also machine learning and deep learning methods, which mainly solve the following problems: Machine learning mainly solves classification and recommendation, knowledge graph, natural language processing, deep learning solves quantitative actuarial, visual model, and reinforcement learning is under development. Background system is divided into two parts, one is component class development, one is service class development. The components are mainly service framework, training platform, container platform, and some distributed storage components. Model service is mainly aimed at this application to develop some special systems, with special application service docking.
Above is our platform architecture. Firstly, data collection, mainly relying on Kafka, has a collection mechanism for the old system, data collection is completed into Hadoop and relational DB. Data cleaning depends on Hive and Spark. Hive implements HQL, and Spark processes complex data. In addition, we also need to do some insight analysis, which is divided into two parts: one is single-table fast real-time analysis, the second is multi-table association real-time analysis. Druid & ES is used for multi-dimensional tables, and Presto & Impala is used for multi-table associations. Some use MATLAB and SAS for actuarial quantitative model, and Tensorflow for deep learning. Hbase and Redis are mainly used for portrait storage and real-time query. Some container platforms provide container calls externally.
Next, let’s talk about why we use AI technology. AI is still widely used in the financial field. Many businesses are driven by data, and finance is highly dependent on data. Specific applications are agent management, Ping An has millions of agents, the use of AI algorithm to manage agent recruitment, sales, upgrading, and intelligent customer service, intelligent collection, intelligent claims, seat and other applications in many scenarios.
Data is the core, so let’s talk about our data and data processing. There are many challenges in data mining in the insurance industry. The first challenge is the long decision-making cycle and low-frequency interaction, such as non-daily consumable/non-essential insurance, in which the consumption decision-making process is relatively rational. Therefore, whether the model is effective or not has to go through a long period of testing, in this cycle will face great risks. The next data is more complex and has poor stability. The complexity of data is reflected in the fact that there are many business lines and storage media of each business line. The scenes generated by each data are also different, and the data types are also very different, including text, LBS, and some images and natural language. There is also A relatively high cost, the model made by the Internet must go through A/B text, but the insurance industry can not be implemented in this way.
So how to solve these problems? Portrait generation, quality inspection and data embedding are generally solved from the following three aspects. The first is how the portrait is generated, and then the data is verified, because the obtained data is not necessarily reliable, and then the data is added. The reason for embedding is that the data is complicated and needs to provide a standardized representation of the data.
To establish portrait production, data stratification (ODS, DW, DM, MM) is first established. This kind of data stratification is mainly partial business. Although there is no technical difficulty, it is not easy to build it. Then the data pattern is abstracted and unified, aiming at behavior data, fact data and image data. There are a lot of behavioral data generated every day, such as insurance phone call, claim settlement, online button clicking, etc. Five elements of behavior are classified, that is, who does what action to what object at what time and the intensity of this action. The fact triplet is added as subject, predicate and object. After the data is abstract and unified, there are many ways to make portraits. The first one is workshop production. According to the requirements of leaders, portraits are more tired. The current standard is to standardize the format of the portrait demand, so as to realize the automatic production mechanism, the advantage of doing so is to save manpower first, the demand can be reused.
Next, let’s talk about how to check data quality. Quality inspection has always been a difficulty, because its indicators are very complex and the time is very poor, so it is difficult to judge whether the calculation is accurate. We mainly start from three aspects: first, stability. Although it is difficult to judge the stability of the model given, we need to know how unstable it is. Secondly, the importance of some indicators is observed. Some indicators are very important in reality but not in the model. This kind of modeling causes problems, so the importance of variables in the model needs to be judged. Then, relevant variables need to be eliminated, and the indicators to test the correlation include correlation coefficient, PCA/RUFS, and variance inflation factor. In addition, dimension reduction is also required (PCA is not stable, so RUFS algorithm is used). To test the stability of the development tool Spark + Python, the implementation of flexible configuration, one-click output results.
There are many data embedding methods, such as image embedding, word embedding and graph node embedding. First, the structured data was manually combined, then GBDT was used for feature combination, and FM coding was used for feature representation using vectors of low decomposition, and now KB coding can be used. SITF and CNN are mainly used for text data using TF/IDF model, WORD2VEC and image data. The main work is to integrate the data with the following data to make it more unified.
Next, I will talk about how to train algorithms and models. At present, it is mainly based on distributed or parallel machine learning. In the face of large amount of data, complex model and large parameters, the parallel running platform can solve this problem. The platform requires efficient communication, reliable fault tolerance and strong description ability. The reason why a machine cannot run all the algorithms is that it cannot describe the algorithm, so a platform with strong description ability is needed. At present, the types of parallelism include model parallelism, data parallelism and hybrid parallelism. At present, data parallelism is mostly used. Data is first segmented and distributed to each worker for calculation, and a gradient input model is obtained. The model gathers the data and returns a worker, and then the worker is used to update the local model.
Distributed machine learning requires strong descriptive ability, which is simplified as a programming paradigm problem. The paradigm here, for example, is whether or not this paradigm can describe the algorithm completely. The earliest paradigm is MP, the main implementation is MPI, but it only provides basic communication primitives, almost unlimited programming, high programming threshold, no error recovery mechanism. Later, MR mainly studied Hadoop, which is simple to program and fault-tolerant, but strict and inflexible. Data exchange through disk is low in efficiency. The next step is DAG (directed acyclic graph), the typical implementation is Spark, memory computing, restrictions relaxed, flexible implementation of complex algorithms, no rings, cannot support a large number of iterations, the purpose is fault tolerance. Then is the calculation graph, Tensorflow is a typical representative, the advantage is automatic differentiation, support arbitrary iteration, can realize most NN algorithm; The disadvantage is weak fault tolerance. Later, there was the introduction of dynamic computation diagrams, mainly representing Torch, as well as tensorFlow’s support for changing diagrams in computation, and the use of RNN.
Parameter update model is to solve how to synchronize workers in a cluster. The first method is BSP, mainly including Pregel(not open source) and Spark. If 10 workers complete a round and send parameters to the central node, the parameters are updated and then returned, this method is slow but can ensure convergence. Then is ASP, all asynchronous, so mainly used for single-machine multi-core, using shared type storage model; In this way, there is no convergence guarantee for random update. If the model is highly sparse and has fewer conflicts, it has a certain regularization effect. The typical implementation of SSP is Petuum, in which the parameters of the fastest worker and the slowest worker are synchronized when their bound exceeds the threshold, with the benefit of fast speed and guaranteed convergence. Ps-lite, which is based on PS, relies on distributed storage, supports massive parameters, and supports the above three update modes. These are the two aspects of machine learning that need to be understood.
Let’s share how we did it. The distributed machine learning cluster relies on Spark. Spark features DAG to describe computing tasks, RDD to abstract data operations, memory-based data exchange, synchronous parameter update (BSP), seamless connection with the production environment, and mainly adopts data parallelism. A lot of packages are developed based on Spark distributed cluster. MLlib package implements decision tree, SVN, LR. Splash implements MCMC,Gibbs Sampling and LDA, which is 20 times faster than MLLIB. Another is Deep Learning4j, which mainly does Deep learning on Spark and supports GPU. However, it is not as flexible as TensorFlow and cannot write network structure automatically. Next comes PAMLkit, which supports NB,Ada Grad+FM,FTRL+LR algorithms.
The first point based on spark distributed cluster experience is to understand the algorithm without bias. Next, the code structure is good: Gradient, Updater, and Optimizer classes are independent of each other. There is the relevant tuning experience, the above are our actual combat summed up from the pit inside. Sparse vectors should be used as much as possible and traversed or calculated in a sparse manner. If not carefully, performance will deteriorate.
Now let’s talk about deep learning in TensorFlow. TensorFlow application is mainly oriented to structured data, supplemented by visual text. DNN algorithm is widely used. Other related algorithms (CNN, AE) have been successfully applied, and reinforcement learning is under development. TensorFlow distributed features: coding to build training clusters and assign tasks; Need to manually start the process on each machine; Need to cut the data in advance, and hand copy to each machine; There is basically no fault tolerance mechanism. The training mode has gone through the following three stages: single machine, single card, whole one-time data reading, iteration batch by batch into video memory; Later, it entered the single-node dual-card system and adopted the input queue. After the queue, data was sent between Gpus in turn. In synchronous mode, the parameters are updated after averaging each GPU gradient. Later, it entered multi-machine and multi-card mode and adopted Between Graph, quasi-synchronous mode. Data is partitioned in advance without manually starting the process. Data is distributed based on PDSH and services are started.
Most of the modeling is based on Spark and TensorFlow. Spark mainly relies on its parallel capability, while TensorFlow builds complex models. How to combine the two, for example, GBDT+FM+DNN model, GBDT+FM training on Spark, DNN on TensorFlow. The first phase is to copy the Spark output directly to TensorFlow, which splits the data and copies it to each machine. At present, data output by Spark training is stored in HDFS. Then, PDSH is used to start each worker process on TensorFlow and read part of data from HDFS to continue training. Currently, Spark is being developed to coexist with TensorFlow cluster, and each RDD partition can start one calculation diagram and stack programming.
After model training, it is necessary to provide services. The challenges to provide services include: numerous models, a long history of modeling, broad business needs, hundreds of models put into production, scattered operation and difficult monitoring. Another is that there are many modeling platforms, such as MATLAB, Java, Python, SAS, R, Spark and Tensorflow. Because there are a lot of quantitative and actuarial models that are important. Then, the algorithm strategy is complex, including decision tree algorithm, various linear models, deep learning model, traditional timing algorithm, etc., and various algorithms are often combined. In addition, the data processing is different, which is messy and personalized. The model includes both historical data and real-time data and requires online join. Different models need different data processing. The system needs to achieve goals: centralized management, unified monitoring; The second on-line fast, saving resources, scalable and highly reliable; Can’t limit modeling engineers, support cross-platform; Support for typical model format transformations; Typical data processing operators need to be defined.
A lot of open source component development is used to achieve this goal. The framework adopts thrift, which is characterized by cross-language communication and supports python, Java, c++, etc. U Mature and stable, open source for ten years, widely used; Lightweight and simple, with a compiler (less than 3M). Zookeeper is used for service coordination, Redis is used for online storage, Netty is used for external communication library, Docker is used for running container, Nginx is used for load balancing, etc.
The model application architecture is divided into three layers: model processing layer, data computing layer and interface layer. The core of model processing layer is the model parser to achieve cross-platform, cross-language, output format of three PMML (linear model), Protobuf (TensorFlow), custom format. The model is trained to form model files, which are loaded by the online algorithm service to provide services. Business application call model, support HTTP protocol, use load balancer to define to each application service, the application service is the corresponding data computing layer, the definition of the relevant operator is for data processing development, realize the combination of features, it is passed into the model router, model router calls related services. There is also a management monitoring platform.
– END