Over the past 15 years, Ant Financial has reshaped payment and changed life, providing services to more than 1.2 billion people around the world, which cannot be separated from the support of technology. At the 2019 Hangzhou Computing Conference, Ant Financial will share its technological precipitation over the past 15 years, as well as its future-oriented financial technology innovation with attendees. We have compiled some of the best speeches and will continue to publish them in the”
Ant Financial TechnologyThis is one of the headlines.


SQLFlow has received a lot of industry and community attention since it was opened in April. SQLFlow projects are operated in a community-led, collaborative and co-operative manner with external developers. Didi Chuxing, as one of ant Financial’s important partners in the open source community, has implemented SQLFlow based on its own practical application.

On September 27, Xie Liang, chief data scientist of Didi Data Science Department, and Wang Yi, researcher of Ant Financial, gave a detailed introduction to the product form, product mission vision, application in Didi and future prospects of SQLFlow at the Cloud Computing Conference.

Start with the vision of SQLFlow

If you are not familiar with SQLFlow, you can read our previous introduction article or check out the project website:

https://sqlflow.org

To put it simply, SQLFlow = SQL + AI. You can think of SQLFlow as a compiler that translates extended SQL statements into code that the AI engine can run.


SQLFlow’s vision is to promote the popularization of ARTIFICIAL intelligence, that is, as long as you understand business logic can use ARTIFICIAL intelligence, so that the most knowledgeable people can use artificial intelligence freely.

In the traditional modeling process, business experts (analysts, operations experts, product experts, etc.) usually propose specific requirements and cooperate with multiple roles such as product, data science, algorithm, development, and testing to complete specific modeling tasks. In many cases, due to different professional backgrounds, for example, business experts do not understand the principles and details of AI, and algorithm engineers find it difficult to understand the subtleties of business logic, which will lead to high communication costs. Even models based on the above conditions are often not abstracted into general models with wider application.

If you want to let SQLFlow solve the previous problem, it involves three core elements, the first is the data description business logic, which has been better implemented in SQLFlow statements; Second, use AI to enable deep data analysis. Current data analyst at a lot of work is to obtain original data, then finishing them become can describe and evaluate the business status indicators, but the core of the data analyst job than just simple summary and processing of data, they need to spend more time or develop better ability to forecast model is set up, Then read the data and study the internal relationship of the data, SQLFlow gives them a strong ability to help them to the depth of these data mining, so as to correctly interpret the data behind the user’s behavior and better abstract out of the reasonable behavior rules or business logic; Finally, it must be an easy-to-use tool that minimizes the cost or barrier to learning for the user.

Potential users of SQLFlow include operation experts, business analysts and data analysts, who are very familiar with the business. They only need to directly call the corresponding AI solution, a word, a section of SQL code to complete a modeling task. Such a process only requires business experts to deal with SQLFlow through SQL. Reduce the cost of communication, communication loss. Lower modeling costs allow business experts to explore more aggressively and experiment more imaginatively; At the same time, the high-value code and abstracted wisdom are deposited in the SQLFlow model pool in the form of models that are embodied. For example, an operation of xining experts see frequently call this model, an analyst with Beijing, he can also go to call the model to migrate to learn to solve similar problems in the region, so the cost of his modeling and experience can further reduce the cost and the spread of knowledge with the help of SQLFlow can easily break the geographical restrictions and industry.


Where is SQLFlow used?

SQLFlow has been implemented on a large scale in Ant Financial and Didi and has received good feedback. In Didi, it is used in business intelligence business scenarios, and in Ant Financial, SQLFlow is used in precision marketing scenarios, which meet the flexible needs of business experts. SQLFlow also explores richer usage scenarios.

How does Didi use SQLFlow

In the application of SQLFlow, the first problem didi needs to solve is the integration of data.

Didi’s big data platform is built based on Hive, and SQLFlow is mainly connected to Hive clusters. The blue part is the SQLFlow server. There are three parts around the server. The first part is the Didi Notebook at the top, where all the data analysts and operations specialists operate and write SQL code.

The following SQLFlow server intersects the two parts. In the lower left corner is the data server, which parses the SQL code into a series of Parse code and validates the data parts. The lower right corner is the neural network library. For example, it supports keras, XGBoost and other model libraries. These model libraries will fetch the corresponding data from the database according to the Date of the Parse code.

There is two-way communication between the data server and the neural network library, that is to say, the model will fetch data for training or prediction, and the predicted results and the trained model will be returned to the data server for storage for the next use, or for operation experts to do precision marketing screening. Finally, the task information is returned to the SQLFlow server through the model library, and is exchanged in the Didi Notebook.

Xie Liang, Chief data scientist of Didi, explained how to apply SQLFlow in Didi’s business scenarios to help improve business efficiency, starting from the open source model of Cooperation between Didi and Ant, including:

  • Application of DNN neural network classification model in fine subsidy coupon issuance;
  • SHAP+XGBoost can be used to explain the model to understand the influencing factors and influence intensity of user behavior, so as to help operators locate operation points.
  • The autoencoder with cluster analysis is used to analyze the temporal distribution of drivers’ capacity and to mine drivers’ behavior patterns.

The following are introduced respectively.


SQLFlow is used for supervised classification modeling

Classification model is a fast classifier and an important direction of machine learning. Here is a case of didi’s coupon target passenger identification prediction.

How are Didi’s coupons selected? Background operation experts will be based on a taxi passenger history behavior information seems to send vouchers, scene for promotional like to eat, drink, and be merry, for instance, what the user will see what kind of scenario is more likely to go in for beer and skittles related consumption, then the directional send coupons to passengers, can convert travel demand, thereby creating user value and benefits.

In the past, the process of completing the whole above modeling was very tedious, requiring not only a lot of cross-team cooperation, but also the time input of experts in different fields. When the whole modeling process was completed and the model was trained for a long time, the best time to put in was missed. Therefore, the rapid growth and development of business have put forward higher requirements for the mutual cooperation of company data and business departments as well as the speed and process of model research and development.


SQLFlow can meet this requirement. The analyst needs only to feed the user data to SQLFlow to create an efficient classifier, filtering intermediate features and combinations of features that can be processed by bucketize or vocabularize, Finally, the trained model is output to a data set called income_model. The code shown in the boxes above is even simpler, with the last line of code completing the entire model training process. As a result, there is almost no learning curve for analysts.

Use SQLFlow to do black box model interpretation

More often than not, it is not enough for data analysts and operations experts to know what, but why and how. When, for example, drops to passengers, an analyst with active influence factor analysis, we need a taxi for passengers past behavior to predict passenger activity model, to analyze what are the factors influencing they take a taxi, and these factors are embedded into the custom of the whole marketing plan, to achieve better retained by the user.

In this case, we need to determine the user’s current life cycle stage, including days of registration, level, behavior score, and so on; From the user’s travel demand, we need to know the estimated mileage accepted by the user when taking a taxi in history and the mileage accumulated by the platform. In addition, we must also understand the user’s ride experience, including the number of user needs, pickup distance, answer time, whether there is a queue and so on. Because the data dimension and business meaning differentiation, operating students hard to than before and after through simple data summary and analysis to determine which factors in which business scenarios can affect user releases and retained more, so we must use the way model to abstract the information after then sorted to the importance of the information.

In Didi, we use SQL language in SQLFlow to extract users’ travel data in the past period of time, make SQL call DNN through explable extension, and then use SHAP + XGBoost interpretation model to understand influencing factors of user behavior and quantify the influence intensity. After a series of model modeling, it can be seen that for all the information listed above, a point is marked on each user, with each dimension on the vertical axis and featurevalue on the horizontal axis. Through this graph, you can find out what the influence is for each person in each dimension. All the information can be generated into a large Hive table, which can be used by operation experts to find operation scenarios and improve operation efficiency. Whether it’s generating shapValues or querying Hive tables, SQLFlow enables an operations specialist to implement complex modeling tasks with simple SQL statements that would normally be handled by a highly specialized AI algorithm engineer.

Unsupervised clustering with SQLFlow

The third example is unsupervised clustering. The actual scene here is the preference stratification of drivers’ driving, that is, according to the characteristics of drivers’ driving duration in a period of time, the driver group is clustered to identify different types of drivers and provide information for subsequent strategy delivery and management.

Didi needs to arrange transportation capacity reasonably according to drivers’ driving habits. There are tens of thousands of active drivers on the platform. How to score or distinguish these drivers? This is the hard one.

Previously, Didi classifies drivers subjectively based on historical experience and common sense — those who work more than eight hours a day are called high-capacity drivers, and those who work less than eight hours a day are called medium-capacity drivers. Or they can divide drivers into five categories based on rules, such as how much time they have been online in the past 30 days, whether they have been assigned or not, according to a very complicated set of rules. They become high capacity drivers, active medium capacity drivers, low frequency medium capacity drivers, active low capacity drivers, occasional drivers and so on. But there are many problems with this. Because both drivers with high and medium capacity have very different driving habits and time distribution in different time and space, it also means that we need to depict the transport capacity with finer granularity in different time periods.

The figure above represents the distribution of driving time of 160,000 drivers in a region in a day. The horizontal axis is 144 10-minutes in 24 hours a day. The color represents the standardized driving time in this period, and the brighter color represents the longer driving time. As you may have noticed, the spectrum above is messy, and it’s hard to see how drivers are getting out of their cars.

In SQLFlow, AutoEncoder-based Clustering is implemented

In order to solve this problem, Didi data scientists used AutoEncoder of Deep Learning Technique in SQLFlow to conduct unsupervised clustering of drivers’ driving hours. In this model, 160,000 drivers’ driving patterns were automatically divided into five categories. After clustering, Drivers with the same behavior patterns were well divided into groups, with distinct distinctions between groups.

It can be seen that there are about 40,000 drivers who are really occasional drivers, who basically do not drive, and who do not do a single job after driving. The second type of drivers are numbered 40,000 to 60,000 in total. They are typical rush-hour drivers, and some of them prefer to drive in the evening rush hour. The third group of drivers are the real so-called high-capacity drivers, because they do orders from morning to night, so they are more likely to have didi as a career; The fourth type of driver is low frequency medium driver, they occasionally do a single, although more than the first type of driver orders, but there is no fixed law; Finally, there are the late drivers, who get out of their cars in the middle of the night and go home to sleep in the early hours.

It is the most important task for drivers to design reasonable incentives and operation strategies to reasonably deploy capacity to meet the needs of passengers for these drivers with different driving habits and preferences discovered through data mining. What used to be very complicated and tedious work, now only need simple SQL code can effectively help operation experts to decompose the characteristics of capacity and the structure of capacity throughout the day, thus greatly improving the success rate of operation strategy and the efficiency of business personnel.

As you can see from the previous three examples, SQLFlow is a true digital-intelligence-driven product that enables business students to solve the most complex business problems with the simplest logic.


The value and future of SQLFlow

We know that in computer science, the closer a computing unit is to a data unit, the more efficient it is. The point of SQLFlow is that it is intended to achieve the same goal, integrating ai cells with business entities for productivity gains.

The end of this direction is what you think is what you get.

When Iron Man is building his new reactor, all he has to do is grab these images, put them into the system to see if they fit, and put them back in another one if they don’t fit. In fact, SQLFlow is infinitely close to this state, and this is the final state that we think SQLFlow needs to achieve.

Operation experts do not need to spend time and energy to learn the construction of AI models, but should make greater use of their own business expertise to clearly predict targets and data input, try different models, explore solutions through SQLFlow, and achieve what you want is what you get.

Finally, SQLFlow is a bridge between business analysts and AI, but also between data and insight. We expect millions of analysts to cross this bridge and meet science and wisdom in the future.


The example and operating environment in this article can be obtained through the DOCker image of SQLFlow

docker run -it -p 8888:8888 sqlflow/sqlflow:didiCopy the code

SQLFlow website:

http://sqlflow.org/

SQLFlow documents:

https://sql-machine-learning.github.io/doc_index/sqlflow_getstarted/

SQLFlow lot:

https://github.com/sql-machine-learning/sqlflow


Cloud native, TEE, shared intelligence, converged computing, what are these? Ant Financial’s most cutting-edge technology revealed. Details of the dry goods are in the e-book “Ant Financial Online Technical Interpretation”, pay attention to the “Ant Financial Technology” official official account, and reply “online” in the dialog box, you can download for free.