Recently, Ant Financial launched an online live broadcast of the digital class “Fighting the Epidemic together, Technology breaking the Situation”. We sorted out and published the series of lectures on the public account “Ant Financial Technology”.

Today, we will share with you how to realize the lightweight coupling and elastic dynamic development mode of Alipay on the mobile terminal, and analyze its technology selection and practical experience in depth. The speaker is Wen Shengzhang from the core development of Ant Financial mPaaS client. The following is the full text of the speech:

We provide a mobile development platform called mPaaS, which has been officially released on Ali. It was first derived from a mobile component of our Alipay, aiming to build a fast iterative architecture and dynamic capability. It is a suite of solutions, including mobile development SDKS, and then mobile build tools and a suite of services and tools on the back end as a product of an overall architecture. The main thing we need to do is to use the mobile development platform mPaaS to build a better performance APP. Today, what we are sharing here live is some dynamic practices in the process of Alipay using mPaaS.

Elastic dynamic end-to-end architecture resolution

One of the first things we’re going to talk about here is the ability of the elastic dynamic end. First of all, some of the problems we face in Alipay are massive business, and then some Hybrid solutions in the traditional aspect, which is an old topic. Then the second is the monitoring operation and maintenance system of high availability and timely and rapid release, which contains the ability of local conditions, gray scale, and then the ability to quickly roll back and quickly iterate some ability. And then the third point is to open up some of our Hybrid solutions.

Part1: Use Hybrid architecture to meet massive service requirements

In the first part, we will show you how to use a Hybrid price to meet a large number of business requirements. We know alipay, it is a national level of the APP, it carries a lot of business, if you use the traditional way of iteration, certainly can’t satisfy the needs of these business we now, I might need a pair of 11 activities, for example, 12-12 activities or some other operations, we need some iterative ability very fast, Not only on IOS and Android, but also some fast rollback. We currently have four of these capabilities on the mobile end. Here are four capabilities for example. One is Native, and then Html5. ReactNative is the latest cross-terminal solution, and Flutter is also in the range we are trying.

If we look at these four abilities, the contrast is like this. For the development of Native students, Native development cost is the lowest, because we come from Native development, so basically what do not need to learn some special things, so we have a complete set of a UI framework of the system and the UI API calls are very familiar with, such as it then the user experience is also low, We are based on a whole set of UI architectures such as iOS, UIKit and Android. If we use the native solution, its experience is very good in terms of the current hardware capabilities of mobile terminals, but its dynamics becomes very weak.

We have no way to deliver some new Native capabilities, including even writing a marketing component, and this way is also impossible. So back in the early days of mobile, the first thing that came to mind in order to address the big issues was the Html5 solution. His words are based on this technology stack of WebView, and then he writes in the front page. At the same time, in order to interact with Native side, we involved some components called JsBridge that we talked about a lot at that time, JsBridge of IOS and JsBridge rule of Android. Based on the two sets of JsBridge schemes, we can get through the capabilities of Native, after which we can obtain some simple interaction and complex operation capabilities.

For example, we need to dynamically deliver a business page at this time, so we can use Html to write an Html web page, and then publish it to our platform, and then deliver it to the Native end to quickly render such a page. With the development of Html5 technology, we began to think whether we could use Html DSL to write, Html to write something we want. At that stage, we had solutions like React-JS and React-Native.

And then React-Native is a DSL that uses React and then renders Native components. For us, first of all, we need to learn some front-end development languages for Native, but it can use front-end development languages, bypass the WebView and provide some dynamic capabilities, so its actual experience is as smooth as Native. However, in this problem, in order to be compatible with the characteristics of both ends, a lot of problems occurred in the process of his handling, so we invested a lot of energy in the solution of this problem, but the solution was not very smooth. But its dynamic nature is something that we really value, because first of all it’s reduced to a model like Flex based on its front-end and model. Then abandoned some of the original layout of some system layout management tools. So at this stage, RN has extended a lot of solutions. Taobao also has a solution like Weex, and Google recently released the solution Flutter. In fact, it completely overturns the original development mode of Native. In fact, it’s new to us from head to toe.

For example, Android and IOS are all canvas based. From a Native perspective, it is a canvas based scheme. It draws on the canvas, takes over some events of the canvas at the same time, and performs some rendering actions and responses on a single engine of its own. Respond to some request for user interaction. For Flutter, first we have to learn its Dart language, which is the first cost. After dart is linguized, we also need to understand the workflow of its entire Flutter engine, which is the second cost. After that, its support for dynamism is now in a state of giving up officially. It is not intended to have a dynamic delivery capability, it is based on the Skia engine solution, skia is a good rendering engine, so it has a very good user experience. This is a difference between our four abilities, and here’s a comparison of the four abilities. Let’s see, a solution based on H5 provides an architecture of such a container plus an offline package.

In the traditional H5 page, we just connect a WebView in the client itself, and then provide several JS API, in the future hope that our HTML page is downloaded to our local time, not only can communicate with the server, but also can obtain some local capabilities. However, we know that its rendering performance and problems like the white screen on the first screen have not been solved. Therefore, in order to solve these problems and make its experience closer to native experience, we provided its current architecture. We use here, the first problem to be solved is a solution to use UC kernel on Android, because UC kernel for us, it can smooth the different versions of Android, different models on the external must have some problems, This is one of the most common browser compatibility issues in our front-end domain.

On the basis of solving this problem, we hope to give the container more capabilities. We first need to unify the JS bridge performance of the container. This is the middle layer, the third layer in green. After we have smoothed this layer, all that remains for the front end is its business details and specific business processes. Our container contains an offline package pull, the offline package for us to solve the first screen rendering problem is one of the biggest levers. We use the entire ability to put offline packages together in containers with our distribution platform like MDS which we call MDS.

With the publishing platform, we can quickly release our offline package to the user’s mobile phone, so when the user goes to open our page, it does not need too much time to quickly open our page. So offline package distribution and offline package updates are under our control. At the same time, based on some algorithms, we can quickly predict whether the user needs to open the service quickly. If the service needs to be opened, it can be transmitted to the user’s mobile phone in advance, and then the user can quickly use the offline package when opening the service.

Then, in the following layers of the container, we made a connection with the original Native capabilities, that is, encapsulating the same API interfaces, such as network, multimedia and Push. Then the container itself can be quickly upgraded. In the process of distribution, we can distribute the latest advantage kernel to users’ mobile phones, so that users can upgrade their UC kernel without feeling, to experience some of the latest functions. So on top of that, we call each business an application.

So we have ABCD here let’s say up to N businesses, so every application is in our business level concept. In this case, we can do a layer of control over the production board and ROLLBACK, so that we can better control the occurrence of faults. The whole H5 application startup process, we probably divided into several layers. First, its entry section can be launched using urls, or Native buttons. So for each of our H5 applications, we’ve abstracted it into a layer called APP.

At the entry point, we can pass in some of the parameters we launched, and then pass in our H5 container called Nebula. So once we start the application, we’re going to have some sort of lifecycle callback in there. MicroAPP for us has life cycle callbacks like onLaunch,onStart,onStop. In terms of Native layer, there will be an Activity called between Manager and Fragment on the Android side, and then a page will be created after the call. This page is the page of our H5 container, and then it will have some scripts loading. For example, if you want to do a layer of monitoring for some requests and some business indicators, writing a plug-in here is a very good choice.

What we do more for the external container is to hope that there is a smooth communication channel between our Native and the external. In this case, what we show in the PPT is a JS Bridge concept. We use EvaluateJavascript to send the JS code to the outside side, and the Web side uses console messages to send them back to the Native side. We want to make the Web experience as extreme as possible. The Web experience is just a few concepts. First there is the loading of the home page, and then there is the difference between platforms. Finally, there are some offline resource caching problems, so on these problems, we provide so many solutions.

First of all, we extract the network request from the WebView layer to our Native layer, because our Native layer has better capabilities for the network. For example, we can build some other protocols besides HTTP and Websocket. It also solves some of the problems of crossing over. Because the APP is controlled by us, we can also do some control on the domain names we visit. So on this basis, we can solve a front and back end separation problem. Page resources can be loaded in advance, so for the front end, the front end is only concerned with his business data communication.

Second, we provide a differential update capability. What is the concept of differential update? For example, I delivered a 1-megabit offline package this time, and I found some bugs in the offline package, so I fixed it. For example, I released version 1.0.1. Then the changes of the two offline packages may be very small, such as only one string change or a few lines of code change. At this time, we can use the difference update to submit the part with only changes to our MDS publishing platform, then MDS will send the data amount of this part of the difference to our end at an appropriate time, then the end can automatically merge a new offline package according to the difference.

So, the next time the user opens it, your offline package has been updated, so this process can reduce the amount of traffic used, but also quickly respond to the needs of the business side.

And the third thing we’re talking about is a push-pull concept, which is actually kind of interesting. Because we’re in the traditional HT model, we’re always a pull, so I’m going to lift the request on the client side, and then the server side is going to return the response, and of course http2 has a serverpush concept, so we’re not just reinforcing serverpush, We have a component called Sync, which is a stable long connection for the mPaaS platform. Well, then the server can send something to the server in advance via long links, such as the concept of sending offline packages in advance, or other data, especially event-based data.

Then the fourth case is that when our offline package release fails, we can set up a fallback to go online process. This process is what we do in case our offline package download fails, in order to do the right thing. Then a is independent of the browser kernel problem I said before, may be in before we have the situation is a little more, now our side support version or in 4.3 and 4.4 above, also has some problems, so our offline package, our UC WebView is real-time dynamic update, he does not follow the android So our early release of updates will help users better stabilize their offline package experience and reduce front-end bugs.

In the latter part, we talked about some deeply customized components. In this section, we provide solutions like Ant Design that allow users to quickly step in and construct a page. So the last one is monitoring. Actually, monitoring is the most boring part here, but it’s the most important part. Because we need for the business stability and itself do some monitoring business point, then the monitoring is done, we can quickly respond to user needs, to solve the user encounters some problems, at the same time, we can collect some operating data, and then for the next product improvements and bug fixes do provide effective help.

Our H5 container contains great expansibility. We first provide some JS apis, which can make H5 code have calls. The Native ability has been mentioned above. He is not only some of the common ability, also includes as data storage, broadcast around the world, and can also custom extensions API, and then we provide some containers on Nebula container plug-in, container plug-in, it is based on the event to the response, we have in this way the H5 provides a series of life cycle on the container. So when the lifecycle callback responds, we can receive this lifecycle event in the plug-in.

So, based on this event, we can do all sorts of things, and then we can turn this on and off which is actually the best thing to do for ABTest or something like that. So we can do some switch configuration under specific conditions, such as crowd, model, region and so on to do the development configuration, so we can give a specific region, a specific crowd to do the grayscale in advance, or AB ability. One of the most notable features of our H5 container is that it is much more stable than a native WebView. Here we have two indicators, one is the collapse rate of PV and the other is the probability of PV ANR, so the Crash rate is the Crash rate, so our side is more than twice as high as the traditional container, so the concept of ANR is the probability of you getting stuck in the process of rowing, we mainly solved the two core problems, It is a WebView stability and WebView experience inconsistency problem.

Part2: Monitoring and publishing platform for stable service running and rapid iteration

Then talk a little bit about our monitoring and publishing platform, because that’s a very big back-end capability that needs to be done around our H5 container ecosystem. First of all, we need to have the ability to publish quickly, because we know that based on native H5 pages, it actually has real-time publishing, and the concept of real-time publishing is that if you have a server, when you publish your front-end pages, it updates in real time. He does lose some ability to publish quickly if he uses an offline package, but we need to compensate him for that ability here.

So we first to access our MDS, our offline first is released by MDS, MDS will according to the configuration of a series of things, before you go to our offline package Release to CDN, then according to our client before on some conditions, such as whether you opened the gray level switch, or the entire network Release, If it is a grayscale open switch, it also needs some configuration according to your grayscale, and then send the address of our offline package to the client through some requests between the client and the server, and then the client will get this address and download our offline package from CPN. This is a quick release capability that we provide. Rapid release requires both the ability to intelligently grayscale and the ability to incrementally split offline packages, which is invisible to customers. Since the client only needs to upload two versions of offline packages to our server, our server will automatically calculate the difference between the two offline packages and then deliver the difference to the client. At the same time, we need to ensure that the performance of our MDS meets very high requirements. So the performance QPS of our MDS can reach 50,000 per second, so a touch rate of the opposite end can reach 99.99%.

Then the next part is about monitoring and diagnosis, so monitoring is a dimension that we need to focus on, because when we finish developing the business and distribute it to the client, if the user experience is very bad, our retention rate is very low. So we need to do some burying points for flash back, fluency, power, flow and so on, as well as some businesses that are not available. After we go to the burial point, it is time to collect a user’s usage situation, and we need to report it. If real-time reporting is achieved, the strategy of reporting is actually unreasonable.

Because the first one affects the user experience, because with real-time reporting, we might have a process open all the time, or a thread to report. Second, the flow of users will also have a very big impact. At the same time, we also need to consider the situation of users in the process of using our APP, such as making some customized switches, or doing some special sampling. For example, if a user tells us about the Crash in the process of using the APP at this time, we do not need to collect the Crash situation of the whole network, because the user’s own usage scenario may be quite special, so we need to have the ability of fixed capture for specific users.

Then we have three ways to report, the first is automatic upload, the second is periodic check upload, such as every hour or every other day. The third point is diagnostic command-driven upload, which means the scene I just mentioned, a user reported Crash, so we need the user to report his email address, or account name, etc., so that we can accurately capture a log on the user’s phone. Then we can analyze a jump path of the page based on some logs uploaded by users, and some logs generated by the APP itself. After analysis, we can make some decisions, such as how to optimize these things. So if our Crash go rate and ANR rate reach a certain degree, we need to take some measures to fuse, such as disabling or disabling the page of offline package, then we can go through fallback or repair.

These are the four capabilities that we offer. First fault isolation language is the first point, that is to say if we find this generates a page fault, we need to open a switch in advance, if it is a new business, such as his switch is open, we need to turn it off immediately, and then to block, a process of the bleeding.

Then the second point is that if our flash screen page happens in the process, because users may not be able to enter our APP at this time, we need to enter a safe mode, at this time, we need to carry out some diagnostic uploading of some data in the safe mode. Then we clean up some data in our APP, and then restart it. This is an automatic recovery capability.

Third, we need to do some traffic fuse, such as our network call reaches a certain request or big probability is appeared in certain level, for example, on our side on this side of the server or client APP end business end a scene by a DDOS, so this time we need to make a fuse flow to be quiet.

The fourth point is that we have a very important dynamic capability repair. So there’s a couple of things going on here, the first one being the switch. The second point is a version update of the offline package. For example, if there are some problems with the previous offline package, we need to repair it in time, upload it, and quickly release it to the whole network. Then the third point is a Hotpatch based on Native code. For the Hotpatch on Android, we still use DexPatch, which can repair Native Java code, and then it can also monitor and roll back. Click a rollback button in the background of our mPaaS. So we quickly do a rollback of these new things, because there is no way to ensure that their next code is bug free.

So if we don’t have any problems, if verification so we can repair, repair if the heat is problematic, so I want to do a quick payback, to send a new Patch, so for us, our Native plate may be done slowly, so if after we get on the H5 scheme, So our H5 engine must be faster than the Native engine. Hypothesis says that we have N products in an APP, so the pace of development between them like this appearance, then their whole a life cycle of the products is on the left side of a circular structure, it is a traditional communication, language is the first business side to set goals, and then build development side code, Then we carry out a gray level continuous monitoring, and finally after the official release, we carry out an operation and maintenance monitoring, and after the operation and maintenance monitoring, we do a gray level verification, and then develop a new target.

For different versions of the word, this is the focus of a grayscale concept. For gray scale, we need to know some conditions of users. For example, we need to make a performance optimization scheme based on Android phones this time, so we need to use conditions to screen out our Android, for example, if the CPU is a low-end CPU of Snapdragon 600 series or 500 series, Then we went to verify the performance of our package for this performance fix.

Part3: Better Hybrid scheme, differentiation analysis of HTML5 and small programs

We need to talk about the recent hot small program solution, small program is a more advanced architecture on the H5 solution.

First applet is a new mobile application format based on Web technology and then integrates native capabilities. So for small program, with H5 one of the biggest difference is that H5 itself is actually open technology, but small program is tantamount to a complete set of development framework. Then you continue to follow the corresponding business development framework and DSL write code, and then compiled into a small program package, then sent to the corresponding platform, platform can according to you a set of DSL to apply colours to a drawing gives it a page, of course it is rendering engine can change at any time, is not necessarily the WebView rendering engine, He could, for example, write a set of skia based renderings based on skia. So there are four advantages of small programs, one is easy to obtain, such as scanning two-dimensional code can directly open, he does not need you to set up a server, do not need to consider some things like CDN. The second is the ability of applets and applets to link. Then there’s the ability to connect applets to Native. Third, the security and reliability of small programs are guaranteed by each large platform. Fourth, the rendering of applets is determined by the tags you specify. For example, I can use native components to render some DSL components on applets. In this case, its rendering capability will be slightly higher than that of H5. So the architecture of the whole applet looks like this, the rendering layer and the logic layer of the applet are completely separate.

You can assume that there is a WebView in the render layer that has a choice, so rendering is the WebView, so in fact some of its event execution, like some of the logic execution, is to open another JS engine to do this. Then, they communicate with each other through a layer of JS Bridge at the Native layer. Each time, the logic layer establishes some configuration information that needs to be changed to the Native layer. The Native layer obtains the DOM information that has been rendered in the rendering layer, calculates the difference logic, and finally delivers it to our rendering layer. Render it so that every time you update the UI, you’re going to render it with some different data. So in this case, our execution between the rendering layer and the logic layer is completely isolated, and this is one of the concepts we talked about with the small program dual threading. At the same time, the small program is guaranteed by the platform container, so we can quickly open up some storage network multimedia capabilities of Native layer. In this case, when we use small program development, we can use the least cost to develop the best performance, the strongest dynamic capability of a framework.

So we small program provides an ID at the same time build and release an ability, ID build, we now pay treasure here offer small program IDE, development at the same time pay treasure small programs, taobao small program, and mPaaS applet, although between and among them the technical architecture is different, but between them the DSL is the same, You can use the same set of code to develop. So for the Native layer, first do a layer of analysis for the parameters you pass, and then load the resources of the small program in advance, and then create a Render page. For example, the Render page is based on UC WebView as I said, but it can also not be based on UC WebView. This is completely unperceptive to the business, so after the initialization of the rendering layer, we will create a JS worker, so this worker is a concept of the logical layer we just talked about. After the worker is created, it will monitor some events executed in the JS code of the applet originally executed in the applet, so it will return to the callback of some events in the applet, and then make some responses to these callback in the JS engine of the applet at the business code level. Callbacks like Set Data. After tuning, it will be transmitted to the Native layer, which will transmit some data to the rendering layer. Then, in the next event loop, the rendering layer will make a differential rendering of the data passed this time, and after finishing, we can see the product we want. So this is the two-threaded concept of a small program that we were talking about. Logically on the worker side, then the Render layer on the Render side.

The characteristics of applets are very normative and provide several capabilities. First, the construction of the first package is under a standard requirement, we must provide such a package structure, then the UI components and APIS are another capability provided, and then the entry specification case, We can quickly put the content of our small program and the overall page control of the small program, convergence to a relatively small hole, so can maximize the prevention of various risks.

At the same time, its security and privacy controls, lies in our Native container has been packaged, such as small program want to get some privacy rights, need our Native here at the UI level make some response, such as ability to want to obtain the user location, or want to get the user’s phone number, We need to apply permissions for all these things to users at Native level. And then at the same time we provide some widgets, and a widget is something that you can think of as a plug-in, or something like a UI component.

Then the whole ecology of our small program, at present, Alipay’s mPaaS small program is used in various places, such as ele. me, Gaode, tao tickets and so on are used.

In this case, we packaged our entire applets capabilities and distributed them to individual users in a way that mPaaS does. After integrating the container of our mPaaS, users can also develop small programs they want based on this ecology, and then run them on their APP. We want our normal open ecosystem, and we want users to benefit from small applications. And then have a dynamic capability of their own. We want to increase the engagement of users, and then connect to a huge number of services, and then service, we want to be able to reach multiple services quickly.

Then our mPaaS is alipay integrated a complete set of engine, so our mPaaS provides several complete solutions from the service side to the client side, so for the client side, there are client SDK and client framework. So, the connection between the client and the server, through our custom RPC protocol, including the push and pull, push, just say the Sync components, if I custom HTPR some agreement, mPaaS China and provide a gateway, then some of the user behavior data analysis, and then there’s news push, and then released, And switches and so on.

For mPaaS at the back end, there is some metering and charging control, and then multi-tenant control, which is our own capability. As a whole, our mPaaS is now based on Ali Cloud.

At present, mPaaS has been put on ali Cloud external output, welcome to try.

PPT for address: files.alicdn.com/tpsservice/…