Author: Liu Changqing (Zhu Shui)
This paper focuses on the original introduction of the full-link technology concept in the mobile field by The Mobile Tao team. The whole article is about 12,000 words and takes 15 minutes to read. Readers will gain the change of thinking in the optimization of mobile technology field experience, as well as the precipitation and research and development practice of software-defined experience.
Existing App architecture challenges
Since the start of All In Wireless in 2013, Alibaba Group has experienced several key stages in the development of mobile technology for more than ten years:
- The first stage solved the pain point of large-scale business concurrent research and development, and defined Atlas (containerized framework, providing component decoupling, dynamic support, etc.) framework.
- In the second stage, build ACCS (Taobao wireless full duplex, low latency, high security channel service) long connection duplex encryption network capability, complement the end-to-end interoperability mobile service capability to catch up with the industry;
- In the third stage, dynamic research and development frameworks such as Weex and small programs are built oriented to business characteristics, and mobile technology enters a dynamic cross-platform period.
In the middle and later period, we carried out the construction of BU latong and capacity through ali Mobile group mechanism. Since then, mobile infrastructure has been basically formed, and several groups of capabilities have been deposited in each field to reuse capabilities. App has basically formed a three-layer architecture of upper business, intermediate RESEARCH and development framework or container, and basic capabilities. Our team as a main backer of the wireless terminal side infrastructure, the key is responsible for the group to the mobile foundation ability construction, in recent years, the team focused on in-depth taobao business scenario performance optimization, through experience optimization project transverse profiling App architecture and and related call link, feel group App widespread common problems as follows:
(Figure 1 Taobao App architecture challenge)
- O&m troubleshooting is inefficient. In the monitoring phase, most problems are not monitored or the reported information cannot support more effective analysis. Therefore, troubleshooting needs to rely on logs. Secondly, there is no log problem. When exceptions occur, logs will not be uploaded actively, but need to be retrieved manually. When users are not online, they can not get logs. After a log is pulled, it continues to encounter problems that the log cannot be read. The hawk-eye log of the server is saved for 5 minutes. After this round, the basic time has passed half a day.
- Incomplete end-to-end tracing: A complete business link, will flow through the end-to-end layers, in the case of an order, triggered by the client request to the server’s network, passes through several client module processing, trigger N times the backend application calls and through mobile network instability, imagine what are the problems in these calls will affect the order transaction, What steps can slow down the whole process, the request did not return is not clear problem server or network problem, if the definition is not clear, call the link performance means that each layer are not fully exposed, these factors should be considered, and the lateral natural asynchronous invocation, inducing the phase measurement and the link to major challenges, The current situation is that there is no unified call specification for all layers of the client, and the lack of topology can not restore the call link, resulting in end-to-end tracing.
- Optimization lacks uniform caliber: Because all the research and development framework diameter from the closed loop performance in the past, whether native client technology, or cross-platform technology is technology oriented perspective caliber unified collection generic technology, this kind of situation will lead to natural each business implementation and performance difference is huge, popular said is not close to the user’s body, can lead to online data is difficult to response the real situation and trend of advantages and disadvantages, for a long time, Taobao’s experience has also been deteriorating. Every year, it basically relies on sports to optimize the experience, which cannot be maintained regularly.
- Mobile Paas process empowerment cost: Output of the SDK components group after each BU, basic capabilities embedded in different App after the host environment, also will meet the above mentioned several kind of problem, for all BU students, infrastructure is the black box, if the problem involves the infrastructure, the screening process is more difficult, and where no existing tools can self-help problem diagnosis, Encounter problems can only come to consult, all kinds of pull group pull people, resulting in high q&A costs.
The above are some reflections on the shortcomings of the current client in operation and maintenance investigation, measurement and monitoring, whole-link optimization and other aspects from the perspective of APP structure, which are also the direction of our subsequent efforts.
Observable system
Monitor the evolution of observability
Observability is a system of ideas, with no specific requirements for technical implementation. The focus is on applying the ideas to our business iterations and problem insights by introducing them. The traditional operation and maintenance may only give us the overview of the top-level alarms and anomalies. When we need to locate the error information at a deeper level, we will often draw people through the establishment of a group, and then find the characteristics of the problem through the human body first. Even the development of a certain module will undertake the work of analyzing the dependency relationship of each module. Problem solving typically involves more than three roles (business, test, development, architecture, platform, etc.).
Compared to traditional monitoring, observable performance enables us to better observe the health of the system, locate and solve problems quickly by combining data and connecting the data together. “Monitoring tells us what parts of the system are working, and observability tells us why they are not working.” Figure 2 illustrates the relationship between the two. Monitoring focuses on macro presentation, while observability includes the scope of traditional monitoring.
(FIG. 2 Relationship between monitoring and observability)
The core still looks at the output of each module and key calls and dependencies to determine the overall working status. These key points are summarized as Traces, Loggings, and Metrics.
Observability critical data
(FIG. 3 Key data of observability)
Combine the functions, Loggings, Metrics definitions and the current situation of Taobao to make some interpretation:
- Loggings (log) : The end-to-end logging system based on existing TLOG (wireless) log channel, shows the App run when the event or program in the middle of the execution process of some of the logs, can detailed explanation of the running state of the system, such as the page jump, request log, global information such as CPU, memory, use, most of the logging is not implemented in series, Now, after the introduction of structured call link log, the log can actually be converted to Trace after the structuring of the call chain scenario, supporting single-machine screening.
- Metrics (indicators) : the aggregated values are used in the macro market. There is no detail display for problem positioning. Generally, there are various dimensions and indicators.
- Traces: Is the most standard call log. In addition to defining the parent-child relationship of the call (generally through TraceID and SpanID), it also defines the details of the operation, such as service, method, attribute, status, time and so on. Trace can replace part of Logs. In the long run, the Metrics of each module and method can also be obtained through the aggregation of Trace, but the log storage is large and the cost is high.
Full link observability architecture
The above concept of observable system has some practical implementation in the back end, but when it comes to the characteristics and status quo of the mobile field, there are various problems as follows:
- Problems with call specification: The difference from cloud is that end-to-end asynchronous, asynchronous API is extremely rich, and there is no unified call specification;
- The problem of multiple technology domains: there are a large number of RESEARCH and development frameworks, and the capability of external black box, how to connect a lot of difficult to perceive the cost;
- Problems with endcloud differences: The massive distributed devices at the end and side mean that the challenges of the observable mode are fundamentally different from those at the server side. Logging and metrics can be fully reported and realized on the server side based on a set of systems, but the buried point and log amount of the single machine are greatly different, which is also the reason for the separation of the buried point system and the logging system at the end and side. The end side needs to realize how to take into account the single-machine troubleshooting of massive devices and the index trend definition under big data;
- The problem of end-to-end cloud association: the end-to-end reality is always in a state of disconnection. From the perspective of end-to-end, how to better perceive the back-end state and how to make association, such as how to continuously promote the coverage of serverRT (back-end request invocation time) from IDC to CDN, and how to make the end-to-end link identification also sensed by the backend.
Therefore, we need to define the whole link of the mobile technology field around the above problems, and establish relevant domain-level analysis ability and good evaluation standards, so as to have a deeper insight into the problems of the mobile terminal, and to continue to serve the group’s apps and cross-domain problems in the field of problem detection and performance measurement.
(Figure 4. Definition of full-link observable architecture)
- Data layer: define indicator specifications and collection schemes, and report data based on Opentracing;
- Domain layer: evolution from problem discovery to problem positioning, continuous performance optimization system and technology upgrading and precipitation;
- Platform layer: compare the perspectives of group and competition, combine online and offline indicators, introduce the perspective of manufacturers, and drive the performance improvement of App;
- Business layer: from a full-link perspective, end-to-end communication can be achieved. In addition to students on the client side, cross-domain r&d personnel of different technology stacks can also be served.
Reviewing the goal of the all-link observables project, we set it as “building all-link observables system, improving performance, driving business experience improvement, and enhancing problem location efficiency”. The following chapters will focus on the practice of each layer.
Opentracing observable architecture on mobile
Full link configuration
(Figure 5 End-to-end situation, detailed scene hierarchical diagram)
The existing end-to-end link is long, and there are various research and development frameworks and capabilities on the end side. Although the back-end call link is clear, it is not connected with the end side from the perspective of full link. Taking the user browsing details as an example, once the first screen is opened, different call sequences of Ultron, MTOP (wireless Gateway) and DX modules will be triggered. Different modules have their own processing processes, and different stages have different time consuming and status (success, failure, etc.). Then continue to look at the sliding, it can be seen that the time sequence combination of module call is different, so several elements can be randomly combined in different scenarios, and the full link needs to be defined by dividing several dimensions according to the actual user scenarios:
- Scene definition: a user operation is a scene, such as click, slide are separate scenes, scenes can also be a combination of multiple single scenes;
- Capability layering: different scenarios, including business class, framework class, container class, request class invocation, can be layered for each domain;
- Phase definition: Different layers have their own phases. For example, the framework class has four local phases, while the request class can contain back-end server processing phases;
- User moving line: A moving line consists of several scenarios.
Full link is the decomposition of complex large calls into a limited number of structured minor calls, and can be derived from various cases:
- Single-scenario + Single-phase combined full link;
- Full-link combination of “single scenario + Several layers + Several stages”;
- The full link combination of “several scenarios + several layers + several stages”;
- . .
Falco- Based on the OpenTracing model
In order to support the Logs + Metrics + Tracing industry standard, full link introduced the distributed call specification OpenTracing protocol to conduct secondary modeling on the above client architecture (later referred to as Falco).
Is the model base for the Falco OpenTracing specification, no longer listed below, complete OpenTracing design specification for reference, OpenTracing. IO/docs/overvi… . Falco defined the call chain tracing model of the end-to-end domain, with the main table structure as follows:
(Figure 6 Falco data table model)
- Span common header: The yellow section, which corresponds to the SPAN base attribute of the OpenTracing specification;
- Scene: The baggage part corresponding to OpenTracing is transparently transmitted from the root span to store service scenes. The name rule is “Service IDENTIfy-Behavior”. For example, the first screen of a detail is ProductDetail_FirstScreen and the detail refresh is ProductDetail_Refresh.
- Layer: Corresponding to the Tags part of OpenTracing, defines the concept of layer, which is currently divided into business layer, container layer and capability layer. The module that handles business logic belongs to the business layer and is named Business; The view container belongs to the frameworkContainer layer, such as DX and Weex, named frameworkContainer. Only one atomic capability module is provided, which belongs to the capability layer named ability, such as MTOP and Picture. The layer can be applied to horizontal performance comparison of different modules with the same capability at the same layer.
- Stages: Corresponding to the Tags section of OpenTracing, stages that are included in a module call. Each layer is divided into key stages based on the domain model, aiming to make different modules in the same layer have a consistent comparison caliber, such as DX and TNode comparison, which can measure each other’s advantages and disadvantages from the time of pretreatment, parsing and rendering. For example, the preprocessing stage is called preProcessStart, which can be customized.
- Module: Corresponding to the Tags section of OpenTracing, more logical modules. For example, DX, MTOP, photo library, network library;
- Logs: Corresponding to the Logs section of OpenTracing, Logs are only recorded to TLog and not output to UT burial points.
Falco- Key points
(FIG. 7 Key implementation of Falco)
- End-to-end traceID: the generation meets the principles of uniqueness, fast generation, scalability, readable, and short length.
- Call & restore abstract: by traceID and SPAN multilevel serial number and a Reuters transmission, clear upstream and downstream relationship;
- End-to-end series: the core solves the problem of cloud series. The end-to-end ID is transparently transmitted to the server, and the server stores the mapping relationship with hawk-eye ID. Access layer returned to the eagle eye ID, existence of eagle eye end side all link ID, through such bidirectional mapping relationship, we can know a return request because not in the network stage didn’t succeed, still did not reach the access layer, or a business service does not return, thus become the familiar, coarse-grained network problems can be defined and explained;
- Stratified metric: core purpose is to let the tree module have consistent contrast calibre, supporting frames after upgrading the performance of the horizontal contrast, way of thinking to abstract the client domain model, such as in framework class for example, although the framework is different, but some of the key calls and parsing is consistent, so it can abstract become the standard stage, other similar;
- Structured buried point: first, column storage is used to facilitate data aggregation operations and data compression of large data sets and reduce the amount of data; Secondly, business + scene + stage is precipitated into a table to facilitate associated query.
- Falco-based domain problem settling: including key definitions of complex problems, trail logs for tracking problems, and burying points for some special demands. The information of all domain problems is structured and deposited into Falco, and domain technical developers can continue to build their analytical capacity based on the precipitate domain information. Only by realizing the effective supply of data and the integration of domain interpretation can deeper problems be defined and solved.
(Figure 8 Falco domain problem model)
Operation and maintenance practice based on Falco
Operations category is very broad, surrounding the problems found, take over, positioning analysis, the critical problem to repair process, from large index observation, alarm equipment, to a single screen, log analysis and so on, we all know that in order to do that inside every process involves many ability construction, but the actual execution is hard to do, the parties also not recognized, Taobao client has always had problems with index accuracy and log pulling efficiency. For example, APM performance index, taobao App in the past many inaccurate indicators, business students do not recognize, can not guide the actual optimization. This chapter will focus on sharing relevant optimization practices of Taobao App in index accuracy and log pulling efficiency.
(Fig.9 problem of reversing user moving line and operation and maintenance system)
Macro index system
Taking the opportunity of the horizontal campaign of end performance and based on the user’s somatosensory experience, APM started relevant upgrading work. The core involves the visual and interactive indicators in startup, external chain and various business scenarios. How to make the corresponding end points of indicators closer to the user’s somatosensory experience are mainly as follows:
- 8060 algorithm upgrade: visually useful elements are extracted for calculation (such as pictures and text), and elements that users cannot perceive are removed (blank controls, bottom map). For example, visual specifications are formulated to meet the requirements of custom control marking such as picture library and fishbone map.
- H5 field: support visual interaction of UC page elements and front-end JSTracker (event buried framework) backtracking algorithm, open with H5 page visual algorithm;
- In-depth complex scenes: develop visual specifications for custom frames, open up Flutter, TNode (Dynamic RESEARCH and development framework) and calibrate various research and development frameworks, and implement 8060 algorithm by each research and development framework;
- Outside the chain field: get through the H5 page caliber, redefine the outside of the chain to leave the negative action.
Take startup as an example. After APM calibration, including the stage of picture on screen, the data increased, but it was more in line with the demands of the business side.
(FIG. 10 Startup data trend after calibration)
Outside chain for example, after getting through H5, the new caliber also appeared to rise, but more in line with the sense of body.
(FIG. 11 Comparison of diameter data before and after calibration of outer chain)
Based on this campaign, several research and development framework visual indicators and calibration work have been achieved.
Single-machine screening system
At present, the core of troubleshooting is based on TLOG. This time, the troubleshooting focuses on the key links of log reporting, log analysis, and location diagnosis (no logs, logs cannot be understood, and location is difficult). This section describes the efforts made by the O&M troubleshooting system to improve the efficiency of fault location.
(Figure 12 Core functions of single machine troubleshooting and locating)
- Improve the log upload success rate and ensure that logs are supplied when troubleshooting problems from several aspects. First, the built-in active log upload capability is triggered at multiple times in core scenes or problem feedback to improve the log access rate, such as public opinion feedback and abnormal launch of new functions. Second, upgrade TLOG capability, which involves the optimization of sharding strategy, retry, log governance, etc., to solve the time-effectiveness problem of uploading logs that received many feedback from users in the past; Finally, collect all kinds of abnormal information as snapshots and report them to the MTOP link in off-line mode to help restore the site.
- To improve log location efficiency, you need to classify logs. For example, quick filtering is supported for page logs and all-link logs. Then, the whole link call topology of each scene is opened, the purpose is to quickly see which node the problem occurs, so as to quickly distribute processing; Finally, the principle of structuring error, slow, UI card and other problems is to hand over the interpretation of domain problems to the domain. For example, there are several types of lag logs, such as APM frozen frame, ANR, main thread, etc. In the service category, the request fails, the request RT duration is longer than XX, and the page screen is blank, etc., and the ability to quickly diagnose and locate problems is improved by connecting capabilities in various fields.
- In the construction of full-link tracking capability, Hawk-Eye (the implementation of distributed tracking system in the back end of Ali) has a large number of access services and a large amount of logs, so it is inevitable to do log sampling. For calls that do not hit the sampling, the cache is only 5 minutes, so it needs to find a way to notify Hawk-Eye to keep the log for a longer time within 5 minutes. In the first stage, the back-end parsing service will resolve the Hawk-eye ID of the calling chain and notify the Hawk-eye service to store the corresponding trace log, which can be saved for 3 days after the successful notification. In the second stage, when the gateway is abnormal, the hawk-eye ID is taken out, and the hawk-eye storage is notified to bring forward the storage. In the third stage, similar scene tracking, hawk-eye trace logs of the core scene were obtained and stored on the Ferris wheel platform. The first stage has been online, and the hawk-eye platform can be associated with the jump. Generally, it takes 5 minutes from the occurrence of the problem to the troubleshooting, so the success rate is not high. The success rate needs to be further improved in combination with the second and third stages, which is under planning and development.
- The construction of platform capability is based on end-side full-link log parsing. In terms of visualization, the content of full-link log is displayed structurally to facilitate the rapid abnormal of some nodes. In addition, based on structured logs, it can quickly diagnose time-consuming anomalies, interface errors, and data size anomalies in all-link logs.
The above are some attempts made in operation and maintenance this year. The purpose is to replace process enablement with technical enablement in the field of investigation through technological upgrading.
Next, I will continue to show you the practice of Taobao and the effect of accessing other apps of the group.
Full-link operation and maintenance practice
Taobao stuck problem investigation
Internal colleagues reported that when using Taobao App overseas, there were some problems such as card card and some pages could not be opened. After the process of appeal investigation, TLOG was extracted.
- Through the function of “Full-link Visualization” (Figure 10), it can be seen that the status of the NETWORK with spanID of 0.1 on the H5 page is “failed”, resulting in the page cannot be opened.
- Through the time-consuming abnormal function of “full-link diagnosis” (FIG. 11), it can be seen that a large number of network time-consuming periods are distributed in 2s, 3S +, and some even 8s+. The network stage occurs in the request call stage (transmission), which is related to the slow access of overseas users to Ali’s CDN nodes.
(Figure 13 Full-link visualization function)
(Figure 14 Full-link stuck diagnosis function)
Ele. me main link access
Cold start full link
(Figure 14 Ele. me Full link view – Cold Start full link)
Store full link
(Figure 15 Ele. me Full link view – Shop full link)
Optimization practice based on Falco
New index system
Now I will focus on how we build online performance baseline from the end-to-end full-link perspective based on Falco observable model and use data to drive continuous improvement of Taobao App experience. The first is the construction of data index system, which mainly includes the following points:
- Index definition and specification: close to the user’s experience, around the user clicks to content rendering to sliding operation line of the page to define the relevant indicators, the content acquisition page open, mainly on the screen, click on the response, sliding scenarios such as technology, such as content show a page visual interactive, image on screen index, sliding sliding frame rate (finger), frozen frames and other indicators to measure;
- Index measurement scheme: In principle, indicators of different domains are assigned to corresponding domains. For example, the indicators of lateness can be the caliber of the manufacturer (Apple MetricKit), self-built caliber (such as the main thread lateness and ANR of APM), or customized indicators of different service domains (such as the full-link scenario), such as MTOP request failure and the screen on the detail header diagram.
- Index composition: it is composed of online collection index and offline collection index. Based on online and offline data and relevant specifications, it leads APP experience optimization based on user perspective and competition situation.
(Figure 16 App Performance index System)
- For example, APM defines sliding related indicators as follows:
(Figure 17 Definition scheme of APM related indicators)
- The full link scenario is used as an example. For a user interaction in a specific service, the entire link from the front end to the server to the client is invoked. Details In the full link scenario:
(Figure 18 Full-link scenario – Details first screen Definition)
And so on… .
Optimization under the new index system
FY22 platform technology focuses on the full-link perspective, takes experience as the export, in-depth business development and optimization, focuses on index definition and dissolving problem domain, and carries out major special optimization for users’ real sense of entity. From the bottom up, we show how common network layer policy optimization can evolve from connectivity -> transport -> timeout policies around the request cycle; Technical strategy upgrade for user’s motion perception, such as gateway and image optimization; Technical transformation oriented to business scenarios, preprocessing and preloading of venue framework, lightweight practice of security guards, and even experience classification in business. For example, terminal intelligence is not enabled under the low-end machine of information flow on the home page. Relevant practices will be mainly introduced below.
(FIG. 19 Technical Solution for full-link optimization of Taobao App)
Request streamlined speed – Minimal invocation practices
Taking MTOP request as a scenario, the link mainly involves the interaction between “MTOP and network library”. Based on the analysis of the status quo of the whole link thread model, it can be concluded that when the request is initiated from MTOP to the network layer, the request will be slow:
- Multiple data copies: with the existing network layer mechanism, the network inventory is processed by hook interception and forwarded to the network library for network transmission based on NSURLConnection + “URL Loading System”, which involves multiple data copies and time-consuming interception processing.
- Multiple thread switching: The thread model is too complex, and threads are frequently switched upon completion of a request;
- Asynchronous to synchronous: The original request uses a queue, NSOperationQueue, to process the task. The underlying queue binds the request and response together so that it waits for the response to be released after being sent. The “HTTP Operation” holds the entire IO of an HTTP sending and receiving process. Operation Queue is prone to full blocking because it violates the parallelism of network requests.
The above problems are more obvious in the scenario of mass requests and intense competition for system resources (cold start, dozens of requests rush in).
(Figure 20 before and after thread model optimization – minimalist invocation)
In the transformation scheme, MTOP directly calls the network library interface to improve performance
- Simplified thread model: skip the System URL Loading System hook mechanism, complete the thread switching of sending and receiving data, reduce the thread switching;
- Avoiding weak network congestion: Data packets Sending and Receiving are split, and the air port length RT does not affect the I/O concurrent capacity.
- Replace deprecated API: Upgrade old NSURLConnection to direct call network library API.
Data effect: it can be seen that in a more constrained environment of system resources, such as low-end machine optimization is more obvious.
(FIG. 21 Optimization amplitude of minimalist call AB)
Weak Network Strategy Optimization -Android Network multi-channel practice
In the environment of poor WIFI signal and weak network, sometimes multiple retries have no obvious effect on improving the success rate. The system provides a capability that allows the device to switch requests to cellular network cards in a WIFI environment. Network application layer can use this technology to reduce timeout errors and improve the success rate of requests.
After Android 21, the system provides a new way to get network objects, which can be used by applications to get connected cellular networks even if the device currently has a data connection over Ethernet.
Therefore, when both WIFI and cellular networks exist on the user device, different requests can be dispatched to Ethernet and cellular networks at the same time under a specific policy to achieve network acceleration.
Core changes:
- Prerequisites: Whether the current Wi-Fi network supports cellular networks.
- Trigger time: when no data is returned for more than a certain period of time, the request to switch the cellular network will be triggered and retry will be triggered. The request of the original process will not be interrupted, and the request that uses the preferentially returned channel will be responded, and the request returned late will be cancelled.
- Time control: Orange is configured according to specific scenarios, and it needs to be flexibly adjusted dynamically according to the strength of the network.
- Product form & compliance: when using, the user is informed of the text “improving browsing experience by using WIFI and mobile network at the same time, which can be turned off in Settings – General”, and the pop-up policy is triggered for the first time each time the function is started.
(Figure 22 Android multi-channel Network capability optimization + user compliance authorization)
Data effect: In the case of intense network resource competition, the optimization of long tail and timeout rate is more obvious in the WiFi+ cellular two-channel network scenario. AB data, home page API and P99/P999 bit performance are improved by 23%/63% respectively, and the error rate is reduced by 1.19‰. Home page images and P99/P999 bit performance are improved by 12%/58% respectively. The error rate decreased by 0.41‰.
Technical Policy grading – Picture grading practice
The performance of different devices varies greatly, and the complexity of services is getting higher and higher. Many services cannot provide users with the desired effect on low-end devices, but bring bad experience such as lag. In the past, “delay, concurrency, preloading” was used to optimize performance, but only circumvented the problem, and the core link still faced the critical call time. So, we need to do business experience classification, based on the classification and handling of business process, the process of “high-end equipment to experience the most perfect complex, low-end equipment can also smooth the use of the core functions, is expected to achieve ideal user experience & business core index combination, step back, let some functional damage (does not affect the core business index), To make the performance experience better, the initial idea is to achieve it in two steps:
- In the first stage, business classification requires a rich policy base and judgment conditions to achieve the classification. We will precipitate this part of general ability on the core components to help businesses quickly achieve the business classification ability.
- In the second stage, as a large number of businesses have access to the grading capability and a large number of business grading policies and AB data have been accumulated, the recommendation and optimization of single point business grading policies can be carried out to achieve rapid reuse of a large number of similar businesses and improve efficiency.
Traditional CDN adaptation rules will be based on factors such as network, and the size of the view, system dynamic assembly to get the “best” image size to reduce network bandwidth, the bitmap memory footprint, improving equipment pictures load experience, the equipment classification perspective, and based on the given UED specification, compression parameters can be configured, extended the existing CDN adaptation rules, Realize different models of the picture grading strategy, through this ability, can further reduce the size of the picture, speed up the picture on the screen.
(Figure 23 Picture equipment classification rules)
Lightweight Link Architecture – Secure Visa-free practices
Outside the chain, from start to end link request the customs to the landing page loading (the main request is still MTOP), involving the signature of the security for many times, the signature belongs to the CPU intensive tasks, the captain of the low-end tail significantly, pull side takes too long can lead to traffic jump, FY22 S1 in waves on the business, did a lot of pull end link performance optimization, optimal performance can bring jump lower loss rate, At present, security signing takes a high proportion of customs requests with the largest performance. Therefore, security signing is expected to be skipped. Services can be used based on the actual situation to improve the value of incoming traffic.
(Figure 24 Security Visa-free architecture changes)
- Gateway protocol update: Protocol update supports visa-free Settings, and provides a visa-free interface externally. If the service API is set visa-free, the gateway carries the header to the network library.
- AMDC scheduling service: Considering the stability, AMDC (Wireless Network Policy Scheduling Service) will be used to dispatch to the online safe production environment in the short term. Therefore, THE AMDC scheduling module will determine whether to return the client visa-free VIP according to the description identifier. After the function becomes stable, it will flexibly dispatch to the online master station environment.
- Migration of validation module: the security extension capability is preinstalled in the AServer access layer. Considering the operation and maintenance cost, the capability will be uniformly migrated from AServer to security. There will be no extension module in the subsequent AServer, and security will enable validation and other functions according to API/header features.
- MTOP VWP error retry: In VWP mode, the MTOP layer fails to meet the invalid signature request, triggering the degradation of the old link to ensure user experience.
Summary & Prospect
Conclusion: This paper mainly describes how to complete the construction of observable capability through invoking link Tracing, standard Logging and scene Tracing in the face of the existing challenges of mobile terminals, and build a full-link operation and maintenance system and continuous performance optimization system based on the full-link perspective and new observable capability. Swallow the mobile end long missing call chain tracking ability, and solve the problem of complex call scenarios fast positioning, and change the past, people meat inefficient process of screening, began the process can be assigned to technology transformation, and around the ability to build the whole link Metrics index, make whole link performance index system, governance in-depth business scenarios, upgrade platform technical ability, Drive business experience improvement and long-term tracking with data.
Inadequate: Although taobao App in the access of all kinds of scenarios in succession, but within 15 minutes from the pinpoint the problem there is a big gap, the card is also more, such as log reported success rate, the effectiveness of the server logs for, problems to promote the efficiency of positioning, access to the source of data quality inspection transition & technical, the technical side of the problem in the field of understanding and the continuous precipitation structured information, Finally, the user experience of the whole product needs continuous optimization.
Outlook: To continue the concept of mobile native technology of Alibaba mobile Technology group, we need to go deep into the hinterland of mobile domain and face challenges in the fields of east-west multi-r&d framework and south-north end-to-end full link to improve technology and experience. In the first phase of experience optimization in 2018, we introduced similar concepts and tried in the field of request. Until now, we have found a suitable structural theoretical basis, and carried out in-depth practice based on the characteristics of mobile terminals, and continued to develop the definition and solution model of domain problems. We hope to create a mobile domain observable technology system and form architectural precipitation.
【 References 】
- [1] Observability Technology Conference ppt.infoq.cn/list/qconsh…
- Design specification [2] OpenTracing OpenTracing. IO/docs/overvi…
- [3] Swastika cracking cloud native observability xie.infoq.cn/article/598…
- [4] Apache APISIX www.apiseven.com/zh/blog/why…
- What is [5] Mesh: sidecars cloud.tencent.com/developer/a…
- [6] ‘APM analysis www.gartner.com/doc/reprint…
- [7] New Relic APM blog.csdn.net/yiyihuazi/a…
- [8] dynatrace www.dynatrace.cn/platform/ap…
- [9] OpenTelemetry editrice zhuanlan.zhihu.com/p/361652744
- [10] AppDynamics www.appdynamics.com/
- [11] SkyWalking distributed tracking system www.jianshu.com/p/2fd56627a…
We’re hiring!
We are the “terminal platform Technology Department” of big Taobao platform technology. We have the world’s largest e-shopping mall and first-class mobile technology platform, creating industry-leading technology products, serving more than 1 billion consumers around the world, and handling hundreds of billions of user requests every day.
As the most important client team of Alibaba, we are responsible for the r&d, operation and maintenance support of Taobao mobile domain, mining of original technology and construction of core technology, including but not limited to client experience, framework and innovation experience, manufacturer and system technology, user growth and mobile platform. Whether it is infrastructure, business innovation or technology development, our team can provide you with great opportunities and growth space. We look forward to welcoming you to join us
Resume Delivery
Pay attention to [Alibaba mobile technology] wechat public number, every week 3 mobile technology practice & dry goods to give you thinking!