Author: Idle Fish Technology — Wu Hui

background

In daily development, we will receive feedback, consultation and single point troubleshooting from users, partners, customer service, operation, product and test. Most of these problems will be dealt with by the development students after being handed over by several people.

And development of students to solve these problems is quite painful: communication to obtain the necessary query parameters repeatedly, get the business data from multiple platforms, if there is no related platform tools, so also through constructing SQL, the RPC service into the parameter, such as the cache key query parameters needed to obtain business, finally, based on your understanding of business for many years, from these data to find the answer.

We can summarize the above as two factors that affect the efficiency of our daily problem investigation: the problem flow is large; Check the link length.



Twice a day, half a day?

This paper will introduce a general scheme to improve the efficiency of daily business problem investigation, which has been applied in Xianyu and achieved good results.

The scheme has the following characteristics

Obtain all associated service data in one click: Input parameters of different dimensions to obtain consistent data results;

Business data is easy to understand and available to everyone: Business data attributes can be explained to facilitate non-technical personnel self-service query;

Business data is diagnosable: Displays abnormal data if data is abnormal.

Business data panorama and diagnosis

The overall train of thought

We divide the problems into two types of abnormal case problems: abnormal business data (data missing, inconsistent status, etc.) Business data query: data is normal, and relevant data (document status, rights and interests cancellation, etc.) need to be consulted.

Through the induction of the problem and the investigation process of the abstract, our thinking is relatively clear:

1. Provide a general data aggregation query method, which can obtain all associated data by calling various system services 2. Interpret and supplement the semantic meaning of service data. 3. Abstract the service rule set and determine whether the data is abnormal

In addition, the system should have high scalability and low access cost, so that other services can be accessed and used quickly.

The overall structure is as follows:

The following describes how to implement data panorama, business semantic interpretation, and business data diagnosis.

The data view

This part is the foundation of the entire system, and all subsequent capabilities are based on aggregated query data. The aggregated query will call the interfaces of multiple systems, and the service dependence of each business is different and can only be completed by the developer responsible for the business (because others do not know which data to take). If the query cost is too high, it is not good for the access and use of the business side, and the universality is also greatly reduced. Here we chose GraphQL as the data aggregation query service. GraphQL: graphql.cn/

Xianyu has had some experience with GraphQL before. The graphQL statement was written to obtain all the data at one time, and the front end directly rendered according to the data to achieve rapid page building. By putting the business logic in front, the server only needs to focus on building stable domain services, which eliminates the data format convention and interface joint adjustment of the front and back ends in the development process, and improves the efficiency of research and development. In many cases, GraphQL can also be used as a solution to FaaS.

Taking idle fish recycling business as an example, the whole link of this business involves transaction, capital, Sesame credit, ant energy, commission settlement, and valuation data. Its GraphQL statement is simply described as follows:

As shown above, We input the order number bizOrderId, and rely on the order service queryBizOrder, the evaluation service HSF_quoteService, and the subscription service HSF_agreementpayBillQuery to invoke multiple systems respectively through the GraphQL executor. Spu query service HSF_spuQuery, settlement query service settleBillQuery, and Tair(distributed cache, similar to REDis) is used to query user business status. Aggregated data is obtained through six systems.



Using GraphQL to realize the aggregated query of business data, the cost of business usage is greatly reduced, and generally all data can be obtained without invading the system.

Business semantic transformation

Data conversion is to transform GraphQL query data into understandable business data, including Chinese interpretation of attribute fields and interpretation of the meaning of attribute values, so as to facilitate visual output and reduce the cost of understanding for non-technical students. Three problems need to be solved here:

Data regrouping

From the above example, we can see that “number of user credit orders” is the data from the non-specific domain service of middleware like Redis, but from the division of business logic, this data should belong to the “user” domain. Therefore, in order to express it more intuitively, we classify it as user attribute

Uncertainty of input data

In many cases, the query input provided by the feedback is not always one dimensional parameter. For example, in the transaction link, the user may provide order ID/refund order ID/capital order ID/rights AND interests ID, etc. These data can actually correspond one by one. If only the order number can be used for query, Sometimes you need to do at least one reverse check to get the order Id before using the tool. To achieve “one-click direct”, we need to support multi-dimensional parameter input, such as the example above, and we can get aggregated data by evaluating ids:



You can see that the data structures returned by the two QL’s are different, but the valid business data is the same, which we callHeterogeneous data.

Heterogeneous data unification

Different input parameters caused us to use different GraphQL statements to express the results and get different structures, which was not conducive to business field interpretation and data rule checking. Therefore, it is necessary to merge multiple heterogeneous data into a certain data structure, which can be used as a unified page display and input parameters for subsequent data diagnosis.

We use JSONPath to convert this into a uniform data structure and regroup the data:

In addition to JSONPath, we have implemented some secondary value conversions for other cases: Stateful value conversions: Usually used to translate enumerable state values, such as transaction status =6 interpreted as transaction status =6(transaction success); Virtual attribute value pairs: Similar to macros, such as “is an overpriced order” = eval(order.price>2000), which evaluates the value of an expression to complement an attribute field that does not already exist.

We now have a tool to query business panorama data. You can use it as a query console for business data.

With unified business data, the next step is to implement diagnostic capabilities for business data.

Business diagnosis

Determining whether data is abnormal is a logical set like this from a development perspective:

If (actual value! = expected value){print abnormal result}Copy the code

Here [expected value] is a data set, which is expressed as:

When data A appears in the data, the data should also have results [B1,B2,B3].

We refined the business logic to form a business data rule expression model:

These rules can be derived from business TC use cases and can be visualized to help newcomers learn business logic.

We chose QLExpress to express and enforce this rule QLExpress (github.com/taobao/qlex…

If (order. IsCreditOrder = = "1") is the return order. IdleCreditPayAmount > 0; Otherwise, return true;Copy the code

Execute the rule code above

    //defaultContext 是数据转换后的数据,作为qlexpress的上下文
    Object executeResult = QLExpressUtil.execute(ruleExpress, defaultContext,
        errorList, false, false);
    //将执行结果转换,空结果默认成功, 如果结果为失败,则再执行一次formater获得错误文案
    QLExpResult qlExpResult = buildQlExpResult(context, executeResult, formater);
    if (!qlExpResult.getSuccess()) {
        errors.add(DiagnosisError
                     .of(ErrorLevel.BIZ_ERROR.name(), ruleName, String.valueOf(qlExpResult.getData())));
    }
Copy the code

After all the rules are executed in turn, the results are displayed in combination with business results to obtain a panoramic business data with diagnostic results:

Overall execution process

Finally, to summarize the entire implementation process:

The above is our core capability to realize the panoramic investigation and diagnosis of daily business data. By providing some page configuration interfaces, multiple services can be accessed quickly.

Comparison of application effects

Finally, we compare the screening efficiency of this tool from several dimensions:

Ability to stretch

In addition, we have extended more capabilities to more efficiently support day-to-day operations:

Tap into the Spike question answering robot

You can use your phone when you’re on vacation, at rest, when your computer isn’t around, etc., to do all of this more efficiently and quickly, avoiding the potential breakup warning

Automated review of use cases

If the business system will interact with the external (partner) system, the internal playback tool will not be able to ensure whether the external service is normal. Based on the diagnostic ability, we can verify whether there is an exception from the data level:

conclusion

Through the analysis and induction of daily problems and the abstraction of troubleshooting process, we concluded a general solution that can improve the efficiency of daily single point of problem troubleshooting, so that it can serve the product, operation, customer service, RESEARCH and development, testing personnel, business expansion and low-cost access, and significantly reduce the cost of problem troubleshooting. The implementation of the scheme mainly includes: using GraphQL to implement the aggregation query of business data; JSONPath is used to regroup heterogeneous data and business data from different dimensions and interpret business semantics. Use QLExpress to express and execute business data rules and perform data diagnostics.

Of course, there are still many cases where one-click troubleshooting is not possible. In the future, we will continue to optimize to achieve the goal of “problem to me” : combining log retrieval tools, user behavior playback tools to provide more dimensions of one-click direct; Deal with abnormal data, including data revision, approval and guidance; Provide troubleshooting means for customer service, q&A and other scenarios, and introduce classification of sensitive data to prevent data leakage; Access business messages for reconciliation & monitoring and warning.