preface

Logan is the basic mobile logging component of Meituan-Dianping Group. The name is a combination of Log and An, which stands for individual logging service. Logan is also known as “Uncle Wolverine,” and we’d like to see the product be as sharp as That.

Logan has been iterating steadily for over a year. At present, most apps on Meituan-Dianping have access to and use Logan to collect, upload and analyze logs. Recently, we have decided to open source the storage SDK part of the Logan ecosystem (Android/iOS), hoping to help more developers solve the pain points related to log collection on mobile. We also welcome more developers from the community to build the Logan ecosystem with us. Github projects can be found at github.com/Meituan-Dia…

background

As services expand, logs on mobile devices also increase. However, there is no systematic processing method for mobile terminal logs. In most cases, different logs are processed singly and problems are located based on the log processing results. However, when users reach a certain level, many “difficult problems” can not be solved by the way of locating problems before. One of the biggest headaches for mobile developers is “why do I use the same phone as the user, the same version of the system, but the user’s operation does not reproduce the Bug”. Especially for Android developers, the phone model, system version, network environment and so on are very complicated. It is not surprising that even if they get the exact same phone, they can’t get the exact same phone. I believe many students are familiar with the following scene:

I found that the XX page of our App cannot be opened and the UI cannot be displayed. Please follow up this problem.

You: Ok.

So, we checked the model and version of the system, and then found a phone with the same model and version, and tried to reproduce it and found that everything was fine. We called the user again, asked him exactly how to operate, and then asked the network environment, continued to try to reproduce but still failed. Finally, we checked the Crash log, the network log, and then the buried log (which was not reported yet).

Your inner OS: strange, there is no Crash, the network is also accessible, but why the UI is not displayed?

A few hours later…

Do you have a result on this problem?

You: I have used all kinds of ways to reproduce… It’s not clear what caused the problem.

Use (Lao) door (ban) : that blame me?

You:…

If the creation of a Bug is viewed as a “crime scene”, the developer is the “detective” solving the crime. After the crime, the detective needs to use various means to gather clues and deduce the process of the crime. It is as if the developer needs to query various logs to analyze what the App is experiencing on the user’s phone during this period. Generally speaking, traditional log collection methods have the following shortcomings:

  • Logs are not reported in time. Because the report of logs requires network requests, frequent network requests consume power for mobile apps. Therefore, the SDK usually reports logs to a certain extent or within a certain period of time.
  • Limited information is reported. Because the frequency of network requests reported by logs is relatively high, logs are usually not large to save user traffic. Especially network log and so on this kind of real-time high log.
  • Log islands. Logs of different types are reported to different log systems and are isolated.
  • Logs are incomplete. The types of logs are increasing. Some LOG SDKS collect samples for reported logs.

Facing the challenge

Within Meituan-Dianping group, there are more than 20 kinds of mobile log, and the number is increasing with the continuous expansion of business. The three flaws mentioned above, in particular, are infinitely magnified.

Not all logs are reported to one system. For developers, they may need to view different types of logs in multiple systems, which greatly increases the cost of locating problems for developers. It would be hard to go to work every day and see difficult bugs hanging in the air. This is like a detective encountered a difficult case, when he tried all means to collect clues, still nothing, that kind of mood can be imagined. The way we collect logs to reproduce user bugs is very similar to the way detectives solve crimes, trying to piece together a relatively complete crime scene by collecting clues. If we follow this line of thinking, there is no better way to deal with these problems at present.

However, while detective solving is a lot like logging, it’s not the same. We’re dealing with bugs, not real cases. In other words, because our “dead” is visible, we can learn more from it and even have a “soul talk” with it. Alternatively, the operation of the past is through a variety of log bugs pieced together the user scenario, that can access to the user in the event of a first Bug this period of time to produce all of the log (not sampling, more detailed content), and then aggregate these log analysis out bugs (screen has nothing to do except) the user scene?

Case analysis

The new focus has shifted from “logs” to “users,” which we call “case analysis.” To put it simply, while the traditional approach was to gather logs scattered across systems and piece together the scenario in which the problem occurred, the new approach is to aggregate and analyze all the logs generated by users to find the scenario in which the problem occurred. To this end, we tried at the technical level, and the new solution needs to meet the following conditions in function:

  • Supports multiple log collection, unifies underlying log protocols, and erases differences caused by daily log types.
  • Logs Are recorded locally and reported when necessary to ensure that logs are not lost.
  • Log content should be as detailed as possible without sampling.
  • The log type is extensible and can be customized by the upper layer.

We also need to technically meet the following conditions:

  • Lightweight, as small as possible
  • The API is easy to use
  • Non-invasive
  • A high performance

The garage

In this context, Logan was born, and its core system consists of four modules:

  • Log entries
  • The logging stored
  • Backend system
  • The front-end system

Best practices

Log entries

Common log types include code-level logs, network logs, user behavior logs, crash logs, and H5 logs. These are Logan’s input layers, and you can store a copy of the content in Logan without compromising the original logging functionality. The advantages of Logan are as follows: the log content can be richer, and more information can be carried when writing. There is no log sampling, and the user can only wait for the right time for unified reporting, which can save the traffic and electricity of users.

Take network logs as an example. Normally, network logs only record the fields such as end-to-end delay, packet sending size, and packet return size, and also have sampling. However, in Logan, network logs will not be sampled. In addition to the above content, they can also record information such as Headers request, Headers return packet, and original Url.

The logging stored

The Logan Storage SDK, which is the focus of this open source project, addresses several shortcomings of most mobile logging inventories in the industry:

  • Lag, affecting performance
  • Log is missing
  • security
  • Log scattered

The logan-developed logging protocol solves the problem of local aggregation storage of logs by using the “compress first, then encrypt” sequence and streaming encryption and compression to avoid CPU spikes and reduce CPU usage. Cross-platform C library provides log protocol data formatting processing. For large log fragmentation processing, MMAP mechanism is introduced to solve the problem of log loss, and AES is used to encrypt logs to ensure log security. The Logan core logic is done in the C layer, providing cross-platform support and improving performance while addressing pain points.

To save mobile phone space, only the logs generated in the latest 7 days are saved. After the logs expire, they are automatically deleted. On Android devices, Logan keeps logs in a sandbox to ensure the security of log files.

For details, please refer to meituan-Dianping mobile terminal basic log library — Logan

Backend system

The back end is the receiving and processing data center, which is Logan’s brain. There are four main functions:

  • Receiving log
  • Log Parsing archive
  • Log analysis
  • Data platform

Receiving log

The client can report logs in two modes: active report and retrievable report. Proactive reporting can be guided by customer service to report, or embedded to report when specific behaviors (such as user complaints) occur. The back end initiates a back order to the client to report a back order, which is not described here. All log reports are received by the Logan backend.

Log Parsing archive

The logs reported by the client are encrypted and compressed, and the data needs to be decrypted, decompressed, restored, and archived for structured storage.

Log analysis

Different types of logs are composed of different fields, carrying their own unique information. Network logs contain information such as the request interface name, end-to-end delay, packet size, and Headers request. User behavior logs contain information such as page opening and click events. All types of logs are analyzed, and the information is linked together to form a complete personal log.

Data platform

The data platform is the data source of the front-end system and the third party platform. Because personal logs are confidential data, data acquisition has a strict permission review process. At the same time, the data platform will collect previous cases, extract their problem characteristics and record solutions, and provide suggestions for new cases.

The front-end system

A good front-end analysis system can quickly locate problems and improve efficiency. The r&d personnel search logs through the Logan front-end system and enter the log details page to view the specific content, so as to locate and solve problems.

At present, the Logan front-end log details page within the group has the following functions:

  • Log visualization. All logs are structured and displayed in chronological order.
  • The timeline. Data visualization, semantic analysis using graphics.
  • Log search. You can quickly locate related log content.
  • Log filtering. Supports multiple types of logs. You can select logs to analyze.
  • Log sharing. After a log is shared, click the share link to automatically locate the shared log.

When Logan visualizes the data of the log, he tries to make semantic analysis in a graphical way, which is referred to as the timeline.

Each line represents a log type. The same log type has multiple shapes and colors that indicate different semantics.

For example, in the timeline, code-level logs are classified by log category:

Error logs can be easily distinguished by color differences. Click the red dot to jump directly to error log details.

Case Analysis process

  • When users encounter problems, contact customer service for feedback.

  • Customer service received user feedback. Record cases, sort out problems, and guide users to report Logan logs.

  • After receiving the Case, the r&d student searched the Logan log, used the Logan system to complete log screening, time positioning, timeline and other functions, analyzed the log, and then restored the Case “scene”.

  • Finally, combined with the code to locate the problem, repair the problem, solve the Case.

Location problem

Combined with user information, search user logs through Logan front-end system. Open the log details, first use the time locating function, quickly jump to the log when the problem occurs, combined with the log context, can get the App running situation at that time, roughly infer the cause of the problem. Then use Log filtering to find the key logs and troubleshoot each possible problem. Finally, combine the code to locate the problem.

Of course, in practice the problem is much more complicated than that, and we have to look at the logs and look at the code repeatedly. At this time, Logan may also use advanced functions, such as the timeline. Through the timeline, abnormal logs can be quickly found. Click the icon on the timeline to jump to log details. You can view the detailed response stack and response value of the request service in the background based on Trace information in network logs.

The future planning

  • Machine learning analysis. Firstly, collect past cases and solutions, extract and analyze Case features, structure the cases and put them into the database. Then, analyze the reported logs quickly through machine learning, point out possible problems in the logs, and give suggestions for solutions.
  • Data open platform. Business parties can obtain data through the data open platform, and then develop tools and products suitable for their own business combined with their own business characteristics.

Platform support

Platform iOS Android Web Mini Programs
Support Square root Square root Square root Square root

At present, Logan SDK has supported the above four platforms. This time, iOS and Android platforms will be open source, and other platforms will be open source in the future, please look forward to it.

Test coverage

Since Travis and Circle are not friendly enough to support the Android NDK environment, the current NDK version of Logan is 16.1.4479499 in order to be compatible with lower versions of Android devices, so we did not configure CI in Github repository. Developers can run test cases locally, with test coverage of 80% or higher.

Open source project

A logan-centric case analysis ecosystem has developed within the group. This open source content has iOS, Android client module, data analysis simple version, small program version, Web version has been on the road to open source, background system, front-end system is also in our open source plan.

In the future, we will provide a data platform based on Logan big data, including advanced functions such as machine learning, troubleshooting log solution, and big data feature analysis.

Finally, we hope to provide a more complete integrated case analysis ecosystem, and we welcome your suggestions to build our community.

Module Open Source Processing Planning
iOS Square root
Android Square root
Web Square root
Mini Programs Square root
Back End Square root
Front End Square root

Team to introduce

Zhou Hui, project sponsor, senior mobile architect of Meituan-Dianping.

Jiang Teng is the core developer of the project.

Li Cheng, the core developer of the project.

Bai Fan, the core developer of the project.

recruitment

Dianping Mobile R&D Center, Base Shanghai, provides basic infrastructure services for most mobile terminals of Meituan Dianping Group, including network communication, mobile monitoring, push touch, dynamic engine, mobile R&D tools, etc. At the same time, the team also carries traffic distribution, UGC, content ecology, integration center and other business research and development, waiting for all heroes who are willing to focus on mobile terminal research and development. Welcome to send resume: hui.zhou#dianping.com.