Data collection is the source of data application, guiding enterprises to make decisions in product, operation and business. Wang Zhuozhou, the author of this paper, explains in detail how to achieve an efficient and usable data acquisition scheme from the data acquisition requirements. The main contents are as follows: · Definition and importance of data collection · Common data collection schemes in the industry · Principles of data collection · Case analysis of data collection

I. Definition and importance of data collection

The so-called data acquisition refers to the process of collecting and acquiring various data in order to meet the needs of data statistics, analysis and mining. Typically, data collection refers to the collection of data within an enterprise. In the current Internet field, with the decline of traffic dividend, more and more enterprises through fine operation, in-depth mining of the value of each user. Currently popular data-driven, refined operations and other methodologies and practices are becoming more and more important, and have been accepted and adopted by more and more enterprises. Data-driven, fine-grained operations make decisions based on data. Data collection is their foundation and prerequisite.

Data acquisition is essentially for data application. If we don’t have any data application requirements, it’s meaningless to put a lot of effort into data collection. Data application, in fact, is a relatively large category, including the simplest statistical reports, complex interactive online analysis, currently very popular personalized recommendation and so on.

No matter what kind of data application, it can be roughly divided into five stages, as shown in the following figure:In the data application, we should first collect data through various ways; Then the collected data is transmitted backwards in real time or in batches; For these transmitted data, choose the appropriate data model for ETL and modeling, and choose the appropriate storage scheme according to the subsequent application; After the data is modeled and stored, the data can be used for statistics, analysis and mining. The results of these data applications, on the one hand, can be directly displayed through data visualization, and help us make various product, operation and business decisions; On the other hand, the results of these data applications can also be directly fed back to the product, in the form of “guess what you like” products, directly affecting the product.

Obviously, in a typical data application, data acquisition is the first link, is the source, is the starting point of all data applications. If the data collection is not done well and the overall data quality is affected, then the cost will be very high and the effect will be greatly reduced if we want to make up for it in the later links. The quality of the final application of the data, and the decisions and feedback based on the application, must also be affected.

In this sense, we cannot emphasize the importance of data collection too much.

It is because we are aware of the importance of the data acquisition, the vision of god ce data comes, namely “help China thirty million enterprise restructuring data foundation, to realize the digital management”, hope that through our efforts, to be able to help our customers and partners better and more comprehensive data, to maximize the value of the data. It is precisely because of this that we have done a lot of work and helped a lot of customers in the past five years, whether in data acquisition technology or data governance scheme. For example, we have established a strong data acquisition SDK research and development team, open source all the SDK, and maintain an open source discussion community of nearly 1500 people. At the same time, we continue to export our accumulation, experience and precipitation to the industry, so that the data acquisition technology is no longer mysterious, and the ecology of data acquisition technology is better and healthier.

2. Common data acquisition schemes in the industry

At present, there are three common embedment methods in the market: code embedment, full embedment and visual embedment.

1. Code burying point

The so-called code burying point is that the client integrates the SDK, initializes the SDK when the client starts, and then when an event (behavior) occurs, the client displays the interface calling the SDK to trigger the corresponding event.

Code burying, the most common burying method, is also the “most versatile” burying method. Its advantages are as follows: (1) It can accurately control the buried point; (2) Custom events and attributes can be added flexibly; (3) It can meet the needs of more refined analysis.

At the same time, code burying point also has some disadvantages: (1) early burying point cost is relatively large; (2) Buried point change, need to accompany the client version.

2. The whole point

Full buried point, also known as no-buried point, no-code buried point, traceless buried point, automatic buried point, refers to the need for development engineers to write code or only a small amount of code, can automatically collect all behavior data of users in advance, and then select and configure data analysis products to screen the objects to be analyzed and counted.

The advantages of full buried point are as follows: (1) the cost of early buried point is relatively low; (2) If the analysis requirements or event design changes, there is no need for the application to modify the burying point and release; (3) It can effectively solve the problem of “historical data backtracking”.

At the same time, the full buried point also has some disadvantages: (1) Due to technical reasons, for some complex operations, such as zooming, scrolling, etc., it is difficult to achieve comprehensive coverage; (2) The data related to business cannot be automatically collected; (3) Unable to meet the needs of more refined analysis; (4) Various compatibility problems; (5) The amount of data transmitted is too large and wastes resources.

3. Visualize buried sites

The so-called visualization burying point, namely through the way of visualization burying point. Visualization of buried points generally relies on full buried point correlation techniques.

There are two ways to visualize buried points:

First, by default, no buried point is carried out, and then circled through the visual way, which is selected on the collection.

Second, by default, all buried point collection is enabled, and then the events of all buried point are renamed in a visual way.

For example, for the login button on the login page, the event name of the full buried point collection is usually fixed, such as: AppClick, with the help of a visual burying point, we can rename an AppClick event, such as login.

Visual burying looks pretty cool compared to code burying and full burying, but it has its pros and cons.

Advantages: For example, the burying point is close to the service scenario, and the technical threshold of burying point is lowered. Non-technical personnel such as operation personnel and data analysts can bury the burying point.

Disadvantages: Since the visual buried point relies on the full buried point, it naturally inherits the disadvantages of the full buried point, such as compatibility issues and the inability to collect business-relevant data.

So, what is the future trend of burying sites?

I understand that the future will gradually develop towards scene-oriented, professional and intelligent direction, such as how to add dynamic attributes to events through visualization, similar to visualizing dynamic attribute association.

3. Principles of data collection

Faced with so many data acquisition schemes, how on earth should we choose?

In the past 5 years, Shence has served more than 1500 enterprise customers. Through in-depth service to customers, we found that there is no perfect burial site scheme that can adapt to all scenarios. Different burial sites, they have their own advantages and disadvantages, he has to adapt to the scene and not adapt to the scene. In the face of so many buried point solutions, we can not blindly pursue saving trouble, but can not pursue the “cool” buried point way, the most important is to choose the buried point way that can best meet our needs according to the actual analysis of needs and business scenarios. If multiple solutions can be satisfied, we can go after “easy” and “cool” solutions.For example, for the search page shown above, the requirement is that when the user clicks the search button, an event is triggered with the keyword entered by the user as the event attribute.

For this data acquisition requirement, the operation and implementation are very simple if the code burying scheme is used; If the full buried point scheme is used, it cannot be completely satisfied alone. This is because although the full buried point can automatically collect the click event of clicking on the search button, it cannot automatically obtain keywords as the attribute of the click event, but it can also be satisfied by writing certain codes with the full buried point. If we use the visual burying point scheme, if we can achieve dynamic attribute association, we can also achieve the burying point requirements above.

Therefore, in data acquisition, there is no silver bullet, that is, there is no universal perfect solution that fits all application scenarios. What we can do is to choose the most appropriate data acquisition scheme for different application scenarios.

Of course, although there is no silver bullet, there are still some general principles in data collection for our reference, which can be summed up in four words: large, complete, detailed and timely.

Large: fully consider the growth of user scale and data scale, and prepare for the accumulation of data assets.

Full: Multi-terminal collection, for the full user behavior rather than sampling, collection should be throughout the entire life cycle of the user using the product.

Fineness: collect enough comprehensive attributes and dimensions as far as possible, save data details as far as possible, and make the accumulated data assets better. For example, collect user behavior data from five perspectives: Who, When, Where, How, and What.

When: to improve the timeliness of data collection as much as possible under the condition of technical conditions and cost, so as to improve the timeliness of subsequent data application.

Iv. Case analysis of data collection

Case 1: App and H5 get through

In recent years, the mixed development of App is becoming more and more popular, and the need to connect App and H5 is becoming more and more urgent. What is the connection between App and H5? The so-called “get through” means that after H5 integrates the JavaScript data acquisition SDK, the events triggered by H5 are not synchronized to the server directly, but are first sent to the App data acquisition SDK, which is processed by the App data acquisition SDK for secondary processing before being synchronized to the local cache. Why should App connect with H5? Mainly from the following perspectives.

1. Data loss rate

In the industry, the loss rate of data collected by App terminal is generally around 1%, while that collected by H5 is generally around 5% (mainly due to cache, network or page switching etc.). Therefore, if App and H5 are connected, all events triggered by H5 can be first sent to the App end data acquisition SDK, which will be processed by the App end and then merged into the local cache. After meeting the specific policy, data synchronization can be carried out to reduce the data loss rate from 5% to about 1%.

2. Data accuracy

As is known to all, H5 cannot directly obtain device-related information and can only obtain limited information by parsing the UserAgent value. However, parsing the UserAgent value may encounter at least two problems: (1) Some information cannot be obtained by parsing the UserAgent value, such as the version number of the application program;

(2) Some information can be obtained by parsing the UserAgent value, but the content may be incorrect.

If App and H5 are connected, the data acquisition SDK on the App side can supplement these information to ensure the accuracy and integrity of the event information.

3. User ID

If users use our products before registering or logging in to the App end, we generally use anonymous IDS to identify users. However, the rules for identifying anonymous users in App and H5 are different (iOS generally uses IDFA or IDFV, while H5 generally uses cookies), which will lead to one user using our product, resulting in two anonymous users. If App gets through to H5, the two anonymous ids can be normalized (the anonymous ID of App end shall prevail).How do you get through? In the process of connecting App and H5, Shence data went through three stages, and three schemes were designed to meet the needs of different periods.

Scheme 1: Imagine a scenario where there is an H5 embedded in your App. If the user starts the App but does not register or log in, how should he be identified? We may mark with anonymous ID or device ID, but the generation rules of anonymous ID for H5 and App are different. Cookies are commonly used for H5. Android uses the Android ID, or the more popular OAID, or UUID; In iOS system, IDFA is commonly used. When IDFA is limited, IDFV can be used. Therefore, whether Android or iOS, when mixed with H5, two anonymous ids will be generated when users do not register or log in to the product, which is equivalent to the existence of two anonymous users, which is obviously inconsistent with the actual situation.

So we were faced with the problem of household identification when we first did data access. When H5 is embedded in the startup, the anonymous ID generated by the App side is actively transmitted to H5, so that all events generated by H5 can be identified by the anonymous ID transmitted by the App, and user identity is unified. This is the first version of solution that Shence dealt with through App and H5 in 2016.

Solution 2: In order to solve the problem of data accuracy, Shence data upgrade to the second version of the solution.

As we all know, in your browser to view web pages, browsers have no way to get information to the user’s device, like the user on the computer side open a webpage, unable to access to the user’s disk, in the mobile terminal to open the web page, it is no way to access to the user’s camera, sensors, etc., so the H5 is how to acquire the equipment information?

In general, H5 performs parsing by obtaining the current UA value. However, there are many problems in UA resolution, mainly reflected in the Web and Android, especially in many Browsers of The Android system. UA rules cannot be unified, so the following situations are often encountered:

(1) It is difficult to parse UA value during data collection; (2) The parsed data is not real data; (3) For Android and iOS, in order to achieve some special functions, many developers will obtain and modify the UA value. Some engineers will append after the acquisition, which is the best way; However, some engineers will replace the standard UA value after obtaining it, which leads to our failure to resolve or incorrect UA value resolved.

The event triggered in H5 usually needs to collect its basic attributes, such as App version number, current operating system version number, operating system type, screen size, etc. In this case, UA value alone cannot be resolved, which means higher requirements for “get through” are put forward.

Based on this, Shence sends the event generated by H5 to the App integrated data acquisition SDK through certain technologies. When the App data acquisition SDK receives the event, it processes the attribute content in the event twice, or even revises it. On the one hand to ensure the accuracy of data collection, on the other hand to ensure the integrity of data.

Since most of Shenze’s clients adopt private deployment, it is difficult for Shenze to calculate the data loss rate of users. However, the general standard in the industry is “the data loss rate of App is about 1%, and that of H5 and Web is about 5%”. The reason for the 5-times difference is that the local cache of H5 is limited. Data upload failure means loss; In addition, in most cases the H5 form in a single page in the App, H5 send network request, if a user exit pages, its network request was cancelled, then there is no way to achieve complete synchronization, in this case the data “to” work toward higher request, high standard, how to “get through” App and H5 presented.according to reduce data?

The events collected by App are not synchronized in real time, because there are many events and high frequency in App, and the synchronization immediately after each collection will bring great pressure to the server. Therefore, in general, local cache will be added in App. All collected events will be stored in local cache first, and then synchronized after certain conditions are met. That is, the data synchronization strategy is based on the cache. If H5 events are transmitted to App for secondary processing according to the above scheme, the local cache of App end is entered, and the event synchronization strategy of App end is adopted, the probability of H5 event loss can be greatly reduced.

This is what we are focusing on in the second version of App and H5, where user identity, data accuracy, and data integrity are all addressed.

Plan 3: The third version of the solution is the result of the continuous improvement and iteration of the second version of the solution.

Assume the following scenario: the developer of H5 in an App is a third-party supplier. In this case, the following two problems arise:

(1) The third-party supplier is not a customer of Shence, so it cannot realize data collection, let alone “get through”;

(2) The third-party supplier is the customer of Shence. At this time, App and H5 can really get through, but in many cases, they will be forced to receive a lot of unnecessary data, which is called “dirty data”, while the H5 supplier will find that they cannot collect complete data, and many events are lost “inexplicably”… This is because after App connects with H5, the event of H5 is transmitted to App by default. Therefore, in this case, we need to consider more details and realize H5 event uploading to App in the form of H5 whitelist for App.

At this time, we will be faced with a new scenario. The third-party supplier promises to transfer the data to the App, but also requires to keep a copy. In general, the combination of App and H5 seems to be a relatively common scenario, but it often faces many challenges in the process of implementation.

From 2016 to today, facing the opening of App and H5, we have been updating and iterating, aiming to be able to adapt to various complex scenarios, especially involving third-party development frameworks and third-party browsers.

Case 2: App startup and exit

1. Start the App

What does “App launch” mean? Some people say that using App is “App start”, then if you use music player, the player exits the background music continues to play, can it be counted as “start”? Some people also said that the use of time to define the “App start”, then when the user has payment needs in “JINGdong”, jump to “wechat” to complete the payment and then jump back to “JINGdong”, can be calculated as the “start” of wechat? Or during the use of “wechat”, there is a harassing phone call, the user immediately hangs up but the middle still lasts for two seconds, in these two seconds from “wechat” to “call” and then back to “wechat”, is it “start”?

In the past few years, there were many mobile phone functions. App, H5, etc., were isolated islands. With the development of technology, these isolated islands established connections with each other in the current environment and realized the connection. There are many ways to implement “App launch” :

First, the user clicks the icon to start the App, which is the most common startup method. Second, through a background wake up, also known as “hot start.”

Third, it can be activated by H5 wake-up. For example, if a friend shares jingdong products with you through wechat, you will be prompted “Open with App” in the upper right corner after clicking the link. If you have jingdong App installed in your mobile phone, the JINGdong App will be activated. Fourth, wake up another App through one App, such as map jump, payment jump, push jump, small program jump, etc.

After the definition of “App startup” is clear, how to collect App startup is the next important work. In this process, we face the following challenges:

Challenge 1: Whether to start for the first time

First startup refers to the first startup of the App after the user installs the App. This scenario is usually called activation, and it can be judged as the first startup by certain mechanism. Some people say that you can mark it locally to distinguish whether it is the first startup. However, the Settings of Android and iOS can realize the operation of “clearing local cache”, so it is difficult to distinguish it by local mark. Some people also say that you can complete the mark through SD card, but reading and writing SD card requires permission, the actual operation is also difficult. So, there are technical challenges in distinguishing users from first-time booters.

Challenge 2: Cold start and hot start

Most of the time, we will use the Home button to let the App enter the background, but due to a long time or system resources, the App may be recycled, and the next startup will actually become a cold startup, but according to our previous definition, it is actually a hot startup. So, how to distinguish between cold start and hot start is a very complicated thing.

Challenge 3: Whether to recover from the background

There are two common ways to recover from the background: (1) click the icon to restore; ② Double-click the Home button to display the application list, and click the application list to complete the restoration. Therefore, whether the collection scheme can cover the above different recovery scenarios is a certain test for technology, and complex and changeable scenarios also need to be considered in the process of data analysis.

Challenge 4: iOS passive launch

This content many people have not touched, also do not understand, this is god policy based on some scenes specific invention. What is passive priming? It is unique to iOS system. For example, we are using an App and transfer it to the background due to other reasons. After a certain period of time (called “specific time” in the official iOS document), the system will let the App enter the “zombie state”, at this time, the App background will push to the user. After the iOS device receives the App push, it will initialize the App, starting from the first page, which is transparent to the user. According to the principle of full buried collection, the initialization operation will trigger the App startup and page browsing events. The startup in this scenario is called “passive startup”.

That’s why, over the last two years or so, we’ve heard a lot of complaints from customers. Why do we collect events where users only “start” and “browse” instead of “quit”? This problem is technically limited at the time, and is usually roughly defined as “brushing”. As we got more and more scenarios, we went to the extreme and dug deep to figure this out. However, as a result, users do not understand why the daily activity data collected by Oracle (usually judged by “start”) is lower than that collected by other tools, because we make a distinction between “normal start” and “passive start”. This is also closely related to the values of Shence, we need to collect real data in real scenes, to bring value to the enterprise.

Challenge 5: Android multi-process

How to understand multi-process? Many of our common apps will have the “scan” function, this time will inevitably use the camera, there will be a lot of ROM in Android, compatibility is complex, so the “scan” page is easy to crash; But “scan” is not necessarily a core component in an App, and even if it has problems, it should not affect the normal operation of the App. Generally, a separate process is configured for the Service logic or page of Scan. In this way, scan and main services can exist as two independent processes that do not affect each other.

In this case, it can cause problems for App startup judgment in Android, because there is no way to tell whether the two processes are from the same App. So, the startup concept of Android and iOS is different. When a user opens a page, if the exit time from the previous page of the App exceeds 30 seconds, we consider it as an “App start” in Android, which is called “session mechanism”. Similarly, when a user quits a page and doesn’t open a new one for 30 seconds, it counts as an “App quits.”

Challenge 6: Compliance

We know a lot about compliance. For Shence, because our SDK is open source, the collection behavior of Shence SDK is clearly visible and must be in compliance.

So what impact does compliance have on startup? During data collection, it is necessary to collect relevant information of users, such as device ID, etc. At this time, “compliance” requires users’ consent before data collection, that is, the privacy policy description popping up in common apps. In addition, data collection will also involve system authority. Only when users clearly agree, enterprises can do data collection related work.

However, the above process is completed after the user starts the App. At this time, the data collection time of App start will be missed. Therefore, in order to achieve compliance, the collection of “App start” will be affected to some extent.

2. The App to exit

In most cases, an App that doesn’t show up counts as a exit. Common scenarios include: the user clicks the Home button; The App crash; App jump, etc.; However, for music players, sports related apps, it is necessary to make some special judgments. In the process of collecting “App exits”, we also face challenges:

Challenge 1: Reasons for App withdrawal

A clear understanding of why users quit the App helps to analyze the product and business.

Challenge 2: App usage duration

Not only do we need to capture the “App exit” action, but we also need to know how long the user has been using the App. Some people say that you can record the time stamp in “startup” and “exit” respectively and calculate the duration of App use, but how to mark this time stamp?

In most cases, we use client time to mark the timestamp, but what if the user changes the phone device time between startup and logout, manually or for network reasons?

There are usually several scenarios: exit minus Start equals 0 or close to 0; The date of “start” is August 1, and the date of “exit” is August 30. The usage time is too long, or the user manually adjusted the date of “exit” to July 30, resulting in a negative usage time. These situations are obviously inconsistent with the reality. Therefore, collecting App usage time cannot rely solely on device time.

So, how does the magic strategy meet this challenge? In both Android and iOS operating systems, there is a special function called “counter”, which means that when your operating system is turned on, the counter counts from 0, which is also the startup time we can see from the “Settings” of the phone. Therefore, this time is used to calculate the user’s App usage time. The result is 100% correct.

Challenge 3: Quit the event reissue

A few years ago, someone proposed this scenario: if the user’s phone fell into the water, could god collect the exit event? My answer is that if the user’s phone can be taken out of the water, can turn on and start the App normally, then it can realize the re-issue of exit event.

What is reissue? Because the users in the use of the App, may quit at any time, for this, we when the user starts the page, complete count, at regular intervals to record time, if the user to start the App, next time we found the timestamp is still there, but there is no trigger start events, then we will be the last event of exit immediately reissue.

Both “start” and “exit” are common scenarios in actual data collection and business analysis. In the face of every scene and every challenge of customers, Shence can grasp the nettle, which is adhering to the sense of responsibility for customers, but also the performance of Shence’s pursuit of perfection.

The authors introduce

Mr. Wang Zhuozhou is the author of “Android Full Buried Point Solution” and “iOS Full Buried Point Solution”, and the head of the R&D department of Shence Data Governance. I have 10+ years of Android & iOS development experience, and I am one of the first ones engaged in Android RESEARCH and development in China, developing and maintaining the first commercial open source Android & iOS data buried SDK in China. Mr. Wang Zhuozhou once worked in Beijing Tianyu Langtong Communication Equipment Co., LTD., as an Android system engineer. Graduated from Beijing Institute of Technology, software engineering major.

This share is divided into two parts: “Data collection” and “Data Governance”. The above only involves data collection.

About the divine policy data

Shence Data is a professional big data analysis and marketing technology service provider. Focusing on user-level big data analysis and management needs, the company has launched shence analysis, Shence user portrait, Shence intelligent operation, Shence intelligent recommendation, Shence Guest view and other products.

In addition, it also provides big data related consulting and complete solutions. Shence data has accumulated China Union Pay, Xiaomi, China Post Consumer Finance, Haitong Securities, Guangfa Securities, Orient Securities, Central Bank, Baixin Bank, CyTS, Ping an Life Insurance, Sichuan Airlines, VIPKID, Oriental Pearl, China Resources, Youzan, Baixing.com, Goods Lala, Flash delivery, Donkey Mom, Keep, 36 Krypton, Largo, VUE, Chunyu Doctor, Jumei Youpin.com, Edge Edge games, Laogou, Funenjoy and other more than 1500 paying enterprise users of the service and customer success experience, for customers to provide comprehensive indicators sorting, data model building and other professional consulting, implementation and technical support services. For more in-depth understanding of divine policy data or data-driven related questions, please consult 4006509827, answered by a professional consultant.