On October 13, 2020, Ding Yan, technical director of Fanbei, delivered a keynote speech titled “Practice of Scallop in Data Governance” at the scene of “Shence 2020 Data-driven User Conference” with the theme of “Digital legitimate trend”. (PPT download address attached at the end of the article)
This article is based on his live speech, and the main contents are as follows:
Founded nine years ago, Scallop is a well-known mobile Internet learning platform in China, with tens of millions of registered users. Its products include scallop words, scallop reading, scallop listening, scallop spoken Language, scallop Python course, Excel course and data analysis course, etc.When I look at data governance, I find a consensus that it is a complex issue. Today, I’m going to give a brief explanation of scallop’s practice in data governance, not strictly following the theoretical model, but rather focusing on the practice.
What is data governance
Data governance generally consists of six aspects, as shown in the following figure:At Scallop, our technical architecture is a microservices architecture, corresponding to the product matrix and content matrix.
Realistic goal and landing strategy of scallop
In the process of data governance, we often see companies in transition and with a long history of development facing a variety of problems, generally can be attributed to a heavy historical burden or complex existing architecture.
In order to make the business better and faster development, scallop integrates products and content into several business lines, different business lines by different teams. However, when doing overall data processing, we have to face the following conflicting realistic goals:
· All business data are relatively independent but interconnected.
· Each business data has a certain degree of freedom. Ensure that efficient business lines follow their own pace of development, and “fast” with “slow”, refuse to “slow” to contain “fast”.
· Business data should not affect each other.
The above goals should not only get through but also be independent, but also maintain a certain degree of freedom, which is difficult in the specific implementation process. Based on this, Scallop proposed three solutions:
1. Data classification
Usually, we don’t know what to do with the data as a whole, but when we scale it up, we can quickly improve our processing efficiency. Therefore, I divide the data into critical data, global data, and general data.Critical data often faces requirements such as quality; Global data will affect the overall development of the enterprise, belongs to the shared data, should be strictly reviewed and controlled; General data allows for a certain degree of inaccuracy and confusion.
However, with these three types of data, we always want to make sure that they are isolated from each other.
2. Data governance
We adopt a different strategy for each category of data, and have a professional governance team to lead the drive and oversee the implementation.
In terms of the composition of members, the leader of each governance group must be the authority within the enterprise, and the members must include all the people related to the interests of the data producer, that is to say, any organization that can generate data must have its representatives to join the governance group.
· The governance group of key data is composed of direct managers. From the start of production, each business line and micro-service cannot produce by themselves;
· Global data often interact with each other, so the user portrait we commonly use belongs to global data for unified management.
For example, different labels of A user are generated by different events and behaviors, which are scattered in different organizations. When user A of “Scallop Words” chooses A cet-4 word book, we can guess that user A is most likely A college student and has the demand for CET-4. Well, from the perspective of other organizations like Scallop Speak, this user tag could work as well.
· Generally, data has high degree of freedom and relatively loose management. We usually use automatic means to make statistics of macro data quality, such as quantity and specification, on a weekly or bi-weekly basis, and then generate reports and publish them regularly.
3. Technical support
All the data must be done in the product, which not only requires members to have such awareness, but also requires them to have such ability. Therefore, scallop in the technical level to do targeted measures to support. See figure below:· Plan namespaces
Each type of data has its corresponding namespace. For example, general data should have an isolated namespace, while global data should be cross-namespace. Overall, all data should have a unified plan.
Wrap the SDK,
Based on the data production and data collection of Shence, we will take the initiative to package SDK, and do the standard verification and filtering of data.
· Data gateway
We have to do a strong validation for every data, such as distribution, cleaning, etc., especially the key data and part of the global data; Data gateway processing of general data is often presented as statistics, auditing and so on.