By Ali Yun-Qin Qi
Students familiar with big data should be familiar with the terms of data assets and data asset management, but the addition of “intelligent field” is a little confused. Do data assets have anything to do with intelligence? Don’t worry, let’s look at these two key words first.
Data assets and data asset management
- Data Assets are valuable, measurable and readable Data sets in the network space with Data ownership (exploration rights, use rights and ownership rights). — Wikipedia
- What about data asset management? Data Asset Management (DAM) is a set of business functions that plan, control and provide Data and information assets, including the development, implementation and supervision of plans, policies, programs, projects, processes, methods and procedures related to Data in order to control, protect, deliver and enhance the value of Data assets. — White paper on data Asset Management practices
The definition of the two words is a little official, to understand briefly, the subject of data assets is data, emphasizing that data is “own” data, and valuable, measurable; Data asset management, on the other hand, is about using methods to maximize the value of data. Summary is two key words: data, value.
Exploration in the field of intelligence
Having said that, what does it have to do with the field of intelligence? But let’s not worry. What are the problems with current intelligence? In my opinion, there are two problems that need to be solved:
- The difficulty of getting started and the lack of basic data infrastructure and materials meant that every developer had to follow the general steps of machine learning problem solving, through problem definition, data preparation, feature engineering, model selection, training and tuning, and model evaluation. So is it possible to get users up to speed, to focus on core problems, not on data, not on evaluation, just on model selection and tuning?
- Poor data reusability. For almost every task, the first step is data collection and processing. Generally, the data volume required by machine learning is very large, the minimum level is ten thousand levels, and it needs to be preprocessed, with different processing methods for different problems. What about after the designated problem model training? Unless there is a similar problem in the future, this part of the data is one-time, used up is equivalent to waste. Isn’t it depressing to think that this is the end of all the data we’ve spent so much time collecting and processing?
In order to solve the two key problems mentioned above, since we are just in charge of the business related to data assets, we want to try to see whether the method of data asset management can solve these two problems. In short, we want to standardize, trust and share data by managing intelligence-related data in a systematic and standardized way using data asset management. Specific ideas are as follows:
Data management
In short, manage the data used in the field of intelligence, such as common tasks, common data sets, common models, evaluation metrics, and so on. This allows the developer to see the tasks of interest, as well as some practice of the existing model, and the tasks already contain the corresponding data set. All you have to worry about is model selection and tuning, which saves a lot of time and makes it much easier for developers to get started. Similar to The Kaggle format, we will focus on common tasks, such as code intelligence-related tasks, and build our common data sets and model algorithms based on these tasks.
Here have to mention part of data management, we will set up a set of code intelligence related data warehouse, and raw data passes through the ETL into each task required data sets, and based on this, we can observe the blood distribution and affect the data analysis, the subsequent steps appear problem, so we can carry out immediate analysis and positioning.
Data sharing
Explore the association between data to facilitate users to share data. The sharing here consists of two parts, one is data set sharing and the other is model sharing.
For data sharing, we want to create a common data set based on a given task. For example, in the field of code intelligence, the same annotated original code, after standardized processing, can be applied to different code tasks. The most common type of Code Completion you can use to predict future Code generation; Combined with Code annotations, it can be used to train Code Search, text-to-code Translation and code-to-text Translation. For data users, a sentence like SELECT * from code where comment is not NULL may get the training data of the text-to-code task after processing, without additional data collection and processing. For us, the data provider, the only consideration is whether the data sample is rich enough, whether it is within the time limit, whether the data is safe, etc.
As for model sharing, due to the uninterpretability of general in-depth models, we can at least establish a unified and standard model input and output interface, so that different models can be easily inserted and removed to compare and measure the effects of different models.
Data quality
Data quality mainly considers the following data indicators:
- Integrity: whether data is missing;
- Normative: Whether data is stored according to required rules;
- Consistency: whether the values of the data have conflicting information meanings;
- Accuracy: Whether the data is wrong;
- Uniqueness: Whether data is repeated;
- Timeliness: Whether data is uploaded according to the time required.
Each deep learning task is inseparable from data acquisition and processing, and it takes a lot of time, because the quality of this step is strongly related to the final training effect. So in our data infrastructure, data quality is a particularly important part of our building. In the process of data acquisition and data sharing, we will set up strict data verification rules to ensure the accuracy of our data. For example, or the above original code, after the initial processing, such as to empty lines, add header and tail recognition characters, etc., in the process of storage, will basically verify whether the value of each field is empty, enumeration value meets the requirements. When it comes to specific code tasks, such as text to code, first of all, text and code can not be empty, some strict, such as verifying the length of text and so on. Of course, different tasks have different data specifications and requirements. What we need to do is to abstract the data requirements of common tasks, try to cover the data processing of all common tasks, and support the processing of some specific requirements to ensure the validity of data.
Write in the last
Data asset management is a combination of solutions and standards, of which the above are just a few that we have chosen to be most relevant to the surface of the intelligent field. Others include data standards, domain modeling, and so on. Data standards mainly solve the problem of consistency and accuracy of data; A data model is an abstraction of real-world data characteristics used to describe concepts and definitions for a set of data. These concepts and methods can form a reference for our intelligent process, which needs further research and practice.
This paper attempts to apply standardized process methods such as data assets to the field of intelligence, hoping that the two fields can produce cross-border collision. The core goal is to improve data efficiency and value, hoping to provide some help to the developers of machine learning.
The resources
- White Paper on Data Asset Management Practices