Introduction: In this article, we explain the key elements for effective data product development. Data products are developed by algorithm engineers, so let’s start with the algorithm level of algorithm engineers.

In the past two years, I have been working with algorithm engineers in IT companies. I am an algorithm engineer myself, and I have interviewed algorithm engineers of all levels. After working in an IT company, I was confronted with a big problem at the beginning. IT was difficult to find out the real algorithm level of the applicant in a short time. There are two main difficulties:

  • Most of the new students hardly had any academic papers, so I had to rely on the previous colleagues to assess his or her algorithmic programming ability.
  • Some candidates from unknown IT companies say they have been involved in the development of various online projects, but IT is not clear what role they played in those projects. The complexity of the problem and the effectiveness of the algorithm are also unclear.

The two difficulties are that in a short period of time, there is no credible standard to measure the algorithm level of the other side.

I have been doing research in academia for nearly 20 years and cultivated a group of talents. Some of them went to academia (NYU, UNC, Emory, UoT, Purdue, Yale, UoA, FSU, etc.) and others went to industry (like Uber, DiDi, Goldman Sachs, etc.). During the years in my lab, they worked with me to study some interesting problems in an application scenario, collect and/or process relevant data, design various algorithms, and derive relevant theories. In the academic world, I also interviewed many young talents, mainly reading their recommendation letters, published articles, interviews and academic speeches. In this case, high quality articles/monographs have always been a necessary condition to get a good job in the academic world. The number of articles published in the top journal/top conference and the number of citations are generally regarded as the criteria of credibility, because most of the reviewers in these places have very high requirements. There are basically two criteria:

  • Does the question itself have a big meaning, which could be academic, scientific or commercial;
  • Can your algorithm and theory really solve the problem that much

I am now examining the interviewer’s algorithm level, mainly based on the interviewer’s commonly used algorithm to ask the following basic questions:

  • Will relevant models be adjusted? What’s your opinion?
  • What is the depth and breadth of understanding of these algorithms?
  • Why is it that sometimes an algorithm doesn’t work? Why?
  • If this algorithm doesn’t work, can you develop a better one?

Many students without rigorous algorithmic training often get stuck on the second question. For middle and senior algorithm engineers, it is better to have high quality academic papers/monographs or widely recognized products in the industry. These accumulation reflects his/her deep thinking on a class of problems, which is a credible hard standard. In contrast, those of you who have worked at Google’s Deepmind for a few years will generally assume that it is good because Deepmind itself represents a credible soft standard. A good algorithm team should have lots of algorithmic engineers with both hard and soft skills, but the question is what kind of algorithmic engineers are needed for efficient data product development?

Furthermore, in recent years, we have come into contact with many algorithm students, worked with them every day, and listened to them talk about various models (such as X-Learner, XG-Boost, CNN, Transformer, GCN, U-net, etc.) and various technical indicators (such as AUC, F-Score, etc.) almost every day. Finally, offline and online business metrics (like ROI, GMV, etc.) are discussed. However, it is almost impossible for us to run all their algorithms and check all the details of algorithms, so how can we ensure the quality of data product development?

First of all, in the development of data products, the selection of good business indicators is the most important standard to ensure the quality of data products, and the establishment of reasonable business indicators is the core of all data products application layer technology implementation.

  • Sometimes business metrics are straightforward, such as products related to human safety. Reducing accident and death rates may be a common business metric. However, many of the strategies related to user lifecycle management, including recruit, activation, retention, and recall, are not as simple as the business metrics for evaluating the success of these strategies. Many companies use life cycle value (LTV) as their business metric, which is the sum of all the economic revenue a company gets from all of its users’ interactions. On the individual dimension is “customer lifetime value”, which refers to the sum of future revenues that each buyer may bring to the business. If we further consider how users contribute to a company’s bottom line, LTV can also be broken down into nascent, mature, declining, silent, and reactivated phases. LTV can also be further broken down into core business metrics such as user volume, retention, active user (DAU), and average revenue per user. The other most commonly used metric is Return on Investment (ROI), which is called Return on Investment (ROI) = (profit before tax/total Investment) x 100%. It is often used to measure the effectiveness and efficiency of various operational strategies. A deep question for business strategy is whether LTV or ROI is a better business metric. Due to lack of space, we don’t address this issue here. Please look out for future updates.

Secondly, in data product development, we need to consider two types of technical indicators: the technical indicators of the algorithm and the diagnostic statistics of the model. Most data products are organic combinations that can be disassembled into a series of classification and regression modules.

  • The technical index of the algorithm (zhuanlan.zhihu.com/p/84665209) can be classified according to classification and regression algorithms. Specifically, the commonly used technical indicators of classification algorithms include accuracy, accuracy (precision), recall (recall), F score, AUC (Area Under Curve), Gini coefficient, gain graph, KS (Kolmogorov-Smirnov), etc. For example, gain graphs are widely used in target localization problems to determine which deciles the user is in during a particular activity, whereas k-S is an indicator of the degree of separation of positive and negative case distributions. The common technical indexes of regression algorithm include mean absolute error, mean square error, logarithmic loss, root mean square error, determination coefficient and correction determination coefficient, etc.

  • We also need to consider the diagnostic statistics of the model, because the technical indicators of the algorithm cannot fully reflect the rationality of the core assumptions of the model. This is to use some model diagnostic tools in statistics to find outliers and strong influence points in the data, to describe the distribution of the data, to select better loss functions, and to increase the generalization of the model.

  • The technical indexes of the algorithm are generally macro indexes, while the diagnostic tools of the model are mostly some micro indexes of the data points under a given model.

With that in mind, let’s talk about the relationship between technical and business metrics in data product development.

1. Module dismantling and development of data products

To make a good data product, another important thing is how to effectively disassemble business indicators into various modules, find a clear solution, and carry out relevant data construction, algorithm development and system optimization according to it. To achieve this, there are two key points:

  • In-depth understanding of the business requires relevant algorithm engineers to stumble and tumble in the business for a long time to find the key points to solve the problem.

  • The effective control of the depth and breadth of various algorithms by relevant personnel can really build up each module.

  • What companies need most are algorithmic engineers who understand both business and algorithmic technology, and who are key to data product development.

2. Generalization of modules

In the development of each module, we generally collect one or several sets of training data sets, build a set of models, use cross-validation method to get feedback from the technical indicators and diagnostic statistics of the algorithm, select the model, improve the model until reaching a certain accuracy.

  • The generalization of a model refers to its ability to adapt to new data sets. Good accuracy of the model on the training data set does not guarantee the generalization of the model after on-line.

  • The claim that good accuracy guarantees good model generalization if the test set is large enough is not necessarily true.

  • One of the most critical things to ensure the generalization of the model is to efficiently and orderly build the underlying data and learn the causal relationships underlying the data, so that our model will also be applicable to data that has causal relationships with the training set.

3. Transformation from technical indicators to business indicators

In general, achieving good technical metrics across all modules does not automatically translate into good business metrics, which depends on effective disassembly of business metrics. Well disassembly, the technical indicators of some modules may not have a great impact on improving business indicators, so in the process of making products, we must run the whole process. Find some key modules to optimize, but at a certain cost, everyone still wants to achieve the best technical indicators of each module. This problem is the most difficult point for landing, and more cases are needed to analyze its depth.

Finally, we take the data product Alphago as an example. Its business metric is to beat the competition, and it is broken down into four modules to form a complete system.

At the algorithmically level, AlphaGO combines deep learning, reinforcement learning and Monte Carlo tree search methods, and has made groundbreaking developments in these methods, making it a substantial leap in strength to achieve a record of beating multiple world champions. These achievements are derived from the AlphaGo team’s efficient disassembly of indicators, advanced algorithm level and effective construction of underlying data.


The authors introduce

Tenured professor of biostatistics at the University of North Carolina at Chapel Hill, He joined Didi Chuxing in 2018, leading engineers to build a set of innovative theories and platforms for the operation of Didi Chuxing platform.

Doctor of statistics from North Carolina State University, Joined Didi Chuxing in 2018, mainly engaged in the research and application of statistics and machine learning in bilateral exchange market.

Recruiting team

Engineers/experts with research and practical experience in the underlying engines of big data (such as Spark and Flink) are welcome to join didi Big Data Architecture Department to face the challenges of trillion-level massive data processing in the Internet + travel industry every day.

Delivery email | [email protected]

The subject of the email is “Name + Department + Direction”.

This article is published by OpenWrite, a blogging tool platform