The author | Hu Yu (3) Tsinghua university postdoctoral Dafen intelligent core team members, AI algorithm experts

The founding team of Fangyun has profound technical RESEARCH and development and enterprise management experience. Relying on long-term industry accumulation and deep understanding of the digital industry, it evaluates the RESEARCH and development team in a digital way, drives the enterprise to accurately measure the work efficiency of the research and development organization and individual, and rationally allocates research and development resources. Helps technology decision makers accurately measure r&d organizational performance (for upward reporting and horizontal communication) and individual performance (for downward management). In the year of 2020, we have made a lot of attempts in data analysis based on actual user data, achieved remarkable results, and transformed the research results into practical applications, and deeply improved product capabilities.

Algorithm research is based on data. Whether modeling analysis based on mathematics and empirical knowledge, or data analysis based on statistics and machine learning, all need to rely on data to carry out.

In the first step of algorithm research, we established an independent data index system, and carried out subsequent research on the basis of this index system. The index system is composed of three levels of indicators. The first level is the most basic metadata, the second level is calculated by the first level, and the third level is calculated by the second level and the first level. Generally speaking, advanced indicators have higher information density and can achieve more in-depth information transmission effect when carrying out information representation. On the other hand, data analysis does not mean that the more advanced indicators are selected, the more effective they are. It is necessary to select necessary indicators at all levels according to the specific scene and algorithm requirements to achieve the desired analysis effect. For example, in the Kmeans algorithm, low-level indicators have better classification effect, while in the SVM algorithm, high-level indicators are required.

The second step of the study, Kmeans. In view of the relatively complete collection of metadata and the small amount of data, we chose Kmeans algorithm to conduct unsupervised learning clustering of employees’ behavioral data in combination with skLearn algorithm selection guide chart.

In addition to selecting some basic index data, we introduce the idea of RFM, and take the job freshness (R), work frequency (F) and workload (M) of employees in the specified period as clustering indexes, which are used in the algorithm clustering, and achieve very obvious classification effect. The core here is that we not only evaluated the work result data of employees by basic indicators, but also evaluated the work process data of employees by RFM method. The clustering made by combining these two kinds of data can classify and characterize employees well. The interpretation of classification results can be directly explained according to the meaning of indicators.

Study the third step, SVM. Made a good effect in clustering, on the basis of the data, we think the quality is reliable, the equivalent of objective data set, we have a very good on this basis, we put forward by enterprise managers with employee performance scores, form the Label, so we get the supervised learning training sets, which can supervise employee behavior prediction under study. We made a variety of attempts in this work, and finally selected the most effective 15 indicators through feature engineering as the representation indicators of employee behavior.

Here we review the research process as a reference for future research. In the initial analysis of SVM, we selected more than 60 indicators for supervised learning, but the learning effect was not good and the distinction between categories was very low, mainly because the SVM algorithm could not clearly find the boundary between categories due to too many indicators. So we use some feature engineering methods to reduce dimension. Firstly, through Pearson correlation analysis, a large number of indicators are divided into 24 categories according to correlation degree, and indicators in each category are highly correlated. Therefore, one of the most representative indicators can be selected from each category. In this selection process, our research team selected the most representative 24 indicators according to the actual situation. Secondly, there are still too many 24 indicators for SVM, so we use RFE algorithm to determine which indicators have the greatest impact on learning accuracy, so as to select the most effective indicators. In the RFE process, we used five algorithms, namely Lasso, Ridge, Logistic, RFClassifier and linerSVM, as filters to obtain the most effective features under each algorithm respectively. Then, we selected those features regarded as “effective” by more algorithms, such as average task completion time. If all five filters are considered effective, then this feature is a good feature for us to do supervised learning.

In addition, feature filtering should also consider whether the filter and classifier should have the same algorithm paradigm. For example, if the classification is going to be SVM, then the filter job is to select the SVM class. Only in this way can we ensure that the selected features are most effective under the corresponding classification algorithm.

The fourth step is data distribution fitting. Although we have achieved certain results in the first three steps of research, through careful examination of the existing data, we found that there are still two problems in the data. One is that some data still have the problem of missing or wrong filling, which belongs to the problem of data error. Secondly, there are some extreme data in the relatively complete data. These data are not necessarily wrong data, but may also be caused by the abnormal behavior of individual employees. Regardless of the data anomalies caused by either case (provided that the missing values have been preprocessed), we can judge the distribution of the data by fitting the distribution of the data, and look for those outliers.

In the study of data distribution fitting, we put forward four common distribution functions including normal distribution, F distribution, Chi-square distribution and Gamma distribution to fit employee behavior data by trying various distribution functions. Taking the normal distribution as an example, if we fit an index in line with the normal distribution, we can consider the data within 5% on both sides of the left and right as routine behavior, while the data beyond 5% on both sides as abnormal behavior. Further analysis shows that the data between 5% and 1/1000 on one side is sometimes reasonable behavior, while the data beyond 1/1000 on one side is most likely to be abnormal behavior. Through such analysis, we can find the abnormal behavior data of employees by means of data distribution and fitting, and put forward corresponding management strategies.

In addition, we have also proposed that data can be considered to conform to a certain distribution only when the fitting is significant. However, if we judge in this way, we find that some data do not meet the requirement of significance, but the data itself does have strong practical information, so we propose that there is no need to take significance as the premise of analysis. In fact, this also shows that in the digital era, more practical analysis means should be used to analyze data and guide business. Without adhering to overly academic or rigid analytical criteria.

To sum up, under the four main research ideas, we carried out a series of standard algorithm research on the employee behavior data of cooperative customers, such as feature engineering, unsupervised learning, supervised learning and data distribution fitting. Then, combined with the actual application scenarios, the research results are transformed into concrete applications. Next, summarize the specific applications formed.

(2) Product transformation results

The transformation of research results into products is a process of continuous accumulation and qualitative change caused by quantitative change. In the initial study, we will do research at multiple points, but it is uncertain which results will ultimately translate into practical applications. As more research is done, results that can be translated into actual product features emerge at three levels. At the first level, some good research points, some scenario-specific solutions, can be translated into actual product functionality. At the second level, a single function point seems to have little value, but when a typical function point appears, we realize that other seemingly useless function points are effective supplements to this typical function point. At the third level, several research institutes present some commonalities, which can be transformed into product ideas and product models, which are more valuable than single-point product functions. This idea from research to product transformation, rooted in practice, and refined summary, is of good reference significance.

After exploring several research points, we kept thinking about how to transform the research points into practical functions, which should not only combine customer needs, but also our own design of user pain points and product functions. In the research of 2020, the main line we have been working on is employee behavior portrait. No matter supervised learning or unsupervised learning, it is to select a set of appropriate indicators and weights to achieve the ranking of employees. In this way, we integrate a variety of ranking algorithms, and finally propose that users should choose their own ranking mode. In different ranking modes, we provide users with different algorithms or ranking methods, which is equivalent to our back-end intelligent way to meet the diversified needs of users in the front end. And this is the digital era, products in an intelligent way, to provide users with personalized functions. Specifically, we provide users with four alternative modes to rank employees.

Mode 1. Industry best practice: Develop a set of indicators and corresponding weights based on the existing cases of mature users. Users select the category of cases they want, and we calculate the corresponding ranking results according to their actual data. Here, there are two scoring modes: one is given by product customization; the other is based on the existing scoring ranking, Kmeans is used to confirm the excellence of different categories, and the weight of indicators is reverted by regression tree.

Mode 2: AI clustering algorithm. The system performs three or multiple Kmeans clustering on the natural state of employees, and adjusts the index type and weight each time. Then the customer chooses a clustering result that meets the expectation, and the customer’s choice corresponds to the index type and weight.

Mode 3: AI supervised learning, kmeans clustering for employees, and n categories were obtained. Customers sorted and scored N categories according to their excellence. Next, the system judged the importance of different indicators by RFE algorithm (Decision tree regression or decision tree classification is used by Estimator) according to the scoring situation.

Pattern 4: AI auxiliary customization (pure manual), specified by the user n index, and to determine weights are n, ranking system for employees, optional algorithm are: the weighted sum, RandomForestRegressor, GradientBoostingRegressor. Note that the last two ways to do this are to score y based on the weighted sum, and x is the weighted index of the input. And then the model is trained.

A variety of AI performance evaluation methods of Fangyun Intelligence have been verified by practice and have been productized.

(3) Algorithm accuracy analysis

Data analysis results generally need to have a certain degree of accuracy, it can be said that the algorithm has achieved a certain solution to the problem. In the process of digital transformation, we do not need to judge the algorithm by the absolute accuracy of prediction. This is because when we evaluate employees’ behaviors, training set marks or people’s cognition are highly subjective, and such subjectivity is subject to dynamic change, so the algorithm can capture, sometimes may be objective laws, but sometimes may just be temporary emotions of managers. We should evaluate the quality of the algorithm from practice. The algorithm that conforms to cognition and rules is a good algorithm, but the algorithm that can explain or capture short-term user attitude is also reliable. Specifically, we summarize the accuracy rate as follows based on existing studies.

First, Kmeans is unsupervised learning, no accuracy, but it can explain our discovery of old cattle and Mr. Nanguo, is in line with management common sense.

SVM prediction, we first get a key conclusion, high, medium and low management rigor, corresponding to the employee performance of medium, high and low. This conclusion is in accordance with the common sense law, so we can also deduce that the algorithm is effective.

2. According to SVM training of past employee data +label, our accuracy rate of future prediction is only 60% at first, but after sample screening and parameter tuning, the accuracy rate can reach 93%.

3. In the data rationality analysis, we selected employees within 95% interval by fitting employee behavior data with different distributions, and then selected employees between 95% and 0.001 to accurately select those employees with data problems. The practical results show that we do capture the extreme point of behavior, but also capture more than 5% of the reasonable behavior point.

(4) Research summary and next step plan

The purpose of algorithm research and data analysis is to find new user needs and develop new product functions. In the second part, we summarize the idea of transforming from research to actual product function. First, good research points are directly transformed into actual product features. Second, some low-value function points support typical function points. Third, the common ideas reflected in the research are transformed into product ideas and product models.

The following research is also devoted to exploring more product functions and product models from these three aspects. The main ideas proposed at present are:

One is to implant the knowledge and process of project management into products to help enterprise managers to complete project management simply and efficiently. The dynamic assignment of people to different tasks is a typical feature. On this basis, the analysis and ranking of employees’ behaviors will become a very good auxiliary function. We can assign employees to different tasks according to their behavior characteristics.

Second, deepen single point functions. When we trained the SVM model, we found that the prediction accuracy of each month’s model was not stable until the next month or other months. The most likely reason is that the evaluation criteria fluctuate from month to month. So we can train models every month on long-term data and get multiple models. On this basis, the data of the next month will be predicted on the model of the past months, so that the data of one month will be evaluated differently under the model of the past months, which can reflect the fluctuation of the evaluation standard of each month.

Third, the upgrade of product model. We can use a lightweight front end, collect some simple necessary data, and put the complex analysis on the back end to achieve. Function is presented to the user on the front end of personalized data and model choice, the system to the backend for the analysis of the user to diversify, intelligent presented to the user interface (such as intelligence, templated processes), the results of the analysis (no, entirely, behavior space mapping, etc.), even the customization process, data, algorithms, The system provides analysis results.

Learn more: www.farcloud.com