Machine Learning Pipline – Feature Processing – Part 1
Part 0, Looking back
Log data processing: Processes log data from the original Hive table or HDFS
Sample feature processing: sample labeling, sample cleaning, sampling, and CXR calibration.
In the logic of sample feature processing above, we selected several fields that could uniquely identify primary traffic, such as: The hardware uniqueness id of the user is imei, the triggerId of the current user behavior is triggerId, the current AD position marker posid, the object id of the current user is adid, the label field of whether there is a click mark, and the timestamp field of the Log behavior occurrence timestamp. These fields are all important when processing characteristics downstream.
The above fields probably record whether a user (IMEI) at a certain time (timestamp) triggered (TRIGgerID) on a certain advertising position (POSID) a certain behavior of an advertisement (ADID) is known to transform (label).
Part 1, the text of this issue
Following the book, we respectively introduced the log data processing and sample processing of enterprise machine learning Pipline. According to the structure of the paper here, we begin to introduce feature processing.
The difference between feature processing and sample processing is as follows: Sample uniqueness identifies a behavior state and prepares necessary fields for feature processing. In the sample stage, the number of samples used in model training, the proportion of positive and negative samples and user distribution are determined, which will not be changed by downstream processes.
The so-called feature feature processing is to get more AD, user, traffic and context data, referred to as AUC data, to enrich the feature data and organization form of all aspects that can be used by the model.
Generally, when we get a piece of data, we will observe the value form of each field of the data, count the coverage of the following fields, and do some macro statistics and processing on the data for use.
For ID type features, it can be treated as sparse type, and even text type and category type features can be treated as discrete features.
If it is a continuous feature, it is generally added to the model to obtain embeding after bucket discretization. Similar to GBDT + LR, continuous features are discretized by tree model and then used in combination with other models, which has achieved good online effects in the industry. Of course, there were those who threw continuous features directly into the dense model as a dimension, but some of the experiments I did were mediocre, with a wave of negative optimization.
Coverage of fields is critical to the impact of characteristics, generally more than 70% will have more positive effects. Of course, there are also alternative, such as real-time features, short-term behavior of the user number is particularly small, but the effect is still particularly obvious.
Here is a code that uses spark-shell to count data field coverage:
Welcome to pay attention to the author's public number algorithm full stack roadval df=spark.read.textFile("/hdfs/user/app/data.20210701/*")
.map(
e=>(
e.split("\t") (4),
e.split("\t") (5)
)).toDF("appname"."flag").cache();
val re=df.agg(
(sum(when($"appname"= = = ("-"),0).otherwise(1))/count("*")).as("appnamec"),
(sum(when($"flag"= = = ("-"),0).otherwise(1))/count("*")).as("flagC")
).show()
Copy the code
Feature data can be divided into multi-day feature of aggregation history, day-level feature and real-time feature according to the number taking time, among which real-time feature can be regarded as the supplement of day-level feature.
It can also be divided into single column feature, cross feature and sequence feature, among which sequence feature can be divided into aggregation historical sequence feature and real-time sequence feature. (Real-time features here refer to near real-time)
As for the feature processing of a machine learning system, we mainly introduce it from the following four aspects:
(1) Contextual features
(2) Advertising side features
(3) User-side features
(4) Characteristic organization form
1.1. Context-side Features
The so-called context-side characteristics carry information about the context of the current request. The environment includes the device context and the request context.
The device context includes the operating system (OS), software version, language, hardware ID, and width (screen size) of the requested device.
Request context includes timestamp similar to user request, request channel, sessionID, request IP, mobile network type NET, AD position POSID, request AD number, etc. More in-depth, will consider the current advertising in the advertising space display of the above advertising situation as the characteristics of the current advertising, just like Baidu UBMQ.
These are the characteristics that we tend to treat. Such as:
(1) For the IP field, we will intercept the prefix of the first segment, second segment and third segment of the IP. After all, IP addresses with similar prefixes have a certain degree of similarity in cyberspace.
(2) For the timestamp field, we can convert the timestamp to year, month, day, hour, minute, second, and get that the request is weekend/weekday. Each stage of a day can also be divided into buckets.
These attributes can be stored as triggerids as keys and corresponding fields as values. Use the triggerId field in the sample sample to leftjoin the context data, and assign default values to the parts that Jon did not add. (Ensure that the number of samples does not change).
This is why the TriggerID field is kept in the sample, as follows.
1.2 Advertising Side Features (Ad)
The so-called ad-side features generally refer to the item-related features of our ads.
The amount of advertising is much smaller than the natural amount of recommendation systems. Generally speaking, the number of app download advertisements may be about 5K, and the number of advertisements is not very much. So the various ids of ads are a strong feature in themselves.
Advertising characteristics generally include:
(1) Features of advertising ID class. Advertising id, advertising planid (planid),idealid, and the corresponding first-level category id and second-level category id of the advertisement (just like douyin, first-level column can be divided into entertainment category, and second-level category can be divided into short video category).
(2) Generalization characteristics. Advertising name, advertiser company name, advertiser company category, advertiser for the current advertising set keywords, one sentence introduction, advertising bidding type, template ID, history X days average bid, advertising label, advertising targeted time period, advertising group targeted, whether new material advertising.
(3) Personalized characteristics. For App ads, there are package name, package size, number of downloads, App ranking, number of reviews, number of positive reviews, etc.
(4) Statistical characteristics. We can do statistics based on various dimensions of advertising, such as the history of advertising granularity 7-day click through rate, average download rate, conversion rate, clicks, downloads, conversions, etc. The average engineer calculates the number of clicks and so on and then does some bucket operation, for example :(clicks *1000) /5. Something like this.
For an advertising system, the features related to items are generally the features that describe the concept of materials themselves, which are all very important features. For these features, we generally use ID directly, divide buckets with numerical features, and cross multiply with other features. The organization of the following features is introduced.
We can store the above features as aDID as key, and the corresponding fields as value. Use the aDID in sample sample to leftjoin the AD data, and assign default values to the parts Jon did not use.
1.3. User-side Characteristics (User)
In general, it is relatively easy to obtain contextual and advertising features, and we will try all of them in the initial stage of feature optimization. And the user-side features can be expanded based on our increasingly rich and improved user behavior logs, so there’s a lot that engineers can do. Below we use the advertising system of App to download ads as a demo to introduce.
User-side features include:
(1) Basic attributes of users. Including the user’s age, gender, education, province, urban area, etc.
(2) Historical aggregation behavior characteristics of users. For example, which ads/natural volume (view) the user has seen in the past 7/14/30 days, which ads they have clicked on, which apps they have downloaded, which apps they have installed, which apps they have used, how long they have used each App, which words they have searched in the past, etc. Considering the online requirements for real-time PREDICT, these user behavior lists can be arranged in reverse order of events and the latest 5/10 behaviors can be selected to participate in model training.
(3) Statistical characteristics. For example: average click-through rate, download rate, conversion rate, etc. for a user’s ads over the past 7/14/30 days. After obtaining items of user behavior, we can also obtain the category characteristics of each granularity corresponding to these items. For example, users are particularly fond of sports, games and beautiful women’s entertainment, and the click rate of these categories is very high.
(4) Real-time features. Here we consider the real-time feature as a complement to the sky level feature. What users have watched, clicked, downloaded, searched and so on in the last day. Get the sequence of data aggregated by the user over the course of a day. For this list, engineers can take the timestamp of the current request and subtract the timestamp of each action in the list, and divide it into segments according to the interval size for discretization. Note here that offline modules do not cause feature data traversal.
Feature traversal may lead to an odd-high AUC of the off-line model, even up to 0.999*, which can be seen from the off-line observation indicators.
User-side features, as user behavior data continues to grow, engineers can do a lot of things that I won’t go into here.
We can store the above features as iMEI as key, and the corresponding fields as value. Use the iMEI in the sample sample to leftjoin the user data, and assign default values to the parts that Jon did not add.
User behavior data is very rich, we may store a lot of user data, all of which should have iMEI as key. Left join data one by one.
Note That when using historical user data, the sample time is earlier than the user behavior time, which effectively avoids data traversal.
1.4. Organizational form of features
Following the book, we have looked at the various forms of contextual, user-side, and ad-side features. But in practice, not only do we use single-column features, but we also take certain cross products.
Typically, engineers cross AD side and context side to get the current context’s click-through rate of the current AD.
What’s more, engineers can also cross AD side and user side to get the current user and some historical behavior of the current user to the current AD click-through rate. If it’s a user behavior sequence, we just cross the AD test and the user sequence one by one.
Not only are there second order cross products, but there are higher and higher order cross products. Feature processing can be done not only manually, but also using models. Although DNN can perform high-order cross multiplication, manual feature selection is also essential.
Sequence feature engineers generally conduct pooling operation in DNN model. The conventional choice is sum pooling or average pooling. There is also a weighted sum pooling operation for attention, which includes self attention and DIN attention. There are also ways to consider the time factor in sequence features, such as Alibaba’s Dien network.
There has been a change in the structure of DNN network, which will be introduced gradually in the following articles. If you are interested, you can communicate privately
Here, the theoretical part of enterprise machine learning Pipline feature processing has been introduced, this period is too long to talk about the code, can only be introduced in the next period of the actual operation of industrial practice.
Code word is not easy, feel the harvest of the like, share, and then watch the three links ~
Welcome to scan code to pay attention to the author’s public number: algorithm full stack road