The last article briefly introduced the definition and application of the recommendation system, the second part of the recommendation system, briefly introduced the knowledge of user portrait, and through the text to build the knowledge of user portrait.

The directory is as follows:

  • User portrait
    • Definition of user portrait
    • Key to user profiling
    • Methods for constructing user portraits
  • From text to user portraits
    • Building user portraits
    • Structured text
    • Tag selection
  • summary

User portrait

Definition of user portrait

In fact, user portrait is to model and abstract the attribute label system of each user from massive user data. These attributes usually need to have certain commercial value.

However, from the perspective of computer, user portrait is a vectorization of user information, vectorization is for computer calculation, user portrait should be shown to the machine, not to people.

The user label system is generally divided into multiple categories (first-level classification), and each category has multiple sub-categories (second-level classification). Sub-categories can also be further divided into smaller sub-categories such as third-level and fourth-level classification. The major categories usually include the following:

  • Demographic attributes. User inherent attributes, such as age and gender;
  • Interest preference. Users’ personal preferences, including good category, good brand, good distance, good business unit, etc.;
  • Characteristic population. Groups with specific meanings, such as students, travel experts, car owners, mothers and children, food lovers, etc.;
  • User classification. The hierarchical division of users, such as membership level, consumption level, preferential sensitivity, etc.
  • LBS properties. Various attributes related to the user’s location, such as the user’s resident city and country, hometown, user footprint, residential business district, business district, etc

For a recommender system, the user portrait is not its goal, but a by-product of a key step in the process of building the recommender system.

Recommendation systems are usually divided into recall and sort phases, and user portraits may be used in both phases.

Key to user profiling

The key elements of user profiling are dimension and quantification.

The dimension

Dimensions should first ensure that the name of each dimension is understandable. For example, when recommending a mobile phone to users, dimensions include price, brand, memory size, appearance, etc. Here, it is necessary to ensure that the dimensions of users and mobile phones can match.

Generally, the more dimensions there are, the more detailed the user portrait will be, but the calculation cost will also be larger, there is a trade-off to be made.

Specific use which dimensions, needs to give priority to with the purpose of the recommended, rather than did the user for user portrait painting, the purpose is to improve the hit rate, for example, it is need to consider what factors may influence users to click on the product, such as recommend mobile phone, mobile phone brand, price and configuration of factors, will affect the user will click to view the details of the phone, Therefore, the title of a mobile phone product will have such important information, such as: “iPhone 11 128GB, 5999 yuan”.

quantitative

The quantization of user portrait is actually the way of data processing, which can also be said to be feature engineering. It should be goal-oriented and check which quantization method is used according to the recommendation effect.

Methods for constructing user portraits

According to the means of user quantification, it can be divided into three types of methods

1. Check account

Direct use of original data as the content of user portrait, such as demographic information such as registration information, or purchase and browsing history, usually only do data cleaning, data itself without any abstraction and induction, usually very useful for user cold start and other scenarios.

2. Pile of data

The method is to accumulate historical data and do statistical work, which is also the most common user portrait data, such as common interest labels, to mine these labels from the historical behavior, and then do data statistics on label dimensions, and use the statistical results as quantitative results.

3. The black box

The machine learning method is used to learn the dense vector that human cannot understand intuitively, which plays a very important role in the recommendation system. Such as:

  • Using shallow semantic model to construct user’s reading interest;
  • Implicit factor obtained by matrix decomposition;
  • Deep model is used to learn user’s Embedding vector.

The disadvantage of this method is that the user portrait data obtained is usually unexplainable and cannot be directly understood by people.


From text to user portraits

Text data is the most common form of information expression in Internet products, with large quantity, fast processing and small storage. Common text data can be as follows:

  • For users, including the name, gender, hobbies, comments and so on when registering;
  • For the article, such as the title of the article, description, the content of the article itself (generally news and information), other basic attributes of the text;

Next, I’ll show you some ways to create user portraits from textual data.

Building user portraits

To build a basic version of a user portrait based on text information about the user and the item, the usual steps are as follows:

  1. Structured text: all unstructured text is structured to remove the coarse and extract the essence and retain key information;
  2. Tag selection: Combines structured information about users and items based on user behavior data

The first step is the most critical and basic. Its accuracy, granularity and coverage all determine the quality of user portraits. This step is mainly used in text mining algorithms, and the following text mining algorithms commonly used will be introduced;

The second step is to pass the image of the item to the user based on their historical behavior.

Some of these steps and common algorithms are described below.

Structured text

Generally, the original text data is often described in natural language, that is, “unstructured”, but the computer processing data can only use structured data index, retrieval, and then vectorization and calculation, so the text data need to be structured first, and then subsequent processing.

For text information, the mature NLP algorithm can be used to analyze and obtain the following information:

  1. Keywords extraction: the most basic tag source, but also for other text analysis to provide basic data, commonly used algorithms are TF-IDF and TextRank;
  2. Entity recognition: to identify some nouns, including people, locations and places, books, movies and TV dramas, historical events and hot events, etc., which are usually combined with CRF model based on dictionaries;
  3. Content classification: text is classified according to the classification system to express coarse-grained structured information.
  4. Text: It is also common to divide text into multiple class clusters by unsupervised algorithm under the premise of no one formulating the classification system, and the class cluster number is also a common composition of user portraits.
  5. Topic model: Learn topic vector from a large number of existing texts, and then predict the probability distribution of the new text on each subject. This is also a clustering idea. Topic vector is not a label form, but also a common composition of user portraits.
  6. Embedding: Embedding, from word to text, can be learned. Its goal is to mine the semantic information under the literal meaning and express it with limited dimensions.
1. TF-IDF

The full name of TF is Term Frequency, namely word Frequency, while IDF is Inverse Document Frequency, which is Inverse Document Frequency. The idea of extracting keywords in this method is simple:

Words that appear repeatedly in a text will be important, and words that appear in all texts will be less important.

According to this idea, they are quantified into TF and IDF:

  • TF: word frequency, the number of occurrences in the text to be extracted;
  • IDF: In all texts, count the number of texts in which each word appears, denoted as N, i.e. document frequency, and the number of texts denoted as N.

Therefore, the calculation formula of IDF is as follows:


It has the following characteristics:

  1. The N of all words is the same, so the word with less text (N) will have a larger IDF.
  2. 1 in the denominator is to prevent the document frequency of some words n is 0, resulting in infinite calculation results;
  3. For new words, this should itself be n=0, but you can default to the average document frequency for all words.

The final calculation formula of TF-IDF is TF * IDF, which can calculate a weight of each word. There are two ways to screen keywords according to the weight:

  1. The selection of top-K words is simple and direct, but the disadvantage is that the value of K needs to be considered. If the number of words that can be extracted is less than K, then all words are keywords, which is unreasonable;
  2. Calculate the average weight of all words and take the words with greater weight than the average as keywords.

In addition, depending on the actual situation, some filtering conditions may be added, such as extracting only verbs and nouns as keywords.

2. TextRank

TextRank is one of the derivative algorithms of the famous PageRank algorithm, which is used by Google to measure the importance of web pages. Therefore, the idea of TextRank algorithm is similar, which can be summarized as:

  1. First of all, set a window width in the text, such as K words, count the co-occurrence relationship between words in the window, and regard it as an undirected graph (graph is a network composed of nodes with connection relationship, the so-called undirected graph is the connection relationship between nodes regardless of who starts from, the relationship is ok).
  2. All words have an initial importance of 1;
  3. Each node allocates its weight equally to other nodes “connected with it”;
  4. Each node uses the sum of weights assigned to it by all other nodes as its new weight.
  5. Step 3 and step 4 are iterated repeatedly until the weights of all nodes converge.

The weight of words calculated by this algorithm will show such characteristics: those with co-occurrence relationship will support each other to become keywords.

3. Content classification

In the time of the portal site, each web site have their own channel system, the channel system is a very large content classification system, and now the mobile Internet era, news class of the app will categorize different news under the corresponding different channels, such as hot, entertainment, sports, science and technology, finance and other such classification system, This approach yields the most coarse-grained structured information and is a way to explore users’ interests during their cold startup.

Long text content classification can extract a lot of information, but short text content classification is difficult because of less information can be extracted. The classical algorithm for short text classification is SVM, and the most commonly used tool is Facebook’s open source FastText.

4. Entity recognition

Named Entity Recognition (NER) is often regarded as a sequence tagging problem in NLP, which belongs to the same problem as word segmentation and part-of-speech tagging.

The so-called sequence labeling problem is that given a character sequence, each character is traversed from left to right, and each character is classified while traversing. The classification system varies with different sequence labeling problems:

  1. Word segmentation problem: divide words, the classification of each character is “word beginning”, “word middle”, “word end” one of three categories;
  2. Part-of-speech tagging: To classify each graded word into one of the defined part-of-speech sets;
  3. Entity recognition: Identifies each word as one of the defined set of named entities

Common algorithms are hidden Markov model (HMM)** or conditional random field (CRF).

There is also a more practical non-model approach: lexicography. Prepare a dictionary of various entities in advance, use the trie-tree data structure for storage, and take the sorted words to the dictionary. If you find a word, consider it a predefined entity.

SpaCy is more efficient than NLTK on an industrial-grade tool.

5. Clustering

Currently, the commonly used clustering method is mainly subject model. Also as an unsupervised algorithm, subject model represented by LDA can grasp topics more accurately and achieve the effect of soft clustering, that is, each text can belong to multiple class clusters.

The LDA model needs to set the number of topics. If there is time, we can do some experiments to select the number of topics K. The method is to calculate the average similarity between K topics each time and select a low K value. However, if time is not enough, the number of topics can be as large as possible in the field of recommendation systems as long as the computing resources are sufficient.

In addition, it should be noted that the first few topics with the highest probability can be retained as the topic of the text to obtain the distribution of the text on various topics.

The difficulty of LDA engineering lies in parallelization. If the text quantity does not reach the mass level, it is possible to improve the stand-alone configuration. Open source training tools include Gensim, PLDA and so on.

6. Embedded word

Word Embedding, also known as Word Embedding, in the previous schemes, except for LDA, all the tags are sparse, while Word Embedding can obtain a dense vector for each Word.

To put it simply, a word may hide a lot of semantic information, such as Beijing, which may contain “capital, China, north China, municipality directly under the Central Government, big city” and so on. These semantics are limited in all texts, such as 128, so each word can be expressed with a 128-dimensional vector. The values in each dimension of the vector represent how much the word contains each meaning.

The uses of these vectors are:

  1. Calculate the similarity between words and expand structured tags;
  2. The sum gives a dense vector of text;
  3. Used for clustering

The most famous algorithm in this respect is Word2Vec, which uses shallow neural network to learn the vector expression of each word. Its biggest contribution is the optimization of engineering skills, making the scale of millions of words in a single machine can run the results in a few minutes.

Tag selection

After completing the first step of structured text information, you can get tags (keywords, categories, etc.), themes, word embedding vector, then comes the second step, how to give structured information to users?

The first approach is very simple and crude, and directly accumulates the tags of items that users have acted on.

The second method is as follows: the user’s behavior towards goods and whether there is consumption is regarded as a classification problem. The user has helped us mark some data through practical actions, so it becomes a feature selection problem to pick out the features he is actually interested in.

Two methods are most commonly used: CHI square test (CHI) and information gain (IG). The basic idea is:

  1. View the structured content of an object as a document;
  2. Consider the user’s behavior toward the item as a category;
  3. Every item a user sees is a collection of text;
  4. The feature selection algorithm is used to select what each user cares about on this text set.
1. Chi-square test

Chi-square test is a supervised learning algorithm, that is, it needs to provide classification labeling information, why is it needed? Because the text classification task is heavy, the selection of tube detection is to serve the classification task;

The Chi-square test essentially tests whether the assumption that a word is independent of a certain category C is true, and the larger the deviation, the more likely the word is to belong to category C, which makes it a keyword.

Specifically, to calculate the chi-square values of a word Wi and a category Cj, four values need to be counted:

  1. The number of text A in which the word Wi appears in the text of category Cj;
  2. The number of texts in which the word Wi appears in non-cj texts B;
  3. The number of texts C where the word Wi does not appear in the text of category Cj;
  4. The number of texts D where the word Wi does not appear in non-Cj texts.

Use the table to show the following:

chi-square Belongs to class Cj Does not belong to category Cj A total of
Contains words Wi A B A+B
Does not contain the word Wi C D C+D
A total of A+C B+D N=A+B+C+D

The chi-square value can be calculated as follows:


For this calculation, there are several notes:

  1. Each word and category should be counted, and any word that helps one of the categories should be kept;
  2. Because we’re comparing chi-square values, we don’t need N, because it’s the total number of texts, every word is the same;
  3. The higher the chi-square value, the farther away from the assumption that words and categories are independent of each other, the more likely that words and categories are independent of each other, that words may be keywords.
2. Information gain

Information gain is also a supervised keyword selection method that requires annotation information.

Must first understand the concept of information entropy, it refers to the uncertainty of one thing, the size of the given any text, for example, let you guess what it belongs to the category, if each category about the number of text, so don’t guess, but if a few categories of text, such as category C text accounted for 90% of the quantity, So you have a very high probability of guessing that it’s category C.

The difference between the two conditions is that the hormone information entropy is different:

  1. When the number of texts of each category is similar, the entropy of information is larger.
  2. When the number of text in a few categories is obviously larger, the information entropy is smaller.

Next, if from among a pile of text contains words W the number of text, and then to guess any category of a text, or have the above two cases, but if in the case of the whole text is 1, pick out the case containing the word after the W is 2, then this kind of circumstance suggests W has played a big role, this process is actually the information gain.

The steps of information gain calculation are as follows:

  1. Statistical information entropy of global text;
  2. The conditional entropy of each word is counted, that is, the information entropy of the text is counted after knowing a word. Here, the information entropy of the two parts including and without words is counted, and then weighted and averaged according to the proportion of each text.
  3. Subtract the two and you get the information gain for each word.

The most widely used information gain is the decision tree classification algorithm. The classical decision tree classification algorithm calculates the information gain of each attribute when selecting the splitting point, and always selects the node with the largest information gain as the splitting node.

The difference between chi-square test and information gain is that the former filters a set of labels individually for each behavior, while the latter filters globally.


summary

This article first introduces what a user portrait is, common examples of building a user portrait, then introduces the method of building a user portrait from text data, and how to combine item information and user information.


Reference:

  • “Recommendation system thirty-six type” 4-5 course

Welcome to follow my wechat official account — the growth of algorithmic ape, or scan the QR code below, we can communicate, learn and progress together!