FFM algorithm, full name is field-aware Factorization Machines, is an improved version of FM (Factorization Machines). The original concept of FFM came from Yu-Chin Juan and his team members. By introducing the concept of field, FFM attributes features of the same nature to the same field.

In FFM, every one-dimensional feature, each kind of field for other characteristicsI’m going to learn an implicit vector. Therefore, hidden vectors are not only related to features, but also to fields. This is where the term field-aware comes from in FFM.

What are the optimizations of FFM compared to FM?

A review of FM

The sample π‘₯ is the 𝑛 dimension vector, and π‘₯𝑖 is the value on the 𝑖 dimension. 𝑣𝑖 is the hidden vector with length 𝐾 corresponding to π‘₯𝑖, and 𝑉 is the model parameter, so all samples use the same π‘₯𝑖, that iswithAll use.

Field of FFM

In FFM, each one-dimensional feature is assigned to a specific field, and field and feature are one-to-many relations.

filed Field1 = age Field2 = city Field3 = gender
feature X1 = age X2 = Beijing The x3 = Shanghai X4 = shenzhen X5 = male X5 = female
user1 23 1 0 0 1 0
user2 25 0 1 0 0 1

For continuous feature, a feature corresponds to a Field, or for continuous feature discretization, a sub-box becomes a feature, such as:

filed Field1 = age
feature < 20 20-30 30 to 40 > 40
user1 0 1 0 0
user2 0 1 0 0

For discrete features, one-HOT coding is adopted, and the same attribute is classified into a Field. No matter continuous features or discrete features, they all have One thing in common: under the same Field, only One feature has a non-0 value, and the values of other features are all 0.

The FFM model saysNot only withHave a relationship, but also withMultiplication ofThe owning field has a relationship, that isIt becomes a two-dimensional vector.Is the total number of fields, FFM only preserves the quadratic term in the FM formula.

Use the table data above as an example to calculate user1’s values.

Due to theBelongs to the same field, soYou can use the same variable instead, for example.

So let’s take a look at 𝑦 assumeThe partial derivatives.

Both sides of this equation are vectors of length 𝐾.

Pay attention toIs a one-hot representation of the same property, i.eOnly one of them is 1, and all the others are 0. In this case, so

Promotion to half cases:

π‘₯𝑗 belongs to Field 𝑓𝑗, and all other π‘₯π‘š in the same Field are equal to 0. In the actual project, π‘₯ is a very high-dimensional sparse vector, and only those non-zero terms can be paid attention to when differentiating.

You must have a question: 𝑣 is a model parameter, and if we need to calculate the derivative of the loss function with respect to 𝑣 in order to find 𝑣 using the gradient descent method, why do we need to calculate the derivative of 𝑦 assume with respect to 𝑣?

In projects that actually predict the probability of points, we don’t use formulas directlyAssume that a new layer of the sigmoid function is assumed, and z is found in 𝑦̂ n.

by εΎ—

π‘Ž is used to indicate the predicted click-through rate

Let y=0 represent the negative sample, y=1 represent the positive sample, and C represent the cross entropy loss function. According to the neural network tuning formula, the following equation can be obtained:

After reading this blog, it should be easier to deduce the formula in the paper “Field-Aware Factorization Machines for CTR Prediction”. In this paper, he used 𝑦=1y=1 to represent positive samples, 𝑦=βˆ’1 to represent negative samples, Hence section 3.1

With this derivation, let’s look at the formula for FFM.

Let’s say we have a sample ofA characteristic,So, the quadratic term of FFM isStudent: Implicit vector. In FM model, there is only one hidden vector for each one-dimensional feature. FM can be regarded as a special case of FFM, which is the FFM model when all features are attributed to a Field. According to the Field sensitive property of FFM, its model equation can be derived.

Among them,Is the firstThe field to which the feature belongs if the length of the implicit vector is, then the quadratic parameter of FFM isMany more than the FM modelIn addition, since the hidden vector is related to Field, the FFM quadratic term cannot be simplified, and the time complexity is.

Note that since laten vector in FFM only needs to learn specific fields, usually:

.

Due to the introduction of Field, FFM makes every two groups of feature crossing hidden vector independent, which can achieve better combination effect. FM can be regarded as FFM with only one Field.

FFM paper address: www.csie.ntu.edu.tw/~cjlin/pape…

If you feel the article is helpful to you, welcome to like, forward, comment, your support is the power of creation!