Label propagation combined with simple model can surpass graph neural network

The original address: cf020031308. Making. IO/cca shut / 2020…

  1. Diffusion based algorithms are also useful for sensor node classification tasks, but are not valued
    • A rule of thumb: In the OGBN-ArXiv dataset, simply classifying nodes as having the most classes in the domain was 61 percent accurate.
  2. There is a positive correlation between classification and prediction residuals on adjacent nodes

Correct and Smooth (C&S)

  1. Base Predict: Classify nodes using a simple non-GNN model, such as node character-based or embedded MLP
  2. Correc: GNN method is used to model the residual between the real results and the results of the previous step
  3. Smooth: Using the Label Propagation (2015) algorithm to Smooth the results of the previous step

Label Propagation

Write the normalized adjacency matrix S=D−1/2AD−1/2 2S =D ^{-1/2} AD ^{-1/2}S=D−1/2AD−1/2, and the resulting matrix of nodes (one row for each node) is Y, LP algorithm propagates the results on the training set to the whole graph by iterating Y=(1−α)Y+αSYY =(1 – \alpha) Y+ \alpha SYY =(1−α)Y+αSY until convergence.

Convergent solution Y ^ ^ \ hat YY will make tr (Y (I) – S Y ^ ^ T) + (1 alpha – 1) ∣ ∣ Y ^ – Y ∣ ∣ \ text 2 {tr} (\ hat ^ T (I – S) Y \ hat Y) + (\ frac1 \ alpha 1) | | \ hat Y – Y | | ^ 2 tr (Y (I) – S Y ^ ^ T) + (alpha 1-1) ∣ ∣ Y ^ – Y ∣ ∣ 2 minimized. The minimization of the first part guarantees that the estimation is smooth and the second part guarantees that the estimated result is as close as possible to the actual result.

Correct’s implementation details

In the C&S process, except for the Smooth section, which explicitly uses LP, the algorithms of the other two parts can be replaced.

In this paper, LP algorithm is also used in the Correct link to spread the residual error on the training set to the full graph E=(1−α)E+αSEE =(1 – \alpha) E+ \alpha SEE =(1−α)E+αSE

But in the process of iteration, There will be less and less residual ∣ ∣ (1 – (alpha) E + alpha SE ∣ ∣ 2 or less (1 – (alpha) ∣ ∣ E ∣ ∣ 2 + alpha ∣ ∣ S ∣ ∣ 2 ∣ ∣ E ∣ ∣ 2 = ∣ ∣ E ∣ ∣ 2 | | (1 – \ alpha) E + \ alpha S E | | _2 \ le (1 – \ alpha) | | E | | _2 + \ alpha | | S | | _2 | | E | | _2 = | | E | | _2 ∣ ∣ (1 – (alpha) E + alpha SE ∣ ∣ 2 or less (1 – (alpha) ∣ ∣ E ∣ ∣ 2 + alpha ∣ ∣ S ∣ ∣ 2 ∣ ∣ E ∣ ∣ 2 = ∣ ∣ E ∣ ∣ 2 so need to be revised

  1. Autoscale. Multiply residual estimates outside the training set by a coefficient to make their absolute mean equal to the training set
  2. Scaled Fixed Diffusion (FDiff-scale). Another diffusion algorithm: E=D−1AEE =D ^{-1}AEE=D−1AE without changing the residual on the training set (where D−1AD^{-1}AD−1A can be imagined as the transfer matrix)

The experimental results

The number of parameters is very low and you don’t even need to learn parameters to do very well, and you learn very quickly. The classification effect has surpassed THAT of SotA in many data sets, and it ranks first in many tasks in OGB’s node classification task list.