0 x00 preface

With the social relationship data in place and the principles and implementation of the PageRank algorithm in place, it’s time to do some interesting things.

This article will crawl in front of the 500W Jane book fan data, using PageRank to find the top users.

0x01 Preparations

1. Data preparation

The data storage format is as follows, which is also the data format we often use in the production environment, so it has been processed in the crawler acquisition stage. The data is a directed graph, with users on the left and fans on the right.

Note: this is used in the Jane book generated user ID, according to this ID can be very convenient to spell out the user’s home page.

2. Procedure preparation

Here again, I despise my own program. Instead of writing my own Demo program, I used a Python package: NetworkX. Once you understand the principles, it’s always better to use some open source implementation.

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

NetworkX is very easy to use, generally in three steps:

  1. Introducing the NetworkX package
  2. Initialize a graph
  3. Calculation of PageRank

0x02 Implementation and Effect

implementation

Using NetworkX package is very simple, originally wanted to use Matplotlib to draw the diagram, but the data is too large, the subsequent data can be used to Gephi or Tableau play. Here is a simple code that takes the PageRank value.

For the final result, we can sort it and print out the PageRank values of the top 10 users. The diagram below:

The effect

Effect, nothing to say, you run the data and then take the top users will find that the users before the ranking test, most of them are very many fans of the user, corresponding to the number of their blogs and reading are also a lot of.

Because the whole network is based on me as the entrance to climb, so there are three main types of users in the ranking test:

  1. Data-related practitioners (most of my followers and followers will be similar users)
  2. Chicken Soup writer (theme of Jane’s Book)
  3. Jane book users who draw a lot (have drawn stick figures before, so there are many similar relationships)

0 XFF summary

I’m just doing this for fun, and I’m going to deepen my understanding of the PageRank algorithm. After running the program, I still feel a lot of harvest, and also opened a lot of ideas.

The first was a very interesting observation: Beginning less data (200000) when I run the program, found there was a ranking is the front of the user, probably ranked seventh, but he is only a fan, looked at his fans, the fans ranked in the second, pulled him to the seventh position, accordingly it is mentioned in the previous article was verified: If a page with a high PageRank value links to another page, the PageRank value of the linked page will increase accordingly

Then the program was run after the data volume was large. After the chicken soup users were filtered out of the result set, it was found that many bloggers who had not been followed before had very high blogs. In this way, can a blog site such as Jianshu or CSDN use PageRank value as a weight of recommendation in the recommendation system? It’s not clear how their recommendations work, but I don’t know of any companies that use PageRank as a weight for recommendations, so there should be limitations.

The principle of PageRank algorithm and a basic scene has been roughly over, the follow-up will be to do a community partition, and then respectively to achieve these algorithm MapReduce procedures, and how to optimize the project in MapReduce procedures.


WeChat pay

  • The author: Mudong Koshi
  • Links to this article: www.mdjs.info/2017/09/09/…
  • Copyright Notice: All articles on this blog are licensed under a CC BY-NC-SA 3.0 license unless otherwise stated. Reprint please indicate the source!