preface

Now that you’ve clicked on it, LET me get this straight. I made up the title, but I really meant what you clicked on. Today’s goal is to write a crawler that will crawl all the micro-blog data sent by the target user. Without further ado, let’s begin happily

The development tools

Python version: 3.6.4
Related modules:

The argparse module;

DecryptLogin module;

LXML module;

TQDM module;

Prettytable module;

Pyecharts module;

Jieba module;

Wordcloud module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Note that the DecryptLogin module is updated frequently, so in order to ensure that it works for all relevant examples in the public account, please update the DecryptLogin module as soon as possible. The command format is as follows:

pip install DecryptLogin --upgrade
Copy the code

Introduction of the principle

Here’s a brief description of the whole crawl process. First of all, of course, it is the simulated login of Sina Weibo. Here, we still use the simulated login package of open source to realize the simulated login of Weibo. Specifically, the code implementation is as follows:

Simulate login with DecryptLogin
@staticmethod
def login(username, password) :
  lg = login.Login()
  _, session = lg.weibo(username, password, 'mobile')
  return session
Copy the code

Then, let the program user enter the target user ID that they want to crawl. So how to get the weibo user ID? Take Liu Yifei’s microblog as an example. First of all, enter her homepage, and then you can see the link:

So Liu yifei’s weibo user ID is 3261134763.

According to the weibo user ID entered by the user, we use the simulated login session to access the following links:

Links to # 1
url = f'https://weibo.cn/{user_id}'
res = self.session.get(url, headers=self.headers)
Links to # 2
url = f'https://weibo.cn/{user_id}/info'
res = self.session.get(url, headers=self.headers)
Copy the code

The link appears in the browser something like this:

Obviously, here we can use xpath to extract some basic information about the target user:

Links to # 1
selector = etree.HTML(res.content)
base_infos = selector.xpath("//div[@class='tip2']/*/text()")
num_wbs, num_followings, num_followers = int(base_infos[0] [3: -1]), int(base_infos[1] [3: -1]), int(base_infos[2] [3: -1])
num_wb_pages = selector.xpath("//input[@name='mp']")
num_wb_pages = int(num_wb_pages[0].attrib['value']) if len(num_wb_pages) > 0 else 1
Links to # 2
selector = etree.HTML(res.content)
nickname = selector.xpath('//title/text()') [0] [: -3]
Copy the code

I won’t talk too much about what xpath is, but I can easily write it by looking at the source code of the web page:

After extraction, print it out and let the program user confirm whether the user information obtained with the user ID they input is the same as the user information they want to crawl. Wait until the user confirms that the information is correct before starting to crawl the user’s microblog data:

# The user confirms whether to download all of the user's tweets
tb = prettytable.PrettyTable()
tb.field_names = ['Username'.'Number of concerns'.'Number of followers'.'Number of tweets'.'Number of weibo pages']
tb.add_row([nickname, num_followings, num_followers, num_wbs, num_wb_pages])
print(The user information obtained is as follows:)
print(tb)
is_download = input('Do you want to crawl all of the user's tweets? (y/n, default: y) -- >)
if is_download == 'y' or is_download == 'yes' or not is_download:
  userinfos = {'user_id': user_id, 'num_wbs': num_wbs, 'num_wb_pages': num_wb_pages}
  self.__downloadWeibos(userinfos)
Copy the code

Xpath is also used to extract data from users’ micro-blog data. To view users’ micro-blog, you only need to visit the following link:

url = f'https://weibo.cn/{user_id}? page={page}'Page represents the page of the user's weibo accountCopy the code

Speaking of techniques, there are only two things worth mentioning:

  • Every 20 pages of microblog data, the data is saved to avoid the crawler’s accidental interruption, resulting in the “empty space” of the previously climbed data;

  • Data is paused for x seconds for every n pages climbed, where n is randomly generated and n is always changing, and x is also randomly generated and x is always changing.

The idea is such a way of thinking, some of the details of their own to see the source code

Data visualization

As usual, the data to be climbed to a wave of bai bai, for convenience, look at Liu Yifei’s microblog data visualization effect.

Take a look at the word cloud made from all of her tweets (original tweets only) :

And look at the number of original and retweeted tweets?

And the number of tweets per year?

Sure enough, the number of tweets is much lower now. Check out her first tweet, zoaIU7o2d, 😀 :

“Hello everyone, I’m Liu Yifei”

[Image upload failed…(image-31aa56-1620357151584)]

[image upload failed…(image-332678-1620357151584)]

How many likes does her original micro blog get every year?

How many retweets?

And how many comments?

After reading this article, my favorite friends click on the love support, follow me every day to share Python simulation login series, next article to share netease cloud music automatic check-in

All done~ Complete source code see personal profile or private letter to obtain relevant files.