In the last section, we explained the basic process and necessary skills of crawler. It is the most important to lay a good foundation and master the basic steps for those who just get started.
Today we’re going to look at a slightly more complex example, strengthening training, consolidating knowledge.
Do you play Weibo
If you play micro blog, you must know that some big V’s a micro blog, absolutely have blocked the network, support the power of the server. Weibo hot search, has also become the flow of holy land, a lot of people in order to hot search, has reached the point of almost crazy.
If you play Weibo, then you must know that the comments under Weibo are a golden hill, which hides too many gods and classics. It is definitely a place not to be missed.
Whether it is official announcement or break up, whether it is tearing or cheating, will be stirred up in the micro blog. So how to quickly get microblog information and comments, how to make an automatic, landing crawler tool. Here you can follow me to see how to do it.
A profound
Let’s first look at the comments on a micro blog. What should WE do
Micro-blog page analysis
Let’s start with the following micro blog
Weibo.com/1312412824/…
Let’s take Lin Zhiling’s tweet announcing her marriage as an example
Open Chrome Developer Tools (F12), switch to the Network TAB, refresh the page again, and see a request like this:
We copied the URL and put it in PostMan (if you don’t know PostMan, download one, it’s a great interface test tool), and the response was not normal
Weibo.com/aj/v6/comme…
At this point, we should think about trying to add cookies
In the Request Headers section of the Network, there is a Cookie field, copy this field, put it in PostMan and try again
Now that we can, we can finally return the data we want normally.
Notice how PostMan adds cookies
The URL to streamline
Let’s take a closer look at this URL. It has a lot of parameters, some of which can be trimmed down.
The process is to delete the parameters one by one, and then send the request using PostMan to see which parameters the response is normal.
I ended up with the following minimal URL
Weibo.com/aj/v6/comme…
Here I want you to really get your hands dirty, check off the parameters one by one and send the request, see the response, and really figure out where the condensed URL above came from.
The URL page
After URL streamlining, we also need to deal with the issue of comment pagination.
Looking at the response message again, you see that at the end of the response message, there is a page field
So let’s add the page parameter to the URL, set it to 2, and see if this field changes
The parameter is set successfully, and the paging problem is resolved.
To get the data
Now that we’ve found the available URL rules, it’s time to send the request and extract the data
import requests import json from bs4 import BeautifulSoup import pandas as pd import timeHeaders = {'Cookie': 'SINAGLOBAL = 4979979695709.662.1540896279940; SUB=_2AkMrYbTuf8PxqwJRmPkVyG_nb45wwwHEieKdPUU1JRMxHRl-yT83qnI9tRB6AOGaAcavhZVIZBiCoxtgPDNVspj9jtju; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W5d4hHnVEbZCn4G2L775Qe1; _s_tentry=-; Apache = 1711120851984.973.1564019682028; ULV = 1564019682040:7:2:1:1711120851984.973.1564019682028:1563525180101; login_sid_t=8e1b73050dedb94d4996a67f8d74e464; cross_origin_proto=SSL; Ugrow-G0=140ad66ad7317901fc818d7fd7743564; YF-V5-G0=95d69db6bf5dfdb71f82a9b7f3eb261a; WBStorage=edfd723f2928ec64|undefined; UOR=bbs.51testing.com,widget.weibo.com,www.baidu.com; wb_view_log=1366*7681; WBtopGlobal_register_version=307744aa77dd5677; YF-Page-G0=580fe01acc9791e17cca20c5fa377d00|1564363890|1564363890'}def sister(page): sister = [] for i in range(0, page): print("page: ", i) url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4380261561116383&page=%s' % int(i) req = requests.get(url, headers=Headers).text html = json.loads(req)['data']['html'] content = BeautifulSoup(html, "html.parser") comment_text = content.find_all('div', attrs={'class': 'WB_text'}) for c in comment_text: Sister_text = c.ext.split (" : ")[1] sister.append(sister_text) time.sleep(5) return sister if __name__ == '__main__': print("start") sister_comment = sister(1001) sister_pd = pd.DataFrame(columns=['sister_comment'], data=sister_comment) sister_pd.to_csv('sister.csv', encoding='utf-8')Copy the code
Notice that Headers has an expiration date, so when you run this code, you need to make a copy of the current Cookie replacement.
Code parsing:
- Since the response information is a JSON and then HTML in the data field, you parse the JSON information first. You can use json.loads to convert strings to JSON, or you can directly use requests. Get (URL, headers= headers).json() to retrieve JSON data in response information.
- The Pandas input and output are used to store data. Create a Pandas DataFrame object and save it to a CSV file using the to_csv function.
At this point, a simple blog comment crawler is complete, is it simple enough?
Automatic crawler
Let’s see if we can optimize our crawl process
The current implementation is that we manually find a micro-blog, and then use this micro-blog as a starting point to start crawlers. So in the future, I just need to enter the name of a big V’s microblog, and then enter some fields in the microblog, and the microblog information and corresponding comments can be automatically climbed.
Ok, let’s move towards that goal
Weibo search
Since it is a big V, there must be a search involved here. We can first try weibo’s own search, the address is as follows:
S.weibo.com/user?q= Lin
Again, let’s put it in PostMan and see if we can access it directly
It can return data normally, which saves us a lot of trouble. The following is to analyze and parse the response message to get the data that is useful to us.
After observation, there is a UID information in the data returned by this interface, which is the unique ID of each weibo user. We can take it and save it for later use.
As for how to locate the UID, I have marked it in the figure, I believe you can understand it with a simple analysis.
def get_uid(name): try: url = 'https://s.weibo.com/user?q=%s' % name res = requests.get(url).text content = BeautifulSoup(res, 'html.parser') user = content.find('div', attrs={'class': 'card card-user-b s-pg16 s-brt1'}) user_info = user.find('div', attrs={'class': 'info'}).find('div') href_list = user_info.find_all('a') if len(href_list) == 3: Uid = href_list[2]. Get ('uid') return uid elif title == 'uid' : uid = href_list[2].get('uid') return uid else: print("There are something wrong") return False except: raiseCopy the code
The code is the knowledge we talked about, I believe you can understand.
Utilization of M station
M station generally refers to the mobile web page, that is, to adapt to the mobile mobile terminal and produced pages. Most websites add “M.” in front of the original website as the address of their M site, for example, M.baidu.com is baidu’s M site.
Let’s open the M station of Weibo, and then enter Lin Zhiling’s weibo page to see the request in the Network. Is there any surprise?
We first found such a URL
M. eibo. Cn/API/contain…
Then drag the web page again and find a similar URL in the Network
M. eibo. Cn/API/contain…
Along with the URL, the page also displays the new tweet message, which is clearly the API for requesting the tweet. Similarly, put the second URL in PostMan to see which parameters can be omitted
It turns out that if you pass in the correct ContainerID, page, which is used to control paging, will return the corresponding tweet. But where does containerID come from? We just got a UID, so let’s see if we can use that UID to get containerID.
Here and in need of some experience, can I keep trying to interface “m. eibo. Cn/API/container/getIndex” add different parameters, see what it returns information, such as common type parameter name, id, value, name, etc. Finally, with my unremitting efforts, I found that the combination of Type and value was successful, and I could get the corresponding containerID information
There is really no shortcut to this place, only by trial and experience, which is also a painful place, after all, it is to test the API of other systems in the form of all-black box.
Now you can write code to get the corresponding ContainerID (and if you’re careful, you can see that the interface returns a lot of interesting information that you can try to grab yourself).
def get_userinfo(uid):
try:
url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=%s' % uid
res = requests.get(url).json()
containerid = res['data']['tabsInfo']['tabs'][1]['containerid']
mblog_counts = res['data']['userInfo']['statuses_count']
followers_count = res['data']['userInfo']['followers_count']
userinfo = {
"containerid": containerid,
"mblog_counts": mblog_counts,
"followers_count": followers_count
}
return userinfo
except:
raise
Copy the code
The code is the most basic operation, I won’t explain too much. The following is to save the micro-blog information
The microblog information is stored under res[‘data’][‘cards’], including comments, forwarding, number of likes and so on. So we have a function to parse the JSON data:
def get_blog_info(cards, i, name, page): blog_dict = {} if cards[i]['card_type'] == 9: Mblog = cards[I]['mblog'] mblog_text = mblog[' mblog'] create_time = mblog[' mblog' Mblog ['created_at'] mblog_id = mblog['id'] reposts_count = mblog['reposts_count'] # Comments_count = Mblog ['comments_count'] # emobackfeedback_expression = mblog[' feedback_expression '] # emoback_expression with open(name, 'a', encoding='utf-8') as f: (f.w rite "-- the first" + STR (page) + "page, the first" + STR (I + 1) + "article weibo --" + "\ n") f.w rite (" weibo address: "+ STR (scheme) +" \ n "+" release time: "+ STR (feedback_experience) +" font-family: 'Experience, feedback_experience', "background-gesture, feedback_experience" + "background-gesture, feedback_experience" + "background-gesture, feedback_experience" + "background-gesture" "+ STR (comments_count) + "\n" +" " + str(reposts_count) + "\n") blog_dict['mblog_id'] = mblog_id blog_dict['mblog_text'] = mblog_text Blog_dict ['create_time'] = create_time return blog_dict else: print(" no twitter ") return FalseCopy the code
Function parameters:
- The first argument, which accepts the return value of res[‘data’][‘cards’], is dictionary data
- The second argument is the loop counter for the outer calling function
- The third parameter is the name of the big V to crawl
- The fourth parameter is the page number being climbed
Finally, the function returns a dictionary
Search for twitter information
We also need to realize the function of locating a microblog through some text fragments of microblog, so as to capture the comments under the microblog.
Define another function, call get_blog_info above, get the corresponding tweet information from the dictionary it returns, and compare it with the tweet field we entered. If it contains the tweet field, it means we found the tweet we want
def get_blog_by_text(containerid, blog_text, name):
blog_list = []
page = 1
while True:
try:
url = 'https://m.weibo.cn/api/container/getIndex?containerid=%s&page=%s' % (containerid, page)
res_code = requests.get(url).status_code
if res_code == 418:
print("访问太频繁,过会再试试吧")
return False
res = requests.get(url).json()
cards = res['data']['cards']
if len(cards) > 0:
for i in range(len(cards)):
print("-----正在爬取第" + str(page) + "页,第" + str(i+1) + "条微博------")
blog_dict = get_blog_info(cards, i, name, page)
blog_list.append(blog_dict)
if blog_list is False:
break
mblog_text = blog_dict['mblog_text']
create_time = blog_dict['create_time']
if blog_text in mblog_text:
print("找到相关微博")
return blog_dict['mblog_id']
elif checkTime(create_time, config.day) is False:
print("没有找到相关微博")
return blog_list
page += 1
time.sleep(config.sleep_time)
else:
print("没有任何微博哦")
break except:
pass
Copy the code
The code may seem long, but it’s all learned.
The only caveat is that there is a checkTime function and config configuration file
The checkTime function is defined as follows
def checkTime(inputtime, day): try: intime = datetime.datetime.strptime("2019-" + inputtime, '%Y-%m-%d') except: Datetime.datetime.now () n_days = now-intime days = n_days.days if days < day: return True else: return FalseCopy the code
The purpose of defining this function is to limit the search time. For example, the search time of microblog posts 90 days ago is no longer needed, which is also to improve efficiency. In the Config file, a day configuration item is defined to control the time range that can be searched
Day = 90 # The longest time to capture microblogs, 60 is the time to capture microblogs from two months ago to now sleep_time = 5 # Delay time, 5-10s is recommendedCopy the code
Get comment information
For the comment information, and the previous test knife in the method is the same, will not repeat
def get_comment(self, mblog_id, page): comment = [] for i in range(0, page): Print (" -- -- -- -- -- climbing on taking the first "+ STR (I) + comments" page ") url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id=%s&page=%s' % (mblog_id, i) req = requests.get(url, headers=self.headers).text html = json.loads(req)['data']['html'] content = BeautifulSoup(html, "html.parser") comment_text = content.find_all('div', attrs={'class': 'WB_text'}) for c in comment_text: _text = c.t ext split (" : ")[1] comment.append(_text) time.sleep(config.sleep_time) return commentdef download_comment(self, comment): comment_pd = pd.DataFrame(columns=['comment'], data=comment) timestamp = str(int(time.time())) comment_pd.to_csv(timestamp + 'comment.csv', encoding='utf-8')Copy the code
Defining the running function
Finally, we define the run function, which takes all relevant information that needs user input from the run function and passes it to the following logical function.
from weibo_spider import WeiBo from config import headers def main(name, spider_type, text, page, iscomment, Comment_page): print(" Start..." Print (name, headers) print(name, headers) Print (" get UID..." Get_uid () print(" get uid ") print(" get uid ") If spider_type == "Text" or spider_type == "Text" : Print ("..." ) blog_info = weibo.get_blog_by_text(userinfo['containerid'], text, name) if isinstance(blog_info, str): If iscomment == "Yes" or iscomment == "Yes" or iscomment == "Yes" : Print (" now crawl comments ") comment_info = weibo. Get_comment (blog_info, Download_comment (comment_info) print(" comment_comment_info ") return True return True else: Print (" spider_type == "Page") return False elif spider_type == "Page" or spider_type == "Page" : blog_info = weibo.get_blog_by_page(userinfo['containerid'], page, name) if blog_info and len(blog_info) > 0: Return True else: print(" Please enter correct options ") return Falseif __name__ == '__main__': target_name = input("type the name: ") spider_type = input("type spider type(Text or Page): Text = "hello" page_count = 10 iscomment = "No" comment_page_count = 100 while spider_type not in (" text", "text", "Page", "page"): spider_type = input("type spider type(Text or Page): ") if spider_type == "Page" or spider_type == "page": page_count = input("type page count(Max is 50): ") while int(page_count) > 50: page_count = input("type page count(Max is 50): ") elif spider_type == "Text" or spider_type == "text": text = input("type blog text for search: ") iscomment = input("type need crawl comment or not(Yes or No): ") while iscomment not in ("Yes", "YES", "yes", "No", "NO", "no"): iscomment = input("type need crawl comment or not(Yes or No): ") if iscomment == "Yes" or iscomment == "YES" or iscomment == "yes": comment_page_count = input("type comment page count(Max is 1000): ") while int(comment_page_count) > 1000: comment_page_count = input("type comment page count(Max is 1000): ") result = main(target_name, spider_type, text, int(page_count), iscomment, int(comment_page_count)) if result: Print (" climb successfully!!" ) else: print(" Crawl failed!!" )Copy the code
Although the code is long, most of it is logical and not difficult. The only thing that’s interesting is the input() function, which provides a blocking process that doesn’t continue until the user presses the enter key.
You should notice this code
weibo = WeiBo(name, headers)
Copy the code
Here, the previous series of functions are encapsulated in the class WeiBo, which is object-oriented thinking.
Crawlers and toolsets
Let’s take a look at the definition of reptilian WeiBo
class WeiBo(object): def __init__(self, name, headers): self.name = name self.headers = headers def get_uid(self): Get user's UID... Def get_userinfo(self, uid): Def get_blog_by_page(self, containerid, name): def get_blog_by_page(self, containerid, name): Def get_blog_by_text(self, containerid, blog_text, name): def get_blog_by_text(self, containerid, name, name): Def get_comment(self, mblog_id, page): # def get_comment(self, mblog_id, page): # Def download_comment(self, comment)Copy the code
. In the initialization function of the class, pass in the name of the big V to be climbed and the headers (cookie) prepared by us, and write the above functions to the class. Then the instance weibo of the class will be able to call these functions. For toolsets, even the logic of abstraction, we have also explained
import datetime from config import daydef checkTime(inputtime, day): . ... def get_blog_info(cards, i, name, page): ...Copy the code
Well, at this point, you can happily run the crawler, then have a cup of tea and wait for the program to finish.
Let’s look at the final result
Is there some satisfaction!
The full code stamp is here:
A profound
Github.com/zhouwei713/…
Automatic twitter crawl
Github.com/zhouwei713/…
conclusion
Today I take the micro-blog crawler as an example, a comprehensive explanation of how to analyze web pages, how to deal with anti-crawler, how to use M station and other skills. I believe that after reading today’s course, you should have been competent for some simple crawler work, is not already ready to move it.