The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with
The following article is from Cloud cloud, author: Wu Yu Beichen
This is a small attempt after learning the basic knowledge of Python. This time, I will climb the king of Honor anchor ranking on panda TV’s web page and demonstrate the principle of a crawler without the help of a third party framework.
First, the idea of implementing Python crawler
Step 1: Define your purpose
1. Find the page whose data you want to crawl. 2
Step 2: simulate Http request, extract data and process data
1. Simulate THE Http network request, send a request to the server, and obtain the HTML 2 returned to us by the server. Use regular expressions to extract the data we need (such as the host name and popularity in this case) from the Html 3. The extracted data are processed and displayed in a form that we can view intuitively
Second, view the web source code, observe the key value
We should first find the page we need to deal with, namely, King of Glory on panda TV, and then look at the source code of this page to see where the data we need to pay attention to is. Here is a screenshot of the page:
A web page. The PNG
Then, we need to view the Html source code of the current page in the browser, different browsers to view the operation will be different, this need to baidu about. This time we need to get the name of each anchor and the number of video views. From the source code below, we can quickly find the location of these key data, as shown in the figure:
The Html source code. PNG
Three, the implementation of Python crawler concrete practice
Here is to create a crawler class Spider, and then use different re to get the data in the Html tag, and then rearrange it after printing display, the specific code is as follows: From urllib import request module urllib import re module re: # need to grab the network link url = "https://www.panda.tv/cate/kingglory" # regular: The div code string reString_div = '<div class="video-info">([\s\ s]*?)</div>' # regex: To obtain the host name reString_name = '< / I > ([\ s \ s] *?) < / span >' # regular: ReString_number = '<span class="video-number">([\s\ s]*?)</span>' def __fetch_content(self): Request network, R = request.urlopen(Spider. Url) data = r.read() htmlString = STR (data,encoding=" UTF-8 ") return htmlString def __alalysis(self,htmlString): Get the data preliminarily using the re, "VideoInfos = re.findall(spider.restring_div,htmlString) anchors = [] #print(videoInfos[0]) for html in videoInfos : name = re.findall(Spider.reString_name,html) number = re.findall(Spider.reString_number,html) anchor = {"name":name,"number":number} anchors.append(anchor) #print(anchors[0]) return anchors def __refine(self,anchors): "By refining the data further, F = lambda Anchor :{"name": Anchor ["name"][0]. Strip (),"number": Anchor ["number"][0]} newAnchors = list(map(f,anchors)) #print(newAnchors) return newAnchors def __sort(self,anchors): Data Analysis: "Anchors = sorted(anchors,key=self.__sort_seed,reverse = True) return anchors def __sort_seed(self,anchor): List_nums = re.findall('\d*', Anchor ["number"]) number = float(list_nums[0]) if '000' in Anchor ["number"]: number = number * 10000 return number def __show(self,anchors): Display the data, print the data that has been sorted "for rank in range(0, Len (anchors)): Print (" anchors "+ STR (rank+1) +" name" :" + anchors[rank]["number"] +" \t" + def startRun(self): Program entry, "HtmlString = self.__fetch_content() anchors = self.__alalysis(htmlString) anchors = self. Anchors = self.__sort(anchors) self.__show(anchors) # Create a crawldata spider = spider () spider.startrun ()Copy the code
Then, we should see the following print:
Execute crawler. PNG