Implementation of Python crawler ideas, principles

The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article is from Cloud cloud, author: Wu Yu Beichen

This is a small attempt after learning the basic knowledge of Python. This time, I will climb the king of Honor anchor ranking on panda TV’s web page and demonstrate the principle of a crawler without the help of a third party framework.

First, the idea of implementing Python crawler

Step 1: Define your purpose

1. Find the page whose data you want to crawl. 2

Step 2: simulate Http request, extract data and process data

1. Simulate THE Http network request, send a request to the server, and obtain the HTML 2 returned to us by the server. Use regular expressions to extract the data we need (such as the host name and popularity in this case) from the Html 3. The extracted data are processed and displayed in a form that we can view intuitively

Second, view the web source code, observe the key value

We should first find the page we need to deal with, namely, King of Glory on panda TV, and then look at the source code of this page to see where the data we need to pay attention to is. Here is a screenshot of the page:

A web page. The PNG

Then, we need to view the Html source code of the current page in the browser, different browsers to view the operation will be different, this need to baidu about. This time we need to get the name of each anchor and the number of video views. From the source code below, we can quickly find the location of these key data, as shown in the figure:

The Html source code. PNG

Three, the implementation of Python crawler concrete practice

Here is to create a crawler class Spider, and then use different re to get the data in the Html tag, and then rearrange it after printing display, the specific code is as follows: From urllib import request module urllib import re module re: # need to grab the network link url = "https://www.panda.tv/cate/kingglory" # regular: The div code string reString_div = '<div class="video-info">([\s\ s]*?)</div>' # regex: To obtain the host name reString_name = '< / I > ([\ s \ s] *?) < / span >' # regular: ReString_number = '<span class="video-number">([\s\ s]*?)</span>' def __fetch_content(self): Request network, R = request.urlopen(Spider. Url) data = r.read() htmlString = STR (data,encoding=" UTF-8 ") return htmlString def __alalysis(self,htmlString): Get the data preliminarily using the re, "VideoInfos = re.findall(spider.restring_div,htmlString) anchors = [] #print(videoInfos[0]) for html in videoInfos : name = re.findall(Spider.reString_name,html) number = re.findall(Spider.reString_number,html) anchor = {"name":name,"number":number} anchors.append(anchor) #print(anchors[0]) return anchors def __refine(self,anchors): "By refining the data further, F = lambda Anchor :{"name": Anchor ["name"][0]. Strip (),"number": Anchor ["number"][0]} newAnchors = list(map(f,anchors)) #print(newAnchors) return newAnchors def __sort(self,anchors): Data Analysis: "Anchors = sorted(anchors,key=self.__sort_seed,reverse = True) return anchors def __sort_seed(self,anchor): List_nums = re.findall('\d*', Anchor ["number"]) number = float(list_nums[0]) if '000' in Anchor ["number"]: number = number * 10000 return number def __show(self,anchors): Display the data, print the data that has been sorted "for rank in range(0, Len (anchors)): Print (" anchors "+ STR (rank+1) +" name" :" + anchors[rank]["number"] +" \t" + def startRun(self): Program entry, "HtmlString = self.__fetch_content() anchors = self.__alalysis(htmlString) anchors = self. Anchors = self.__sort(anchors) self.__show(anchors) # Create a crawldata spider = spider () spider.startrun ()Copy the code

Then, we should see the following print:

Execute crawler. PNG

Implementation of Python crawler ideas, principles

First, the idea of implementing Python crawler

Step 1: Define your purpose

Step 2: simulate Http request, extract data and process data

Second, view the web source code, observe the key value

Three, the implementation of Python crawler concrete practice

Related Posts

This comprehensive guide to the HarmonyOS open source component library is not easy to learn

Interface automation test, return value deep complete assertion method

Git working Notes