Author: Jiang Xia

| zhihu: www.zhihu.com/people/1024…

| GitHub:github.com/JiangXia-10…

| CSDN:blog.csdn.net/qq_4115394…

| the nuggets: juejin. Cn/user / 651387…

| public no. : 1024 notes

This article contains 3756 words and takes 13 minutes to read

There are many friends in the background asking me about the knowledge and learning resources of Python crawlers. Today, this article will briefly introduce the knowledge of Python crawlers and use them to crawl the commodity information of Taobao and save it in excel for demonstration. The source of the same article will be synchronized to Github, welcome to download and use it.

Before crawler, we need to know some basic knowledge about crawler.

First of all, we need to understand the basic principles of crawlers:

The basic web page request process can be divided into two steps:

1. Request: Every web page presented to the user must go through this step, which is to send a Request to the server.

2. Response: After receiving the user’s request, the server will verify the validity of the request and then send the Response content to the user (client). The client receives the Response content from the server and displays the content (i.e. the web page), as shown in the figure below.

There are two ways to request a web page:

1. GET: The most common method, which is generally used to obtain or query resource information and is also used by most websites with fast response speed.

2. POST: Compared with GET, the function of uploading parameters in form is more, so in addition to querying information, you can also modify information.

Therefore, before writing the crawler, we should first determine who to send the request to and how to send it.

To whom to send the request, we need to know the URL address of the request, taking the URL of the glasses on Taobao as an example:

https://s.taobao.com/search?q=%E7%9C%BC%E9%95%9C&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a Bo 21. 2017.201856 - taobao - item. 1 & ie = utf8 & initiative_id = tbindexz_20170306Copy the code

Here the Chinese word of glasses has been escaped:

Here we just need to know that the value after q is the name of the product we are searching for, and other parameters are not used for the moment.

Because crawlers don’t just crawl one page of information, we skip to the next page:

You can see that the value of s =44* (page -1).

G_page_config = g_page_config = g_page_config;

, "data" : {" postFeeText ":" freight ", "trace" : "msrp_auction", "auctions" : [{" p4p ": 1," p4pSameHeight ": true," nid ":" 536766094512 ", "categ Ory ":""," PID ":"","title":" Myopia \ u003cSPAN class\u003dH\ u003E glasses \ u003C/SPAN \u003e Male Have weight Ultra Light All Frame \ u003CSPAN ","pid":"","title":" Myopia \ u003CSPAN class\u003dH\ u003E glasses \ u003C /span\u003e Male Have weight Ultra Light All frame \ u003CSPAN Class \u003dH\ U003E glasses \ U003C/SPAN \ U003E frame half frame comfort can be matched with \ U003CSPAN Class \ u003dH \ \ u003c u003e glasses/span \ u003e anti-fog eye myopic lens, "" raw_title" : "danyang glasses frame spectacle frame eyes box anti-radiation optical mirror," "pic_url" : "/ / g-search1.alicdn.com/img/ba o/uploaded/i4/imgextra/i2/104870285060645671/TB2ulgla4vzQeBjSZPfXXbWGFXa_!!0-saturn_solar.jpg"Copy the code

PostFeeText is the freight, raw_title is the title, pic_URL is the address to show the picture,

View_price: freight;

Nick: Name of the shop;

Item_loc: address;

View_sales: indicates the sales volume.

The above data corresponds to the following information:

Network –>headers– >request Method

Now that we know the basics, we can write a small crawler, such as the following code:

Import requests URL = 'https://s.taobao.com/search?q=%E7%9C%BC%E9%95%9C&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm= A21bo. 2017.201856 - taobao - item. 1 & ie = utf8 & initiative_id = tbindexz_20170306 & bcoffset = 3 & ntoffset = 3 & p4ppushleft = 1% 2 c48 & s = 44 ' STRHTML = requests. Get (URL) print(strhtml.text)Copy the code

This captures the content of the page and displays it in HTML format.

Using the Requests library to request the site-loading library is the import+ library name. In the above process, the statement that loads the Requests library is: import Requests.

To GET the data in GET mode, call the Requests library GET method by typing a dot after requests, as shown below:

requests.get
Copy the code

Store the obtained data in the STRHTML variable as follows:

strhtml = request.get(url)
Copy the code

At this point, STRHTML is a URL object that represents the entire web page, but all you need is the source code in the page. The following statement represents the source code for the page:

strhtml.text
Copy the code

Next we complete the crawl taobao information, the main crawl information is: commodity name, store name, price, region, the number of payers.

First we define a function to get the requested URL:

Def Geturls(q, x) def Geturls(q, x) url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \ "= a21bo.2017.201856-Taobao item.1&ie=utf8&initiative_id= tbIndexz_20170306 "urls = [] urls.append(url) if x == 1: return urls for i in range(1, x ): url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \ "& sourceId = TB. Index&spm = a21bo. 2017.201856 - taobao - item. 1 & ie = utf8 & initiative_id = tbindexz_20170306" \ "&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str( i * 44) urls.append(url) return urlsCopy the code

Then define a function to get the HTML page:

def GetHtml(url): R = requests. Get (URL,headers =headers) r.aise_for_status () R.encoding = r.aparent_encoding return rCopy the code

Then define a function to get information about the item and insert it into Excel:

Let’s start with a re library:

The RE library is the standard library for Python

The re library uses raw String to represent the regular expression, which is expressed as r’test’.

A raw string is a string that contains no escape characters.

The main functions of Re library are as follows:

Here we use the findAll () function to extract information, such as:

a = re.findall(r'"raw_title":"(.*?) "', html)Copy the code
Def GetandintoExcel(HTML): def GetandintoExcel(HTML): global count a = re.findall(r'"raw_title":"(.*?)) def GetandintoExcel(HTML): global count a = re.findall(r'"raw_title":"(.*?)) "', HTML) # store name b = re.findall(r'" Nick ":"(.*?)) "', HTML) # c = re commodity prices. The.findall (r '" view_price" : "(. *?) "', HTML) # locale d = re.findall(r'"item_loc":"(.*? "', HTML) e = # sales re. The.findall (r '" view_sales" : "(. *?) "', html) x = [] for i in range(len(a)): try: x.append((a[i],b[i],c[i],d[i],e[i])) except IndexError: break i = 0 for i in range(len(x)): worksheet.write(count + i + 1, 0, x[i][0]) worksheet.write(count + i + 1, 1, x[i][1]) worksheet.write(count + i + 1, 2, x[i][2]) worksheet.write(count + i + 1, 3, x[i][3]) worksheet.write(count + i + 1, 4, X [I][4]) count = count +len(x) return printCopy the code

The main function is as follows:

If __name__ == "__main__": count = 0 # headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome / 80.0.3987.149 Safari / 537.36 ", "cookie" : "t = dc3b73639f4d5652aee8e9f52d7d0c41; sgcookie=E100VfIm5WNNIHQbxK40GoWlA%2BiEh8%2BL3IfWZ8gaPQLmeINstkbFpvtbviklWcVtFdWGQqp2UpMRm4mVXJeOwD1RzA%3D%3D; tracknick=%5Cu5C0F%5Cu5C0F%5Cu5C0F%5Cu5C0F%5Cu54466; _cc_=UtASsssmfA%3D%3D; thw=cn; enc=l%2Fjb6N5FBl9K0ekOiije0dOrXynlA1PT6kAWiXlE8MP7XwVwWABeB1r%2F4%2FN%2FROmEcqBpM4Uk%2FlCcbvHxEX4HhA%3D%3D; cna=E7gdGOrz1lwCAXOs+dCyLVoL; _m_h5_tk=b3004b87cb6eecc0cc7c7154acf7a244_1606566002810; _m_h5_tk_enc=082f300176ed45b823551f15ca981ee0; cookie2=2c290690f10861f7bdd981cba5ac078f; v=0; _tb_token_=0a7840e5536b; JSESSIONID=CE9BABF53453F46B8B6A2FAA274428C1; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com; hng=CN%7Czh-CN%7CCNY%7C156; xlly_s=1; _samesite_flag_=true; tfstk=cVuOB9wPApvG8ZVKacKhcclUWCOhZtfTn1wAkQuqyoMJW-7AiGgoy0ZkfSPvIBC.. ; l=eBjdYUdPOiL-FAJDBOfwourza77OSIRAguPzaNbMiOCPOZCp53UFWZR2YsT9C3GVh6RXR3rEk3ObBeYBqIv4n5U62j-la_kmn; Isg = f9i2fszewe3qwrh BE5OFMfVnXt4DynJaP_rUvlZnyQQzxLJN80UA3iXutEM2-414 "} q = input (" do you want to crawl commodity is: ") x = int (input (" what do you want to crawl a few pages of data: ")) urls = Geturls(q,x) workbook = xlsxwriter.Workbook(q+".xlsx") worksheet = workbook.add_worksheet() worksheet.set_column('A:A', 70) worksheet.set_column('B:B', 40) worksheet.set_column('C:C', Set_column ('E:E', 20) worksheet.write('A1', 'product name ') worksheet.write('B1', Write ('D1', 'region ') worksheet.write('E1',' number of payers ') xx = [] for url in urls: html = GetHtml(url) s = GetandintoExcel(html.text) time.sleep(5) workbook.close()Copy the code

Finally run the program:

This is how to use Python to crawl taobao information. The code is as follows:

import re import requests import xlsxwriter import time import math def Geturls(q, x): url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \ "= a21bo.2017.201856-Taobao item.1&ie=utf8&initiative_id= tbIndexz_20170306 "urls = [] urls.append(url) if x == 1: return urls for i in range(1, x ): url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \ "& sourceId = TB. Index&spm = a21bo. 2017.201856 - taobao - item. 1 & ie = utf8 & initiative_id = tbindexz_20170306" \ "&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str( i * 44) urls.append(url) return urls def GetHtml(url): r = requests.get(url,headers =headers) r.raise_for_status() r.encoding = r.apparent_encoding return r def GetandintoExcel(html): global count a = re.findall(r'"raw_title":"(.*?)"', html) b = re.findall(r'"nick":"(.*?)"', html) c = re.findall(r'"view_price":"(.*?)"', html) d = re.findall(r'"item_loc":"(.*?)"', html) e = re.findall(r'"view_sales":"(.*?)"', html) x = [] for i in range(len(a)): try: x.append((a[i],b[i],c[i],d[i],e[i])) except IndexError: break i = 0 for i in range(len(x)): worksheet.write(count + i + 1, 0, x[i][0]) worksheet.write(count + i + 1, 1, x[i][1]) worksheet.write(count + i + 1, 2, x[i][2]) worksheet.write(count + i + 1, 3, x[i][3]) worksheet.write(count + i + 1, 4, [I] x [4]) count = count + len (x) return the print (" crawl to complete data ") if __name__ = = "__main__" : Count = 0 headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome / 80.0.3987.149 Safari / 537.36 ", "cookie" : "t = dc3b73639f4d5652aee8e9f52d7d0c41; sgcookie=E100VfIm5WNNIHQbxK40GoWlA%2BiEh8%2BL3IfWZ8gaPQLmeINstkbFpvtbviklWcVtFdWGQqp2UpMRm4mVXJeOwD1RzA%3D%3D; tracknick=%5Cu5C0F%5Cu5C0F%5Cu5C0F%5Cu5C0F%5Cu54466; _cc_=UtASsssmfA%3D%3D; thw=cn; enc=l%2Fjb6N5FBl9K0ekOiije0dOrXynlA1PT6kAWiXlE8MP7XwVwWABeB1r%2F4%2FN%2FROmEcqBpM4Uk%2FlCcbvHxEX4HhA%3D%3D; cna=E7gdGOrz1lwCAXOs+dCyLVoL; _m_h5_tk=b3004b87cb6eecc0cc7c7154acf7a244_1606566002810; _m_h5_tk_enc=082f300176ed45b823551f15ca981ee0; cookie2=2c290690f10861f7bdd981cba5ac078f; v=0; _tb_token_=0a7840e5536b; JSESSIONID=CE9BABF53453F46B8B6A2FAA274428C1; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com; hng=CN%7Czh-CN%7CCNY%7C156; xlly_s=1; _samesite_flag_=true; tfstk=cVuOB9wPApvG8ZVKacKhcclUWCOhZtfTn1wAkQuqyoMJW-7AiGgoy0ZkfSPvIBC.. ; l=eBjdYUdPOiL-FAJDBOfwourza77OSIRAguPzaNbMiOCPOZCp53UFWZR2YsT9C3GVh6RXR3rEk3ObBeYBqIv4n5U62j-la_kmn; Isg = f9i2fszewe3qwrh BE5OFMfVnXt4DynJaP_rUvlZnyQQzxLJN80UA3iXutEM2-414 "} q = input (" do you want to crawl commodity is: ") x = int (input (" what do you want to crawl a few pages of data: ")) urls = Geturls(q,x) workbook = xlsxwriter.Workbook(q+".xlsx") worksheet = workbook.add_worksheet() worksheet.set_column('A:A', 70) worksheet.set_column('B:B', 40) worksheet.set_column('C:C', Set_column ('E:E', 20) worksheet.write('A1', 'product name ') worksheet.write('B1', Write ('D1', 'region ') worksheet.write('E1',' number of payers ') xx = [] for url in urls: html = GetHtml(url) s = GetandintoExcel(html.text) time.sleep(5) workbook.close()Copy the code

Finally, the legitimacy of reptiles. Almost every website has a document named robots.txt, of course, there are some websites do not set robots.txt. For the website without robots.txt, the data without password encryption can be obtained through the web crawler, that is, all the page data of the website can be crawled. If the site has a robots.txt document, it is necessary to determine whether there are forbidden visitors to obtain data. For example, on Baidu, visit www.baidu.com/robots.txt in the browser.

It can be seen that Baidu can allow some crawlers to access some of its paths, but for users without permission, all crawlers are forbidden, and the code is as follows:

User-Agent:*

Disallow:/

This code means that no crawler is allowed to crawl any data other than the crawler specified earlier.

So our crawler had better comply with the provisions and legitimacy of the website.

However, many websites actually have anti-crawler mechanism. In the final analysis, crawler actually simulates the browsing and visiting behavior of human through the code to carry out the batch capture of data. Therefore, when the amount of data captured gradually increases, it will give the server to be visited caused a lot of pressure, and even may crash, so many websites will take some anti-crawling strategy.

How does the server recognize crawler behavior? The first relatively rudimentary anti-crawling mechanism is to check the connected UserAgent to see if it is browser access or code access. If it is code access, the server will block the incoming IP when the traffic increases. We can find the URL, Form Data, Request headers, etc., and the server can determine whether the keyword is a user-agent under Request Headers. We can then encapsulate our browser request header in the request header.

Therefore, we only need to construct the parameters of the request header. Create a request header as follows:

Headers = {‘ the user-agent ‘:’ Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/70.0.3538.110 Safari/537.36′} Response = request.get(url,headers=headers)

We need to know that crawler access frequency is very fast, we may look at a graph in a second, while code can climb hundreds or thousands of graphs in a second, which will also cause pressure on the server and will still be anti-crawling, so the second anti-crawling mechanism is our common verification code. It counts the frequency of access for each IP address, and if the frequency exceeds a set threshold, it returns a verification code. If it is really a user access, the user will fill it in and continue to access, if it is code access, it will be blocked IP. At this time, we have two solutions: the first method is commonly used to add a delay, such as every 5 seconds to grab, the code is as follows:

import time

time.sleep(5)

However, this approach will lead to the reduction of crawler efficiency. The second solution is to set the proxy IP address. Requests have their own proxies property. First, build your own proxy IP pool, assign it to proxies as a dictionary, and then transfer it to Requests as follows:

proxies={
"http":"http://xx.xx.x.xx.xxx",
"https":"http://xx.xx.x.xx.xxx",
}
response = requests.get(url, proxies=proxies)
Copy the code

Finally, crawler can be used as a skill and interest to learn, do not violate the rules of the website using crawler to do some illegal behavior.

In addition, all the actual combat article source will be synchronized to Github, there is a need to welcome the use of download.

This article source address: github.com/JiangXia-10…

Finally, if you think this article is good, just click “like” and “recommend” to more people. Welcome to pay attention to the public number: 1024 notes, free access to massive learning resources (including video, source code, documents)!

Other recommendations:

  • Getting started with Python (3) : Using tuples
  • Getting started with Python (4) : Using sets
  • Getting Started with Python (5) : Dict usage
  • Getting Started with Python 6: Calling custom functions
  • Getting started with Python (I) : String formatting