Scrapy is a trick that crawlers must use

Everywhere is pit, and use and cherish

It’s been more than a week since the Python crawler went into the pit. Oh, no, this article was supposed to be written last weekend, but weekends always go by quickly, so I ended up writing a framework with nothing to fill in, so it’s been more than two weeks now. This is a tribute to my week of Python Scrapy crawlers

Introduction to scrapy

Scrapy-chs.readthedocs. IO/scrapy-chs.readthedocs. IO/scrapy-chs.readthedocs. Documents (English) 1.2: doc.scrapy.org/en/1.2/inde…

Write much better, feeling can not stop, or another to write an introductory article, write a link to update here today or last week dug a good pit to fill it

The meta scrapy

Request() scrapy’s meta is used to pass in the data required for scrapy.Request(). The meta must be a dictionary.

def parse(self, response): Request(url='baidu.com', callback= detailPage, meta={website:' baidu.com'}) def detailPage (self, response): website = response.meta['website']Copy the code

Json parsing in Python:

Crawler can’t do without JSON parsing, which may not be needed in many traditional websites. However, many new websites now use JSON for data transmission and dynamic display, so JSON parsing is very important for crawler

Loads (),loads(),dumps() json.loads() : loads(), loads(),dumps() json.loads() Loads () : loads are similar to json.loads(), but not the same. Loads are read from files and parsed to dict or list json.dumps() : loads() Json.dump () converts a dict or list to a string, so it is the reverse of json.dump().

import json
dict = {'name':'qitiandasheng','age':18}
str = json.dumps(dict)
data = json.loads(str)
with open('test.json','w') as f:
  data = json.load(f)Copy the code

String functions:

Replace () : This function takes two arguments. The first argument is the string to replace and the second argument is what to replace. Strip () : strip the left and right empty characters

Regular expressions:

For string processing, there are even more powerful regular expressions. To use regular expressions in Python, you first need to introduce the re module import re. There are two ways to use regular expressions in Python. In this way, you do not need to pass in the regular expression to call a function. If a regular expression needs to be used several times, you are advised to use this method. It compiles the regular expression first and then matches the regular expression. Alternatively, you can use the functions of the RE module directly, passing in the regular expression as the first argument

Re.compile () : Passes in a regular expression string, the raw string r “is recommended, and does not need to escape specific characters. This function returns a regular expression object re.match() : Re.findall () : matches at any character and returns a list of all results

Match () and search() return a match object with two methods: group() and groups() : Groups () : returns all matches as a tuple. Groups () : returns all matches as a tuple

Re.findall () returns the type of the value of each element in the list, depending on how the regular expression is written: The value of the group (ungrouped, or uncaptured) string that your regular expression does not capture is the single full string that your regular expression matches when the element in the list is a tuple: Your regular expression has captured groups of tuples, which are groups of values

M.baidu.com/feed/data/l…

Newline ^ M

Windows and Linux newlines are different. Files edited on Windows will have an extra ^M symbol when uploaded to Linux

Pages that have been climbed will not be climbed again

Scrpay has a mechanism in which a spider sends a request to a URL and scrapy does not process it if it requests the URL again

allowed domains

When a spider is created, it adds allowed domains to the allowed domains, and scrapy does not request urls that are not allowed in the allowed domains

HTTPS configuration for Fiddler

Fiddler is first configured to fetch HTTPS packets:

Paste_Image.png

Then, enter the proxy IP and port on the phone, such as 192.168.1.5:8888, and click FiddlerRootcertificate to install the certificate and grab the HTTPS packets

Python beginner common mistake: www.oschina.net/question/89…

I have done more than two years of public accounts to write things without typography, ha ha ha finally finished, sleep ZZZ tomorrow have time to make up for the introduction of scrapy