Everywhere is pit, and use and cherish
It’s been more than a week since the Python crawler went into the pit. Oh, no, this article was supposed to be written last weekend, but weekends always go by quickly, so I ended up writing a framework with nothing to fill in, so it’s been more than two weeks now. This is a tribute to my week of Python Scrapy crawlers
Introduction to scrapy
Scrapy-chs.readthedocs. IO/scrapy-chs.readthedocs. IO/scrapy-chs.readthedocs. Documents (English) 1.2: doc.scrapy.org/en/1.2/inde…
Write much better, feeling can not stop, or another to write an introductory article, write a link to update here today or last week dug a good pit to fill it
The meta scrapy
Request() scrapy’s meta is used to pass in the data required for scrapy.Request(). The meta must be a dictionary.
def parse(self, response): Request(url='baidu.com', callback= detailPage, meta={website:' baidu.com'}) def detailPage (self, response): website = response.meta['website']Copy the code
Json parsing in Python:
Crawler can’t do without JSON parsing, which may not be needed in many traditional websites. However, many new websites now use JSON for data transmission and dynamic display, so JSON parsing is very important for crawler
Loads (),loads(),dumps() json.loads() : loads(), loads(),dumps() json.loads() Loads () : loads are similar to json.loads(), but not the same. Loads are read from files and parsed to dict or list json.dumps() : loads() Json.dump () converts a dict or list to a string, so it is the reverse of json.dump().
import json
dict = {'name':'qitiandasheng','age':18}
str = json.dumps(dict)
data = json.loads(str)
with open('test.json','w') as f:
data = json.load(f)Copy the code
String functions:
Replace () : This function takes two arguments. The first argument is the string to replace and the second argument is what to replace. Strip () : strip the left and right empty characters
Regular expressions:
For string processing, there are even more powerful regular expressions. To use regular expressions in Python, you first need to introduce the re module import re. There are two ways to use regular expressions in Python. In this way, you do not need to pass in the regular expression to call a function. If a regular expression needs to be used several times, you are advised to use this method. It compiles the regular expression first and then matches the regular expression. Alternatively, you can use the functions of the RE module directly, passing in the regular expression as the first argument
Re.compile () : Passes in a regular expression string, the raw string r “is recommended, and does not need to escape specific characters. This function returns a regular expression object re.match() : Re.findall () : matches at any character and returns a list of all results
Match () and search() return a match object with two methods: group() and groups() : Groups () : returns all matches as a tuple. Groups () : returns all matches as a tuple
Re.findall () returns the type of the value of each element in the list, depending on how the regular expression is written: The value of the group (ungrouped, or uncaptured) string that your regular expression does not capture is the single full string that your regular expression matches when the element in the list is a tuple: Your regular expression has captured groups of tuples, which are groups of values
M.baidu.com/feed/data/l…
Newline ^ M
Windows and Linux newlines are different. Files edited on Windows will have an extra ^M symbol when uploaded to Linux
Pages that have been climbed will not be climbed again
Scrpay has a mechanism in which a spider sends a request to a URL and scrapy does not process it if it requests the URL again
allowed domains
When a spider is created, it adds allowed domains to the allowed domains, and scrapy does not request urls that are not allowed in the allowed domains
HTTPS configuration for Fiddler
Fiddler is first configured to fetch HTTPS packets:
Then, enter the proxy IP and port on the phone, such as 192.168.1.5:8888, and click FiddlerRootcertificate to install the certificate and grab the HTTPS packets
Python beginner common mistake: www.oschina.net/question/89…
I have done more than two years of public accounts to write things without typography, ha ha ha finally finished, sleep ZZZ tomorrow have time to make up for the introduction of scrapy