Forbidden by robots.txt
An error screenshot
Study:
After checking robot.txt, I found a Robot protocol, which defines which web pages or files are allowed to be captured by crawler machines on this site. You can visit the link www.baidu.com/robots.txt to view permissions
User-agent: Baiduspider
Disallow: /baidu
Scrapy defaults to the Robot protocol, so we just don’t have to.
Solution:
Disable scrapy’s ROBOTSTXT_OBEY function, locate the variable in Setting, and set it to False.
TypeError: Object of type ‘Selector’ is not JSON serializable
Problem: JSON serialization failed
Reason: Forget extract()
Extract () : serialize this node as a Unicode string and return a list
Write () argument must be STR, not bytes
Error message:
Code:
filename=open('test.json'.'w')
Copy the code
Solution:
Change to WB and open it in binary write mode
Install scrapy
Environment:
Python version: 3.6.3 MacOS: 10.13.5
Installation:
Pip3 install Scrapy – user
An error
Execute scrapy -v to execute scrapy -v
bash: scrapy: command not found
The solution
- Check to see if dependencies are properly installed by clicking on scrapy’s Github and comparing PIP list with setup.py
- Dependencies are installed correctly to create soft chains
Find / -name scrapy
Ln -s/Users/macbook/Library/Python / 3.6 / bin/scrapy/usr/local/bin/scrapy
Just execute the scrapy command