This article is participating in Python Theme Month. See the link to the event for more details
Chen is also a beginner Scrapy, some wrong hope big guy can give advice, in the bottom of the message tell me! Thank you very much!
As of this writing, Scrapy’s latest version is 2.5.0
All right, cut the crap and let’s go!
Command Line Help
Any command line tool usually comes with command line instructions, almost by default in the industry. Scrapy is no exception.
$scrapy -h scrapy 2.5.0 -project: scrapybot Usage: scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark testcheck Check spider contracts commands crawl Run a spider edit Edit spider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider) andprint the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
Copy the code
startproject
Create a new Scrapy project, automatically generate a Scrapy project structure
The command format
$scrapy startProjectThe project directory is optional. If you do not write the project directory, you will generate a folder with the same name as the project directory
Copy the code
Order sample
$ scrapy startproject example
New Scrapy project 'example', using template directory 'd:\devtools\python\python39\lib\site-packages\scrapy\templates\project', created in:
D:\WorkSpace\Personal\my-scrapy\example
You can start your first spider with:
cd example
scrapy genspider example example.com
$ ls
example readme.md venv
Copy the code
The project structure is roughly as shown in the following figure
genspider
Using predefined templates to generate new crawlers is very useful and can sometimes greatly improve crawler efficiency, provided there is a good set of predefined templates.
The command format
$scrapy genspider [-t crawler template name] Crawler name Specifies the crawler domain name# Crawler template name is not required, default template will be used
Copy the code
Order sample
$ cd example/example/spiders/
$ scrapy genspider exampleSpider example.spider.com
Created spider 'exampleSpider' using template 'basic' in module:
example.spiders.exampleSpider
Copy the code
The created crawler is as follows
crawl
Run a crawler. This command looks a bit like a Runspider, but it requires that the crawler execute a scrapy-approved project structure.
The command format
$scrapy crawl Specifies the name of a crawler# is the name of the crawler we created with genspider above
Copy the code
Order sample
$scrapy crawl exampleSpider 2021-07-25 01:50:59 [scrapy.utils.log] INFO: start (bot: example) 2021-07-25 01:50:59 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0 2021-07-25 01:50:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2021-07-25 01:50:59 [scrapy.crawler] INFO: Overridden settings: ...'start_time': datetime.datetime(2021, 7, 24, 17, 51, 0, 206683)}
2021-07-25 01:51:02 [scrapy.core.engine] INFO: Spider closed (finished)
Copy the code
runspider
This command is also used to execute the crawler, but can execute the external crawler file, which is a separate Spider
The command format
$scrapy Runspider crawler fileCopy the code
Order sample
$scrapy runspider exampleSpider. Py 2021-07-25 01:54:24 [scrapy.utils.log] INFO: start (bot: example) 2021-07-25 01:54:24 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0 2021-07-25 01:54:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor ...'start_time': datetime.datetime(2021, 7, 24, 17, 54, 24, 908097)}
2021-07-25 01:54:31 [scrapy.core.engine] INFO: Spider closed (finished)
Copy the code
bench
Benchmark, will run a simple example crawler to benchmark, it is not very clear what to test.
This did not see what effect in the end, I hope a big guy can leave a message to answer Chen’s doubts!
Order sample
$scrapy bench 2021-07-25 01:58:17 [scrapy.utils.log] INFO: start (bot: example) 2021-07-25 01:58:17 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0... 2021-07-25 01:58:19 [scrapy.extensions.logstats] INFO: Crawled 90 pages (at 5400 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:20 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 6360 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:21 [scrapy.extensions.logstats] INFO: Crawled 285 pages (at 5340 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:22 [scrapy.extensions.logstats] INFO: Crawled 369 pages (at 5040 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:23 [scrapy.extensions.logstats] INFO: Crawled 433 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:24 [scrapy.extensions.logstats] INFO: Crawled 513 pages (at 4800 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:25 [scrapy.extensions.logstats] INFO: Crawled 593 pages (at 4800 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:26 [scrapy.extensions.logstats] INFO: Crawled 657 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:27 [scrapy.extensions.logstats] INFO: Crawled 721 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:28 [scrapy.extensions.logstats] INFO: Crawled 785 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) ...'start_time': datetime.datetime(2021, 7, 24, 17, 58, 18, 691354)}
2021-07-25 01:58:29 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
Copy the code
check
Crawler code check, similar to static code check, check in advance whether the crawler written error.
The command format
$ scrapy check [-l] <spider>
Copy the code
Order sample
$ scrapy check -l
first_spider
* parse
* parse_item
second_spider
* parse
* parse_item
$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex'field is missing [FAILED] first_spider:parse >>> Returned 92 requests, expected 0.. 4Copy the code
list
Lists all available crawlers for the current project
The command format
$ scrapy list
Copy the code
Order sample
$ scrapy list
hotList
Copy the code
edit
Edit crawler, temporarily modify the configuration, or can. An editor opens to edit the crawler code
The command format
$ scrapy edit <spider>
Copy the code
Order sample
$ scrapy edit hotList
'%s'It is not an internal or external command, nor a runnable program or batch file.There seems to be no default editor in the computer, so there is an error
Copy the code
fetch
Use Scrapy to access web pages
The command format
$ scrapy fetch <url>
# List of supported parameters--spider= spider: uses the specified crawler to access the web page, which can be used to verify whether the crawler is valid. Headers: prints the response header, but does not print the response body. No-redirect: ignores the redirectionCopy the code
Order sample
$scrapy fetch https://www.baidu.com 2021-07-25 02:16:16 [scrapy.utils.log] INFO: start (bot: csdnHot) 2021-07-25 02:16:16 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0... 2021-07-25 02:16:17 [scrapy.core.engine] INFO: Spider closed (finished) <! DOCTYPE html> <html><! --STATUS OK--><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="Cache-control" content="no-cache" /><meta name="viewport" content="Width = device - width, minimum - scale = 1.0, the maximum - scale = 1.0, user - scalable = no"/><style type="text/css">body {margin: 0; text-align: center; font-size: 14px; font-family: Arial,Helvetica,LiHei Pro Medium; color:# 262626; }form {position: relative; margin: 12px 15px 91px; height: 41px; }img {border: 0}.wordWrap{margin-right: 85px; }#word {background-color: #FFF; border: 1px solid #6E6E6E; color: #000; font-size: 18px; height: 27px; padding: 6px; width: 100%; -webkit-appearance: none; -webkit-border-radius: 0; border-radius: 0; }.bn {background-color: #F5F5F5; border: 1px solid #787878; font-size: 16px;
# will print out baidu home page HTML code
Copy the code
view
So let’s use Scrapy to pull up the browser and open the page, and this is going to do some analysis on the page.
The official document mentions that crawlers and ordinary users sometimes see a different page, so you can confirm whether you can crawl this page.
The command format
$ scrapy view <url>
# List of supported parameters--spider= spider: uses the specified crawler to access the page, which can be used to verify whether the crawler is working --no-redirect: ignores redirectionCopy the code
Order sample
$scrapy view https://www.baidu.com 2021-07-25 02:20:37 [scrapy.utils.log] INFO: start (bot: csdnHot) 2021-07-25 02:20:37 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0 2021-07-25 02:20:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor#...
Copy the code
Downloaded a webpage file to the local, and we see at ordinary times really not the same Baidu!
shell
Is also a development crawler often use the command, and the difference between fetch is that you can customize a shell script test parsing web pages
The command format
$ scrapy shell [url]
Copy the code
Order sample
$ scrapy shell --nolog -c '(response.status, response.url)' https://blog.csdn.net/phoenix/web/blog/hotRank?page=0&pageSize=25
(200, 'https://blog.csdn.net/phoenix/web/blog/hotRank?page=0')
Copy the code
parse
Try to crawl a given url, often used to test whether the crawler code is valid.
The command format
$ scrapy parse <url> [options]
# List of supported parameters
--spider=SPIDER: bypass spider autodetection and force use of specific spider
--a NAME=VALUE: set spider argument (may be repeated)
--callback or -c: spider method to use as callback forparsing the response --meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: -- meta = '{" foo ":" bar "}' - cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: -- Cbkwargs = '{" foo ":" bar "}' -- Charge: Process items through Charge --rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to useforParsing the Response -- Noitems: Don't show Monopoly Items -- Nolinks: Don't Show Extracted Links -- Nocolour: avoid using pygments to colorize the output --depth or -d: depth levelfor which the requests should be followed recursively (default: 1)
--verbose or -v: display information for each depth level
--output or -o: dump scraped items to a file
Copy the code
Order sample
$ scrapy parse https://blog.csdn.net/rank/list --spider=hotList
...
2021-07-25 02:27:06 [scrapy.core.engine] INFO: Spider closed (finished)
>>> STATUS DEPTH LEVEL 0 <<<
# Scraped Items ------------------------------------------------------------
[]
# Requests -----------------------------------------------------------------
[]
Copy the code
settings
View crawler configuration
The command format
$ scrapy settings [options]
Copy the code
Order sample
$ scrapy settings --get BOT_NAME
csdnHot
Copy the code
The resources
- Scrapy 2.5 documentation