This article is participating in Python Theme Month. See the link to the event for more details

Chen is also a beginner Scrapy, some wrong hope big guy can give advice, in the bottom of the message tell me! Thank you very much!

As of this writing, Scrapy’s latest version is 2.5.0

All right, cut the crap and let’s go!

Command Line Help

Any command line tool usually comes with command line instructions, almost by default in the industry. Scrapy is no exception.

$scrapy -h scrapy 2.5.0 -project: scrapybot Usage: scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark testcheck Check spider contracts commands crawl Run a spider edit Edit spider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider)  andprint the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command
Copy the code

startproject

Create a new Scrapy project, automatically generate a Scrapy project structure

The command format

$scrapy startProjectThe project directory is optional. If you do not write the project directory, you will generate a folder with the same name as the project directory
Copy the code

Order sample

$ scrapy startproject example
New Scrapy project 'example', using template directory 'd:\devtools\python\python39\lib\site-packages\scrapy\templates\project', created in:
    D:\WorkSpace\Personal\my-scrapy\example

You can start your first spider with:
    cd example
    scrapy genspider example example.com

$ ls
example  readme.md  venv
Copy the code

The project structure is roughly as shown in the following figure

genspider

Using predefined templates to generate new crawlers is very useful and can sometimes greatly improve crawler efficiency, provided there is a good set of predefined templates.

The command format

$scrapy genspider [-t crawler template name] Crawler name Specifies the crawler domain name# Crawler template name is not required, default template will be used
Copy the code

Order sample

$ cd example/example/spiders/
$ scrapy genspider exampleSpider example.spider.com
Created spider 'exampleSpider' using template 'basic' in module:
  example.spiders.exampleSpider
Copy the code

The created crawler is as follows

crawl

Run a crawler. This command looks a bit like a Runspider, but it requires that the crawler execute a scrapy-approved project structure.

The command format

$scrapy crawl Specifies the name of a crawler# is the name of the crawler we created with genspider above
Copy the code

Order sample

$scrapy crawl exampleSpider 2021-07-25 01:50:59 [scrapy.utils.log] INFO: start (bot: example) 2021-07-25 01:50:59 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0 2021-07-25 01:50:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2021-07-25 01:50:59 [scrapy.crawler] INFO: Overridden settings: ...'start_time': datetime.datetime(2021, 7, 24, 17, 51, 0, 206683)}
2021-07-25 01:51:02 [scrapy.core.engine] INFO: Spider closed (finished)
Copy the code

runspider

This command is also used to execute the crawler, but can execute the external crawler file, which is a separate Spider

The command format

$scrapy Runspider crawler fileCopy the code

Order sample

$scrapy runspider exampleSpider. Py 2021-07-25 01:54:24 [scrapy.utils.log] INFO: start (bot: example) 2021-07-25 01:54:24 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0 2021-07-25 01:54:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor ...'start_time': datetime.datetime(2021, 7, 24, 17, 54, 24, 908097)}
2021-07-25 01:54:31 [scrapy.core.engine] INFO: Spider closed (finished)
Copy the code

bench

Benchmark, will run a simple example crawler to benchmark, it is not very clear what to test.

This did not see what effect in the end, I hope a big guy can leave a message to answer Chen’s doubts!

Order sample

$scrapy bench 2021-07-25 01:58:17 [scrapy.utils.log] INFO: start (bot: example) 2021-07-25 01:58:17 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0... 2021-07-25 01:58:19 [scrapy.extensions.logstats] INFO: Crawled 90 pages (at 5400 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:20 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 6360 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:21 [scrapy.extensions.logstats] INFO: Crawled 285 pages (at 5340 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:22 [scrapy.extensions.logstats] INFO: Crawled 369 pages (at 5040 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:23 [scrapy.extensions.logstats] INFO: Crawled 433 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:24 [scrapy.extensions.logstats] INFO: Crawled 513 pages (at 4800 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:25 [scrapy.extensions.logstats] INFO: Crawled 593 pages (at 4800 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:26 [scrapy.extensions.logstats] INFO: Crawled 657 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:27 [scrapy.extensions.logstats] INFO: Crawled 721 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2021-07-25 01:58:28 [scrapy.extensions.logstats] INFO: Crawled 785 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) ...'start_time': datetime.datetime(2021, 7, 24, 17, 58, 18, 691354)}
2021-07-25 01:58:29 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
Copy the code

check

Crawler code check, similar to static code check, check in advance whether the crawler written error.

The command format

$ scrapy check [-l] <spider>
Copy the code

Order sample

$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex'field is missing [FAILED] first_spider:parse >>> Returned 92 requests, expected 0.. 4Copy the code

list

Lists all available crawlers for the current project

The command format

$ scrapy list
Copy the code

Order sample

$ scrapy list
hotList
Copy the code

edit

Edit crawler, temporarily modify the configuration, or can. An editor opens to edit the crawler code

The command format

$ scrapy edit <spider>
Copy the code

Order sample

$ scrapy edit hotList
'%s'It is not an internal or external command, nor a runnable program or batch file.There seems to be no default editor in the computer, so there is an error
Copy the code

fetch

Use Scrapy to access web pages

The command format

$ scrapy fetch <url>
# List of supported parameters--spider= spider: uses the specified crawler to access the web page, which can be used to verify whether the crawler is valid. Headers: prints the response header, but does not print the response body. No-redirect: ignores the redirectionCopy the code

Order sample

$scrapy fetch https://www.baidu.com 2021-07-25 02:16:16 [scrapy.utils.log] INFO: start (bot: csdnHot) 2021-07-25 02:16:16 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0... 2021-07-25 02:16:17 [scrapy.core.engine] INFO: Spider closed (finished) <! DOCTYPE html> <html><! --STATUS OK--><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="Cache-control" content="no-cache" /><meta name="viewport" content="Width = device - width, minimum - scale = 1.0, the maximum - scale = 1.0, user - scalable = no"/><style type="text/css">body {margin: 0; text-align: center; font-size: 14px; font-family: Arial,Helvetica,LiHei Pro Medium; color:# 262626; }form {position: relative; margin: 12px 15px 91px; height: 41px; }img {border: 0}.wordWrap{margin-right: 85px; }#word {background-color: #FFF; border: 1px solid #6E6E6E; color: #000; font-size: 18px; height: 27px; padding: 6px; width: 100%; -webkit-appearance: none; -webkit-border-radius: 0; border-radius: 0; }.bn {background-color: #F5F5F5; border: 1px solid #787878; font-size: 16px;
# will print out baidu home page HTML code
Copy the code

view

So let’s use Scrapy to pull up the browser and open the page, and this is going to do some analysis on the page.

The official document mentions that crawlers and ordinary users sometimes see a different page, so you can confirm whether you can crawl this page.

The command format

$ scrapy view <url>
# List of supported parameters--spider= spider: uses the specified crawler to access the page, which can be used to verify whether the crawler is working --no-redirect: ignores redirectionCopy the code

Order sample

$scrapy view https://www.baidu.com 2021-07-25 02:20:37 [scrapy.utils.log] INFO: start (bot: csdnHot) 2021-07-25 02:20:37 [scrapy.utils.log] INFO: Versions: LXML 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6: db3FF76, Jun 28 2021, 15:26:21) PyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), Cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0 2021-07-25 02:20:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor#...
Copy the code

Downloaded a webpage file to the local, and we see at ordinary times really not the same Baidu!

shell

Is also a development crawler often use the command, and the difference between fetch is that you can customize a shell script test parsing web pages

The command format

$ scrapy shell [url]
Copy the code

Order sample

$ scrapy shell --nolog -c '(response.status, response.url)' https://blog.csdn.net/phoenix/web/blog/hotRank?page=0&pageSize=25
(200, 'https://blog.csdn.net/phoenix/web/blog/hotRank?page=0')
Copy the code

parse

Try to crawl a given url, often used to test whether the crawler code is valid.

The command format

$ scrapy parse <url> [options]
# List of supported parameters
--spider=SPIDER: bypass spider autodetection and force use of specific spider
--a NAME=VALUE: set spider argument (may be repeated)
--callback or -c: spider method to use as callback forparsing the response --meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: -- meta = '{" foo ":" bar "}' - cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: -- Cbkwargs = '{" foo ":" bar "}' -- Charge: Process items through Charge --rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to useforParsing the Response -- Noitems: Don't show Monopoly Items -- Nolinks: Don't Show Extracted Links -- Nocolour: avoid using pygments to colorize the output --depth or -d: depth levelfor which the requests should be followed recursively (default: 1)
--verbose or -v: display information for each depth level
--output or -o: dump scraped items to a file
Copy the code

Order sample

$ scrapy parse https://blog.csdn.net/rank/list --spider=hotList
...
2021-07-25 02:27:06 [scrapy.core.engine] INFO: Spider closed (finished)

>>> STATUS DEPTH LEVEL 0 <<<
# Scraped Items ------------------------------------------------------------
[]

# Requests -----------------------------------------------------------------
[]
Copy the code

settings

View crawler configuration

The command format

$ scrapy settings [options]
Copy the code

Order sample

$ scrapy settings --get BOT_NAME
csdnHot
Copy the code

The resources

  • Scrapy 2.5 documentation