directory

Principle 2.1 The crawl process 2.2 the interpretation of each part 2.3 The analysis of scrapy data flow III. Pycharm and PyCharm are required to create a pyCharm and to create a pyCharm. Build a scrapy project feedback 5.1 how to find bugs faster 5.2 Coexistence of interpreters can cause a variety of errors six. Knowledge supplement 6.1 Spider classes and methods

A preface.

Scrapy is a fast, high-level screen scraping and Web scraping framework for Python development that crawls Web sites and extracts structured data from pages. Scrapy is versatile and can be used for data mining, monitoring, and automated testing.

Crawler enthusiasts believe that the advantage of scrapy is a high degree of customization, suitable for learning and research crawler technology, more related knowledge to learn, so it takes a long time to complete a crawler. Others say that scrapy doesn’t run on python3 and isn’t as widely applicable as expected.

Generally speaking, web crawler is a program that crawls data everywhere or directively on the Internet. In a more professional description, it crawls HTML data of specific web pages. The general method of crawling web pages is to define an entry page, and then generally a page will have the URL of other pages, and then get these urls from the current page and add them to the crawler’s crawling queue, and then enter the new page and recursively carry out the above operations.

Principle 2.

Scrapy uses Twisted’s asynchronous network library to handle network communication. The architecture is clean and includes various middleware interfaces that can be flexibly implemented.

In our first attempt to understand the results of scrapy and how it works, we use a graphic introduction like this:



2.1 Crawl Process

The green line in the figure above is the data flow. Firstly, the Scheduler will send the initial URL to Downloader for download, and then send it to Spider for analysis. The results analyzed by Spider are as follows: Links that need to be fetched further, such as the “next page” link analyzed earlier, are passed back to the Scheduler; The other is data that needs to be saved, which is sent to the Item Pipeline, where the data is postprocessed (detailed analysis, filtering, storage, etc.). In addition, various middleware can be installed in the data flow channel to do the necessary processing.

2.2 Interpretation of each part

  • Scrapy Engine: Handles the flow of data throughout the system and triggers transactions.
  • Scheduler: Receives requests from the engine, queues them, and returns when the engine asks again.
  • Downloader: Used to download web content and return web content to the spider.
  • Spiders: Spiders do the main work, using them to make resolution rules for a specific domain name or web page. Write classes that analyze the response and extract the item(that is, the item that was obtained) or additional urls to follow up. Each spider is responsible for a specific (or set of) web sites. Item Pipeline: Handles items that spiders extract from web pages. Its main task is to clarify, validate and store data. When the page is parsed by the spider, it is sent to the project pipeline and the data is processed in several specific sequences. Downloader Middlewares: A hook framework between the Scrapy engine and the Downloader, handling requests and responses between the Scrapy engine and the Downloader.
  • Spider Middlewares: A hook framework between a Scrapy engine and a Spider that handles Spider response inputs and request outputs.
  • Scheduler Middlewares: Middleware between the Scrapy engine and the Scheduler, sending requests and responses from the Scrapy engine to the Scheduler.

2.3 Scrapy data stream analysis

The steps are as follows:

STEP 1: The engine opens a web site (s), finds the Spider that handles the site, and requests the first URL(s) to crawl. The engine fetches the first URL from the Spider and dispatches the URL as a Request from the Scheduler. STEP 3: The engine asks the scheduler for the next URL to crawl. STEP 4: The scheduler returns the next URL to be crawled to the engine, which forwards the URL to the Downloader through the download middleware (in the request direction). STEP 5: Once the page is downloaded, the downloader generates a Response of the page and sends it to the engine through the download middleware (Response direction). STEP 6: The engine receives the Response from the downloader and sends it to the Spider middleware (input direction) for processing. The Spider handles the Response and returns the retrieved Item and the new Request to the engine. The engine feeds the crawling Item to the Item Pipeline and the Request to the scheduler. Repeat until there are no more requests in the scheduler, and the engine shuts down the site.

Reference source: “learning Scrapy” author: JasonDing links: www.jianshu.com/p/a8aad3bf4…

3. Understand

When most scientific sites mention scrapy, they introduce it as a crawler framework. What a framework does is encapsulate repetitive work.

For example, if you use Linux to process a set of data in four steps, each of which involves rewriting the command line and creating a new directory, there’s a waiting period in between, so it takes a long time to process the set of data, and you might forget the files stored in the directory, and worst of all, When someone else in the group needs to do the same with other data, you have to do it all over again, which leads to a lot of unnecessary wasted time.

Then, someone came up with the idea of writing the command in the SH file to execute it directly, so as to save the waiting time between four steps. The running time of the program remains the same, but the scattered process is turned into a whole process, which not only improves the efficiency but also avoids the constant inputting of commands in front of the computer for a long time.

Then, based on the previous sh file, someone thought of writing a framework to leave out the input of different users, such as data source, path, etc., so that the framework can be used by anyone who wants to process similar data in the same way, and avoid a lot of time to write sh files repeatedly. From these ideas, the framework was slowly created.

A framework, in layman’s terms, is the extraction of the same features of similar processes.

In actual four.

4.1 Install scrapy first

pip install scrapy

Copy the code

If Uninstalling SIX-1.4.1 appears, change the command to:

sudo pip install Scrapy –upgrade –ignore-installed six

PIP is a package management tool for Python. It is installed manually when python is downloaded. PIP runs on Unix/ Linux, OS X, and Windows platforms.

Verify that scrapy is installed successfully

scrapy version

Copy the code

A successful installation should look like this:

  • A common problem on the MAC is “no initialization function defined in the dynamic module”

ImportError: dynamic module does not define init function (init_openssl)

Copy the code

Solutions:

The most likely reason for this problem is that Python is 32bit, while computers are 64bit. How do I check the Python version and the operating system bits on my computer?

uname -a

Copy the code

You can get information about your computer’s operating system

import platform
platform.architecture()

Copy the code

You can see the current Python version as shown in the following example:

Scrapy Pits on OSX (www.cnblogs.com/Ray-liang/p…)

4.2 Create the project and download PyCharm and pyCharm configuration

Here to choose the classic “climb douban 9 points list” example of douban list links: www.douban.com/doulist/126…

4.2.1 Project establishment

First enter the command in the terminal:

scrapy startproject book

Copy the code

Successful establishment will result in:

New Scrapy project 'book', using template directory '/ Library/Frameworks/Python framework Versions / 2.7 / lib/python2.7 / site - packages/scrapy/templates/project', created in: /Users/wuxinyao/Desktop/book You can start your first spider with: cd book scrapy genspider example example.comCopy the code

When you return to the directory created, you can see that a directory named book has been generated. After entering the directory, you can use the command line to create the main crawler Python file, named douban in this example. Instructions:

scrapy genspider douban https://www.douban.com/doulist/1264675/

Copy the code

The url above is the url that the crawler is targeting. If it succeeds, it will display the following code:

Created spider 'douban' using template 'basic' in module:
  book.spiders.douban

Copy the code
4.2.2 Proceed with PyCharmCopy the code

Pycharm download url: www.jetbrains.com/pycharm/dow…

  • Create main file

You must create main.py in the book home directory, which means main.py is on the same level as the auto-generated scrapy. CFG. Type in main.py:

Scrapy import cmdline.execute("scrapy crawl douban ".split())Copy the code
  • Modify the douban. Py

In spiders directory to find douban. Py, comment out the # allowed_domains = [‘ www.douban.com/doulist/126…

Def parse(self, response) : print response.bodyCopy the code

  • add

In setting.py add:

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; WOW64; The rv: 45.0) Gecko / 20100101 Firefox 45.0 / 'Copy the code

4.2.3 Common Problems: Downloaded scrapy packages cannot be imported

The unresolved reference 'scrapy'Copy the code

And due to pyCharm’s permissions, scrapy may not be available to download directly on the IDE. So there will be repeated errors. There are several solutions to this problem:

Try again to download scrapy on the IDE

File -- > Default Settings -- > Project InterpreterCopy the code

Select a version of the interpreter, click the plus sign in the lower left corner, search for scrapy in the new pop-up screen, and click Install to download it.

MAC computers come with Python, but many people choose to download new Python for a variety of reasons (such as changing the operating system’s number of bits from 32 to 64) and store it in a different directory. MAC Native Python is also prone to permissions issues when introducing new packages, so MAC users often have multiple Python interpreters.

From this screen you can select the interpreter you want to use:


This doesn’t necessarily solve the problem; downloads can fail for a variety of reasons, such as permissions, or because the version of a package required for scrapy downloads isn’t new enough. It is equivalent to running PIP install scrapy in a terminal.

Even if the download is successful, it may not be able to successfully run out. But this is the easiest solution, so try it.

  • Reedit path
The run - > Edit ConfigurationCopy the code


Script writes the absolute path to main.py, and Python Interpreter selects the version of the interpreter you want to use.

If your scrapy works on terminals, you can use which scrapy to find the location of your scrapy and select a python version with a similar path. Or use Which Python to find the absolute path to a working Python and select that version of the interpreter.

Successful execution results:




The first few lines look like this, equivalent to ripping off the site’s source code. In fact, use a browser to check the source of the site, the display is the same result.

4.3 Extracting title name and author name

First look at the source code for the site:

Found in the

Refer to the writing method of the original author, the extraction sequence is as follows:

  • Extraction big frame:
  • Extract author: ‘

    This code is appended to the function parse (self,response) in douban.py. Comment out the previous “print Response. body” and add this directly.

    4.4 Scrapy

    In fact, write here, a complete small program has been formed, the output should be:


    It is possible that the first iteration of the program will not produce such a result, and that various errors will occur. However, when you debug it, you will see something that will help you understand scrapy structure:

    2017-07-20 xx:50:53 [scrapy.middleware] INFO: Enabled Extensions:... 2017-07-20 20:50:53 [scrapy.middleware] INFO: Enabled Downloader Middlewares... [scrapy. Middleware] INFO: Enabled Spider Middlewares:... [Scrapy. Middleware] INFO: Enable Item Pipelines:... 2017-07-20 20:50:53 [scrapy.core.engine] INFO: Scrapy... 2017-07-20 20:50:54 [scrapy.core.engine] INFO: Spider closed (finished)...Copy the code

    Possible problems:

    If you don’t get the above output, take a closer look at the output code if it has:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
    
    Copy the code

    Python’s STR defaults to ASCII encoding, which conflicts with Unicode encoding, so this header error is reported

    Just add this to main.py:

    The import sys reload (sys) sys. Setdefaultencoding (' utf8 ')Copy the code

    That will solve the problem.

    4.5 The difficulty of small projects is deepened

    (2) The pipline and item scrapes are designed to extract and store data. (3) The pipline and item scrapes are designed to extract and store data. (4) The pipline and item scrapes are designed to extract data and store data.

    4.5.1 Description of the subdirectories generated in Step 1:

    Enter the command:

    scrapy startproject myproject
             ls myproject
    
    Copy the code
    • Scrapy. CFG: project configuration file
    • /: Project Python module, where you will add code later
    • Myproject /items.py: project items file
    • Myproject /pipelines. Py: Pipeline files
    • Myproject /settings.py: project configuration file
    • Spiders of Myproject/Spiders

    Item is like a dictionary in Python

    As you can see from the previous section, Item is the container that holds the data, and we need to model Item to get the data fetched from the site.

    Scrapy (douban.py) scrapy (douban.py) scrapy (douban.py)

    In your own Python file (called project.py in this case), you need to import the item function first

    from project.py import MyprojectItem
    
    Copy the code

    Variable setting of Feed output in 4.5.3 Setting

    After the Item stores the captured data, if you want to print it as an Excel table, you need to set settig.py to add the following two lines:

    FEED_FORMAT: indicates the output format. CSV/XML/JSON/FEED_URI: indicates the output location. The format can be local or FTP serverCopy the code

    Such as:

    FEED_URI = u 'file:///G://dou.csv FEED_FORMAT =' CSV 'Copy the code

    In this case, the output file is stored on disk G and is called Douo.csv, which is a CSV file.


    Build feedback to scrapy

    5.1 How can I Find Bugs faster

    For beginners, it’s more about the configuration of PyCharm and the ability to adapt regular expressions. It’s better to understand the flow of information and data through scrapy. This way, even if the program runs a bug (which is extremely common), you can quickly figure out which file the problem is in. Scrapy’s console is not a great tool for debugging bugs. It doesn’t display key errors directly, so it’s best to keep those features separate in mind and look for bugs.

    5.2 Possible errors caused by coexistence of interpreters

    In this article, we introduce the use of scrapy, a framework that allows you to implement a crawler, including the possibility of having multiple Python interpreters on your computer. Change the Python interpreter in Run/Configuration if errors are reported while importing the package. If console keeps reporting errors, cannot connect to console, rebuild project and select an external library in the usr/bin directory.

    Six. Knowledge supplement

    6.1 Classes and methods in spiders

    • Name: the mandatory and unique spider name of type string, which is entered when the spider is run
    • Allow_domains: allowed method domain names, which can be string or list. Optional and optional.
    • Start_urls: Specifies the URL to visit first
    • Start_requests (): Gets urls from start_urls by default and generates a request for each URL. The default callback is parse. This is where the scheduling starts. This can be rewritten in order to access from the specified URL, generally used to simulate login, when obtaining dynamic code. Get dynamic code so you can do this:
    from scrapy.http import Request,FromRequest start_requests(): Return [Request(url, mata={'cookiejar':1}callback=login)] #Copy the code

    The url here is the login url that you login to. When you access this url, the server will return you a response, which contains the code to be sent when you login next. So the callback to login method here is going to get the code from the response that’s returned either by regular expression or by xpath and so on.

    Start_requests returns the response downloaded by the Downloader to callback, which is the login method I defined. In the login method, in addition to parsing and obtaining dynamic code, you can also simulate login.

    def login(self,response): //h1/text()).extract() # Headers ={} # postdata={} #server Return [formRequest. from_response(url, headers=headers, formdata=postdata, Meta ={'cookiejar':response.meta['cookiejar'], callback=loged, dont_filter=x})] And the cookie uses the cookie in the response returned, which is the cookie recorded in start_requests above. In addition, the URL is the one you actually posted the data to, which is usually available via Firebug. In fact, you can get cookies at this point, so you can get cookies after you log in. After this method is called and successfully logged in, you can now access other pages in the loged method using the make_requests_from_URL method below. # If you need to input the verification code here, you can download the picture and manually input it, which can be seen in another piece of record.Copy the code

    If the method is overridden, the URL in start_urls will not be accessed first, and will be “emphasized” later when it is intended to be accessed. That will be explained later.

    Also note that start_requests is called automatically only once.

    make_requests_from_url(url):
    
    Copy the code

    This method automatically returns to Parse when you specify a URL. The only methods that can be automatically invoked in scrapy are those (start_requests and make_requests_from_URL). This is important because it combines the rules in the CrawlSpider, described later.

    To access urls in start_urls, make_requests_from_URL () :

    Def loged (self, response) : for url in start_urls: yield make_requests_from_url(url)Copy the code

    parse():

    The default place to pass response to scrapy is parse (), which is used to extract web content from the Spider class, but in the CralwSpider, The parse () implementation takes further processing with the link obtained in the rule, so rewriting the parse method in a CrawlSpider is not recommended.

    Rule () :

    Rule provides how to guide Downloader to get links.

    The from scrapy. Linkextractors import LinkExtractor as LKE reference author: ChrisPop link: http://www.jianshu.com/p/a1018729d695Copy the code

    Conclusion: Choosing a new approach is always difficult at first. Adapting to a new software is also buggy. Once you get through the initial phase, the rest will open up and become more comfortable. We also wish you more and more fun crawler programs using scrapy!

    This article is from the cloud community partner “Datapai THU”. For relevant information, you can pay attention to the wechat public account of “Datapai THU”