Want to crawl website data? Go to the website first! For most large websites, the first hurdle to accessing their data is to log on to the site. Please follow my steps to learn how to simulate login website.

Why simulated landing?

There are two types of websites on the Internet: those that require login and those that do not. (This is nonsense!)

So, for websites that do not need to log in, we can directly obtain data, simple and easy. And for the need to log in to view the data or do not log in to view only part of the data site, we have to obediently log in the site. (Unless you directly hack into someone else’s database, please do not use hackers!)

Therefore, for the website that needs to be logged in, we need to simulate the login, on the one hand, to obtain the information and data of the page after logging in, and on the other hand, to get the cookie after logging in, so that we can use it in the next request.

Simulate the idea of landing

When it comes to simulated landing, everyone’s first reaction is: Ugh! Isn’t that easy? Open your browser, enter the url, find the username and password box, enter the username and password, and click login!

There is nothing wrong with this approach, which is how our Selenium simulated login works.

In addition, our Requests can request cookies that have already been logged in, bypassing the login.

We can also use Requests to send post Requests with the information required to log in to the site.

These are the three common ways to simulate landing sites, and our Scrapy also uses the latter two methods, since the first method is unique to Selenium.

Scrapy simulates landing ideas:

2. Attach the information required for website login to the POST request for login

Simulated landing instance

Simulated login with cookies

Each login method has its advantages, disadvantages and usage scenarios. Let’s take a look at the application scenarios of login with cookies:

1. Cookie expiration time is very long. We can log in once without worrying about expiration, which is common in some non-standard websites. 2. We can get all the data we need before the cookie expires. 3. You can use selenium in conjunction with other programs, for example, to save login cookie acquisition locally and then read the local cookie before Scrapy sends the request.

Let’s take a look at this simulated login through the long-forgotten renren site.

Let’s start by creating a Scrapy project:

> scrapy startproject login
Copy the code

In order to successfully climb, please first set the robots protocol in the Settings to False:

ROBOTSTXT_OBEY = False
Copy the code

Next, we create a crawler:

> scrapy genspider renren renren.com
Copy the code

We open the spiders renren.py with the following code:

# -*- coding: utf-8 -*-
import scrapy


class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    start_urls = ['http://renren.com/']

    def parse(self, response):
        pass

Copy the code

As we know, start_urls is the first url that we need to crawl. This is the initial url that we need to crawl data. Let’s say I need to crawl renren’s personal center page. http://www.renren.com/972990680/profile, if I put the site in start_urls directly, then we request directly, you think about it, can I succeed?

No, right! Because we haven’t logged in yet, we can’t see the personal center page.

So where do we add our login code?

What we do know is that we must be logged in before the framework requests a web page in start_urls.

Let’s go to the source of the Spider class and find this code:

def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

    def make_requests_from_url(self, url):
        """ This method is deprecated. """
        return Request(url, dont_filter=True)

Copy the code

As you can see from the source code, this method takes the URL from start_urls and constructs a Request object to Request it. In this case, we can override the start_requests method to do something else, namely add cookies when constructing the Request object.

The rewritten start_requests method looks like this:

# -*- coding: utf-8 -*-
import scrapy
import re

class RenrenSpider(scrapy.Spider):
    name = 'renren'
    allowed_domains = ['renren.com']
    # Personal Center page url
    start_urls = ['http://www.renren.com/972990680/profile']

    def start_requests(self):
        Cookies obtained from the request using Chrome's Debug tool after login
        cookiesstr = "anonymid=k3miegqc-hho317; depovince=ZGQT; _r01_=1; JSESSIONID=abcDdtGp7yEtG91r_U-6w; ick_login=d2631ff6-7b2d-4638-a2f5-c3a3f46b1595; ick=5499cd3f-c7a3-44ac-9146-60ac04440cb7; t=d1b681e8b5568a8f6140890d4f05c30f0; societyguester=d1b681e8b5568a8f6140890d4f05c30f0; id=972990680; xnsid=404266eb; XNESSESSIONID=62de8f52d318; jebecookies=4205498d-d0f7-4757-acd3-416f7aa0ae98|||||; Ver = 7.0; loginfrom=null; jebe_key=8800dc4d-e013-472b-a6aa-552ebfc11486%7Cb1a400326a5d6b2877f8c884e4fe9832%7C1575175011619%7C1%7C1575175011639; jebe_key=8800dc4d-e013-472b-a6aa-552ebfc11486%7Cb1a400326a5d6b2877f8c884e4fe9832%7C1575175011619%7C1%7C1575175011641; wp_fold=0"
        cookies = {i.split("=")[0]:i.split("=") [1]for i in cookiesstr.split("; ")}

        # Request with cookies
        yield scrapy.Request(
            self.start_urls[0],
            callback=self.parse,
            cookies=cookies
        )

    def parse(self, response):
        # Look up the keyword "leisure" from the personal center page and print it
        print(re.findall("Idle huan", response.body.decode()))
Copy the code

I logged in to Renren with the correct account first. After logging in, I used chrome’s debug tool to get a Request cookie from the Request and then added this cookie to the Request object. I then looked for the “fun” keyword in the web page in the Parse method and printed it out.

Let’s run this crawler:

>scrapy crawl renren
Copy the code

In the run log we can see the following lines:

2019-12-01 13:06:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.renren.com/972990680/profile?v=info_timeline> (referer: http://www.renren.com/972990680/profile)
['carefree huan'.'carefree huan'.'carefree huan'.'carefree huan'.'carefree huan'.'carefree huan'.'carefree huan']
2019-12-01 13:06:55 [scrapy.core.engine] INFO: Closing spider (finished)
Copy the code

We can see that the information we need has been printed.

We can add COOKIES_DEBUG = True to the Settings configuration to see how cookies are being passed.

After adding this configuration, we can see the following information in the log:

2019-12-01 13:06:55 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://www.renren.com/972990680/profile?v=info_timeline>
Cookie: anonymid=k3miegqc-hho317; depovince=ZGQT; _r01_=1; JSESSIONID=abcDdtGp7yEtG91r_U-6w; ick_login=d2631ff6-7b2d-4638-a2f5-c3a3f46b1595; ick=5499cd3f-c7a3-44ac-9146-60ac04440cb7; t=d1b681e8b5568a8f6140890d4f05c30f0; societyguester=d1b681e8b5568a8f6140890d4f05c30f0; id=972990680; xnsid=404266eb; XNESSESSIONID=62de8f52d318; jebecookies=4205498d-d0f7-4757-acd3-416f7aa0ae98|||||; Ver = 7.0; loginfrom=null; jebe_key=8800dc4d-e013-472b-a6aa-552ebfc11486%7Cb1a400326a5d6b2877f8c884e4fe9832%7C1575175011619%7C1%7C1575175011641; wp_fold=0; JSESSIONID=abc84VF0a7DUL7JcS2-6w
Copy the code

Send a POST request to simulate login

Let’s use GitHub as an example to illustrate this simulated login method.

Let’s start by creating a crawler called Github:

> scrapy genspider github github.com
Copy the code

We want to use POST request simulation login, first need to know the LOGIN URL address, as well as the login required parameter information. Using the debug tool, you can see the following information about the login request:

From the request information we can find the login URL is: https://github.com/session, login required parameters as follows:

commit: Sign inUtf8: ✓ authenticity_token: bbpX85KY36B7N6qJadpROzoEdiiMI6qQ5L7hYFdPS+zuNNFSKwbW8kAGW5ICyvNVuuY5FImLdArG47358RwhWQ== ga_id: 101235085.1574734122 login: [email protected] password: XXX webAuthn-support: supported webAuthn-iuvpaa-support: unsupported required_field_f0e5: timestamp: 1575184710948 timestamp_secret: 574aa2760765c42c07d9f0ad0bbfd9221135c3273172323d846016f43ba761dbCopy the code

That’s a lot of parameters for a request, Khan!

There is also the required_field_f0e5 parameter, which is generated dynamically every time the page loads. This parameter is always passed empty, which saves us one parameter. We don’t have to wear this parameter.

The rest of the parameters are placed on the page as shown below:

We use xpath to get each parameter, and the code is as follows (I have replaced the username and password with XXX, please write your real username and password when running) :

# -*- coding: utf-8 -*-
import scrapy
import re

class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    # login page URL
    start_urls = ['https://github.com/login']

    def parse(self, response):
        Get request parameters
        commit = response.xpath("//input[@name='commit']/@value").extract_first()
        utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
        authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
        ga_id = response.xpath("//input[@name='ga_id']/@value").extract_first()
        webauthn_support = response.xpath("//input[@name='webauthn-support']/@value").extract_first()
        webauthn_iuvpaa_support = response.xpath("//input[@name='webauthn-iuvpaa-support']/@value").extract_first()
        # required_field_157f = response.xpath("//input[@name='required_field_4ed5']/@value").extract_first()
        timestamp = response.xpath("//input[@name='timestamp']/@value").extract_first()
        timestamp_secret = response.xpath("//input[@name='timestamp_secret']/@value").extract_first()

        Construct the POST argument
        post_data = {
            "commit": commit,
            "utf8": utf8,
            "authenticity_token": authenticity_token,
            "ga_id": ga_id,
            "login": "[email protected]"."password": "xxx"."webauthn-support": webauthn_support,
            "webauthn-iuvpaa-support": webauthn_iuvpaa_support,
            # "required_field_4ed5": required_field_4ed5,
            "timestamp": timestamp,
            "timestamp_secret": timestamp_secret
        }

        # print parameters
        print(post_data)

        Send a POST request
        yield scrapy.FormRequest(
            "https://github.com/session".Login request method
            formdata=post_data,
            callback=self.after_login
        )

    Operation after successful login
    def after_login(self, response):
        Find the Issues field on the page and print it
        print(re.findall("Issues", response.body.decode()))
Copy the code

[FormRequest] [FormRequest] [FormRequest] [FormRequest] [FormRequest] [FormRequest]

2019-12-01 15:14:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/login> (referer: None)
{'commit': 'Sign in'.'utf8': '✓'.'authenticity_token': '3P4EVfXq3WvBM8fvWge7FfmRd0ORFlS6xGcz5mR5A00XnMe7GhFaMKQ8y024Hyy5r/RFS9ZErUDr1YwhDpBxlQ=='.'ga_id': None, 'login': '[email protected]'.'password': '54ithero'.'webauthn-support': 'unknown'.'webauthn-iuvpaa-support': 'unknown'.'timestamp': '1575184487447'.'timestamp_secret': '6a8b589266e21888a4635ab0560304d53e7e8667d5da37933844acd7bee3cd19'}
2019-12-01 15:14:47 [scrapy.core.scraper] ERROR: Spider error processing <GET https://github.com/login> (referer: None)
Traceback (most recent call last):
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy/utils/defer. Py." ", line 102, in iter_errback
    yield next(it)
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy/core/spidermw py." ", line 84, in evaluate_iterable
    for r in iterable:
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy/spidermiddlewares/can use p y", line 29, in process_spider_output
    for x in result:
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy/core/spidermw py." ", line 84, in evaluate_iterable
    for r in iterable:
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy/spidermiddlewares/referer. Py." ", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy/core/spidermw py." ", line 84, in evaluate_iterable
    for r in iterable:
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy spidermiddlewares/urllength py." ", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy/core/spidermw py." ", line 84, in evaluate_iterable
    for r in iterable:
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy/spidermiddlewares/the depth. The p y", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/cxhuan/Documents/python_workspace/scrapy_projects/login/login/spiders/github.py", line 40, in parse
    callback=self.after_login
  File "/ Applications/anaconda3 / lib/python3.7 / site - HTTP/request/packages/scrapy/form. Py." ", line 32, in __init__
    querystr = _urlencode(items, self.encoding)
  File "/ Applications/anaconda3 / lib/python3.7 / site - HTTP/request/packages/scrapy/form. Py." ", line 73, in _urlencode
    for k, vs in seq
  File "/ Applications/anaconda3 / lib/python3.7 / site - HTTP/request/packages/scrapy/form. Py." ", line 74, in <listcomp>
    for v in (vs if is_listlike(vs) else [vs])]
  File "/ Applications/anaconda3 / lib/python3.7 / site - packages/scrapy/utils/python. Py." ", line 107, in to_bytes
    'object, got %s' % type(text).__name__)
TypeError: to_bytes must receive a unicode, str or bytes object, got NoneType
2019-12-01 15:14:47 [scrapy.core.engine] INFO: Closing spider (finished)
Copy the code

Error: ga_id = “None” ga_id = “None”; error: ga_id = “None”;

Modify the code as follows:

ga_id = response.xpath("//input[@name='ga_id']/@value").extract_first()
if ga_id is None:
    ga_id = ""
Copy the code

Run the crawler again, this time to see the results:

Set-Cookie: _gh_sess=QmtQRjB4UDNUeHdkcnE4TUxGbVRDcG9xMXFxclA1SDM3WVhqbFF5U0wwVFp0aGV1UWxYRWFSaXVrZEl0RnVjTzFhM1RrdUVabDhqQldTK3k3TEd 3KzNXSzgvRXlVZncvdnpURVVNYmtON0IrcGw1SXF6Nnl0VTVDM2dVVGlsN01pWXNUeU5XQi9MbTdZU0lTREpEMllVcTBmVmV2b210Sm5Sbnc0N2d5aVErbjV DU2JCQnA5SkRsbDZtSzVlamxBbjdvWDBYaWlpcVR4Q2NvY3hwVUIyZz09LS1lMUlBcTlvU0F0K25UQ3loNHFOZExnPT0%3D--8764e6d2279a0e6960577a6 6864e6018ef213b56; path=/; secure; HttpOnly 2019-12-01 15:25:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/> (referer: https://github.com/login) ['Issues'.'Issues']
2019-12-01 15:25:18 [scrapy.core.engine] INFO: Closing spider (finished)
Copy the code

We can see that we have printed the information we need, login is successful.

For form requests, FormRequest provides another method, from_Response, that automatically fetches the form on the page. We simply pass in the username and password to send the request.

Let’s take a look at the source of this method:

@classmethod
    def from_response(cls, response, formname=None, formid=None, formnumber=0, formdata=None,
                      clickdata=None, dont_click=False, formxpath=None, formcss=None, **kwargs):

        kwargs.setdefault('encoding', response.encoding)

        if formcss is not None:
            from parsel.csstranslator import HTMLTranslator
            formxpath = HTMLTranslator().css_to_xpath(formcss)

        form = _get_form(response, formname, formid, formnumber, formxpath)
        formdata = _get_inputs(form, formdata, dont_click, clickdata, response)
        url = _get_form_url(form, kwargs.pop('url', None))

        method = kwargs.pop('method', form.method)
        if method is not None:
            method = method.upper()
            if method not in cls.valid_form_methods:
                method = 'GET'

        return cls(url=url, method=method, formdata=formdata, **kwargs)
Copy the code

We can see that this method has a number of parameters, all of which are related to form positioning. Scrapy can easily locate a login page if there is only one form, but what about pages with multiple forms? We need these parameters to tell Scrapy which form to log in to.

Of course, the premise of this method is that the action of the form form of our web page contains the URL to submit the request.

In the Github example, our login page only has a login form, so all we need to do is pass in the username and password. The code is as follows:

# -*- coding: utf-8 -*-
import scrapy
import re

class Github2Spider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['http://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            response, Automatically find the form form from response
            formdata={"login": "[email protected]"."password": "xxx"},
            callback=self.after_login
        )
    Operation after successful login
    def after_login(self, response):
        Find the Issues field on the page and print it
        print(re.findall("Issues", response.body.decode()))
Copy the code

After running the crawler, we can see the same result as before.

Isn’t that a lot easier? No need for us to find various request parameters, do you feel Amazing?

conclusion

This article introduces several methods to simulate the landing site Scrapy, you can use the method in the article to practice. Of course, there is no captcha involved here, captcha is a complex and difficult topic, I will introduce you later.